Sunday, October 6, 2013

The Code that remove HTML tag using regular expression



I need some source code that extract text from HTML.
Using regular expression it is very simple.

It removes scripts, style, tags, entities, and whitespace.

 private String getText(String content) {  
      Pattern SCRIPTS = Pattern.compile("<(no)?script[^>]*>.*?</(no)?script>",Pattern.DOTALL);  
      Pattern STYLE = Pattern.compile("<style[^>]*>.*</style>",Pattern.DOTALL);  
      Pattern TAGS = Pattern.compile("<(\"[^\"]*\"|\'[^\']*\'|[^\'\">])*>");  
      Pattern nTAGS = Pattern.compile("<\\w+\\s+[^<]*\\s*>");  
      Pattern ENTITY_REFS = Pattern.compile("&[^;]+;");  
      Pattern WHITESPACE = Pattern.compile("\\s\\s+");  
        
      Matcher m;  
        
      m = SCRIPTS.matcher(content);  
      content = m.replaceAll("");  
      m = STYLE.matcher(content);  
      content = m.replaceAll("");  
      m = TAGS.matcher(content);  
      content = m.replaceAll("");  
      m = ENTITY_REFS.matcher(content);  
      content = m.replaceAll("");  
      m = WHITESPACE.matcher(content);  
      content = m.replaceAll(" ");             
        
      return content;  
 }  

If you are interested, please try to test it.

No comments:

Post a Comment