Mini's Programming: The Code that remove HTML tag using regular expression

I need some source code that extract text from HTML.
Using regular expression it is very simple.

It removes scripts, style, tags, entities, and whitespace.

 private String getText(String content) {  
      Pattern SCRIPTS = Pattern.compile("&lt;(no)?script[^&gt;]*&gt;.*?&lt;/(no)?script&gt;",Pattern.DOTALL);  
      Pattern STYLE = Pattern.compile("&lt;style[^&gt;]*&gt;.*&lt;/style&gt;",Pattern.DOTALL);  
      Pattern TAGS = Pattern.compile("&lt;(\"[^\"]*\"|\'[^\']*\'|[^\'\"&gt;])*&gt;");  
      Pattern nTAGS = Pattern.compile("&lt;\\w+\\s+[^&lt;]*\\s*&gt;");  
      Pattern ENTITY_REFS = Pattern.compile("&amp;[^;]+;");  
      Pattern WHITESPACE = Pattern.compile("\\s\\s+");  
        
      Matcher m;  
        
      m = SCRIPTS.matcher(content);  
      content = m.replaceAll("");  
      m = STYLE.matcher(content);  
      content = m.replaceAll("");  
      m = TAGS.matcher(content);  
      content = m.replaceAll("");  
      m = ENTITY_REFS.matcher(content);  
      content = m.replaceAll("");  
      m = WHITESPACE.matcher(content);  
      content = m.replaceAll(" ");             
        
      return content;  
 }

If you are interested, please try to test it.

Mini's Programming

Sunday, October 6, 2013

The Code that remove HTML tag using regular expression

No comments:

Post a Comment