Showing posts with label regular expression. Show all posts
Showing posts with label regular expression. Show all posts

Sunday, October 6, 2013

The Code that remove HTML tag using regular expression



I need some source code that extract text from HTML.
Using regular expression it is very simple.

It removes scripts, style, tags, entities, and whitespace.

 private String getText(String content) {  
      Pattern SCRIPTS = Pattern.compile("<(no)?script[^>]*>.*?</(no)?script>",Pattern.DOTALL);  
      Pattern STYLE = Pattern.compile("<style[^>]*>.*</style>",Pattern.DOTALL);  
      Pattern TAGS = Pattern.compile("<(\"[^\"]*\"|\'[^\']*\'|[^\'\">])*>");  
      Pattern nTAGS = Pattern.compile("<\\w+\\s+[^<]*\\s*>");  
      Pattern ENTITY_REFS = Pattern.compile("&[^;]+;");  
      Pattern WHITESPACE = Pattern.compile("\\s\\s+");  
        
      Matcher m;  
        
      m = SCRIPTS.matcher(content);  
      content = m.replaceAll("");  
      m = STYLE.matcher(content);  
      content = m.replaceAll("");  
      m = TAGS.matcher(content);  
      content = m.replaceAll("");  
      m = ENTITY_REFS.matcher(content);  
      content = m.replaceAll("");  
      m = WHITESPACE.matcher(content);  
      content = m.replaceAll(" ");             
        
      return content;  
 }  

If you are interested, please try to test it.