I need some source code that extract text from HTML.
Using regular expression it is very simple.
It removes scripts, style, tags, entities, and whitespace.
private String getText(String content) {
Pattern SCRIPTS = Pattern.compile("<(no)?script[^>]*>.*?</(no)?script>",Pattern.DOTALL);
Pattern STYLE = Pattern.compile("<style[^>]*>.*</style>",Pattern.DOTALL);
Pattern TAGS = Pattern.compile("<(\"[^\"]*\"|\'[^\']*\'|[^\'\">])*>");
Pattern nTAGS = Pattern.compile("<\\w+\\s+[^<]*\\s*>");
Pattern ENTITY_REFS = Pattern.compile("&[^;]+;");
Pattern WHITESPACE = Pattern.compile("\\s\\s+");
Matcher m;
m = SCRIPTS.matcher(content);
content = m.replaceAll("");
m = STYLE.matcher(content);
content = m.replaceAll("");
m = TAGS.matcher(content);
content = m.replaceAll("");
m = ENTITY_REFS.matcher(content);
content = m.replaceAll("");
m = WHITESPACE.matcher(content);
content = m.replaceAll(" ");
return content;
}
If you are interested, please try to test it.
No comments:
Post a Comment