Jericho HTML Parser
I needed to parse and merge HTML for the current project I am working on. After searching and playing with several HTML parsers I decided to use Jericho which is a light-weight and easy HTML Parser. I also checked the CyberNeko HTML Parser which is another strong alternative but here is the simple reason why I did not choose it
The maven dependency is:
<dependency>
<groupId>net.htmlparser.jericho</groupId>
<artifactId>jericho-html</artifactId>
<version>3.1</version>
</dependency>
Here is a simple code using Source to construct the source data, which can be a String, Reader, InputStream, URLConnection or URL.
public String read(InputStream is) throws IOException {
Source source = new Source(is);
return source.getTextExtractor().toString();
}
and you can merge the HTML using StreamedSource which is preferred for performance with the memory. Now let’s insert a title at the head segment:
public String merge(InputStream is) {
Writer writer = new StringWriter();
try {
StreamedSource source = new StreamedSource(is);
for (Segment s : source) {
if (s instanceof StartTag) {
writer.write(s.toString());
if (((StartTag) s).getName().equals(HTMLElementName.HEAD)) {
writer.write("<title>TEST</title>");
}
} else {
writer.write(s.toString());
}
}
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
return writer.toString();
}
As you can see Jericho is really easy to use!
Did you enjoy this post? Why not leave a comment below and continue the conversation, or subscribe to my feed and get articles like this delivered automatically to your feed reader.



Comments
No comments yet.
Leave a comment