Jericho HTML Parser

I needed to parse and merge HTML for the current project I am working on. After searching and playing with several HTML parsers I decided to use Jericho which is a light-weight and easy HTML Parser. I also checked the CyberNeko HTML Parser which is another strong alternative but here is the simple reason why I did not choose it :)

The maven dependency is:

        <dependency>
            <groupId>net.htmlparser.jericho</groupId>
            <artifactId>jericho-html</artifactId>
            <version>3.1</version>
        </dependency>

Here is a simple code using Source to construct the source data, which can be a String, Reader, InputStream, URLConnection or URL.

    public String read(InputStream is) throws IOException {
        Source source = new Source(is);
        return source.getTextExtractor().toString();
    }

and you can merge the HTML using StreamedSource which is preferred for performance with the memory. Now let’s insert a title at the head segment:

    public String merge(InputStream is) {
        Writer writer = new StringWriter();
        try {
            StreamedSource source = new StreamedSource(is);
            for (Segment s : source) {
                if (s instanceof StartTag) {
                    writer.write(s.toString());
                    if (((StartTag) s).getName().equals(HTMLElementName.HEAD)) {
                        writer.write("<title>TEST</title>");
                    }
                } else {
                    writer.write(s.toString());
                }
            }
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return writer.toString();
    }

As you can see Jericho is really easy to use!

Did you enjoy this post? Why not leave a comment below and continue the conversation, or subscribe to my feed and get articles like this delivered automatically to your feed reader.

Comments

No comments yet.

Leave a comment

(required)

(required)