HTML Parsers
jsoup is a Java library that simplifies working with real-world HTML and XML. It offers an easy-to-use API for URL fetching, data parsing, extraction, and manipulation using DOM API methods, CSS, and xpath selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers.
Last Release on Mar 4, 2025
4. TagSoup187 usages
org.ccil.cowan.tagsoup » tagsoupApache
TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either ...
Last Release on Aug 22, 2011
JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be
used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the
document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
Last Release on Jul 20, 2010
6. HTML Parser Jar152 usages
org.htmlparser » htmlparserCPLLGPL
HTML Parser is the high level syntactical analyzer.
Last Release on Apr 24, 2011
HtmlCleaner is an HTML parser written in Java. It transforms dirty HTML to well-formed XML following
the same rules that most web-browsers use.
Last Release on Jun 19, 2023
JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM parser for real-world HTML.
Last Release on Sep 11, 2024
Relocated → net.sf.jtidy »
jtidy
9. Jericho HTML Parser138 usages
net.htmlparser.jericho » jericho-htmlApacheEPLLGPL
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML.
Last Release on Oct 25, 2015
10. HtmlParser73 usages
nu.validator.htmlparser » htmlparserBSDMIT
The Validator.nu HTML Parser is an implementation of the HTML5 parsing algorithm in Java for applications. The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser.
Last Release on Jun 7, 2012
Relocated → nu.validator »
htmlparser