Monday, March 30, 2015

3 Examples of Parsing HTML File in Java using Jsoup

3 Examples of Parsing HTML File in Java using Jsoup

HTML is core of web, all the page you see in internet are HTML, whether they are dynamically generated by JavaScript, JSP, PHP, ASP or any other web technology. Your browser actually parse HTML and render it for you. But what would you do,  if you need to parse an HTML document and find some elements,  tags, attributes or check if a particular element exists or not from Java program. If you have been in Java programming for some years, I am sure you have done some XML parsing work using parsers like DOM and SAX, but there is also good chance that you have not done any HTML parsing work. Ironically, there are few instances when you need to parse HTML document from core Java application, which doesn't include Servlet and other Java web technologies. To make the matter worse, there is no HTTP or HTML library in core JDK as well; or at least I am not aware of that. That's why when it comes to parse a HTML file, many Java programmers had to look at Google to find out how to get value of an HTML tag in Java. When I needed that I was sure that there would be an open source library which will does it for me, but didn't know that it was as wonderful and feature rich as JSoup. It not only provides support to read and parse HTML document but also allows you to extract any element form HTML file, their attribute, their CSS class in JQuery style and also allows you to modify them. You can probably do anything with HTML document using Jsoup. In this article, we will parse and HTML file and find out value of title and heading tags. We will also see example of downloading and parsing HTML from file as well as any URL or internet by parsing Google's home page in Java.

No comments:

Post a Comment