Asked  6 Months ago    Answers:  5   Viewed   32 times

I code a lot of parsers. Up until now, I was using HtmlUnit headless browser for parsing and browser automation.

Now, I want to separate both the tasks.

As 80% of my work involves just parsing, I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

I want to know which HTML parser is the best. The parser would be better if it is close to HtmlUnit parser.


EDIT:

By best, I want at least the following features:

  1. Speed
  2. Ease to locate any HtmlElement by its "id" or "name" or "tag type".

It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

 Answers

80

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Tuesday, June 1, 2021
 
joostvandriel
answered 6 Months ago
97

…or a shorter try:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

the picked table is the longest one on the page

tables[[which.max(n.rows)]]
Tuesday, June 1, 2021
 
Tak
answered 6 Months ago
Tak
73

In Html Agility Pack I could not find any option that make html page tidy.There is one option that inserts the missing closing tags but it works in some html page only.That Option in html agility pack is,

  HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
      doc.OptionFixNestedTags=true;

I have also tried regex for that but it also work for some html page only.

So I found the best html tidy pack is :

http://www.devx.com/dotnet/Article/20505/1763/page/2.

We can see there : how to import the dll and how to use that tidy pack, there is sample code also available. It is great at all.It can insert the missing closing tags and makes your html page tidy .

Thanks for helping everyone..

Thursday, July 29, 2021
 
dotoree
answered 4 Months ago
12

JProfiler works very well for us.

http://www.ej-technologies.com/products/jprofiler/overview.html

Tuesday, August 3, 2021
 
moister
answered 4 Months ago
80
.* 

This is an greedy operation that will take any character including the quotes.

Try something like:

"href="([^"]*)""
Monday, August 16, 2021
 
rold2007
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share