The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Please Recommend: HTML Parsing Library

Does anyone know of a nice C++ HTML parsing library?  I am primarily interested in walking through a document and extracting information that might be contained in TABLE elements.  I don't particularly want to use the Microsoft DOM, although this is a Windows project.
Meganonymous Rex Send private email
Saturday, September 01, 2007
 
 
Problem is that you'll need something that'll correct HTML as it's read (lots of HTML isn't properly constructed, web browsers correct the parsed HTML before rendering).

Considered taking what you need out of mozilla?
Lally Singh Send private email
Saturday, September 01, 2007
 
 
Lally:
I think taking what I need out of mozilla would probably take as long as writing a simple thing myself.
Meganonymous Rex Send private email
Saturday, September 01, 2007
 
 
Extracting text contained in TABLE elements is not that difficult. Just use a regex library such as boost to find <TABLE>, <TR>, etc. This might give a few false positives, but I'm assuming your app doesn't need a perfect HTML parser.
Arun
Saturday, September 01, 2007
 
 
I don't think parsing HTML is a good use of Regex.  I could be wrong.

I can't recommend a specific library, but in the past I've found stuff on codeproject.com that was useful for one-off parsing projects.
Jason Send private email
Saturday, September 01, 2007
 
 
You could look at my product: CMarkup, written originally for XML but also works with HTML, including ill-formed HTML
http://firstobject.com/dn_markup.htm
Ben Bryant Send private email
Saturday, September 01, 2007
 
 
Oh, and here's the link to how CMarkup works with HTML:
http://www.firstobject.com/dn_markhtml.htm
Ben Bryant Send private email
Saturday, September 01, 2007
 
 
You can use regex to retrieve table elements. I have done that before.

If you need to parse messy HTMLs, you may consider using the IE engine (mshtml.dll). The DOM model is accessible from IHTMLDocument2 interface. It is a little bit tricky to retrieve the interface - you need to host the web browser control in a hidden window or calls IE automation in order to get it.
Glitch
Saturday, September 01, 2007
 
 
I needed one recently and just wrote my own, it isn't that difficult.
Tony Edgecombe Send private email
Sunday, September 02, 2007
 
 
I'm not sure what's worse, Tony - the fact that you reinvented the wheel, or the fact that you seem to be proud of the fact.  The likelihood that your homemade HTML parser is both correct and robust is vanishingly small, though I'm sure it works just fine on the tiny and non-representative sample of HTML that you've been testing it with.

Best practice is to use something that has been widely tested parsing real HTML - in this case, as it's a Windows project, the Microsoft solution that's built into Windows would be the logical choice, while for a cross-platform project something like Mozilla's Gecko, or KHTML, would do the job nicely.
Iago
Sunday, September 02, 2007
 
 
Iago

If OP needed a universal HTML parser you would be right. But if all he needs is data within table rows and cells, homemade regex-based solution does the job just fine.

We parse a lot of HTML for the living, so I know exactly what I am talking about.

DOM tree suffers from one defect. If one cell is
<td>text</td> and another is <td><span><font><b>text</b></font><span> the tree gets messy. What you'd need for HTML parsing is 'DOM with shortcuts' where you can throw away some tags (especially formatting) at will.

When I use regex, I typically pre-process HTML and throw away tags not needed for the task, most often formatting tags.

If people reinvent the wheel, it's not always because they are stupid. Often the wheel is not the right one.
Yury @ Xtransform Send private email
Sunday, September 02, 2007
 
 
Why not use a tool like HTML Tidy (http://tidy.sourceforge.net/) to convert HTML into well-formed XHTML and then use a standard XML library to extract what you want?
Arethuza
Sunday, September 02, 2007
 
 
>>Best practice is to use something that has been widely tested parsing real HTML

Or maybe best practice is to avoid the bloat of linking to IE or Mozilla for what is a reasonably trivial problem.

My software runs as a service, what if IE popups up a dialog or message box unexpectedly, what if the user has an addin which interferes with my code, what if someone tries to exploit a security problem in IE through my program.

Rolling your own code is often the right decision.
Tony Edgecombe Send private email
Sunday, September 02, 2007
 
 
Thanks everyone: I've looked into it a bit and I'm casting my lot with Tony.  My experience with the Microsoft DOM stuff is extensive and all negative.  I was able to write what I need in about 45 minutes and it's pretty robust.  I guess that makes me an idiot, huh Iago?
Meganonymous Rex Send private email
Sunday, September 02, 2007
 
 
Hahaha, a "pretty robust" HTML parser in only 45 minutes!  Good one!
SomeBody Send private email
Thursday, September 06, 2007
 
 
Somebody: I said I was able to write "what I needed" in 45 minutes, not that I wrote a whole parser.  Maybe if you spent more time actually reading and less time attempting to reply with zingers, you would have gotten that part.
Meganonymous Rex
Sunday, September 09, 2007
 
 
Regexps aren't suitable for parsing HTML. It would get a wrong match, if there were nested tables, for example. If speed isn't imperative, the suggestion to pass the content through tidy, and then use an XML parser, is a good one. It's robust, and you don't need to include a bloat of libraries for the task.
Troels Knak-Nielsen Send private email
Monday, September 10, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz