The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

C# HTML parser

Dear all,

I am looking for an C# HTML parser in my current project, any suggestions? Thanks.
cqcoder
Wednesday, May 10, 2006
 
 
A quick google search turned up this:

http://www.theserverside.net/discussions/thread.tss?thread_id=36886

I can't speak for GOLD, but I've used ANTLR and really liked it.

Those are both parser generators, however.  You'll have to create your own html parser with an actual grammar (also linked in that thread).

Good luck!
N Send private email
Wednesday, May 10, 2006
 
 
You need an HTML parser...for what?

Are you going to try to render HTML, or do you just need to extract specific bits of information from it, like link targets? Do you know that your HTML sources will be well-formed (ROFL!!) or will you have to parse poorly formed HTML as well?

The potential solution will depend heavily on what you need to do with the HTML that you parse.
BenjiSmith Send private email
Wednesday, May 10, 2006
 
 
I am trying to parse the results returned by search engine and extract some information from it.
cqcoder
Wednesday, May 10, 2006
 
 
Hmm.  Maybe you'd be better off just running some regular expressions with something like Perl or Ruby...
N Send private email
Thursday, May 11, 2006
 
 
Sorry.  Forgot about the C# bit.  Still, have you considered using regular expressions?
N Send private email
Thursday, May 11, 2006
 
 
I made a HTML parser assembly available at http://ezlag.com/files/HTMLParser.rar

It tolerates poorly formatted HTML tags and parses them to build a XML document representing structures in a HTML document.

I use it to transform web pages into formal XML documents, extracting information from the XML documents into RSS feeds, in case the source web pages not providing ATOM/RSS news feeds.

It has some basic i18n, so it's ok to parse most web pages in chinese/japanese/korean/russian. (tested for 2 years)
Alec Yu Send private email
Thursday, May 11, 2006
 
 
It should be ok to handle source code of JSP pages, as it was created to extract plain text messages in source code of JSP pages into Java resource bundles, and replace them in source code by retrival code to get messages from resource bundles.
Alec Yu Send private email
Thursday, May 11, 2006
 
 
Yes, I am using regular expression now. Thanks.

Thursday, May 11, 2006
 
 
>> Yes, I am using regular expression now.

Ugh.

Immediately - I mean right now - go to google and track down HtmlAgilityPack. It is a free library written by a Microsoftie that takes HTML - even 'real world' malformed HTML - and gives you back a nice XML DOM. Onec you have that, you can do what you like "in one line of code", as the saying goes.
Larry Lard Send private email
Thursday, May 11, 2006
 
 
Try BeautifulSoup with IronPython (.NET version of Python)
http://www.crummy.com/software/BeautifulSoup/
Lorenzo
Thursday, May 11, 2006
 
 
We run HTML through HTML Tidy (available on sourceforge with a very unrestrictive licence), with options set to produce XHTML, then load this into an XML DOM. Works a treat.
J
Friday, May 12, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz