The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
TechInterview.org
CityDesk
FogBugz
Fog Creek Copilot


The Old Forum


Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

Parsing HTML

I just wrote an HTML Parser.  An afternoon project.  It reads in the .htm and spits out a tree with the tag/attributes/text for each node.  Big whoop.  It was too easy in .NET.

At any rate, the reason why I did that was to alleviate the incredible slowness of using the MSHTML control from .NET.  I think it's slow because of all the data marshalling.  (I believe MSHTML is an old C++/COM beast.)

My code has to run under terminal services.  So let's say you have one machine with 5 sessions and you're using the MSHTML control.  The darn thing slows to a halt (well it's still going but barely.)  Don't know why.  Is MSHTML like a single threaded distributed object where all 4 sessions block until 1 is done?  How stupid is that and how stupid am I for not understanding it all.

Parsing HTML should be a relatively canned task no?  Roll back time to this morning when I'm googling for some clues as to MSHTML slowness and a possible replacement.  To no avail did I find anything satisfactory.  Looney.

Codeplex, CodeProject, MSDN forums, all kinds of stupidity I searched.  You would think one of them would have something decent.

Anyway, my point is, why is MSHTML soOooOOooooOOooo slooOOooowww when running accessed from multiple terminal services sessions?

And why is it just plain too darn easy to write an HTML parser in .NET?  I must have done it wrong.
Eat WhiteSpace Send private email
Sunday, June 22, 2008
 
 
"I must have done it wrong."

Indubitably.
yessir
Sunday, June 22, 2008
 
 
To do a basic HTML parser is easy. But somewhere along the line they decided that it would be ok to allow invalid html in documents. This was a good decision for speeding the adoption of the web but also meant millions (billions?) of incorrectly formatted web pages. For example IE will cope with <SPAN><STRONG>Hello</SPAN></STRONG>. How well your parser works is dependent on handle these.
Craig Send private email
Sunday, June 22, 2008
 
 
The reality is, with all the new HTML development tools and validators, seriously malformed documents are likely to be a lot rarer than they used to be.
Almost H. Anonymous Send private email
Monday, June 23, 2008
 
 
"The reality is, with all the new HTML development tools and validators, seriously malformed documents are likely to be a lot rarer than they used to be. "

Possible less than 90% even.
Craig Send private email
Monday, June 23, 2008
 
 
The solution to support such malformed documents in a pragmatic manner is to first try the fast, strict method.  Upon failure, apply the standard, slow version.

Monday, June 23, 2008
 
 
MSHTML is COM based and apartment threaded. This means that a specific instance runs within the context of a single thread. Calls to that specific instance are, therefore, serialized. However multiple instances - certainly instances running inside separate processes hosted in separate TS sessions - are not serialized.

Why then is MSHTML so slow compared to your parser? My guess is that MSHTML does much more than your simple parser. One difference that was sited here is parsing invalid HTML. Another is handling HTML extensions such as style sheets, XML and scripts.

A significant task performed by MSHTML is the construction of a dynamically modifiable DOM. You gave no indication what the output of your parser is.
Dan Shappir Send private email
Monday, June 23, 2008
 
 
HtmlAgilityPack
Larry Lard
Monday, June 23, 2008
 
 
I've always found that using a tool to convert HTML to well-formed XHTML and then parsing this in the normal way (e.g. using XmlDocument) works quite well.
Arethuza
Monday, June 23, 2008
 
 
> A significant task performed by MSHTML is the construction of a dynamically modifiable DOM.

Constructing a DOM is a significant feature, as is laying-out the rendered text; but that feature alone needn't take signifiant ("soOooOOooooOOooo slooOOooowww") time (e.g. my .NET parser doesn't).

Anyway; Google suggests a few hints, e.g. http://www.pcreview.co.uk/forums/thread-1359382.php tries to explain why the exact thing you're trying to do ("spits out a tree with the tag/attributes/text for each node") would be slow across the marshalled API.

It also ends with, "By the way, traversing the same DOM in C++ is virtually instantaneously".

So a solution if you want might be for you to write an unmanaged C++ DLL (perhaps, I don't know, with its own unmanaged thread), which would drive the DOM and extract the data that you want ... and then return that data to your .NET application via a single "get the overall result as a single string" function call.

The above doesn't explain about "running accessed from multiple terminal services sessions", though; however http://www.velocityreviews.com/forums/t66955-mshtml-aspnet-web-application-slow.html mentions something which suggests http://www.google.ca/search?hl=en&q=sta+thread+c%23 which (among other things) suggests you might want to specify the [STAThread] attribute in your code if it isn't there already.

> And why is it just plain too darn easy to write an HTML parser in .NET?

I'm going to parse that as:

a) I'm happy with what I did
b) .NET is the bee's knees
c) Am I missing something?

There isn't enough information though in your OP to answer c).

> XHTML ... (e.g. using XmlDocument)

I use a forward-only XmlReader fwiw.
Christopher Wells Send private email
Monday, June 23, 2008
 
 
My HTML parser is top notch quality software.  It'll parse the worst HTML in the best way.

So I add <STAThread()> to my program and whacko-shmacko nothin happens.  So I start to think.  Haven't done that in a while.

I traverse the DOM smartly.  Instead of starting from the beginning at each request I attempt to write an algorithm that gathers the key:value pairs from the HTML table in order.  Problem is only so much you can do.  Sped up a tad but not liek mondo speed up know what I'm sayin?

HTML is definitely not a good database.  Not sure why customers insist on using MS-Word generated HTML documents as a database but they do.  Wild.  Hey, everybody know Word and Excel.
Eat WhiteSpace Send private email
Monday, June 23, 2008
 
 
So yea, I'm searching the web for extra-ordinarily bad html pages, know any?  Found a top 10 list and some other stuff but I'm sure the 3l1t3 h4x0rz on this baord know more that stupid Google.
Eat WhiteSpace Send private email
Monday, June 23, 2008
 
 
I find it a little frustrating that it's so difficult to just buy this sort of thing off the shelf.

I needed a HTML parser/renderer last year and could only find one solution which was very poorly documented and buggy. I'd have spent quite a bit of money for an off the shelf solution.
Tony Edgecombe Send private email
Monday, June 23, 2008
 
 
If what you need is performance, you should avoid the Microsoft XML parser and instead grab the Gnome libxml2 parser.  Very fast, and when properly instantiated very tolerant of crappy HTML.
Clay Dowling Send private email
Monday, June 23, 2008
 
 
> I find it a little frustrating that it's so difficult to just buy this sort of thing off the shelf.

Did you consider FireFox, for example?

I guess there are business reasons why it's difficult to think it's worth developing a COTS solution. One reason is entrenched competition (e.g. Microsoft and FireFox).

Another reason is that it's hard to know in advance (because I for one haven't identified who would buy such a thing, and for what purpose) how much development would be enough, for example:

* What platforms?
* How many language-specific bindings?
* How much support for CSS?
* How many versions of HTML?
* What embedded Javascript?
* How many of the DOM APIs?
Christopher Wells Send private email
Monday, June 23, 2008
 
 
I needed something light, embedding firefox really didn't look like a sensible solution.

If it had been a trivial problem then I wouldn't have been looking for a third party solution.

I don't have any answers, I just know I was willing to spend a big wad of cash to solve this problem and couldn't.
Tony Edgecombe Send private email
Tuesday, June 24, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz