The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Event or DOM parsing?

I've been working with XML lately in -- Python, Perl, PHP, and JavaScript -- and have run into a big quandry.  Which type of parsing is better: Event-based (SAX, expat) or DOM?

The documents I've read, I've mostly read for example code, trying to cypher out how parsing works in each language.  So, I've not seen any benchmarks on speed (either processor or developer).  Is there a big difference?  Should I focus on one?  Do they fit different situations better, or are they merely options?

Or is this tantamount to asking if vi or Emacs is a better editor?  If it is a religious question along those lines, I apologize and beseech the moderator to delete this post.
Andrew Burton Send private email
Wednesday, July 06, 2005
In most cases, DOM parsers are built on top of SAX parsers. The SAX parser reads the file and generates events for all of the elements, attributes, etc of the XML file. Then it executes a callback function for each event encountered (like a createElementNode("myElementName") function, or something like that).

DOM parsing is nearly always slower than SAX parsing, because there's almost always a SAX parser underneath the DOM parser. A DOM parser also requires more memory because it keeps the entire document's structure in memory at the same time, whereas a SAX parser can discard each event object once it has called the appropriate callback function.

So SAX is almost always faster and definitely always requires less memory.

DOM is usually more convenient, since it provides methods for traversing the document's tree structure. With SAX, you generally have to invent your own mechanism for populating and traversing application objects. You have to write your own callback functions.

Hope that information is helpful.
BenjiSmith Send private email
Wednesday, July 06, 2005
Yes, all the above is correct, but let me add:

If you're looking to use an XML parser somewhere to fire events in a linear manner, SAX is all you'll need.  For example, something along the lines of detecting nodes, grabbing their content and moving along, no problem.

If you're looking to work with a data structure that allows for iterative actions and comparisons, then you'll need to go with DOM.
KC Send private email
Thursday, July 07, 2005
Thanks for the answers.  They both told me what I needed to know:  DOM is slower code-wise, but has the potential to be faster development-wise.  Thank again.
Andrew Burton Send private email
Thursday, July 07, 2005
Which one is better depends entirely on the problem you're solving.

If you'd for instance would like to read a whole XML configuration file, you'd probably use SAX. Then again, if you'd like only to read few attributes or sections you'd use DOM.

SAX is more suited to stuff like big XML streams and processing of documents á la XSL. DOM on the otherhand allows you to make queries into one XML document that has been loaded fully on the memory (documents read using SAX need only stack and your parse buffer).
Thursday, July 07, 2005
I can't think of a single problem that would be *easier* to solve by using SAX rather than DOM.

As far as I'm concerned, using SAX is an optimization that I'd only pursue if I needed exceptional speed or low memory usage during the XML-parsing phases of my application. And, in general, my applications spend less (usually far less) than a tenth of a percent of their time (and only a tiny fraction of their memory consumption) performing XML parsing tasks. So, for me, it has *never* made sense to use SAX.
BenjiSmith Send private email
Thursday, July 07, 2005
Actually, while we're discussing DOM/SAX..

I've been looking at whether to use DOM or SAX myself, but for the reason that I basically want to create custom objects for each tag type.

Now, the SAX way of doing this would make me a tree of my tag objects by calling a factory to make them based on the names, using the "here's a tag" callback.

Is the DOM way really to read it all into "generic" tag objects and then go crawling around that tree turning the data into my work objects? Or have I missed a "factory" object somewhere in DOM which will allow me to create descendents from the basic tag class but on a tag-name by tag-name basis?
Katie Lucas
Thursday, July 07, 2005
Actually, Katie, I generally just create a whole Document object and then crawl through its tree to create my classes It's generally not as painful as it sounds.

I'll call a method like this...

List<MyClass> myObjects = getObjectsFromDom(Node n);

I've rarely had an XML Schema where I needed more than a handful of business classes (and one DOM-to-objects conversion class) to parse and represent all of the business objects pulled from the XML. But, if I was going to deal with anything larger, I'd learn about JAXB, which is the Java Api for Xml Binding. Here's an article...

I've never used it, but it seems like it'd be handy for a system that needed to read/write lots of xml, particularly if I had to deal with sets of xml files with complex schemata.
BenjiSmith Send private email
Thursday, July 07, 2005
XPath is really the way to go - forget about coding and use a mini-language - same benefits as SQL and all...

(for Java, try dom4j)

Of course, most XPath implementations need to load the entire XML file - which is not so good for large files and web service stuff.

(Again, for Java, XOM apparently offers a streaming solution to this)

In fact, an XPath implementation on top of SAX would be nice...oh wait, that's what I'm working on :) (sorry, it's commercial, so I'll shut up now...)
Richard Rodger Send private email
Friday, July 08, 2005
I knew that XPATH provided a way to query for elements and attributes along some traversal path within the XML tree structure. But I didn't know that it could actually provide a binding mechanism betweeen the XML elements and the Java objects which those elements represent. Is that the case?
BenjiSmith Send private email
Friday, July 08, 2005
XPath doesn't do binding, you'd need a proper binding component like Castor for that. But it does reduce the need for binding by making data extraction really easy.
In fact, on one project, I initialised objects using a dom4j Node and just pulled the data out directly using XPath expressions in the constructor. Of course you can get all fancy and do this with factories if you like.
Richard Rodger Send private email
Saturday, July 09, 2005
The major advantage of SAX over DOM is that with a SAX parser, you don't need the whole document in memory; you just need enough to handle the current element.

So, if you're processing BIG documents, and you only need local information (i.e. don't have to look for stuff on the ancestor or sibling axes) SAX may work where DOM will roll over due to memory constraints.
Chris Tavares Send private email
Sunday, July 10, 2005
The most common way to use XPath is with the DOM.

But yeah, as everyone above mentioned, most DOM implementations are tree representations of the entire XML file in memory, which is pretty easy to work with. Most event based mechanisms are streams which give you callbacks or event loops to pass over the file in a linear fashion. They can also be used if you're looking for one thing in the file and want to abort early on.
mb Send private email
Monday, July 11, 2005

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz