The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

when to use a parser

This is a follow-up to the following post:
http://discuss.joelonsoftware.com/default.asp?design.4.211580

When does one know he needs a dedicated parser (like ANTLR, Yacc etc) instead of trying to use regex to extract information from a file? 

In my case, I am trying extract information from someone else's file format which I have no control over. It may or may not change in time although it's been rather stable for a few years now. It is a text file with a tag-like structure reminiscent of XML, but it is not XML or any other standard format. It reminds me of XML because it has that <tag> </tag> type of architecture. The real data is between these tags.

I've never used a parser before. I played with GOLD parser over the weekend. I looked at a few others such as ANTLR and Lemon parser, but I found GOLD parser to be a little more user friendly. It comes with a nice grammar builder and tester.

The only thing is, I need to write my own grammar and that appears to be a daunting task at the moment given the fact that I have no experience writing grammars.

What are some of the questions I should be asking myself to figure out if I should go ahead with the parser or just try using regex and a big switch statement?
newbie Send private email
Monday, October 10, 2005
 
 
If the format is straightforward, and the programs will be run with a babysitter, as in a data conversion project where you will watch what happens, regular expressions are probably enough.

You can do amazing things with complex regular expressions with Perl style extensions.

If the files are more like a programming language, that is, quite complex with "syntax", then a parser is probably indicated.
Arthur, King of the Britons, MCSE
Monday, October 10, 2005
 
 
At this point, you'll be better off if you pick up at least a cursory understanding of formal languages. Read up on the Chomsky Hierarchy and the Pumping Lemma. That should give you enough to know whether you need a regular expression or a context-free grammar in each situation that you encounter.
comp.lang.c refugee
Monday, October 10, 2005
 
 
Although I've already let my opinion be known on the original thread you've indicated, if you are going to try to define a language (even a relatively simple one like the one you indicated), I'd suggest reading up on the Backus Naur Form, as this will definitely help you use tools such as ANTLR.

http://cui.unige.ch/db-research/Enseignement/analyseinfo/AboutBNF.html
Andrey Butov Send private email
Monday, October 10, 2005
 
 
way too complicated to get into things that require a lot of dependencies. I would recommend simple text string searching like csDoc.Find("<TAG>") (MFC) etc.
If you are using C++, CMarkup may handle it as is. CMarkup is an XML parser but also works pretty well on generic markup and HTML. You can evaluate it without signing up or anything (just a .cpp/.h pair), though it is not free for commercial use (note: this is my own product I am recommending so you decide). Here's the page about non-XML usage, but basically all the same methods are used as for normal XML, it is just flexible enough to deal with non-XML and ill-formed markup:
http://www.firstobject.com/dn_marknonxml.htm
Ben Bryant
Monday, October 10, 2005
 
 
Thanks for all the replies.

I've tried the regex approach before. It tends to get out of hand for anything but the trivial cases. It looks like I am going to have to try the parser approach and where I end up.
newbie
Monday, October 10, 2005
 
 
Let me just put in another plug because people may not realize how easy it is in CMarkup. Say you have a file something like this:

<codename>56gh</codename>
<user>John</user>
<diagnostic id=b5>
<jobid>4</jobid>

Note it is not good XML because there is no root element, and the id attribute is not quoted, and the diagnostic element is not ended. The CMarkup code looks like this:

CMarkup doc;
doc.Load( "filename.dat" );
doc.FindElem( "codename" );
string strCodeName = doc.GetData();
doc.FindElem( "diagnostic" );
string strDID = doc.GetAttrib( "id" );

Alternatively, if you don't know the order of tag names:

while ( doc.FindElem() )
{
  string strTag = doc.GetTagName();
  if ( strTag == "codename" )
    ...
}

So it doesn't matter if it is not quite XML.
http://www.firstobject.com/
Best wishes
Ben Bryant
Monday, October 10, 2005
 
 
Be careful with parser programs.  Many programmers write parser programs to sell but later realize how tiny the market is.  So, a lot of people do nothing but collect your money and send you the product; they ignore support requests and never issue an upgrade or bug fix.  Use Google to research customer experiences with any parser product before you buy it.
Daniel Howard Send private email
Monday, October 10, 2005
 
 
The application I am trying to write is for my own personal education. I am not interested in commercial components.
newbie
Monday, October 10, 2005
 
 
In that case, start with the underlying theory and then learn how tools like parser generators fit into the picture.
comp.lang.c refugee
Monday, October 10, 2005
 
 
Let me soften my statement since it was harder than I intended.  I'm not saying, "Don't buy parser tools."  I'm saying, "Do research on your potential parser tool provider to see what other customers have to say."

Some parser tool providers do a good job and provide good support.  Make sure that you do research and get one of them.
Daniel Howard Send private email
Monday, October 10, 2005
 
 
being about 75% of the way through writing a parser from scratch its *interesting* to say the least... it amazed me just how little information there was out there, given that old 8-bit machines could handle parsing its not too hard in theory.

making it work in practice isn't too hard either, making it bullet proof is harder.

the difficult bit seems to be the tokeniser, actually parsing the tokens isn't to hard. before you start though make sure you know the format the input will take, note you do *not* need to know every command etc just the format, just setup a default command and pass anything thats not recognised to it.

the thingy I've written took about 48 hours or coding over a few weeks, but that started from the problem and includes actually bulling through the language definition & design etc.

the trouble is a parser tends to either be:

a, very application specific

or

b, useless

but once you have written your own its not too hard to make one work, making it work *fast* is another matter...
Claire Rand Send private email
Tuesday, October 11, 2005
 
 
> When does one know he needs a dedicated parser (like ANTLR, Yacc etc) instead of trying to use regex to extract information from a file?

First question: Is the data context-sensitive or context-free?

If it is context-free, you can use regular expression on it. Context-free data is when the meaning of the data does not change depending on where it is. For example, a plus sign might mean add two numbers together. If it always means this, it is context-free. In some languages a plus sign also means concatenate two strings together. In those, the plus sign is context-sensitive.

Second question: Do I need all the data in the file?

If yes, you will probably need a parser. If you only need a small percentage, you can usually get by with just regular expressions. If you're not sure, see question three.

Question three: After disregarding everything in the file I'm not interested in, does my data have the same meaning throughout?

If yes, use regular expressions. If not, use a parser.


In your case, you will probably have to use a parser. Consider, for example, the sequence 'tag <tag> tag </tag>' The symbol 'tag' appears four different times in four different contexts. The first one in outside of any element. The second is part of a start tag for an element called tag. The third one is inside the element tag. The fourth one is the end tag for the element tag. But if all you are required to do is count the number of times 'tag' appears in the data, you can use regular expressions. Is not how the data appears in the file, it's how your program uses it. Question three is the most important. If after throwing away everything you're not interested in, will rearranging the data change the results of my program? If no, regular expressions are good enough. Else, parser.
Mr. Shawn H. Corey, B.Sc. Send private email
Tuesday, October 11, 2005
 
 
I hate to be contrary, but you have a strange idea of what "context free" means.  "Context free" means "is accepted by some pushdown automaton".  The operational semantics of the various portions of the language are irrelevant to the context-free-ness of the language.
Eric Lippert Send private email
Wednesday, October 12, 2005
 
 
You don't have to learn things like pumping lemma or details of chomsky's work just to decide between a 'parser' and regex.  I should also note that context free is not the same as a regular language.

A parser such as yacc or antlr is basically the same as regular expression parser...EXCEPT that it allows you to nest regular expressions within regular expressions.

In other words, if you need to parse a comma delimited file containing numbers, you should use regular expressions.  For example:
123,-345.3,2234.3
the above line can be parsed using regular expression of the following form: [int],[float],[float]
(instead of [int] and [float] would obviously be regular expressions)

If you have to parse something like this (arbitray expression):
(123+(-345/2234.3))
This can't be parse with regular expression. 

This is how you would parse this with regular expression:
[(][int][+][(][float][/][2234.4][)][)]

But what if you had the following expression:
(123+(-345/(2234.3 * 48)))

Now you have to write a different regex for it.  It turns out that you can abstract out some complexity by understanding that the reason these expressions seem so complicated is because expression are nested inside other expressions...or the expression seems to have a recursive structure.

Instead you could use a CFG (context free grammar) parser:
expression := [(] <expression> <operator> <expression [)] | [int] | [float]
operator := [+] | [-] | [/] | [*]

(this is an extremely quick and rough attempt at defining a parser...also understand that yacc, antlr,
etc. probably have different actual syntax...this is more like pseudo code...probably wrong ;) )

If you need to parse something that is of recursive nature, you simply can't use a simple regular expression for parsing (from what I understand, PERL's regular expression are actually not regular expression, they are more powerful)

A few notes:
1. BNF is the name of the syntax generally used to express  a CFG (context free grammar)...uhh...actually EXTENDED BNF (EBNF) is whats used by most tools.
2. The idea of CFG/BNF is fairly simple and elegant, the implementation gets messy
3. YACC/ANTLR are generally used not as parsers but as parser generators (these tools generate code which can be used to parse a language...by themselves don't parse anything)
4. Some people have mentioned the Chomsky heirarchy, in this heirarchy regular expressions are simpler than CFG, which is simpler than context sensitive grammars.
5. Most programming languages are NOT context free but context sensitive!  In a CFG (again, context free grammar), you can't do things like specify a type system (class inheritance, variable declaration constraints, etc.)  Unfortunately there is no common syntax for context sensitive grammars (like there is for CFG and Regex)
6. A CFG also allows you to be more flexible.  In a Regex, you can generally do two things: Check if your expression matches (so 23.K9 is not a float...and can be checked easily with regex) or split a text according to your regex (so hello12hello43hello56 can be split into just 'hello's by splitting a text on [int]).  In a YACC/ANTLR/ETC. you can build up trees which you can manipulate, write 'actions' which execute code, etc., etc.
7. If you have an XML type file, you almost certainly can't do it with Regex...a regular expression may work for something like this:
<tag> blah blah blah </tag>
But it won't work if tags can be nested:
<tag> <tag> blah <tag> blah blah </tag> </tag> </tag>
Luckily you can use one of many xml parsers out there (it doesn't even have to be actual xml, as long as it has a tag structure ismilar to xml, you can generally use an xml parsing api provided with .NET, Java, C++, etc.)
8. Please excuse my spelling!
Falcon Send private email
Wednesday, October 12, 2005
 
 
Mr. Shawn and Falcon,

Thank you both very much! The file I need to parse has a nested tag structure. It is not too terribly complicated, but looks like a parser will do better than simple regex.

Thanks again!
newbie
Thursday, October 13, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz