The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

And who says regular expressions are that hard?

Derek
Thursday, May 18, 2006
 
 
There's a bug in that regular expression, but you can easily spot it.
Mr. Powers Send private email
Thursday, May 18, 2006
 
 
This is a great example of the theory of language parsing.

Regular expressions match regular languages. Email addresses are NOT regular languages. Therefore you cannot parse email addresses with a regular expression.

The fact that people keep trying brings me constant amusement.
Chris Tavares Send private email
Thursday, May 18, 2006
 
 
Regexp haters are always looking for extreme examples of their position.  That is clearly an extreme example; an exception rather than the rule.
Almost H. Anonymous Send private email
Thursday, May 18, 2006
 
 
> Regular expressions match regular languages.

What is it that makes email addresses so irregular?
son of parnas
Thursday, May 18, 2006
 
 
I think that that is the ugliest piece of code I have ever seen. And I used to program a lot in 8-bit 8051 assembly language. Wow.
Steve Hirsch Send private email
Thursday, May 18, 2006
 
 
It's unfortunate that I couldn't find (quickly) the source that went with this article on using perl regexes to translate to braille:
http://www.foo.be/docs/tpj/issues/vol4_3/tpj0403-0003.html

The regular expression was six pages long. It was beautiful.
dwayne Send private email
Thursday, May 18, 2006
 
 
"What is it that makes email addresses so irregular?"

Because not even the @ is guaranteed.
TheDavid
Thursday, May 18, 2006
 
 
Can any of you recommend a really top notch RegExp tutorial and reference?  Most of my knowledge comes from a JavaScript book I have and I've always been a bit embarrassed about my RegExp weakness.
cipher
Thursday, May 18, 2006
 
 
a former big-fiver Send private email
Thursday, May 18, 2006
 
 
Just remember that the IETF's email accomodates all sorts of proprietary gateways and bridges. If the world was perfect (er, or rather imperfect) and SMTP/MIME was the only game in town--the regex would be simplier.

It would be interesting to see this fully commented out in a more verbose syntax--such as the one .NET regex offers (disclaimer: you can be verbose in many regex implementation, but that's whta I am using now days)
Li-fan Chen Send private email
Thursday, May 18, 2006
 
 
Regular Expressions ARE horrible.

They are cryptic to read

They are fragile to even minor errors

They do not have any (or any worthwhile) compile-time error checking.  If you make an error or typo, it causes a problem at run time.  We all should know by know that run-time problems are much more expensive to find and debug that compile-time problems.

They are virtually impossible to test. The only real option is by throwing a lot of data at them, and seeing if the data comes out right: you can't easily unit test a bit of complicated regular expression at a time.

A big regular expression is generally non-modular, it can't easily be understood in chunks. This causes development, testing and readability



> The regular expression was six pages long. It was beautiful.

Six pages of, unreadable, non-modular, untestable, densely packed cryptic hierographics are considered "beautiful" by many programmers (I don't know if the example I'm quoting was being a little sarcastic - but if he was, there are plenty of other programmers who would use "beautiful" to describe such a regex with no sarcasm intended)...

...Yet a function, with plenty of white space, clearly readable, typically gets a ton of flak, if it's a little longer than a screen or two, or doesn't use an absolutely clear variable name for one variable, or use a #define macro to save a bit of typing, etc.
S. Tanna
Thursday, May 18, 2006
 
 
The problem with plenty of code is plenty of logic. Every "if then else" cause a bifurcation and it makes it hard to test. A regular expression at least makes it simpler altogether. I doubt many people test all the logic variations in their code -- at best they will test the common case, which won't break the regular expression either.
Lostacular
Thursday, May 18, 2006
 
 
>>>The problem with plenty of code is plenty of logic. Every "if then else" cause a bifurcation and it makes it hard to test.

Very much agreed.

>>>A regular expression at least makes it simpler altogether.

???? The fact that you do not see if-then-elses or other control structures does not mean that they are not there. In fact regular expressions are full of them. Just watch the code generated by a regexp compiler or develop a regexp matcher yourself and you will know what I mean. In fact regexps make things worse by removing the clues about the complexity that is actually involved.

Code coverage, for example, is as big a problem with regexps as with functionally equivalent procedural code. However, do not expect much help from analysis tools or a debugger even.
Karel Thönissen Send private email
Thursday, May 18, 2006
 
 
It's a false dichotomy to compare regex just to masses of "if then else".  They are not the only 2 way to handle complicated string manipulations!

But I'll run with it for now...

> Every "if then else" cause a bifurcation and it makes it hard to test. A regular expression at least makes it simpler altogether.

The complexity is still there in both cases, the complexity is merely expressed in different ways.

In fact, I would argue that regular expressions are MORE complex, because there is an interpretation at run-time of the regex, which is an extra layer of complexity. 

If you mis-nest brackets or statements for "if then else", you get a compile-time error

If you mis-nest the equivalents of brackets/statementes for regex logic, your program does something, usually something fairly close to what you intended, with no warning, and a subtle bug.



> I doubt many people test all the logic variations in their code -- at best they will test the common case, which won't break the regular expression either.

Er, it may not be practically possible to test all combinations of all the many "if then else" sequences in your program, but you can (and you're a fool if you don't) test then "then" and the "else" for each individual conditional.  You can also use code coverage, to ensure each statement in your program has been tested at least once, even if you haven't tested every possible combination of statement orders.

With regex's you can not do any of this. It's impossible to check each statement-equivalent in a regex has been covered. It's impossible to do a code coverage to see if you've fully tried out your regex.  All you can do is throw lots of data at it, and hope you have enough test cases to catch any important bugs.


I stand by my earlier comment: regex's are HORRIBLE!!!

I will add: they are for lazy programmers.

I will concede, when you need a quick'n'dirty duct-tape type solution to some problem that you must solve immediately, or where you need a throwaway solution, they have their uses. 

But otherwise, they suck.


And BTW, why is it that there are umpteen very similar but slightly different (different enough not to be compatible with each) versions of regexs even in the same environment.... think about the consequences  of that (e.g. passing the slightly wrong kind of regex) given that regex has basically no worthwhile error checking.    That's just one more spoon full of nasty radioactive gunk in the whiffy alphabet soup of regular expressions.
S. Tanna
Thursday, May 18, 2006
 
 
Sure, and I do not expect help from _developers_! Given enough rope, they will hang themselves, with debugger, code coverage, code analysis, or without them.

Personally, I wouldn't want to write something complex in regular expression -- I would prefer to break it in code + regex, or use something more adequate.
Lostacular
Thursday, May 18, 2006
 
 
> Sure, and I do not expect help from _developers_! Given enough rope, they will hang themselves, with debugger, code coverage, code analysis, or without them.

If that truly were the case: Why bother doing anything? Why bother trying to write better code? Why bother trying to use better tools? Why bother debugging or using a debugger? Because whatever you do, it will suck, because developers will "hang themselves".
S. Tanna
Thursday, May 18, 2006
 
 
Check this out:
http://www.artima.com/lejava/articles/javaone_2006_wed_ideas.html

Specifically, listen to this one:
"Alberto Savoia, founder and CTO of Agitar Software discusses why more people don't do unit testing in practice. (5:10)"

There he says that few developers do unit testing, even after some of them tried it. Then he goes on to say that "automatic unit testing" would make most developers do unit testing, probably. These "automatic" things have been promised for 50 years already, but they are insufficient at best.

Why bother?
"You don't have to be well liked to succeed"
http://money.cnn.com/2006/05/15/magazines/business2/welllikedsuccess/index.htm

I don't need to please everyone, thus I have my own take at things, that isn't perfect either, but nothing is perfect.  Just don't shut your eyes to try to make the monsters go away, because they won't go away that easy -- fight them, learn, accept.
Lostacular
Thursday, May 18, 2006
 
 
What does either of your links have to do with whether regulation expressions are good thing (which you seem to think) or are a horrible thing (which I think)?

Your MP3 for example, simply says only 20% of programmers do unit tests, and then the guy tries to sell his tool to encourage more people to do better tests.

Fact is, as I've repeatedly pointed out, it's impossible to test a six page regex, except by cross-your-fingers testing. And that one reason (of many reasons) is why I think  regexs are a bad think. 

Maybe I misunderstood you, but you seemed to be arguing previously that nobody does testing anyway, and that you therefore you felt my concern was misplaced.
S. Tanna
Thursday, May 18, 2006
 
 
"What does either of your links have to do with whether regulation expressions are good thing (which you seem to think) or are a horrible thing (which I think)?"

Think about all the beasties that men have been able to domesticate and use in a good way, like horses, cattle, even elephants somewhere in the Asia and Africa. I tend to think that regexes can be domesticated.

"Your MP3 for example, simply says only 20% of programmers do unit tests, and then the guy tries to sell his tool to encourage more people to do better tests."

Yes, lots of advertising, but it's hard to separate where the hype ends and the real benefits begin. For instance, some tools from Sun have been marketed a lot, but they still suck bigtime. Just take the message with a grain of salt.

"Fact is, as I've repeatedly pointed out, it's impossible to test a six page regex, except by cross-your-fingers testing. And that one reason (of many reasons) is why I think  regexs are a bad think."

I think small regexes could be tested, with the sample data that you mentioned above. Maybe you could even try some random data.

"Maybe I misunderstood you, but you seemed to be arguing previously that nobody does testing anyway, and that you therefore you felt my concern was misplaced."

Nope, I just think that the tests serve to code and regex as well, but they may be different tests, like you mentioned.

Lastly, see this quote:

<quote>
    In the EastMedia/VeriSign project we were seeing a bunch of attack attempts from a “security company”. [...] After they ran the automated scans we saw a few “hand coded” attacks which probably means someone at this “security company” was very intrigued by what Mongrel was doing.

    The funniest part of this is that all Mongrel does is use a correctly coded parser based on a real grammar and a parser generator (Ragel). Other web servers use hand coded HTTP parsers that turn out to be vulnerable, difficult to compare to the real HTTP 1.1 RFC grammar, and are just a pain to manage. Using Ragel makes Mongrel robust against many of these attacks without actually having to create specific logic for detecting “attacks”.
</quote>
http://www.oreillynet.com/ruby/blog/2006/05/post.html

Which should remind us that code is very generic, and you have good code and bad code, thus my assumption that "given enough rope..."
Lostacular
Thursday, May 18, 2006
 
 
Regular expressions are a ***HUGE*** part of my job, so I'll step up and defend them for a bit.

I absolutely agree that regular expressions are often very difficult to write, and usually even more difficult to read. Authoring good regular expressions requires an intimate knowledge of the underlying execution engine, including an understanding of finite state machines, zero-width lookaround assertions, and other weird arcana.

The tool support for regular expressions is very very poor. I've never seen a development environment that could perform decent syntax highlighting of a regular expression. I've never seen a debugger that could set breakpoints within a regular expression, monitoring the matching progress of the underlying state machine. And, although it's certainly possible to write unit tests for regular expressions, very few people ever do.

Having said all of that, though, there are certain types of computational problems where regular expressions are really the only suitable solution.

Let's say I need to perform a transient search operation over a large document. I'll only ever look at the document once, ever, and then I need to throw it away. So there's no time to parse and index it. Plus, I need to search the document for twenty different search terms simultaneously. For example, let's say I'm searching for the names of twenty different people:

ann
anne
amy
daniel
david
dolores
donald
gart
gary
maria
marie
mark
marsha
marvin
melissa
melvin
michael
michelle
stacy
steve

Searching for those names one-at-a-time would require me to scan through the document twenty separate times. That's a lot of work, and not very efficient.

Instead, I'd construct a regular expression like this:

\b(?:m(?:ar(?:i[ae]|k|sha|vin)|ich(?:ael|elle)|el(?:issa|vin))|d(?:a(?:niel|vid)|o(?:lores|nald))|a(?:nne?|my)|gar[ty]|st(?:acy|eve))\b

That particular expression would perform the most efficient possible search over my target document, and I'd only need to scan it once.

Some of us care about efficiency sometimes.

In many circumstances, it might be possible to run twenty separate searches over my document. Maybe I'm not overly time-constrained. But what if I had a list of a thousand different search terms? And what if the document I'm searching contains several megabyes of text? And what if I need to search ten documents like that per second? What if my search terms have wildcards? It's easy to overwhelm an algorithm that searches for each string separately. Algorithmically speaking, such a search algorithm scales at O(n), whereas my regular expression above scales at O(log n). Much much better.

Anyhow, those are the kinds of problems that I often have to work with. And using a regular expression engine is the only way I know to acceptably solve those problems.

Even without those kinds of constraints, though, there are certain things you just can't do without regular expressions. How would you validate an email address? How are you going to express something like this, in ordinary code?

[a-z](?:[\w\.\-]*[a-z0-9])?@[a-z0-9][\w\-]*\.(?:[a-z0-9][\w\-]*\.)*[a-z]{2,}

Would you just prefer not to validate email addresses? Is that an option in your system?

Because I do so much work with regular expressions, I've written tools that compile most of my expressions from other sources of data. And I always write a number of tests (including both positive and negative examples). And I always write documentation about how the expression works, so that other people can use them too.

I'll be the first to admit that regular expressions can be very difficult to learn, and even more difficult to consistently work with.

But I just don't see that there's any viable alternative. You guys that are poo-pooing regular expressions...what would you suggest instead?
BenjiSmith Send private email
Friday, May 19, 2006
 
 
"The complexity is still there in both cases, the complexity is merely expressed in different ways."

It's about the equivalent to a C++ application compared to the equivalent assembler.  The complexity is still there in both cases -- which one would you rather edit?

"there is an interpretation at run-time of the regex, which is an extra layer of complexity."

So interpreted languages are inherently more complicated than compiled languages?  There's no reason that regular expression couldn't be a first-class part of any programming language and natively compiled just like regular code.

"If you mis-nest the equivalents of brackets/statementes for regex logic, your program does something, usually something fairly close to what you intended, with no warning, and a subtle bug."

That's so ignorant.  If you mismatch brackets in a regexp then you get a run-time error something along the line of "mismatched bracket"!  In fact, I even get nice line/column numbers.

The problem is that people think that a regexp is easily representable by a couple of if/then statements and that's all.  What about the backtracking to find a match?  All the looping?  The equivalent code to a single one-line regexp could be thousands of lines.  Which would you rather debug?  The truth is, without regexp the code simply wouldn't be written at all.

"I will add: they are for lazy programmers."

So are compilers!  Please get back to writing machine code with notepad.

"But otherwise, they suck."

I have a question...  how to validate phone numbers, email addresses, and urls in your application?  Do you do it at all?  Why not?

What one should also note is that regular expression pre-date computers (even the syntax) and are used in language and set theory.  This was not created to be some cryptic language for computer geeks but as a useful scientific and mathematical tool.  Those who can read/write regexp can do so as easily as reading any other code. 

Anyway It's not even code, it's a representation of what you want to match.  Hell it's more like SQL.  But I guess you don't like SQL either...  can't debug that.  Can't prove coverage.  Better just write some if/then statements and a few loops...
Almost H. Anonymous Send private email
Friday, May 19, 2006
 
 
Um, what would you use instead of regular expressions? It's like saying using lex and yacc and is evil because you can write something the fails at runtime.

That's kind of the point. Something has infinite combinations, how do you test that with a finite set of if tests?
son of parnas
Friday, May 19, 2006
 
 
Sorry, sop, you can't use lex or yacc either.

Have you ever seen the syntax? It's almost exactly like a regular expression.
BenjiSmith Send private email
Friday, May 19, 2006
 
 
See, Tanna? :-)
Lostacular
Friday, May 19, 2006
 
 
Firstly, correct me if I'm wrong, but won't Benji's email regex fail to match "my.e_mail.address@foo.bar.com"? And herein comes the problem. Yes, regex are handy for matching strings, but by all the Gods does the syntax suck like a vacuum.

The entire point of programming languages, for those who seem to have forgotten their first CompSci class, is to be an intermediary between how computers speak (machine code) and how people speak (English, French, Spanish etc.). This is why, other than a few poor unfortunates, we don't program in assembler. This is why the database language "M" is so rarely used. This is why we don't use one character variable names anymore.

Yes, you can learn to read regex. I can, albeit slowly and with some difficulty. I can also read assembler. So what?

Can somebody not come up with a nice, structured, meaningful way of writing regex? If your email matching syntax is six pages long then that doesn't matter IF IT IS READABLE. It can be twenty pages, or even a hundred, if it can be understood. Indentation, whitespace, named references to sections, all of these things are quite possible, but everyone in the "I'm superior because I can read five pages of junk that make everyone else's eyes go out of focus" can't see why it is necessary.

Try understanding this (completely fictional) regex format and tell me if I'm wrong:

<START>
  <{"a" - "z"} CASE = "ANY">
  <{"a"-"z", "0"-"9", "-", "_", "."} CASE="ANY" REPEAT="ANY">
  <"@">
  <{"a"-"z", "0"-"9", "-", "_", "."} CASE="ANY" REPEAT="ANY">
  <".">
  <{"com", "net", "org", "ac", "gov"} CASE="ANY">
<END>

(I know that this wouldn't work because you'd need all of the top level domain tags in the last part, but this is a demonstration of an idea and I'm not going to spend the rest of the day looking up all of the country domains)
Paul Brown Send private email
Friday, May 19, 2006
 
 
"I'm not going to spend the rest of the day looking up all of the country domains"

You don't have to: http://en.wikipedia.org/wiki/Country_code_top-level_domain
Berislav Lopac Send private email
Friday, May 19, 2006
 
 
Benji:

You acknowledge regular expressions tools and readability are poor.

As for your search and other examples, you're setting up strawmen.  Of course validation of user input is a good thing - but come on, regular expressions are the only way to do validation - and you're being silly if you're pretending they are.  Of course, fast string search is a good thing, but you are contrasting regex to a multiple successive linear searches - haven't you ever heard of Boyer-Moore for example?



Almost H.:

>> "there is an interpretation at run-time of the regex, which is an extra layer of complexity."

> So interpreted languages are inherently more complicated than compiled languages? 

That isn't what I meant. I also pointed out that the syntax checking of regex's are done at runtime (late), and fixing problems at runtime (late) is more expensive.

To clarify: Regexs are like a language within a language.  There's a validation (syntax checking etc.) that is done on the overall program, this is done very early, e.g. compile time.  There's a validation that is done on regexs, which is done very late, when the regex are used.



> There's no reason that regular expression couldn't be a first-class part of any programming language and natively compiled just like regular code.

Except they pretty much never are, and for a reason (point 2 below)

1. If you want to use a regex that was created by the programmer before the program was compiled, it *could* be validated early. But no popular programming language (AFAIK) does

2. If you want to be able use a regex dynamically generated at runtime, the the validation is necessarily late... therefore more expensive in development terms... and I somehow suspect all the regex fanboys would howl at a language which didn't let them do it this way. So they never are a first class part of a programming language, they are always a language within a language.



>> "If you mis-nest the equivalents of brackets/statementes for regex logic, your program does something, usually something fairly close to what you intended, with no warning, and a subtle bug."

> That's so ignorant.  If you mismatch brackets in a regexp then you get a run-time error something along the line of "mismatched bracket"!  In fact, I even get nice line/column numbers.

I meant the functional equivalents of brackets **and statements**,  not necessarily literal bracket characters!

It is very easy to make a tiny mistake in a regex which doesn't generate a runtime error, and does generate runtime misbehaviour.  In any case runtime errors/misbehaviour are more expensive than compile time errors (which goes back to the previously let section)


> Anyway It's not even code, it's a representation of what you want to match.  Hell it's more like SQL.  But I guess you don't like SQL either...  can't debug that.  Can't prove coverage.  Better just write some if/then statements and a few loops...

Another strawman, this time based on a foundation of sand.

1. A regex may be "A representation of what you want to match" - but I'd argue it's a horrible represenation (that's my whole point!) for a whole variety of reasons many of which I stated earlier.

2. A regex IS code. It's a language (admittedly a limited problem-domain one) within a language. It is being used to solve programming problems which would otherwise have to be solved with other programming. It rarely (never?) is a first class element of the language that it is embedded within.  And so on.

Yes SQL has some limitations and is not perfect (like all programming languages), but it's not nearly as bad as regex's.


> Better just write some if/then statements and a few loops...

Why do regex fanboys assume the only alternatives to handling strings, are regex versus "if then else" (you, Almost H.)

or doing a linear string search instead of Regex (Benji)

There are many different ways to solve string problems.  Like I said earlier, regexs (particularly the UNIX syntax type regular expressions) are universal duct tape when you want to quickly glue together something that more or less works. But that doesn't make them any less horrible.




Re: Last point on email validation problem

Paul has the interesting idea for the email problem. His neater syntax for regexs would be one way to improve on regex.  You could also take his idea and run with it in different ways.  You could have a set-based language.  You could have a set-based set of library calls. And so on.

For people who think that regex's a good way to solve email validation problem: If you were to modify the problem to be "You must validate an email address, and if it is invalid say why it is invalid, for example, missing @, address includes bad characters, domain includes bad characters, domain is of incorrect form, domain is not a valid TLD, etc.." it suddenly becomes very difficult to do that with a regex, which helps demonstrate why the regex solution is fragile.
S. Tanna
Friday, May 19, 2006
 
 
What started as an interesting discussion was in danger of degenerating into a slanging match when S.Tanna suddenly brought it back into sharp relief with her last comment:

<quote>
If you were to modify the problem to be "You must validate an email address, and if it is invalid say why it is invalid, for example, missing @, address includes bad characters, domain includes bad characters, domain is of incorrect form, domain is not a valid TLD, etc.." it suddenly becomes very difficult to do that with a regex"
</quote>

This is a telling comment. Regexes work wonders for certain problems, but S.Tanna's example above shows a common situation where a single regex just won't cut the mustard. You end up having to break the problem down into smaller steps to test different parts of the email address. The "gung-ho", "macho" approach of a single regex (entirely approriate in other situations) is no longer appropriate. As with most things, you need to choose the right solution for each problem you encounter.

I love the power of regexes, but hate the syntax and the way you have to use them. My regex cheat sheet is always with me when I have write a new regex, so  a loud "YES PLEASE" to whoever comes up with a great regex "intellisense" and debugging tool integrated into their IDE!
Regular diet of roughage
Friday, May 19, 2006
 
 
i've done lots of useful things, easily, with regular expressions. when i find something else which is as useful and easy as regexes, i will use that. regexes are useful.
Anonymous poster
Friday, May 19, 2006
 
 
> Have you ever seen the syntax? It's almost exactly
> like a regular expression.

Yes, I've used them many times, which is exactly why I can't imagine validating a program with if-tests.
son of parnas
Friday, May 19, 2006
 
 
"Firstly, correct me if I'm wrong"

Ahem. You're wrong.

The regular expression I wrote validates the email address you provided. If you're going to make a counter-argument, at least take a fucking second to check your facts. But then again, I guess you can't be bothered--as a computer scientist--to use the tools that are available to you in our field. It's better to grouse that things aren't exactly as you'd like them.

Yeah, I like the imaginary regex format that you invented. But, unless you've written a compiler for it (it certainly can't be runtime compiled, because that would be WRONG!!!) then it's just a pleasant little fantasy. Those of us in the real world have got projects to work on, and we can't just sit around fantasizing about tools we'd use if they, uh, existed.

You guys still haven't provided any insight into what *actual* tools you'd use for your real-world validation problems. Or would you just continue sitting aroud complaining, refusing to use the industry-standard tools sitting right in front of you?

And, no, the Boyer-Moore algorithm wouldn't work for the example I provided. Boyer-Moore is a great algorithm, but it doesn't work well for large sets of competing search strings (not to mention search patterns that contain wildcards, repetitions, or choices). Since it relies on indexing the characters in the search term, it utterly fails if I'm searching for 500 different string patters which (combined) use all 26 characters of the alphabet (or, the whole ASCII range, or whatever).

"Can somebody not come up with a nice, structured, meaningful way of writing regex?"

Ahem...they already have. Just because you can't be bothered to learn how it works doesn't mean it doesn't exist.

And furthermore, you guys seem to be implying that regular expressions are some kind of weird voodoo that's only used by off-the-wall, fringe programmers. You might not have noticed, but regular expressions are the *standard* way of performing complex string searches. A Perl-compatible regex engine is included in the standard libraries of Perl, Python, Ruby, Java, PHP, C#, Visual Basic, Cold Fusion, Delphi, VBScript, and JavaScript. It's one of the standard tools of our profession.
BenjiSmith Send private email
Friday, May 19, 2006
 
 
Right on.

Writing your own if..then..else structures to process ultimately means you're implementing a (non)finite automata by hand.  That's bound to be more error prone than using a debugged regex engine.
Grant
Friday, May 19, 2006
 
 
And, dragging something in from the JoS forum, even PowerBuilder has regex handling.  Very handy at times.
a former big-fiver Send private email
Friday, May 19, 2006
 
 
The sheer arrogance and ignorance of the regexp haters is pretty telling.  You have the audacity to claim that because you don't understand something used by millions of programmers everyday in production code that you know better.  You claim that because regexps are short and quick that it's bad; but short and quick is why they are good.  You've failed to backup your claims with any real code.  I have and will refute all your points in detail. 

Lets go:

"They do not have any (or any worthwhile) compile-time error checking.  If you make an error or typo, it causes a problem at run time."

That's true.  But a typo will cause an error anytime the statement is executed.  Assuming one has unit tests (and one should have unit tests) this is entirely a non-issue.

"They are virtually impossible to test. The only real option is by throwing a lot of data at them, and seeing if the data comes out right"

If we assume we have two functions named ValidEmail() and one is writting in regexp and the other is written in code.  How do you test it?  How is the testing any different.  This is entirely moot.

Furthermore, there are lots of tools for working with regular expressions.  Their are visual editors that, in real time over a set of sample data, show you what you are matching (and check the syntax),  I don't use that because of my regular expressions are pretty easy to understand or break down.

"you can't easily unit test a bit of complicated regular expression at a time. A big regular expression is generally non-modular, it can't easily be understood in chunks."

This is showing the ignorance again.  Complex regular expressions are generally modular and when I write a regular expresssion I do iteratively.  I might start out with a regular expression for matching email like this: "\S+?@\S+" which is relatively simple and then iteratively add to it until I get the matches I want.  Some regular expressions, like matching URLs, are inherently modular: you match the scheme, the login, the host name, and then the path.

"You can also use code coverage, to ensure each statement in your program has been tested at least once, even if you haven't tested every possible combination of statement orders."

This is completely moot.  Code coverage gains you absolutely nothing with string matching.  You can have 100% code coverage and still not know if you are matching all the strings you should or not matching the strings you shouldn't.

"given that regex has basically no worthwhile error checking."

Sure it does.  It has syntax checking.  A unit test would catch a syntax error immediately.

"Can somebody not come up with a nice, structured, meaningful way of writing regex?"

Paul Brown's XML-like regexp is equivalent to writing "#define start {" and #dedine end }" in C -- it does nothing to improve the language.  A VB programmer might not understand what the brackets are for in C but does that make the language bad or the programmer ignorant?  Does making the language more verbose for verboseness sake improve it in anyway.  I would again say that the regular expression syntax as been around since the 1950's -- there isn't (yet) any better way to represent it. 

"I meant the functional equivalents of brackets **and statements**,  not necessarily literal bracket characters!"

Well bracket characters in regexp are functional equivalent of backets...  I really have no idea where you are going here.

"It is very easy to make a tiny mistake in a regex which doesn't generate a runtime error, and does generate runtime misbehaviour."

Holy cow.  Yes, it's possible to make a tiny mistake in a regexp that means it does/doesn't match the string (I'm not sure "runtime misbehavior" applies).  However, it's possible to make a tiny mistake in any programming language that doesn't result in a compile-time error.  This is completely ridiculuous -- what are you trying to prove?  I can make a tiny mistake in a C program and it'll compile yet completely crash out -- in this way regexp is more robust than C!

"but I'd argue it's a horrible represenation (that's my whole point!) for a whole variety of reasons many of which I stated earlier."

Unfortunately, none of your reasons hold any merit.  You might have an opinion but it's a poor one at that.  Interestingly enough, I asked you how (or even if) you'd solve the problem and you've ignored it. 

"For people who think that regex's a good way to solve email validation problem"

Come on, it takes a giant document to describe a valid email address -- doing it in code or by regexp is going to be ugly.  However, that's not really the point.  I do *basic* email address validation -- it lets in a lot of false positives but that doesn't matter -- the whole point is to keep people from typing their mailing address into the email address field (and yes, it happens, a lot).  It wouldn't be worth my time to do without a regexp.

S. Tanna, you never did answer my questions: how to validate phone numbers, email addresses, and urls in your application?  Do you do it at all?  Why not?
Almost H. Anonymous Send private email
Friday, May 19, 2006
 
 
1.
> You might not have noticed, but regular expressions are the *standard* way of performing complex string searches.

Yes so standard, there are about a dozen different variations of regex syntax in just the UNIX shell...



2. To people who think that regex is the *only* way to solve a problem (e.g. string search example).  Think about that for a moment.

When a program hits a regex, what happens is something in the regex library decodes the alphabet soup that is regex.  Then something else in the library does the actual operation, e.g. the search.  There is no reason why a the operations could be done with out decrypting some alphabet soup string.

Another reason why the string search example is a really bad example for regex fanboys:  Is what if the search targets (let's say typed in by the user) includes syntatically significant strings.

Let's say the user types in a list of search terms, Anne, Bob, Charlie, and you build a regex from whatever they type in.

Well what happens if the user types in (for example) something like ^A-D  because they are actually looking for the literal ^A-D.

So then you need to escape the user's input before calling a regex function.  (Then internally the regex library unescapes it and turns it back into the literal.)  That's more complexity, not less!



3. It's also a mistake to think that people who don't like regex, have never written any.  I have written many, modified many, and debugged many.

- I just happen to think the cryptic syntax of UNIX-style regexs is horrible (of course that one's a matter of opinion). 

- And I know from bitter experience that regex's are fragile to modification of the problem (e.g. my email example, why you have to say the reason why the email is invalid)

- And I know from bitter experience that regex's are fragile to hard-to-spot slight typos

- And I know from bitter experience that regex's are hard to fully test

- And I believe that many of these issues, are fundamentally rooted in the fact that in common programming languages, regexs are languages within a language that are deal with very late in the compile/link/run sequence.



4. Almost H.

How would I validate an email (for example).  Didn't I say?

I'd have a function, that would validate a particular address, say IsValidEmail. 

I would look for the @ and split on it.  No @, then it's invalid,  otherwise it's valid if the address part (before the @) and the domain part (after the @ is valid), which leads me to functions isValidAddress and isValidDomain

The isValidAddress function could check the string is only allowable characters for example, and isValidDomain would check the string is allow characters, contains a . not at the beginning or end, and is a valid domain name.

It's the exact same complexity as regex's

But my version is (a) easier to read, (b) consists of reusable functions which can be individually unit tested, (c) and is flexible to change - for example:

- If I wanted to tell the user the reason why an email was invalid, I could. 

- If I wanted to later add a final validation of the domain part by adding a WHOIS lookup inside IsValidDomain, I could do that too.

I'm not going to harp on this point. But when something can be expressed by rules (e.g. BNF), it's silly to assume the only way to express those rules in a program is UNIX-style regex's.  Are people's minds really so closed?
S. Tanna
Friday, May 19, 2006
 
 
"The isValidAddress function could check the string is only allowable characters for example, and isValidDomain would check the string is allow characters, contains a . not at the beginning or end, and is a valid domain name."

Like with all pseudo-code you skip of the complicated bits.  Do you then have isValidDomainName() function?  Anyway, you've created the *exact* same code but using 40x  the lines at minimum.  I'm not saying it's a bad solution, but it's a waste of time.

"But my version is (a) easier to read"

Subjective.

"(b) consists of reusable functions which can be individually unit tested"

Fair enough but moot.  Mine would consist of a single my unit tests could easily cover the extra cases.

"If I wanted to tell the user the reason why an email was invalid, I could."

That's a good point.  But certainly overengineering.  If wanted to give the reason why the email was invalid, I'd use a hybrid approach..  your functions but each with their own regexp.

Of course, your method is going to be WAY slowing than my regexp.  And if I have to a lot of matches (which comes up on FruitShow all the time) then I'm wasting a lot of cycles for nothing.  Now, I admit, I could code a finite state machine by hand it would be faster than the regexp but more complicated than it and your code.

"If I wanted to later add a final validation of the domain part by adding a WHOIS lookup inside IsValidDomain"

Sure, I could also extract out the domain and then validate it.

Not every solution requires one giant regexp.  I usually use regexp in concert with functions, code, and loops.  It certainly doesn't have to be all or none.

"it's silly to assume the only way to express those rules in a program is UNIX-style regex's.  Are people's minds really so closed?"

No one is saying that it's the only way.  Simply that for most things it's the most efficient, quick, and easiest way to do things.
Almost H. Anonymous Send private email
Friday, May 19, 2006
 
 
If someone were to suggest that anything is a solution to all programming problems, they would be set afire faster than jetfuel.  I do not believe it was "The sheer arrogance and ignorance of the regexp haters" being as telling as the unwillingness to admit it has limitations.

I am guessing the OP was more an amusing "look at this crazy stuff I found."  Instead it started a flame war over silliness.  I use regex, nearly daily but I also recognize it's shortcomings. 

Regex has functional use, like a wrench. Sure, you can hammer with it, but it sucks as a hammer. Sure you can edit an email address but as someone pointed out, an edit where a response is expected other than "bad email address" is going to require multiple passes at the field in question to identify the error meaningfully.

As to some of the specific issues:
 - regex is not well understood by a majority of developers.  That makes it harder for the next person to easily figure out what is going on.  This can be fixed with documentation but reading code is harder.
 - testing is comparable, but only to the point of discovering an error.
 - syntax errors can be caught easily, semantic errors will allow invalid results and are more difficult to identify once a problem is discovered.
 - a single pass regex is more difficult to debug than if/then/else logic. It has nothing to do with the person's ability as a developer it has to do with modularity of the problem domain.  Sure you can write modular ones, but more often we see ones like the OP.  A personal shrine to the developer's ego.
 - Being verbose for "verboseness sake" is never good.  But brevity for the sake of brevity is no better.
 - COBOL has been around since the 50s too, but that does not make it the language of choice today.

As for how I validate data, it depends on a few things:
 1. does a common routine exist?  Why regex at all when most languages contain a method to do this edit?
 2. the language is the language that everyone on the team (or most everyone) understands. Knowing regex is good.  Being the only person on your team who does and implementing change in it is bad.
 3. if feedback needs to be made to a user.  Batch systems can do things differently from online systems because they are never going to hear back from the user.  It does not matter why the email address is bad, just that it is. That said, see rule #1.  The edits should match.

Regex may be an fine answer depending on your need, but this is really a lot of emotion around support.  You said you “it would not be worth your time without regex.”  I find it hard to believe you would quit if they took it away.
anon in this war
Friday, May 19, 2006
 
 
"as telling as the unwillingness to admit it has limitations."

It has a lot of limitations (the OP is a perfect example).  In fact, it's these sorts of examples that are trotted out to show how horrible regexp is.  But really, that regexp doesn't represent reality.  If I saw that, I'd kill the programmer who put that in.  That's an extreme case.

"Regex has functional use, like a wrench. Sure, you can hammer with it, but it sucks as a hammer."

Exactly.  However an equivalant number of people here think that there's no point in having a wrench when you can just use your hands and a hammer!

"regex is not well understood by a majority of developers."

Depends on your enviroment.  I suspect the vast majority of Perl developers know regexp pretty well but the vast majority of C developers do not.  I expect to see far more regexp in Perl than in C.

"syntax errors can be caught easily, semantic errors will allow invalid results and are more difficult to identify once a problem is discovered."

See there's no evidence to back that up.  If my regexp doesn't match something that it should, I can see what it's not matching and I can understand the regexp and it would be an easy change.  Nothing magical is happening.  If anything, regexp is a far simpler language than the average programming language.

"a single pass regex is more difficult to debug than if/then/else logic."

Except that to match most regexps you basically need to build a finite state machine in code.  Those are very hard to debug.

"But brevity for the sake of brevity is no better."

As pointed out, the regexp syntax has been around since the 1950's.  It's not brevity for brevity's sake.  Surely if it was, someone would have come up with something better by now. 

I am, however, annoyed by some of the more advanced features of Regexp basically being symbol-soup.  The core features (that all regexp implementations share) could not be made any simpler.

"does a common routine exist?  Why regex at all when most languages contain a method to do this edit?"

Agreed.  I had a similar discussion like this but the example was to validate an IP address.  My reply was why not use the ip_to_long() function!

We're still talking about fairly simple examples here.  We're asking, "is this an email address".  I do use regexp's for that purpose but more commonly (and more importantly) I use regexps to do "does this long string contain a URL and if so tell me where so I can replace it with a link".

"You said you “it would not be worth your time without regex.”  I find it hard to believe you would quit if they took it away."

This regexp matches all typed in links in my forum software (FruitShow, click my link).  It matches a wide range of user intent.  It also deals with common issues like wrapping a URL in quote, a period at the end (which Joel's forum screws up), etc.

/\b(?:(?:https?:\/\/)|(?:www[.]))([A-Za-z0-9_-]+(?:[.][A-Za-z0-9_-]+)+)([^\s"]*[\w\-\@?^=%&\/~\+#])?/xims

It's long and maybe even a little bit ugly.  But the comparative code to search through a posting to replace matching strings with links would be *huge* and complicated.  It would be difficult to tweak to just the matches I want.  In the end, I doubt I would have bothered.
Almost H. Anonymous Send private email
Friday, May 19, 2006
 
 
"I'm not going to harp on this point. But when something can be expressed by rules (e.g. BNF), it's silly to assume the only way to express those rules in a program is UNIX-style regex's.  Are people's minds really so closed?"

To paraphrase Winson Churchill: Regexes are the worst way to do string pattern matching, except for all of the ones that have been tried.

Let's follow the logical course of action if you manually write your own string validation.

You start up a new project that also needs some validation, and decide you can reuse your code.  At first you cut-n-paste, but they you realize that this is pretty general purpose stuff.  So you make your own library that becomes part of your personal tool kit.  You find a bug here and there, and fix it.  More forms of validation get added and then you refactor.

Then one day you decide that the library is really good and stable, but it's annoying to have to write new functions and recompile the library every time a new minor variation comes up.  So you decide it'd be cool to be able to store validation definitions in metadata.  You come up with an XML definition and write a factory function that calls dispatches to the appropriate validation code, and it works great!

Then one day you decide that the XML is annoying.  It's a little too verbose, and since your programming language doesn't have multi-line strings you need to create a seperate file to store the definition and load it in at runtime.  So you create a more succint representation, and write a processor that translates that to XML and subsequently calls your validation code.  Awesome! Now you can create amazingly complex string validation routines instantly in one line of code! Life doesn't get any better than this!  You've got to tell someone about this!

And then it hits you.  You've just reinvented regular expressions.

Have a nice day.
Grant
Friday, May 19, 2006
 
 
I think there is a different, larger point to be made here: validating email addresses is a waste of time. Just send mail to it and let the internet sort it out.

Programmers seem to enjoy doing extra work. I don't know how many times I've seen someone check to see if a file is present (or readable) before opening it. Just open it!
UNIX Guy
Friday, May 19, 2006
 
 
> Just send mail to it and let the internet sort it out.

It depends. If someone could give me a simple to call function that magically validated an email address and gave me a good error for all failures then I would use it.

Your way email just doesn't arrive and nobody has any idea why without going to a lot of work. Checking the email address removes one potential problem.
son of parnas
Friday, May 19, 2006
 
 
Funny, all this heat and nobody mentioned the perl 6 regex syntax. Larry Wall's Apocolypse 5 addresses many of the complaints seen in this thread.

http://dev.perl.org/perl6/doc/design/apo/A05.html

---
Six pages of, unreadable, non-modular, untestable, densely packed cryptic hierographics are considered "beautiful" by many programmers (I don't know if the example I'm quoting was being a little sarcastic - but if he was, there are plenty of other programmers who would use "beautiful" to describe such a regex with no sarcasm intended)...
---

Well, beautfiul in a Rube Goldberg kind of way. The article is worth reading if only as an exploration of the capability of regular expressions, and for his musings on the similarities between regular expressions and natural language.
dwayne Send private email
Friday, May 19, 2006
 
 
Sorry, but there is simply no better way to do string matching, substitution, or splitting then through a regex.  String matching is complex.  Deal with it.

Are they a bitch to read?  Yes.  I'll let you on on the secret, it only takes one or two extra lines:

# looking for all instances of stuff like bob:is:rad
$string =~ m/^(\S*):(\S*):(\S*)$/g

Did you see that?  As a programmer, I *know* regex's are hard  to read, especially if you aren't sure what the surrounding logic is trying to do.  So I make what is known as a CODE COMMENT and place it right above the regex so a person reading it (like myself in six months) can start to figure out what is going on.  Problem solved.

It ain't that hard.  Sure you need your cheat sheet, but it sure beats 100 lines of if/then/else logic you'd find in some languages with poor searching (old VB code).
Cory R. King
Friday, May 19, 2006
 
 
Let me throw my two cents in with several points.

1.  Regular expressions are usefull.  They fill a niche between fixed text searches and full blown parsers.  They can also handle tasks where the source is not formal/strict enough for a (e.g. recursive descent) parser but is consistent enough to extract information from.

2.  The example in the original post is *NOT* the way to regex is expressed in the source file.  It is just a dump of the final expression.  The regex is built out of parts and subparts, using meaningful variable names.  It is fairly easy to follow and actually reads like a parser.

3.  Language support varies.  Some languages have regexes as first class objects.  Constant regexes are compiled at compile time, with the attendant errors for mismatched bracked etc.  If the  Mail::RFC822::Address module where written today, each subpart would be an actual regex object, and there would be no explicit string representation.
  Further, with the 'x' flag, regexes can be split over multiple lines and each part commented *within* the regexp.
Camel Jockey Send private email
Friday, May 19, 2006
 
 
Full-force modern regular expressions, as in Perl and other languages, have much more power than the classic regular expressions of the Chomsky classification.

Heh, heh, he said Chomsky, heh-heh.
dot for this one
Friday, May 19, 2006
 
 
Benji - "correct me if I'm wrong" was a polite way of saying "I have checked my fucking facts and your regex doesn't match against underscores, like in my example". I'll know not to bother being polite in future.

My point isn't that regex aren't useful - they are. I never at any point claimed not to use or understand them, because I do. My point was that the (PERL) syntax dates back, as others have pointed out, to the fifties when every byte was sacred.

Yes, you can learn to read regex, but the syntax is a mess because the major concern when it was devised was brevity rather than readability. Feel free to comment above your regex, but when I have to debug your code to find out why it isn't matching what it should be matching, telling me what you think it does won't help me - I already know that from the spec.

No, I haven't written a compiler for this format - it was made up on the spot and I have already spotted problems with it, but that doesn't mean that a better syntax for regex can't or shouldn't be created. If everyone took the attitude of just using the tools that were available to them and never questioning anything we'd be posting to this forum using punched cards.

If anyone else is up for the challenge of creating a better regex syntax then drop me a line; I'm sure we can get something done. Maybe in twenty years' time you'll be defending it to people who've found an even better way.
Paul Brown Send private email
Saturday, May 20, 2006
 
 
"My point was that the (PERL) syntax dates back, as others have pointed out, to the fifties when every byte was sacred."

Haha..  the syntax was for language and set theory -- nobody cared about bytes. 

"If anyone else is up for the challenge of creating a better regex syntax then drop me a line"

As pointed out above, a better regexp syntax from the creator of Perl for Perl6:
http://dev.perl.org/perl6/doc/design/apo/A05.html

I didn't bother to look before, but you're right about that regexp not matching the underscore.  I didn't have to test it, I looked at it for 3 seconds and it was plainly obvious.  It's unfortunate those regexp's are so damn unreadable, maybe I would have seen in that 2 seconds instead.
Almost H. Anonymous Send private email
Saturday, May 20, 2006
 
 
<quote>
Benji - "correct me if I'm wrong" was a polite way of saying "I have checked my fucking facts and your regex doesn't match against underscores, like in my example". I'll know not to bother being polite in future.
</quote>

Uh huh. I stand by my original premise. You did not check your facts. Try running this little script:

<code language="perl">
  #!/usr/bin/perl
  use strict;

  my $str = 'my.e_mail.address@foo.bar.com';

  if ($str =~ m/
    [a-z](?:[\w\.\-]*[a-z0-9])?@[a-z0-9][\w\-]*\.(?:[a-z0-9][\w\-]*\.)*[a-z]{2,}
  /x) {
    print "matched\n";
  } else {
    print "failed\n";
  }
</code>

Or, if perl isn't your thing, try this on for size:

<code language="Java">
  package net.benjismith.junk;
  import java.util.regex.Matcher;
  import java.util.regex.Pattern;

  public class RegexTest {
   
    private static String addr = "my.e_mail.address@foo.bar.com";
    private static String emailPatternStr =
 
      // Username
      "[a-z](?:[\\w\\.\\-]*[a-z0-9])?" +
 
      // At
      "@" +
 
      // Subdomain block (with dot)
      "[a-z0-9][\\w\\-]*\\." +
 
      // Optional additional subdomains (each w/ dot)
      "(?:[a-z0-9][\\w\\-]*\\.)" +
 
      // Top-level domain extension
      "*[a-z]{2,}";
   
    public static void main (String[] args) {
      Pattern emailPattern = Pattern.compile(emailPatternStr);
      Matcher emailMatcher = emailPattern.matcher(addr);
      if (emailMatcher.matches()) {
        System.out.println("Matches");
      } else {
        System.out.println("Fails");
      }
    }
  }
</code>

Or, maybe this:

<code language="Python">
  import re
 
  pattern = re.compile(
    r'[a-z](?:[\w\.\-]*[a-z0-9])?@[a-z0-9][\w\-]*\.(?:[a-z0-9][\w\-]*\.)*[a-z]{2,}'
  )
  addr = 'my.e_mail.address@foo.bar.com'
 
  match = pattern.match(addr)
  if (match):
    print "Matched"
  else :
    print "Failed"
</code>

I think if you'll run any of those examples, you'll see that the regular expression DOES match. Every time.

The \w assertion means "word character" it's a shorthand for the class [a-zA-Z0-9_]. It matches the underscore character.

So, when I said you should "at least take a fucking second to check your facts", I was being deliberately IMPOLITE because I get tired of people talking out of their ass, attempting to appear authoratative on some subject when they clearly have no idea what they're talking about.

I'm sure you're a nice guy, Paul, and I'm generally a pretty friendly person myself. Sorry for the impoliteness. But this is one of my biggest pet peeves. I get really annoyed when I'm trying to have a discussion, and somebody jumps into the conversation, making a bold claim that's clearly not true (and that would have only taken a few seconds to verify). "Check your facts" means actually *check* them, rather than just running on your assumptions.

Anyhow, cheers. :) I have pretty strong opinions about things. I'm sure you do too. I don't mean anything personal by it.
BenjiSmith Send private email
Saturday, May 20, 2006
 
 
Damn, I'm an idiot.  Should have looked at it for 4 seconds. ;)
Almost H. Anonymous Send private email
Saturday, May 20, 2006
 
 
To go on a bit of a tangent for a second, if you want to see a language with a ridiculously terse syntax, check this out:

http://www.blogjava.net/evanwhj/archive/2006/03/11/34837.html

Personally, I don't know how anyone can possibly use that kind of syntax for anything useful. Why can't they just come up with a syntax that's actually UNDERSTANDABLE by actual human beings???

I personally propose this:

http://en.wikipedia.org/wiki/Lojban

...or maybe this...

http://en.wikipedia.org/wiki/Esperanto
BenjiSmith Send private email
Saturday, May 20, 2006
 
 
1) There are no two identical regular expression implementations.* Even the regular expression library in Common Lisp CL-PPCRE is more Perl compatible than half the perl installations that are currently in use. Perl has changed its regular expression engine so often (admittedly hardly ever breaking backwards compatibility) that the phrase "perl-compatible" makes me laugh.

2) There are too many cryptic standards, with meaningless differences only in syntax. What's up with that?

3) Why does Benji's expression not give a good matching on Ruby, which is supposedly perl compatible? (I tested it.)

str = "my.e_mail.address@foo.bar.com"
str.match(Regexp.new("[a-z](?:[\w\.\-]*[a-z0-9])?@[a-z0-9][\w\-]*\.(?:[a-z0-9][\w\-]*\.)*[a-z]{2,}"))
=> returns "ss@foo.bar.com"

However, if you change the regex to "[a-z]*(?: ... " it works fine, because the "ss@" part indicates that the regular expression is not sufficiently greedy. I suppose that adding word boundary tags would work even better.

4) If I feed the regular expression to grep -Po, which is supposedly PERL COMPATIBLE, the match returned is "mail.address@foo.bar.com". So it matches something different than my Ruby PERL COMPATIBLE regular expression did, but I still can't get an identical match without (slightly) adjusting your regex.

5) Escape sequences. Why are the escape sequences in Emacs, Vim, UltraEdit, Visual Studio, PHP, Perl, Ruby and grep different? Do I need one slash, or two slashes? Really, why do I have to write test expressions just to test how the escape sequences work in the environment?

I use regular expressions a lot, and there are many things you can do with regular expressions that would otherwise require a complete state machine. But the notation? It's horrible. And yes, it's obscure for verbosity's sake. If you don't believe obscurity isn't a goal in itself, read the part where Larry Wall compares regular expression escape sequences with a Huffman Encoding - more commonly used escape codes should have fewer characters.

*] Statement not backed up by facts; feel free to prove me wrong.
L. Lort
Sunday, May 21, 2006
 
 
I agree, Lort. It's a pain in the ass.

It's like the fact that no two C++ compilers implement the same subset of the C++ spec. Or the fact that every single database implements a slightly different version of SQL. Most of the time, most of the stuff is the same, but there are a few gotchas. When you come across those gotchas, it's always an enormous pain.

This problem is not unique to regex engines. I'd venture to say that no two independant implementations of *anything* are ever sufficiently similar that you can just ignore their differences.

I can't say anything about the Ruby regex, cuz I've never used Ruby, but the grep one surprises me because I thought grep used the PCRE implementation (which is the same implementation underlying the Python regex engien).

Oh well.
BenjiSmith Send private email
Sunday, May 21, 2006
 
 
L. Lort,

The compatibility between regular expression languages is pretty moot -- it's basically the same issue one has with different SQL implementations.  You code for the one you are working with.  For me, that's often PCRE but I also code regexp's in Javascript (which is much more limited).  I'd be *nice* if they were all the same, but I accept the differences.
Almost H. Anonymous Send private email
Sunday, May 21, 2006
 
 
You can be modular with regular expressions, just like you would with any other form of code. Build them up from small pieces, and give those pieces descriptive names.

Attacking that giant Regexp is a bit of a straw man. It was likely generated by some kind of compiler, and likely the code that generated it / the grammar it was generated from were perfectly readable. It's just an intermediate form.
Matt Send private email
Monday, May 22, 2006
 
 
"it's basically the same issue one has with different SQL implementations"

Not really. With SQL, either the database server knows how to read the query, and you get the answer you expect, or it will gracefully fail on you. It's not even possible (I think) to write an SQL query that will run, but isn't "proper" in the sense that the specs make no guarantee towards its output.

With regular expressions it's different. The rules of greedy/nongreedyness of expressions makes it such that different regex implementations give different answers to *correctly formulated* expressions, and you have to be very careful while writing your regular expressions and maximize their greedyness to ensure you get identical behavior on different regex implementations.

Try writing a regular expression that matches HTML tags.

If you do something like this: <(.*?)>(.*?)<(.*)>
chances are the regex engine will stop scanning (.*) as soon as the last '>' is hit. So it might seem to work. But clearly the expression above is incorrect.

This kind of nondetermism is *unnecessary*. That's why I object to it.
L. Lort
Monday, May 22, 2006
 
 
"This kind of nondetermism is *unnecessary*."

I suppose all implementations could have agreed on something..  just like Mozilla and Microsoft have agreed on the standard for HTML and CSS...

Sure it's unnecessary, but it doesn't really invalidate the usefulness of the tool.  I code in one single language 90% of the time, so I'm well aware of the syntax and the expected matching. 

Admittedly, regexp is hard to code without testing.  I don't write a regexp and expect to get it right the first time (although that's true of any non-trivial amount of code I write) -- I test the regexp against the input as I write it.
Almost H. Anonymous Send private email
Monday, May 22, 2006
 
 
"With SQL, either the database server knows how to read the query, and you get the answer you expect, or it will gracefully fail on you."

This has not been my experience. I only have significant experience with MySQL and MS SQL (and a tiny bit of experience with Postgres and Oracle), and I've encountered  quite a few scenarios where the same (valid) query will regurn different results depending on the DB engine.

Off the top of my head...

Incrementing and decrementing dates often works differently from db to db. Some databases will throw errors on incorrect dates (February 31), while others will  either attempt to automatically correct the error (January 3) or will always return an empty set from such a query.

Some database servers have a case-insensitive LIKE function. In other servers, LIKE is case-sensitive.

Some database servers have a BOOLEAN type. Others don't. For the ones that do have it, there's a lot of confusion about what to do with NULLs. If you allow nulls in a BOOLEAN field, it's not really boolean anymore, is it? Some db's allow it. Some don't.

Oracle, MySQL, and MS SQL all exhibit different behavior when a UNIQUE constraint encounters fields with NULL values.

There are probably quite a few other similar gotchas with various SQL queries seeming to work correctly but silently producing incorrect results if run against a different DB engine with a slightly different SQL dialect.
BenjiSmith Send private email
Monday, May 22, 2006
 
 
Almost H. Anonymous Send private email
Monday, May 22, 2006
 
 
"Searching for those names one-at-a-time would require me to scan through the document twenty separate times. That's a lot of work, and not very efficient.

Instead, I'd construct a regular expression like this:"

How do you know that the underlying code working the data using your regular expression is not scanning that data per token per list item?  (just curious).
~Eric
Tuesday, May 23, 2006
 
 
Good question, Eric. This has actually been a topic of quite a bit of research at my company. At first, when I started investigating this technique for constructing regular expressions, I assumed that it would provide no benefit because I thought the regex compiler was smart enough to detect redundancies in the state graph and merge those nodes while constructing the search plan.

As it turns out, I was wrong. Through extensive testing, I demonstrated that the regex engine makes no effort at all to optimize its search pattern. It searches for *exactly* what you tell it to search for, without attempting to optimize away any redundancies.

So I wrote a regex parser of my own, and a library for performing arbitrary redundancy reductions in lists of regular expressions. I can successfully create a single regular expression that contains tens of thousands of search terms. And it consistently provides O(log n) performance, when compared to the one-at-a-time search method.

The regular expressions that are generated by this process are very messy-looking (much like the regex posted at the top of this thread), but it's important to emphasize that it was generated by a compiler, not by an error-prone human being. And I don't expect anyone to *ever* edit those expressions. If we find a bug (and it's just as common to find a regex bug as to find a code bug), then we edit the source expressions and then recompile. These massive expressions are not ever intended to be human-editable on their own.
BenjiSmith Send private email
Tuesday, May 23, 2006
 
 
* I should mention that my company is in the process of patenting this implementation.
BenjiSmith Send private email
Tuesday, May 23, 2006
 
 
Benji,

"The tool support for regular expressions is very very poor. I've never seen a development environment that could perform decent syntax highlighting of a regular expression. I've never seen a debugger that could set breakpoints within a regular expression, monitoring the matching progress of the underlying state machine."

Have you looked at RegexBuddy? It has a pretty good debugger, and is great for testing regexes.

(Not affiliated in any way; just a happy user.)

Ken
Ken
Wednesday, May 24, 2006
 
 
I've heard of it, but I've never used it.

It would be great to have a tool that would let me see the internal representation of the state-graph generated by a particular regex, and to set breakpoints between the nodes of that graph. Then I could step through the state-machine's matching steps, watching it perform lookahead and backtracking operations.

That'd be very very cool.

Anyhow, I don't know if RegexBuddy is that sophisticated, but I'definitely take a look at it.
BenjiSmith Send private email
Wednesday, May 24, 2006
 
 
"So I wrote a regex parser of my own, and a library for performing arbitrary redundancy reductions in lists of regular expressions. I can successfully create a single regular expression that contains tens of thousands of search terms. And it consistently provides O(log n) performance, when compared to the one-at-a-time search method."

Benji - Not to take the wind out of your sails, but why wouldn't you translate from regex directly to DFA?  The regex->shorter-regex->parser path seems circuitous, when there are well-known implementations of optimal regular language parsing.  Take a look at lex or flex, which is something like 30 years old now.

As far as "O(log n) performance", is that expression metaphorical somehow?  Results in language theory show that parsing by matching text of length m against a regular expression of length n may require time proportional to  max(2^n, m).  See
http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
Even common sense says you've got to read all the characters in the regular expression and at least that number of characters in the input, for O(n).

flex is your friend if you want to do fast regex matching.

Regards,

Will
Will Dowling Send private email
Thursday, May 25, 2006
 
 
N isn't the length of the input string, but the possible matches that have to be considered.

With a linear search, you would if the string matches any of the expressions. That takes O(N) time, where N is the number of matches. If you also consider the string length of the input, it takes O(N*M) time.

However, if you use a mimized DFA O(M * Lg(N)) is all you need. A trie might be an even better solution.
Homer Simpson
Thursday, May 25, 2006
 
 
"The regex->shorter-regex->parser path seems circuitous, when there are well-known implementations of optimal regular language parsing."

Good point. Using a parser generator is an interesting idea. With the current architecture of our software, it would probably be a deployment problem, but that's definitely something to think about. On the other hand, the underlying implementation of a regular expression engine is *already* a DFA, so why build one myself? Using a regular expressions gives me a well-defined syntax for defining the state graph of the DFA. Deploying a new regex is much simpler than deploying a new parser, written in C, for every new set of search criteria.

"Even common sense says you've got to read all the characters in the regular expression and at least that number of characters in the input, for O(n)."

My back-of-the-matchbook O(log n) calculation actually doesn't include parsing the regular expression and creating the internal state graph used by the matcher (since that compiled regex can be used again and again, to search different input documents). But, because of the way I construct my regular expressions, they always form tree structures. Matching against a tree verses matching against a list provides a relative efficiency gain equivalent to O(log n). That's what I was referring to.
BenjiSmith Send private email
Thursday, May 25, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz