The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
TechInterview.org
CityDesk
FogBugz
Fog Creek Copilot


The Old Forum


Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

Regular expression (POSIX)

Hello, I have a question.

I want to write an expression to match a string with a series of rules.

The problem is that the expression correcly matches strings that well formed but also matches strings that are well formed and have extra stuff.

[0-9]{1}(1|0){1}

What must be inserted in order to invalidate further chars?

I tried using:

[0-9]{1}(1|0){1}\w{0}
and
[0-9]{1}(1|0){1}[a-zA-Z0-9]{0}

To make it match 0 letters and numbers after the first string, which gives the same results.

Thanks.
Drunkie
Monday, February 25, 2008
 
 
"What must be inserted in order to invalidate further chars?"

If you're matching an entire string, you want to anchor it with the ^ and $ characters:

^[0-9]{1}(1|0){1}$

This matches only 2 characters.  Without the ^ and $ it merely matches any string that contains that pattern somewhere in it.
Almost H. Anonymous Send private email
Monday, February 25, 2008
 
 
Thanks, it works.
Drunkie
Monday, February 25, 2008
 
 
Are you sure about that? This may be specific to the Bash shell (which I'm familiar with) but I believe ^ and $ match the beginning and end of a line. In that case, adding those characters would only match when the two characters are the only ones on the line.

Question for the OP: Are you looking for instaneces of the two characters set off by whitespace? If so you can try

\s\([0-9]{1}(1|0){1}\)\s

The \s matches space, tab, return and newline. The \( and \) will cause the expression to only return the enclosed segment of what is matched.

(Here's hoping nothing there gets stripped from this post.)
Drew Kime Send private email
Monday, February 25, 2008
 
 
Bah, race condition.
Drew Kime Send private email
Monday, February 25, 2008
 
 
Though you're complicating things needlessly: {1} is meaningless, so you can just write ^[0-9](1|0)$.  (Or ^[0-9]([01])$, which may be infinitesimally faster on some implementations.)
Iago
Monday, February 25, 2008
 
 
There's no such thing as \s (or \w) in POSIX regular expressions (there is [[:space:]], but who wants to use that?).  And if you're using an extended type of regex you will also have access to zero-width assertions like \b (Perl etc) or \< and \> (Emacs etc) that will match a whitespace boundary _or_ the end of a line, and will also not require an extra match group to be introduced, so there aren't many cases where \s(foo)\s is optimal.

And then people wonder why regular expressions have a reputation for being complicated... :D
Iago
Monday, February 25, 2008
 
 
(quote)
Though you're complicating things needlessly: {1} is meaningless, so you can just write ^[0-9](1|0)$.  (Or ^[0-9]([01])$, which may be infinitesimally faster on some implementations.)
(endquote)

{1} is redundant, but I think it makes the expression more readable.

Thanks.
Drunkie
Monday, February 25, 2008
 
 
> {1} is redundant, but I think it makes the expression more readable.

I think the overwhelming majority of people who know regular expressions would disagree with you.

Why stop at {1}? If that's more readable, surely you could make the same case for {1}{1}, {1}{1}{1}, etc.
clcr
Monday, February 25, 2008
 
 
Drunkie said: "{1} is redundant, but I think it makes the expression more readable."

Sure sure, in in the the same same way way that that extra extra words words make make sentences sentences more more readable readable.

Monday, February 25, 2008
 
 
It just occurred to me that this is essentially the same as the argument over whether to compare booleans to true, e.g.:

  if (succeeded)

vs.

  if (succeeded == true)

In both cases the more verbose form makes newbies somewhat more comfortable but most experienced people find that it adds clutter and suggests that the author didn't really understand what was going on.
clcr
Monday, February 25, 2008
 
 
@Iago

Thanks for the clarification. I had assumed bash was pretty close to POSIX standard. Guess not.
Drew Kime Send private email
Monday, February 25, 2008
 
 
Someone needs to come up with a tool that SIGNIFICANTLY reduces the pain involved with the creation of a regex.

Some tools exist, but they are garbage at best....not intuitive to use....

RegEx is the bane of current software development right along with multi-threading, application protection and executable installation protocols.
Brice Richard Send private email
Monday, February 25, 2008
 
 
Brice Richards: "Someone needs to come up with a tool that SIGNIFICANTLY reduces the pain involved with the creation of a regex."

Done. See www.regexbuddy.com - it's great for designing and testing regexes, allows you to build reusable libraries and automatically paste the regex in various dialects and IDEs. It's also pretty cheap (< $30, IIRC). Also allows GREPping in files or arbitrary text pasted into a window.

No relation, just been using it for a while.
Ken White Send private email
Tuesday, February 26, 2008
 
 
(quote)
Why stop at {1}? If that's more readable, surely you could make the same case for {1}{1}, {1}{1}{1}, etc.
(end of quote)

Well that can't be defended with the same argument because that introduces a lot of entropy in the expression.  Using "(1|2){1}" serves as clarification. It adds some entropy but it minimal and serves a purpose.

Does this have any non negletable performance influence ? I'm using PHP btw.
Drunkie
Tuesday, February 26, 2008
 
 
(quote)

if (succeeded)

vs.

  if (succeeded == true)

(end of quote)

Sometimes you actually need a form similar to the second. For example some PHP functions return stuff like:

* N number, if (...)
* boolean false, otherwise

And you actually need

if ( $x === false ) {
}

Using

if ( !$x ) {
}

Could lead to incorrect results.

Regards.
Drunkie
Tuesday, February 26, 2008
 
 
"RegEx is the bane of current software development right along with multi-threading, "

Regular expressions have been around for 25+ years and there is a huge body of knowledge around them, including many examples. They are absolutely nothing new.

What I do is take pieces of regular expression examples from outside sources, and build my expression outward, checking it against test data as I go. I'm good enough to get close with a regular expression written from scratch but not so good that I can wing it without testing.

Multi-threading being a "bane"? No, not really. Unless one doesn't understand it at all. It adds complexity but it's manageable.
Bored Bystander Send private email
Tuesday, February 26, 2008
 
 
RegEx may be old and venerable and whatnot, but they are still a horror.
ping?
Friday, February 29, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz