The Joel on Software Discussion Group (CLOSED)

A place to discuss Joel on Software. Now closed.

This community works best when people use their real names. Please register for a free account.

Other Groups:
Joel on Software
Business of Software
Design of Software (CLOSED)
.NET Questions (CLOSED)
Fog Creek Copilot

The Old Forum

Your hosts:
Albert D. Kallal
Li-Fan Chen
Stephen Jones

More verbose alternative to regex?

I'd like to put a disclaimer here at the beginning that what I am about to say will probably reveal my ignorance.

Here's my question: is it possible to make a more reader-friendly version of Regex for parsing strings?  Like instead of the nasty one line /////////\/\/\/\/\/||||||||[][][][]//// strings, we are able to use know...readable?

In my mind I'm trying to equate Regex parsing to the rest of the programming we do.  We COULD put our entire file parsing function on one huge line, and we COULD replace operators like IF with things like the | pipe character, and we COULD replace loop structures with a single postfix operator...but we don't.

So why do we do this with regex?  Maybe it is good enough for most, but I haven't seen anyone put out an alternative to the standard regex library (or language-integrated syntax), anywhere.

Does this exist?  Is this possible?  Am I just ignorant?
pds Send private email
Sunday, March 26, 2006
From the Wikipedia:

"A regular expression, often called a pattern, is an expression that describes a set of strings. They are usually used to give a concise description of a set, without having to list all elements. For example, the set containing the three strings Handel, Händel, and Haendel can be described by the pattern "H(ä|ae?)ndel" (or alternatively, it is said that the pattern matches each of the three strings). In most formalisms, if there is any regex that matches a particular set then there is an infinite number of such expressions. Most formalisms provide the following operations to construct regular expressions."


"The origins of regular expressions lies in automata theory and formal language theory, both of which are part of theoretical computer science. These fields study models of computation (automata) and ways to describe and classify formal languages. In the 1940s, Warren McCulloch and Walter Pitts described the nervous system by modelling neurons as small simple automata. The mathematician Stephen Kleene later described these models using his mathematical notation called regular sets. Ken Thompson built this notation into the editor QED, and then into the Unix editor ed, which eventually led to grep's use of regular expressions. Ever since that time, regular expressions have been widely used in Unix and Unix-like utilities such as: expr, awk, Emacs, vi, lex, and Perl."

That's to say, Regular Expression is quite powerful in its domain. It has a proven history of organically growing. How can an alternative become as powerful as Regex without becoming Regex? I don't think it's possible. Sometimes we use only a subset of what's available in Regexes, and then sometimes we can use alternatives to those subsets, but Regexes go beyond simple use cases.

With a combination of Ruby + Regexes, people write parsers and have fun doing it.

But I have never seen people that use Ruby trying to avoid using Regexes, because even though Ruby is very expressive, Regex is much better at its domain. The goodness of this combination is that we can break Regexes in smaller pieces to keep things under control. But other languages that go full Regex can approach the power of Ruby doing so.

Summing up, I think people should feel comfortable with the basics of Regexes because it's good for them.
Sunday, March 26, 2006
Why don't you give a shot at designing your ideal regex language, and tell us how it works out?  I imagine you'll find the answer to your own question that way.
Alyosha` Send private email
Sunday, March 26, 2006
Regex sintax is very simple and regular amoung various languages and implementations.

But this simple tokens can be joined creating "look ugly" and "difficult to understand" regex rules.

Changing the syntax would not help simplify the complexity of the regex, will only make it look better.

Trust me, it´s better to have a good introduction to standard regex than trying to use non-standard (and also limited regex engines).  I know it because I made that mistake many years ago.

I suggest reading O´Reilly "Mastering Regular Expressions, Second Edition", and you will master regex on any language and engine.
Marcello Morsello Send private email
Sunday, March 26, 2006
I believe SNOBOL and the Icon/Unicon Library have regex done in abbreviations rather than obscure symbols.
setsquare Send private email
Sunday, March 26, 2006
"SNOBOL was widely used in the 1970s and 1980s as a text manipulation language in the humanities, but in recent years, its popularity has faded as newer languages such as Awk and Perl have made string manipulation by means of regular expressions popular; it is now mostly a special interest language used mainly by enthusiasts, and new implementations are rare. However, SNOBOL's pattern matching algorithm is in many ways more powerful than regular expressions."

"Icon is a very high-level programming language featuring goal directed execution and excellent facilities for managing strings and textual patterns. It is related to SNOBOL, a string processing language. Icon is not object-oriented, but an object-oriented extension called Idol was developed in 1996 which eventually became Unicon."

Interesting. Thanks for mentioning them.

I found these comments, also:

"escargo 30 Dec 2003 - I will just note that the successor to Snobol, Icon, also has a page on this wiki. It's pattern matching is different than Snobol's, but also very powerful. It is not based on regular expressions, but does pattern matching with other facilities.

If it ain't regular expressions, it ain't no good. - LES

Les's ideas are no better than his grammar. The only advantage of regular expressions over Snobol pattern matching and Icon string scanning is that regular expressions are very terse. Perhaps because of their terseness, they quickly become unreadable as they become more complex. Pattern matching and string scanning are far more powerful, are quicker to write, and are far easier to debug. One writer said that if you have a problem and you solve it with a regular expression, you end up with two problems. If you need to do anything complex with strings, your best bet is Icon string scanning. Larry"

Sunday, March 26, 2006

"I've been looking at some pretty hairy regular expressions (in Perl)
recently that are virtually unreadable. The thought crossed my mind
that maybe there are some alternatives to classic regular expression
syntax out there. Note that I'm only talking about the syntax - it's
hard to beat the semantics of regular expressions.

Does anybody know of anything?"

No easy way out, though.
Sunday, March 26, 2006
you could use posix character classes instead of perl-style

e.g. [:digit:] instead of \d
Jonathan Ellis
Sunday, March 26, 2006
regex may look cryptic but the alternative is a fully written out language. You learn to appreciate the brevity once you know it.

It is like saying "C is hard for me, let's make a verbose version of C that I can read better," but then once you learn the logic you'd want to go back to C syntax after all.
Sunday, March 26, 2006
In Perl you can avoid leaning toothpick syndrome "\/\/\/" by using a different delimiter from the default, so instead of m/.../ use m{...}. Also in Perl, use the x modifier so you can break down the regex into separate lines and comment each piece.
Ian Boys Send private email
Sunday, March 26, 2006
I know what you mean - I have lots of little AWK scripts and some of the regex look like line noise.

I could imagine a regex builder though - for instance one in which you could view a sequence of strings & try and build a regex without having to remember what the ^ or $ symbols do.

Fun exercise would be to try and mock up GUI form that allowed you to create and/or view a regex; as somebody else says, even doing a crayon sketch on some scrap paper will help you understand the problem domain.

You could start by looking at something like:

something like this:

may be more what you want but shows how complex it can get.
Grant Send private email
Sunday, March 26, 2006
"C is hard for me, let's make a verbose version of C that I can read better"

Oh, that exists. It's called Java.
Berislav Lopac Send private email
Monday, March 27, 2006
I personally think BNF is more readable than regex.
Monday, March 27, 2006
Parsec can do what regexes do:

It seems like an argument in favour of regex syntax, though, because why would you want to write \((a|b)\) as

do{ char '('
  ; char 'a' <|> char 'b'
  ; char ')'
Monday, March 27, 2006
Just as you imply, there is such a good thing as regex style. Generally, most examples of monstrosities are either jokes, self-consciously clever code, or "death of a thousand cuts" increases in the complexity of the code. If you come across these in production code, treat them as any other kind of unreadable code and take the opportunity to rewrite them!

I'm not sure that regular expressions are very much like code. They're small (in terms of characteristic operators that can be applied to them - alternation, concatenation and kleene closure), and inherently recursive - q.v. "(a|b)|(ab)|(a(a|b)b)". The only language I can think of which really emulates that is lisp, which makes a fair amount of use of pattern-matching - the big brother of regular expressions, which are really just pattern matching over strings.

You might also be interested in Larry Wall's statement of intentions for regular expressions in Perl 6:
R. C. James Harlow Send private email
Monday, March 27, 2006
There are systems (like that adopted by Felix, see ) which allow common subexpressions to be bound to names - which, curiously enough, makes complex regexen both terser AND easier to read!

Hopefully this type of thing will gain more mainstream popularity.
Monday, March 27, 2006
+1 for ... regex training wheels!

This is a wonderful example of a domain where a solution that works for the master is hard for the beginner (regex), but one that is easy for the beginner is often insufferable for the master. 

And as someone else noted, if a beginner creates a solution that manages the complexities, in the process they learn enough that it's not useful anymore.
rkj Send private email
Monday, March 27, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz