The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Regular expressions

Is it possible to specify a regex pattern that would match on different permutations on the order of a sequence without needing to specify every single permutation?

For example, let's say I want to match 4 letter words consisting of the letters a,b,c & d. Each letter could appear only once but in any order.

I could use a regular expression like this:

(a|b|c|d){4}

Which would match on inputs like:

abcd
dcba
abdc

But would also match on inputs like abab or dddd -- something I don't want.

Am I missing something (which is possible 'cause I'm not a frequent writer of regular expressions) or is what I want something beyond what can be defined in a regular expression?
Jason A Send private email
Thursday, May 31, 2007
 
 
One way that may be prohibitively complex and expensive would be to write a code generator to spit out all n! permutations for you, and use that code.
blake8086 Send private email
Thursday, May 31, 2007
 
 
You could look at Jeffery Friedl's book but I don't think this problem matches the regular expression model very well.

How long will the strings be? And how big is the character set?

There are 24 permutations of {1, 2, 3, 4} but 358,800 permutations of {A, B, C, ..., Z}, for example. For the first, one could imagine an exhaustive regular expression with all 24 matches, for the second it's out of the question.
frustrated
Thursday, May 31, 2007
 
 
"appear only once but in any order" is not something regular expressions do very well, implying as it does some memory about what has already appeared.

I'm not an RE expert, though.
AllanL5
Thursday, May 31, 2007
 
 
Agreed -- most likely the only regular expression that will do the job is one that enumerates all the possible permutations.

The easiest way to handle this is to simply keep track of which characters have been used already. In Perl:

sub isvalid {
  my $s = shift;
  my %chars = ();
  return 0 unless $s =~ /^[a-d]*$/;

  for (my $i = 0; $i < length($s); $i++) {
    my $c = substr($s, $i, 1);

    return 0 if defined $chars{$c};
    $chars{$c} = 1;
  }

  return 1;  # OK
}


Translation into other languages that support regular expressions and dictionaries, sets, maps, or any other type of associative container is straightforward.
clcr
Thursday, May 31, 2007
 
 
I'd be tempted to try some 'state machine' implementation, myself.
AllanL5
Thursday, May 31, 2007
 
 
"I'd be tempted to try some 'state machine' implementation, myself."

A regular expression is just that: a state machine. So unless you introduce some extra construct to keep track of which characters you've seen, as I did, you'll end up with one state per possible permutation.
clcr
Thursday, May 31, 2007
 
 
I've been doing tons of regular expression work (even to the extent of writing regular expression parsers & generators, and writing a regex-to-DFA library).

What you're asking for is, unfortunately, not possible with raw regular expressions. However, if you're willing to run a subsequent confirmation regex, it's actually not so bad. Here's a quick perl script that demonstrates what I'm talking about:

    #!/usr/bin/perl
    use strict;
   
    my @texts = (
      "abcd",
      "cdab",
      "ccab",
    );
   
    foreach my $text (@texts) {
      if ($text =~ m/([abcd]{4})/) {
        my $capture = $1;
        if ($capture =~ m/(.).*\1/) {
          print "found repeat $1 in '$text'\n";
        } else {
          print "no repeats in $text\n";
        }
      }
    }

The second regular expression looks for repeated characters from the text that was captured in the first regular expression. So, to fulfill your requirements, I'd validate a match from the first regex, and I'd validate that NO matches were generated from the second regex.

Have fun!!
BenjiSmith Send private email
Thursday, May 31, 2007
 
 
Couldn't you just create a table with one slot for each valid letter, initialize it to zero, then update the corresponding slot as each letter is encountered?  If the slot is zero, you set it to one.  If it's already one, you fail because that letter was already encountered.  At the end, check all the slots for ones. 

Seems like a solution that would be orders of magnitude simpler than a regex and significantly faster (particularly if the number of valid letters is small enough to fit in something like a 32-bit integer where initialization and verification would be simple operations).
SomeBody Send private email
Thursday, May 31, 2007
 
 
Table/slot type of solution would make sense for the "abcd" language.
But I need to scale up the complexity of the match.
The goal was to see how completely I could spec out data coming from a serial port using a regular expression.

I was hoping to be able to use a single regular expression...
but it looks like some extra logic's required to do the extra checks.

Thanks for all the great feedback!
Jason A Send private email
Thursday, May 31, 2007
 
 
A charming but hideously impractical solution for the pythonistas in the crowd:

import re

def gen_perm(l):
    if len(l) < 2:
        yield l
    else:
        for i in range(len(l)):
            for l2 in gen_perm(l[:i] + l[i+1:]):
                yield [l[i]] + l2
               

def create_perm_re(s):
    perms = []
    for l in gen_perm(list(s)):
        perms.append(''.join(l))
    return '|'.join(perms)

regex = create_perm_re('abcd')
regex_compiled = re.compile(regex)
m = regex_compiled.match('blahblahabcdblahblah')

Memory usage for create_perm_re('abcdefg'): ~3MB
Memory usage for create_perm_re('abcdefgi'): ~31MB
Memory usage for create_perm_re('abcdefgi'): 300MB when I finally killed the process.

A little lesson in the factorial complexity....
Kevin Atkinson Send private email
Thursday, May 31, 2007
 
 
Just to be precise, there are 358,800 sequences of four letters without repetition.

If you had all 26 letters it would be 26 factorial which is a big number indeed.
frustrated
Thursday, May 31, 2007
 
 
Kevin, do they have "h" in your version of the alphabet?

;-)
BenjiSmith Send private email
Friday, June 01, 2007
 
 
Easiest ( definitely not the fastest, though): just sort the string and see if it matches 'abcd' ?
Ruatara P Send private email
Tuesday, June 05, 2007
 
 
The solution is indeed possible with a plain regex. If you're just checking that a string is a permutation of "abcd" you can use

^(?=.*a)(?=.*b)(?=.*c)(?=.*d)[abcd]{4}$

where (?=.*a) checks that the "a" char is found anywhere in the string, but without consuming chars. If you want to actually search for all words that are anagrams of "abcd" in a longer text, this is the way to go

\b(?=.{0,3}a)(?=.{0,3}b)(?=.{0,3}c)(?=.{0,3}d)[abcd]{4}\b

I have test these in .NET but the (?= ) construct is standard, thus they should work in any language that support regexes
Francesco Balena Send private email
Wednesday, June 06, 2007
 
 
That's awesome... next chance I get I'll have to startup RegexDesigner.Net and play with this (personally) unexplored corner of the syntax.
Jason A Send private email
Wednesday, June 06, 2007
 
 
Ah, very nice solution, Francesco. I hadn't thought of that.
BenjiSmith Send private email
Friday, June 08, 2007
 
 
A point that has not been covered: Regular expressions are much less powerful than many think.  RE handlers often add handling for things that are not actually part of REs.  Bear this in mind when considering porting issues and whether a given "RE" handler can deal with your situation.  The name is not the thing (unless you are reading Ursula K. LeGuin's Earthsea stories).

Sincerely,

Gene Wirchenko
Gene Wirchenko Send private email
Tuesday, June 12, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz