The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

PHP Regex speed vs. C/C++

Okay, I need a wizard's opinion here. My PHP site at http://www.SmarterReviews.com takes product reviews from around the web and attempts to discover patterns in what people are saying about various attributes of each product, like "screen" or "battery life" for an mp3 player.

I have a huge array of regular expressions to match literally hundreds of thousands of different text patterns. A simplistic example would be "/(good|great|awesome){1} battery life/". For each sentence in the reviews, I loop through dozens of regexs like this looking for matches. It takes ~1 minute to process a few hundred reviews on my dual-core Athlon XP.

All this is currently written in PHP (w/ PCRE). My question is, if I wrote a C/C++ module specifically for regex matching, which I would call from PHP, would that speed things up significantly? Or does PHP basically do this anyway when it invokes the regex engine? I know some languages have a "compile" feature for regular expressions, which I think validates the syntax for quicker loading later. Does PHP have this?

I don't know how things work under the hood in PHP regex's. I haven't been able to find a lot of info on optimizing speed for large regex's. It's all working correctly in PHP now, so there would have to be a pretty big (20%+) speedup for me to (convert|pay someone to convert) to C/C++.

One more thing, I'm using the "i" modifier for all patterns (case-insensitive); would it be faster to convert the whole string to lowercase first and leave out the "i" modifier?

TIA
Sam Sanders Send private email
Friday, September 07, 2007
 
 
Do yourself a favour and read a proper book on the subject; Mastering Regular Expressions is a classic, there may be other good ones too.  That'll give you the information you need to write efficient regexes, and to choose between regex implementations.

(The short form is that you're unlikely to get much improvement from switching to C/C++, since you'd probably end up using the exact same PCRE library under the hood!  But there's more to it than that.)
Iago
Friday, September 07, 2007
 
 
What iago said, plus:

Could you cache the lower-case'd version of your search text?  That way you only pay the conversion cost once (and insensitive matching does slow things down a little; whether that little is significant of course depends on too many factors to list such as traffic, text size, etc.).  (I say "once" but of course if you're periodically refreshing your source data that's "once per fetch".  Of course I'd bet your fetch cycle is far, far longer than the amount of time between reader accesses so the cost still amortizes down to almost nil on a per-read basis.)

Also have you considered a full-text index of some form?  MySQL and Postgres both have solutions along these lines, you might be able to use Lucene, etc. etc.  It's just a slightly different way of looking at the "extract meaning from textual data" problem than straight-up regex'ing.  (Plus, these things tend to be pretty fast by virtue of their narrow focus.)
son of anon
Friday, September 07, 2007
 
 
I should mention that this matching doesn't run live for each website viewer I have; I don't know if I gave that impression. I have a cron job to periodically scrape new reviews, process them looking for matches, and then cache those matches & aggregate data.

It's pretty fast from the website user's standpoint because everything's cached at that point. I just wanted to be able to speed up the cron job. I periodically tweak the regex's and then I have to run them all over again.
Sam Sanders Send private email
Friday, September 07, 2007
 
 
"I wrote a C/C++ module specifically for regex matching, which I would call from PHP, would that speed things up significantly?"

Not the regexp's themselves, no.  You might get some performance benefit but it's not likely to be worth the effort.  Shuffling strings in C/C++ is never enjoyable.

"Or does PHP basically do this anyway when it invokes the regex engine?"

Basically.  PCRE is a C library.

"I know some languages have a "compile" feature for regular expressions, which I think validates the syntax for quicker loading later. Does PHP have this?"

It does have it but it doesn't make much difference.  You can read about it in the PHP manual under PCRE.

You could try combining your regexp's together and using the preg_match_all() and see if that improves matching performance.

The final idea, is to install the XDebug extension and profile you code.  That way you can tell exactly where the performance bottlenecks are.
Almost H. Anonymous Send private email
Friday, September 07, 2007
 
 
I would expect converting strings to lower case and leaving the /i modifier off the regex to, if anything, slow things down slightly rather than speed things up. But either way, I wouldn’t expect it to make a noticeable difference.

Saturday, September 08, 2007
 
 
You won't see an appreciable speedup redoing it in C/C++.  You'll probably see it slow down or get less reliable, as the libraries PHP is using are mature and heavily used.

Btw, what you're doing is essentially a data mining operation.  You may want to look into some data mining and/or information retrieval books.
Lally Singh Send private email
Saturday, September 08, 2007
 
 
You could take the simple option and upgrade to a high end Core 2 Quad - it'll probably run 3-4x faster.
Michael Reitzenstein Send private email
Monday, September 10, 2007
 
 
You might want to try other regex libraries. According to the tests at http://boost.org/libs/regex/doc/performance.html there are significant performance differences between them for some combinations of expressions and input.

Of course performance will vary between hardware, compiler, etc, but it's probably worth testing some other regex libraries on your data.

Another option that springs to mind is to make use of something like http://re2c.org/ but it's probably going to be more difficult to use.
Adam
Monday, September 10, 2007
 
 
If you're any good at C, you might want to consider using Lex/Flex.

The big problem is that you're checking the string against regex 1, then reprocessing the entire string again against regex 2, then reprocessing the entire string again against regex 3, etc.

A lexical analyzer lets you specify regexes that match "tokens". You specify a series of separate regular expressions, and through the magic of state machines, the output program actually combines all the regexes into one big, efficient one.

The net result is you scan the string ONCE looking for all the matches simultaneously.
Chris Tavares Send private email
Saturday, September 15, 2007
 
 
Before attempting anything else I'd try to optimize the regexes. I was able to speed up your regex by over 20% with the following:

/\b(?:g(?:ood|reat)|awesome) battery life/i

I added a word boundary at the start so the expression will only be attempted at the start of a word.

Changed to non-capturing parantheses which is a bit faster if you only use them for grouping.

Alternation is expensive, so I moved to g out of the alternation so the alternation for good|great is only attempted at words starting with g.

Removed the {1}, it's not neccessary.

I'd also reorder the alternatives so that the most frequently used alternatives are put first.

I have no idea whether these optimizations are appropriate for your data though. The speed improvement varied from 20-40% depending on the type of data I tested against.

It's really important that you do some profiling against your own data so you can accurately measure whether your optimizations are effective or not. Sometimes it can be very counter-intuitive which optimizations work in a specific context.
Pocoyo
Saturday, September 15, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz