The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Is there a way to stop a regex from looking inside HTML tags?

The task is simple: take some text, filter it with a set of regexes via PHP's preg_replace(), replace certain character sequences with other sequences (e.g. simple quotes with smart ones, double hyphens with mdash entity, double newlines into HTML paragraphs etc). The problem is, it must do it only outside the HTML tags that can be in that text.

So far I have not succeeded to make a regex that refrains from doing something on a match - it seems that it all is aimed at *doing* stuff, not *not doing* stuff on a match... Any ideas?
ping?
Sunday, December 11, 2005
 
 
> reg_replace()

Is their another function other than replace? Usually there's just a match of some kind.
son of parnas
Sunday, December 11, 2005
 
 
Couldn't you just use the strip_tags function first, then pass the result to preg_replace?

http://us2.php.net/strip_tags

Like this:

$whatever = preg_replace('|pattern|', strip_tags($subject));

(That's from memory, so it's probably wrong, but you get the idea.)

Deane
Deane Send private email
Sunday, December 11, 2005
 
 
This is possible with regex, but don't ask me how. The O'Reilly regex book was pretty good.

The stuff you're doing already is pretty basic, regular expressions can be extraordinarily complex. I used to be really good at them, but I still needed to sit down with the book and troubleshoot my expressions, and then I made sure I kept snippets for future use to base other expressions off of.

Basically what you want to say is "not between < and >" but you also need to throw in a little bit to say "and the next >" otherwise the first <html> and the last </html> will match it.

Sorry... that's all I remember. Google some regex tutorials.
MarkTAW Send private email
Sunday, December 11, 2005
 
 
Someone posted a solution in the comments for
http://us2.php.net/preg_replace
 
"'(?!<.*?)$string(?![^<>]*?>)'si"

I don't completely understand this, but it seems to work. I'm not sure what it means to have a negative assertion at the beginning of the string.
JW
Sunday, December 11, 2005
 
 
If it's easier to do something with the text that you find, then you could try avoiding looking for tags in order to ignore them. Instead, think of the text outside the tags as kind of "inverse tags" delimited by > <. Look for the patterns:

^([^<]*)$  - string with no tags at all
^([^<]*)<  - up to the first tag
>([^<]*)<  - between tags
>([^<]*)$  - after the last tag

Then do what you want on the text that you find.

Just a thought...
EKB Send private email
Monday, December 12, 2005
 
 
Also, apparently (?!string?) syntax is enabled using the PCRE (Perl-compatible regular expression) extension to PHP. See: http://us3.php.net/pcre

(?!string?) is a "lookahead assertion" saying that the next few characters are not "string". From the PHP web site:

<quote>
Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example, \w+(?=;)  matches a word followed by a semicolon, but does not include the semicolon in the match, and foo(?!bar)  matches any occurrence of "foo" that is not followed by "bar". Note that the apparently similar pattern (?!foo)bar  does not find an occurrence of "bar" that is preceded by something other than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion (?!foo) is always TRUE  when the next three characters are "bar".
</quote>
EKB Send private email
Monday, December 12, 2005
 
 
JW's version appears to be working, or at least working in most cases. Kudos to EKB for explaining why it works. :) Thanks, folks.
ping?
Tuesday, December 13, 2005
 
 
EKB,
That's what I don't understand. The expression I quoted *starts* with a negative lookahead assertion. That doesn't make sense, according to the last sentence of the paragraph you quoted.
JW
Tuesday, December 13, 2005
 
 
Split to an array using a pattern of /(<.*?>)/

The <.*?> will match a tag, and the brackets make sure that each tag will be included in the array in addition to the markup between the tags.

Run your search and replace on each non-tag member of the array.

Recombine everything into a single string.
DAH
Tuesday, December 13, 2005
 
 
Make sure your data is XHTML. Use an XML parser, and run your regex on each cdata node.

Seriously, it'll be a lot more maintainable than messing with overcomplex regexes.  There's a reason perl requires special goggles to read, and it's not $just $the @crufty $yntax.
Iago
Tuesday, December 13, 2005
 
 
DAH - thanks, that's exactly what I was trying to avoid. There is already enough stuff done to that data - in a single pass, but with multiple regexes a couple of which invoke custom callbacks. I'd rather have the smart text replacement happen in the same pass than having to call it for a hundred chunks apiece.

By the way, the pattern, though starting with a negative assertion, does work for most cases. I saw some cases of it consistently failing on a certain kind of A tags. I haven't found out why it fails, yet.
ping?
Tuesday, December 13, 2005
 
 
JW: I agree - it looks like it shouldn't work. Maybe it's a "do nothing" part of the regexp? If you delete that, what happens?
EKB Send private email
Tuesday, December 13, 2005
 
 
I'm pretty certain this is unsolvable with PCRE. To properly handle HTML tags you need to recognize quoted attributes and their content. I've only seen it solved in Perl by using dynamic regex (this is when you place code within regex, it gets evaluated during matching and its return value is treated as regex.) AFAIK, PCRE doesn't have this feature.
Egor
Tuesday, December 13, 2005
 
 
Egor, I think this problem is a bit simpler. The OP doesn't need to handle the HTML tags. Instead, I think he wants to ignore them, but without deleting them. They should stay exactly where and as they are, while he mungs the text outside the tags.

Is that right?
EKB Send private email
Wednesday, December 14, 2005
 
 
Either way, he still needs to recognize them. And I think that triangle bracket within an attribute value will break any non-dynamic regex. If it has to be PHP, that a combination of finite state machine code and regexes should probably be used.
Egor
Thursday, December 15, 2005
 
 
If it's valid XHTML, there shouldn't be any triangle brackets within attributes; they should all be &lt; or &gt;
JW
Thursday, December 15, 2005
 
 
OK - yes, the intent was to leave the tags in the text and merely stop the regex from looking inside them so they don't replace tag attribute quotes with smart quotes, for instance.

As for the angular bracket (greater sign) appearing inside an attribute value, it's technically correct that it would break the recognition. However, the only *valid* case when it might appear there is as a "greater" operator in a Javascript expression in an onSomething handler. In the rest of such cases the bracket should be the "&gt;" entity anyway. And this single valid appearance I can safely avoid - function calls serve better for events than inline Javascript.

Is regex 100% fit for parsing HTML? No. Does it do what I need to do in this case, or 99% of it anyway? Yes! Do I want to complicate my little script tenfold just to cover that 1% which is unlikely to ever show up? Emphatic no. :)

This is a case where "worse is better" for real.
ping?
Thursday, December 15, 2005
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz