Google Answers: Matching non-HTML in PHP with preg

View Question

Q: Matching non-HTML in PHP with preg_replace() ( Answered 5 out of 5 stars

Question

Subject: Matching non-HTML in PHP with preg_replace()
Category: Computers > Programming
Asked by: endquote-ga
List Price: $20.00

Posted: 17 Jun 2002 21:03 PDT
Expires: 24 Jun 2002 21:03 PDT
Question ID: 28275

I have a large string of HTML which I want to use preg_replace() on to
place <span> tags around certain words. So I want to search for the
word unless it occurs within an HTML tag, or HTML comment block. It
should also work if my word is preceded or followed by punctuation, or
is at the beginning or end of the large string. I think this should
work out to just a lot of matching of angle-brackets, but I can't seem
to get it right.

Answer

Subject: Re: Matching non-HTML in PHP with preg_replace()
Answered By: runix-ga on 17 Jun 2002 22:44 PDT
Rated: 5 out of 5 stars

Hi! :) I got the regex! :) <? $words=array('WORD'); $data="WORD this is my WORD dont <p class='WORD'> or <!-- DONT WORD --> WORD"; foreach($words as $word){ $data=preg_replace("/(?!<!--)(?!<)(^\|[\s\.,>])($word)($\|[\s,\.])(?!>)(?!-->)/", "$1<span>$2</span>$3",$data); } print $data; print "\n\n\n"; ?> prints: <span>WORD</span> this is my <span>WORD</span> dont <p class='WORD'> or <!-- DONT WORD --> <span>WORD</span> You have to add the words you want to replace in the array '$words'. I used WORD here Aditional Links I recommend you this book: 'Mastering Regular Expressions' [ http://www.amazon.com/exec/obidos/ASIN/0596002890/qid=1024379027/sr=8-1/ref=sr_8_1/102-4897136-7732161 ] Search Strategy Personal Experience Good luck!
Request for Answer Clarification by endquote-ga on 18 Jun 2002 00:09 PDT Almost! Except: $sWord = "too"; $str = "<!-- bar too foo -->"; $str = preg_replace("/(?!<!--)(?!<)(^\|[\s\.,>])($sWord)($\|[\s,\.])(?!>)(?!-->)/i", "\\1<b>\\2</b>\\3", $str); returns <!-- bar <b>too</b> foo --> Not sure why that is. That's not good though. This is going to be run on HTML from a wide variety of users, so hopefully it can handle non-standard weirdness, too. Also, will this have problems with punctuation other than periods and commas? Could a non-word character be used instead of \., to be more generic? I actually have that book, but honestly don't have time to read it before I need the answer to this, and find it to be pretty difficult to read anyway. Thanks though, this is the best I've seen yet! If you could just fix the comment thing...
Request for Answer Clarification by endquote-ga on 18 Jun 2002 00:31 PDT Messing with it some more, here. Your example does work, unless you include another word before the end of the comment. For example: WORD this is my WORD dont <p class='WORD'> or <!-- DONT WORD foo --> WORD would give <span>WORD</span> this is my <span>WORD</span> dont <p class='WORD'>or <!-- DONT <span>WORD</span> --> <span>WORD</span> Also I put an i modifier on there to make it case-insensitive. This'll be very exciting for me if you can make it go. :) It's kind of fundamental to a largish project.
Request for Answer Clarification by endquote-ga on 18 Jun 2002 01:09 PDT Damn, okay, I just came up with another requirement for this, but since it wasn't in the original I can make it a new question you can answer for the same price. Words should also not be matched if they are linked with <a> tags. So... <!-- some WORD here --> and <a href="http://word.com/">some word here</a> should both not match.
Clarification of Answer by runix-ga on 18 Jun 2002 15:25 PDT Ok! Here's version 2.0 :) Now, It doesn't replace the words inside the comments or HTML tags. <? $data="this is my WORD <p class='biri WORD biri'> or <!-- DONT WORD CRI --> WORD"; $words=array('WORD'); $data=add_spans($data,$words); print $data; print "\n\n\n"; function add_spans($data,$words){ $data=">".$data."<"; foreach($words as $word){ $data=preg_replace("/(>[^<])($word)([^<>])/", "$1<span>$2</span>$3",$data); } $data=substr($data,1,strlen($data)-2); return $data; } ?> prints: this is my <span>WORD</span> <p class='biri WORD biri'> or <!-- DONT WORD CRI --> <span>WORD</span> The only problem I found is that if you have the word you're looking for 2 times without a html tag in the middle, the first one doesn't gets <spaned>. It's not possible to fix this problem using regular expressions (or, at least, I don't know how to), so if you want we can think another way of doing it Good luck!
Clarification of Answer by runix-ga on 19 Jun 2002 11:34 PDT ok! let's try this one! And don't worry about the links issue, I've modified it so it doesn't replace anything inside a tag (ie <span class='WORD'> won't be replaced!) <? $data=file("sampletext.txt"); $words=array('more'); $data=add_spans($data,$words); print $data; print "\n\n\n"; function add_spans($data,$words){ if (is_array($data)){$data=join('',$data);} $data=">".$data."<"; foreach($words as $word){ $data=preg_replace("/((?=>)[^<]?[^\w])($word)([^\w][^<>])/i", "$1<span>$2</span>$3",$data); } $data=substr($data,1,strlen($data)-2); return $data; } ?>
Request for Answer Clarification by endquote-ga on 19 Jun 2002 18:43 PDT It's still matching stuff in comments. Try it with: $words=array('more', 'foxy', 'while'); and you'll see. Perhaps moshen's comment would be helpful? I'll try and hack on it some, too.
Request for Answer Clarification by endquote-ga on 19 Jun 2002 19:56 PDT I made some progress: <? $data=file("sampletext.txt"); $words=array('more', 'foxy', 'while', 'phoenix'); $data=add_spans($data,$words); print $data; function add_spans($data,$words){ if(is_array($data)) { $data=join('',$data); } foreach($words as $word){ $data = preg_replace("/($word)/i", "<b>\\1</b>", $data); // put tags around the word $data = preg_replace("/(<[^<>])<b>($word)<\/b>([^<>]>)/i", "\\1\\2\\3", $data); // remove them if it was in an html tag $data = preg_replace("/(<a [^<>]>.)<b>($word)<\/b>(.<\/a>)/i", "\\1\\2\\3", $data); // remove them if it was in a link // $data = preg_replace("/(<\!--[.\r\n]-->)/", "", $data); // remove them if it was in a comment // $data = preg_replace("/(<\!--.)<b>($word)<\/b>(.-->)/i", "\\1\\2\\3", $data); // remove them if it was in a comment } return $data; } ?> Still can't seem to match a comment though!
Clarification of Answer by runix-ga on 21 Jun 2002 06:15 PDT This is the correct version, without regexes: <? $data=file("sampletext.txt"); $words=array('more', 'foxy', 'while', 'Phoenix', 'phoenixfest', 'real'); $data=add_spans($data,$words); print $data; function add_spans($data,$words){ $pre = '<b><a href="http://tangent.cx/r.php?url=dev.endquote.com%2Findex.php%3Fid%3D376" onmouseout="doTangent()" onmousover="doTangent(\'dev.endquote.com/index.php?id=376\',\'Pre-Phoenix Festival.\',\'2001-07-04\',\'ly The Phoenix Festival is the day aft\',\'dev.endquote.com/index.php?id=396\',\'Money and DSL.\',\'2001-08-27\',\'CA World Sound Festival and maybe kick\')">'; $post = '</a></b>'; if (is_array($data)){$data=join('',$data);} $forb=array(); $end=0; do{ $start=strpos($data,"<!--",$end); if ($start === false){ break; } $end=strpos($data,"-->",$start); $forb[]=array($start,$end); }while($start<strlen($data)); $end=0; do{ $start=strpos($data,"<",$end); if ($start === false){ break; } $end=strpos($data,">",$start); $forb[]=array($start,$end); }while($start<strlen($data)); $end=0; do{ $start=strpos($data,"<a",$end); if ($start === false){ break; } $end=strpos($data,"</a>",$start); $forb[]=array($start,$end); }while($start<strlen($data)); $dataL=strtolower($data); foreach($words as $word){ $word=strtolower($word); $pos=0; do{ $pos=strpos($dataL,$word,$pos); if ($pos===false){break;} if (check($pos,$forb)){ if ($pos>=1){ $before=substr($data,$pos-1,1); }else{$before='';} if ($pos<strlen($data)){ $after=substr($data,$pos+strlen($word),1); }else{$after='';} if (eregi("[a-z]",$before.$after)){ $pos++; continue; } $NEW=$pre.substr($data,$pos,strlen($word)).$post; $data=substr($data,0,$pos).$NEW.substr($data,$pos+strlen($word)); $dataL=substr($dataL,0,$pos).$NEW.substr($dataL,$pos+strlen($word)); $end=0; do{ $start=strpos($NEW,"<",$end); if ($start === false){ break; } $end=strpos($NEW,">",$start); # print "new forbidden areas: ".($start+$pos)." , ".($end+$pos)."\n"; $forb[]=array($start+$pos,$end+$pos); }while($start<strlen($NEW)); $forb=updateForb($pos,strlen($NEW)-strlen($word),$forb); $pos=$pos+7; break; }else{ $pos++; } }while($pos!=false and $pos<strlen($data)); } return $data; } function updateForb($pos,$sum,$forb){ $ret=array(); foreach($forb as $f){ list($start,$end)=$f; if ($pos<=$start){ $start=$start+$sum; $end=$end+$sum; } $ret[]=array($start,$end); } return $ret; } function check($pos,$forb){ foreach($forb as $f){ list($start,$end)=$f; if ($pos>=$start and $pos<=$end){ return 0; } } return 1; } ?>

endquote-ga rated this answer: 5 out of 5 stars

Very very helpful!

Comments

Subject: Re: Matching non-HTML in PHP with preg_replace()
From: mohsen-ga on 19 Jun 2002 08:59 PDT

Greeting !

Just a quick quote. The point here is that, the WORDs that should NOT be
matched are more easy to construct a regex for, than those that should match.
So if I were you, I would do it this in 2 phases. At first I would replace all
occurences of WORD with <span>WORD</span>. That's easy and fast. Secondly
I would  change back the WORDs that should not have been changed to
their orginal. for doing that, you need to write a regex that matches any
"<span>WORD</span>" which is either inside a comment or within HTML tags or
between <a> and </a>. I am sure that's far simpler to handle and understand.

Hope it was helpful.
regards,
mohsen-ga

Subject: Re: Matching non-HTML in PHP with preg_replace()
From: cbra-ga on 26 Jun 2002 09:28 PDT

Thanks to mohsen-ga!!

I had nearly the same problem in Perl:
Highlight words in a HTML page, but don't destroy the HTML tags.
Your two stage approach works for me:

my @farben = ( 'yellow', '#ffa0a0', '#a0ffa0', '#d0d0ff' );
foreach ( @pattern ) {
  $color = shift @farben;
  $resultColor = "<span style=\"background-color:$color\">"; 

   # now change EVERY match
  $text =~ s/$_/$resultColor$_<\/span>/sixg;  
   # now search for highlights within HTML: 
   # < not followed bei closing >
   # following the span
   # store original HTML tags in $1 and $2
  $text =~ s#(<[^>]*)<span [^>]+>([^<]+)<\/span>#$1$2#gsx;
}

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy

Hi! :) I got the regex! :) <? $words=array('WORD'); $data="WORD this is my WORD dont <p class='WORD'> or <!-- DONT WORD --> WORD"; foreach($words as $word){ $data=preg_replace("/(?!<!--)(?!<)(^\|[\s\.,>])($word)($\|[\s,\.])(?!>)(?!-->)/", "$1<span>$2</span>$3",$data); } print $data; print "\n\n\n"; ?> prints: <span>WORD</span> this is my <span>WORD</span> dont <p class='WORD'> or <!-- DONT WORD --> <span>WORD</span> You have to add the words you want to replace in the array '$words'. I used WORD here Aditional Links I recommend you this book: 'Mastering Regular Expressions' [ http://www.amazon.com/exec/obidos/ASIN/0596002890/qid=1024379027/sr=8-1/ref=sr_8_1/102-4897136-7732161 ] Search Strategy Personal Experience Good luck!
Request for Answer Clarification by endquote-ga on 18 Jun 2002 00:09 PDT Almost! Except: $sWord = "too"; $str = "<!-- bar too foo -->"; $str = preg_replace("/(?!<!--)(?!<)(^\|[\s\.,>])($sWord)($\|[\s,\.])(?!>)(?!-->)/i", "\\1<b>\\2</b>\\3", $str); returns <!-- bar <b>too</b> foo --> Not sure why that is. That's not good though. This is going to be run on HTML from a wide variety of users, so hopefully it can handle non-standard weirdness, too. Also, will this have problems with punctuation other than periods and commas? Could a non-word character be used instead of \., to be more generic? I actually have that book, but honestly don't have time to read it before I need the answer to this, and find it to be pretty difficult to read anyway. Thanks though, this is the best I've seen yet! If you could just fix the comment thing...
Request for Answer Clarification by endquote-ga on 18 Jun 2002 00:31 PDT Messing with it some more, here. Your example does work, unless you include another word before the end of the comment. For example: WORD this is my WORD dont <p class='WORD'> or <!-- DONT WORD foo --> WORD would give <span>WORD</span> this is my <span>WORD</span> dont <p class='WORD'>or <!-- DONT <span>WORD</span> --> <span>WORD</span> Also I put an i modifier on there to make it case-insensitive. This'll be very exciting for me if you can make it go. :) It's kind of fundamental to a largish project.
Request for Answer Clarification by endquote-ga on 18 Jun 2002 01:09 PDT Damn, okay, I just came up with another requirement for this, but since it wasn't in the original I can make it a new question you can answer for the same price. Words should also not be matched if they are linked with <a> tags. So... <!-- some WORD here --> and <a href="http://word.com/">some word here</a> should both not match.
Clarification of Answer by runix-ga on 18 Jun 2002 15:25 PDT Ok! Here's version 2.0 :) Now, It doesn't replace the words inside the comments or HTML tags. <? $data="this is my WORD <p class='biri WORD biri'> or <!-- DONT WORD CRI --> WORD"; $words=array('WORD'); $data=add_spans($data,$words); print $data; print "\n\n\n"; function add_spans($data,$words){ $data=">".$data."<"; foreach($words as $word){ $data=preg_replace("/(>[^<])($word)([^<>])/", "$1<span>$2</span>$3",$data); } $data=substr($data,1,strlen($data)-2); return $data; } ?> prints: this is my <span>WORD</span> <p class='biri WORD biri'> or <!-- DONT WORD CRI --> <span>WORD</span> The only problem I found is that if you have the word you're looking for 2 times without a html tag in the middle, the first one doesn't gets <spaned>. It's not possible to fix this problem using regular expressions (or, at least, I don't know how to), so if you want we can think another way of doing it Good luck!
Clarification of Answer by runix-ga on 19 Jun 2002 11:34 PDT ok! let's try this one! And don't worry about the links issue, I've modified it so it doesn't replace anything inside a tag (ie <span class='WORD'> won't be replaced!) <? $data=file("sampletext.txt"); $words=array('more'); $data=add_spans($data,$words); print $data; print "\n\n\n"; function add_spans($data,$words){ if (is_array($data)){$data=join('',$data);} $data=">".$data."<"; foreach($words as $word){ $data=preg_replace("/((?=>)[^<]?[^\w])($word)([^\w][^<>])/i", "$1<span>$2</span>$3",$data); } $data=substr($data,1,strlen($data)-2); return $data; } ?>
Request for Answer Clarification by endquote-ga on 19 Jun 2002 18:43 PDT It's still matching stuff in comments. Try it with: $words=array('more', 'foxy', 'while'); and you'll see. Perhaps moshen's comment would be helpful? I'll try and hack on it some, too.
Request for Answer Clarification by endquote-ga on 19 Jun 2002 19:56 PDT I made some progress: <? $data=file("sampletext.txt"); $words=array('more', 'foxy', 'while', 'phoenix'); $data=add_spans($data,$words); print $data; function add_spans($data,$words){ if(is_array($data)) { $data=join('',$data); } foreach($words as $word){ $data = preg_replace("/($word)/i", "<b>\\1</b>", $data); // put tags around the word $data = preg_replace("/(<[^<>])<b>($word)<\/b>([^<>]>)/i", "\\1\\2\\3", $data); // remove them if it was in an html tag $data = preg_replace("/(<a [^<>]>.)<b>($word)<\/b>(.<\/a>)/i", "\\1\\2\\3", $data); // remove them if it was in a link // $data = preg_replace("/(<\!--[.\r\n]-->)/", "", $data); // remove them if it was in a comment // $data = preg_replace("/(<\!--.)<b>($word)<\/b>(.-->)/i", "\\1\\2\\3", $data); // remove them if it was in a comment } return $data; } ?> Still can't seem to match a comment though!
Clarification of Answer by runix-ga on 21 Jun 2002 06:15 PDT This is the correct version, without regexes: <? $data=file("sampletext.txt"); $words=array('more', 'foxy', 'while', 'Phoenix', 'phoenixfest', 'real'); $data=add_spans($data,$words); print $data; function add_spans($data,$words){ $pre = '<b><a href="http://tangent.cx/r.php?url=dev.endquote.com%2Findex.php%3Fid%3D376" onmouseout="doTangent()" onmousover="doTangent(\'dev.endquote.com/index.php?id=376\',\'Pre-Phoenix Festival.\',\'2001-07-04\',\'ly The Phoenix Festival is the day aft\',\'dev.endquote.com/index.php?id=396\',\'Money and DSL.\',\'2001-08-27\',\'CA World Sound Festival and maybe kick\')">'; $post = '</a></b>'; if (is_array($data)){$data=join('',$data);} $forb=array(); $end=0; do{ $start=strpos($data,"<!--",$end); if ($start === false){ break; } $end=strpos($data,"-->",$start); $forb[]=array($start,$end); }while($start<strlen($data)); $end=0; do{ $start=strpos($data,"<",$end); if ($start === false){ break; } $end=strpos($data,">",$start); $forb[]=array($start,$end); }while($start<strlen($data)); $end=0; do{ $start=strpos($data,"<a",$end); if ($start === false){ break; } $end=strpos($data,"</a>",$start); $forb[]=array($start,$end); }while($start<strlen($data)); $dataL=strtolower($data); foreach($words as $word){ $word=strtolower($word); $pos=0; do{ $pos=strpos($dataL,$word,$pos); if ($pos===false){break;} if (check($pos,$forb)){ if ($pos>=1){ $before=substr($data,$pos-1,1); }else{$before='';} if ($pos<strlen($data)){ $after=substr($data,$pos+strlen($word),1); }else{$after='';} if (eregi("[a-z]",$before.$after)){ $pos++; continue; } $NEW=$pre.substr($data,$pos,strlen($word)).$post; $data=substr($data,0,$pos).$NEW.substr($data,$pos+strlen($word)); $dataL=substr($dataL,0,$pos).$NEW.substr($dataL,$pos+strlen($word)); $end=0; do{ $start=strpos($NEW,"<",$end); if ($start === false){ break; } $end=strpos($NEW,">",$start); # print "new forbidden areas: ".($start+$pos)." , ".($end+$pos)."\n"; $forb[]=array($start+$pos,$end+$pos); }while($start<strlen($NEW)); $forb=updateForb($pos,strlen($NEW)-strlen($word),$forb); $pos=$pos+7; break; }else{ $pos++; } }while($pos!=false and $pos<strlen($data)); } return $data; } function updateForb($pos,$sum,$forb){ $ret=array(); foreach($forb as $f){ list($start,$end)=$f; if ($pos<=$start){ $start=$start+$sum; $end=$end+$sum; } $ret[]=array($start,$end); } return $ret; } function check($pos,$forb){ foreach($forb as $f){ list($start,$end)=$f; if ($pos>=$start and $pos<=$end){ return 0; } } return 1; } ?>