Google Answers Logo
View Question
 
Q: Matching non-HTML in PHP with preg_replace() ( Answered 5 out of 5 stars,   2 Comments )
Question  
Subject: Matching non-HTML in PHP with preg_replace()
Category: Computers > Programming
Asked by: endquote-ga
List Price: $20.00
Posted: 17 Jun 2002 21:03 PDT
Expires: 24 Jun 2002 21:03 PDT
Question ID: 28275
I have a large string of HTML which I want to use preg_replace() on to
place <span> tags around certain words. So I want to search for the
word unless it occurs within an HTML tag, or HTML comment block. It
should also work if my word is preceded or followed by punctuation, or
is at the beginning or end of the large string. I think this should
work out to just a lot of matching of angle-brackets, but I can't seem
to get it right.
Answer  
Subject: Re: Matching non-HTML in PHP with preg_replace()
Answered By: runix-ga on 17 Jun 2002 22:44 PDT
Rated:5 out of 5 stars
 
Hi! :)

I got the regex! :)

<?
$words=array('WORD');
$data="WORD this is my WORD dont <p class='WORD'> or <!-- DONT WORD
--> WORD";
foreach($words as $word){
    $data=preg_replace("/(?!<!--)(?!<)(^|[\s\.,>])($word)($|[\s,\.])(?!>)(?!-->)/",
                            "$1<span>$2</span>$3",$data);
}
print $data;
print "\n\n\n";
?>

prints: 
<span>WORD</span> this is my <span>WORD</span> dont <p class='WORD'>
or <!-- DONT WORD --> <span>WORD</span>

You have to add the words you want to replace in the array '$words'. I
used WORD here


Aditional Links

I recommend you this book: 'Mastering Regular Expressions'
[ http://www.amazon.com/exec/obidos/ASIN/0596002890/qid=1024379027/sr=8-1/ref=sr_8_1/102-4897136-7732161
]

Search Strategy

Personal Experience


 Good luck!

Request for Answer Clarification by endquote-ga on 18 Jun 2002 00:09 PDT
Almost! Except:

$sWord = "too";
$str = "<!-- bar too foo -->";
$str = preg_replace("/(?!<!--)(?!<)(^|[\s\.,>])($sWord)($|[\s,\.])(?!>)(?!-->)/i",
"\\1<b>\\2</b>\\3", $str);

returns <!-- bar <b>too</b> foo -->

Not sure why that is. That's not good though. This is going to be run
on HTML from a wide variety of users, so hopefully it can handle
non-standard weirdness, too.

Also, will this have problems with punctuation other than periods and
commas? Could a non-word character be used instead of \., to be more
generic?

I actually have that book, but honestly don't have time to read it
before I need the answer to this, and find it to be pretty difficult
to read anyway. Thanks though, this is the best I've seen yet! If you
could just fix the comment thing...

Request for Answer Clarification by endquote-ga on 18 Jun 2002 00:31 PDT
Messing with it some more, here. Your example does work, unless you
include another word before the end of the comment. For example:

WORD this is my WORD dont <p class='WORD'> or <!-- DONT WORD foo -->
WORD

would give

<span>WORD</span> this is my <span>WORD</span> dont <p class='WORD'>or
<!-- DONT <span>WORD</span> --> <span>WORD</span>

Also I put an i modifier on there to make it case-insensitive.

This'll be very exciting for me if you can make it go. :) It's kind of
fundamental to a largish project.

Request for Answer Clarification by endquote-ga on 18 Jun 2002 01:09 PDT
Damn, okay, I just came up with another requirement for this, but
since it wasn't in the original I can make it a new question you can
answer for the same price. Words should also not be matched if they
are linked with <a> tags. So...

<!-- some WORD here -->
and
<a href="http://word.com/">some word here</a>

should both *not* match.

Clarification of Answer by runix-ga on 18 Jun 2002 15:25 PDT
Ok! 
Here's version 2.0 :)

Now, It doesn't replace the words inside the comments or HTML tags.
<?
$data="this is my WORD <p class='biri WORD biri'> or <!-- DONT WORD
CRI -->  WORD";
$words=array('WORD');

$data=add_spans($data,$words);
print $data;
print "\n\n\n";

function add_spans($data,$words){
    $data=">".$data."<";
    foreach($words as $word){
        $data=preg_replace("/(>[^<]*)($word)([^<>]*)/",
"$1<span>$2</span>$3",$data);
    }
    $data=substr($data,1,strlen($data)-2);
    return $data;
}

?>

prints: this is my <span>WORD</span> <p class='biri WORD biri'> or
<!-- DONT WORD CRI --> <span>WORD</span>



The only problem I found is that if you have the word you're looking
for 2 times without a html tag in the middle, the first one doesn't
gets <spaned>.
It's not possible to fix this problem using regular expressions (or,
at least, I don't know how to), so if you want we can think another
way of doing it

Good luck!

Clarification of Answer by runix-ga on 19 Jun 2002 11:34 PDT
ok! let's try this one!
And don't worry about the links issue, I've modified it so it doesn't
replace anything inside a tag (ie <span class='WORD'> won't be
replaced!)


<?
$data=file("sampletext.txt");
$words=array('more');

$data=add_spans($data,$words);
print $data;
print "\n\n\n";

function add_spans($data,$words){
    if (is_array($data)){$data=join('',$data);}
    $data=">".$data."<";
    foreach($words as $word){
        $data=preg_replace("/((?=>)[^<]*?[^\w])($word)([^\w][^<>]*)/i",
                            "$1<span>$2</span>$3",$data);
    }
    $data=substr($data,1,strlen($data)-2);
    return $data;
}

?>

Request for Answer Clarification by endquote-ga on 19 Jun 2002 18:43 PDT
It's still matching stuff in comments. Try it with:

$words=array('more', 'foxy', 'while'); 

and you'll see. Perhaps moshen's comment would be helpful? I'll try
and hack on it some, too.

Request for Answer Clarification by endquote-ga on 19 Jun 2002 19:56 PDT
I made some progress:

<? 
$data=file("sampletext.txt"); 
$words=array('more', 'foxy', 'while', 'phoenix'); 

$data=add_spans($data,$words); 
print $data; 

function add_spans($data,$words){ 
    if(is_array($data)) { $data=join('',$data); } 
    foreach($words as $word){ 
    	$data = preg_replace("/($word)/i", "<b>\\1</b>", $data); // put
tags around the word
    	$data = preg_replace("/(<[^<>]*)<b>($word)<\/b>([^<>]*>)/i",
"\\1\\2\\3", $data); // remove them if it was in an html tag
    	$data = preg_replace("/(<a [^<>]*>.*)<b>($word)<\/b>(.*<\/a>)/i",
"\\1\\2\\3", $data); // remove them if it was in a link
//		$data = preg_replace("/(<\!--[.\r\n]*-->)/", "", $data); // remove
them if it was in a comment
//		$data = preg_replace("/(<\!--.*)<b>($word)<\/b>(.*-->)/i",
"\\1\\2\\3", $data); // remove them if it was in a comment
    } 
    return $data; 
} 
 
?> 

Still can't seem to match a comment though!

Clarification of Answer by runix-ga on 21 Jun 2002 06:15 PDT
This is the correct version, without regexes:

<?

$data=file("sampletext.txt");
$words=array('more', 'foxy', 'while', 'Phoenix', 'phoenixfest',
'real');

$data=add_spans($data,$words);
print $data;

function add_spans($data,$words){
    $pre = '<b><a href="http://tangent.cx/r.php?url=dev.endquote.com%2Findex.php%3Fid%3D376"
onmouseout="doTangent()"
onmousover="doTangent(\'dev.endquote.com/index.php?id=376\',\'Pre-Phoenix
Festival.\',\'2001-07-04\',\'ly The Phoenix Festival is the day
aft\',\'dev.endquote.com/index.php?id=396\',\'Money and
DSL.\',\'2001-08-27\',\'CA World Sound Festival and maybe kick\')">';

    $post = '</a></b>';
	if (is_array($data)){$data=join('',$data);}
	$forb=array();
	$end=0;
	do{
		$start=strpos($data,"<!--",$end);
		if ($start === false){
			break;
		}
		$end=strpos($data,"-->",$start);
		$forb[]=array($start,$end);
	}while($start<strlen($data));
	$end=0;
	do{
		$start=strpos($data,"<",$end);
		if ($start === false){
			break;
		}
		$end=strpos($data,">",$start);
		$forb[]=array($start,$end);
	}while($start<strlen($data));
	$end=0;
	do{
		$start=strpos($data,"<a",$end);
		if ($start === false){
			break;
		}
		$end=strpos($data,"</a>",$start);
		$forb[]=array($start,$end);
	}while($start<strlen($data));


	$dataL=strtolower($data);
	foreach($words as $word){
		$word=strtolower($word);
		$pos=0;
		do{
			$pos=strpos($dataL,$word,$pos);
			if ($pos===false){break;}
			if (check($pos,$forb)){
				if ($pos>=1){
					$before=substr($data,$pos-1,1);
				}else{$before='';}
				if ($pos<strlen($data)){
					$after=substr($data,$pos+strlen($word),1);
				}else{$after='';}
				if (eregi("[a-z]",$before.$after)){
					$pos++;
					continue;
				}
				$NEW=$pre.substr($data,$pos,strlen($word)).$post;
				$data=substr($data,0,$pos).$NEW.substr($data,$pos+strlen($word));
				$dataL=substr($dataL,0,$pos).$NEW.substr($dataL,$pos+strlen($word));
				$end=0;
				do{
					$start=strpos($NEW,"<",$end);
					if ($start === false){ break; }
					$end=strpos($NEW,">",$start);
#					print "new forbidden areas: ".($start+$pos)." ,
".($end+$pos)."\n";
					$forb[]=array($start+$pos,$end+$pos);
				}while($start<strlen($NEW));

			        $forb=updateForb($pos,strlen($NEW)-strlen($word),$forb);

				$pos=$pos+7;
				break;
			}else{
			 $pos++;
			}
		}while($pos!=false and $pos<strlen($data));
	}
	return $data;
}
function updateForb($pos,$sum,$forb){
	$ret=array();
	foreach($forb as $f){
		list($start,$end)=$f;
		if ($pos<=$start){
			$start=$start+$sum;
			$end=$end+$sum;
		}
		$ret[]=array($start,$end);
	}
	return $ret;
}
function check($pos,$forb){
	foreach($forb as $f){
		list($start,$end)=$f;
		if ($pos>=$start and $pos<=$end){
			return 0;
		}
	}
	return 1;
}

?>
endquote-ga rated this answer:5 out of 5 stars
Very very helpful!

Comments  
Subject: Re: Matching non-HTML in PHP with preg_replace()
From: mohsen-ga on 19 Jun 2002 08:59 PDT
 
Greeting !

Just a quick quote. The point here is that, the WORDs that should NOT be
matched are more easy to construct a regex for, than those that should match.
So if I were you, I would do it this in 2 phases. At first I would replace all
occurences of WORD with <span>WORD</span>. That's easy and fast. Secondly
I would  change back the WORDs that should not have been changed to
their orginal. for doing that, you need to write a regex that matches any
"<span>WORD</span>" which is either inside a comment or within HTML tags or
between <a> and </a>. I am sure that's far simpler to handle and understand.

Hope it was helpful.
regards,
mohsen-ga
Subject: Re: Matching non-HTML in PHP with preg_replace()
From: cbra-ga on 26 Jun 2002 09:28 PDT
 
Thanks to mohsen-ga!!

I had nearly the same problem in Perl:
Highlight words in a HTML page, but don't destroy the HTML tags.
Your two stage approach works for me:

my @farben = ( 'yellow', '#ffa0a0', '#a0ffa0', '#d0d0ff' );
foreach ( @pattern ) {
  $color = shift @farben;
  $resultColor = "<span style=\"background-color:$color\">"; 

   # now change EVERY match
  $text =~ s/$_/$resultColor$_<\/span>/sixg;  
   # now search for highlights within HTML: 
   # < not followed bei closing >
   # following the span
   # store original HTML tags in $1 and $2
  $text =~ s#(<[^>]*)<span [^>]+>([^<]+)<\/span>#$1$2#gsx;
}

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy