![]() |
|
|
| Subject:
A simple Perl question
Category: Computers Asked by: corn_jan-ga List Price: $3.50 |
Posted:
25 Jun 2002 05:27 PDT
Expires: 25 Jul 2002 05:27 PDT Question ID: 32850 |
Hi, Suppose I have a string $data which can contain a link with the following format: <a href=/HTML LINK class=SubLink>TEXT<img src=IMAGE LINK width =8 height=7 border=0 hspace=4></a> An example of such a link is the following: <a href="/artikelen/Nieuws/1024894334943.html" class="SubLink">LPF-achterban boos over uitspraken Herben<img src="/ad.nl/images/more.gif" width="8" height="7" border="0" hspace="4"></a> Note that only the items typed in capitals are variable. Suppose I would then like to remove the following portion from the link: <img src="IMAGE LINK " width="8" height="7" border="0" hspace="4"> I need a command which removes this bit. Can you give me a Perl command which I can use in a Perl script which does this? Suppose also that I have an aray of strings containing several of these links. Can you also give me a Perl command which removes <img src="IMAGE LINK " width="8" height="7" border="0" hspace="4"> from all of the links in the array in one go? Thanks, Corne |
|
| Subject:
Re: A simple Perl question
Answered By: leapinglizard-ga on 27 Jun 2002 09:08 PDT |
The following line of Perl declares a variable called $imagelink and assigns to it the first IMG tag it finds in a string called $data. If the string contains no IMG tags, $imagelink will remain uninitialized, and any later attempt to use it may cause fatal errors. Note that the character inside the square brackets is a zero, not a capital O. my $imagelink = ($data =~ m/(<img\s+src.*?>)/is)[0]; The following Perl fragment assumes that you have an array of strings called @data_arr. It declares an array called @imagelink_arr, and adds to it the first IMG tag, if any, found in each element of @data_arr. You may later iterate over @imagelink_arr using the foreach operator, or assign it wholesale to some other array. my @imagelink_arr = (); push @imagelink_arr, ($_ =~ m/(<img\s+src.*?>)/is) foreach @data_arr; These examples have been tested using version 5.6 of the Perl interpreter. Consult Perldoc.com for further detail on the syntax of Perl regular expressions. http://www.perldoc.com/perl5.6/pod/perlre.html Suggested search terms: perl regex regular expression manual help Regards, leapinglizard |
|
| Subject:
Re: A simple Perl question
From: dagoon-ga on 25 Jun 2002 06:14 PDT |
This piece of code should do:
$table[1] = $table[0] = '<a
href="/artikelen/Nieuws/1024894334943.html"
class="SubLink">LPF-achterban boos over uitspraken Herben<img
src="/ad.nl/images/more.gif" width="8" height="7" border="0"
hspace="4"></a>';
# parse
for ($i=0; $i<@table; $i++)
{
$table[$i] =~ s/(<a href.*? class="SubLink">.*)<img .*?>(.*)/$1$2/gi;
}
# test
foreach $t (@table)
{
print("$t\n");
} |
| Subject:
Re: A simple Perl question
From: sean_projectscim_com-ga on 25 Jun 2002 07:47 PDT |
#you have your variable $data already to the string you mentioned.
# this code will strip the image from the $data string
$data =~ s/<img[^>]*>//;
#if you have a perl numeric array, $data, the easiest way is like this:
for($x=0; $x<= $#data; $x++) {
$data[$x] =~ s/<img[^>]*>//;
print $data[$x]; print "\n";
} |
| Subject:
Re: A simple Perl question
From: juerd-ga on 02 Jul 2002 14:33 PDT |
> my $imagelink = ($data =~ m/(<img\s+src.*?>)/is)[0];
That can also be written (prettier) as:
my ($imagelink) = $data =~ /(<img\s+src.(?>)/is;
As for parsing HTML with a regex: don't. The perldoc "perlfaq9" says:
***
How do I remove HTML from a string?
The most correct way (albeit not the fastest) is
to use HTML::Parser from CPAN. Another mostly
correct way is to use HTML::FormatText which not
only removes HTML but also attempts to do a little
simple formatting of the resulting plain text.
Many folks attempt a simple-minded regular expres-
sion approach, like "s/<.*?>//g", but that fails
in many cases because the tags may continue over
line breaks, they may contain quoted angle-brack-
ets, or HTML comment may be present. Plus, folks
forget to convert entities--like "<" for exam-
ple.
Here's one "simple-minded" approach, that works
for most files:
#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more complete solution, see the
3-stage striphtml program in
http://www.perl.com/CPAN/authors/Tom_Chris-
tiansen/scripts/striphtml.gz .
Here are some tricky cases that you should think
about when picking a solution:
<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
<# Just data #>
<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other tags, those solu-
tions would also break on text like this:
<!-- This section commented out.
<B>You can't see me!</B>
-->
***
HTML::TreeBuilder's lookdown method can be quite handy for simple HTML parsing. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
| Search Google Answers for |
| Google Home - Answers FAQ - Terms of Service - Privacy Policy |