Google Answers: A simple Perl question

View Question

Q: A simple Perl question ( Answered, 3 Comments )

Question

Subject: A simple Perl question
Category: Computers
Asked by: corn_jan-ga
List Price: $3.50

Posted: 25 Jun 2002 05:27 PDT
Expires: 25 Jul 2002 05:27 PDT
Question ID: 32850

Hi,

Suppose I have a string $data which can contain a link with the
following format:

<a href=”/HTML LINK “ class=”SubLink”>TEXT<img src=”IMAGE LINK” width
=”8” height=”7” border=”0” hspace=”4”></a>

An example of such a link is the following:

<a href="/artikelen/Nieuws/1024894334943.html"
class="SubLink">LPF-achterban boos over uitspraken Herben<img
src="/ad.nl/images/more.gif" width="8" height="7" border="0"
hspace="4"></a>

Note that only the items typed in capitals are variable. 

Suppose I would then like to remove the following portion from the
link:
<img src="IMAGE LINK " width="8" height="7" border="0" hspace="4">

I need a command which removes this bit.
Can you give me a Perl command which I can use in a Perl script which
does this?

Suppose also that I have an aray of strings containing several of
these links. Can you also give me a Perl command which removes <img
src="IMAGE LINK " width="8" height="7" border="0" hspace="4"> from all
of the links in the array in one go?

Thanks,
Corne

Answer

Subject: Re: A simple Perl question
Answered By: leapinglizard-ga on 27 Jun 2002 09:08 PDT

The following line of Perl declares a variable called $imagelink and
assigns to it the first IMG tag it finds in a string called $data. If
the string contains no IMG tags, $imagelink will remain uninitialized,
and any later attempt to use it may cause fatal errors. Note that the
character inside the square brackets is a zero, not a capital O.

my $imagelink = ($data =~ m/(<img\s+src.*?>)/is)[0];

The following Perl fragment assumes that you have an array of strings
called @data_arr. It declares an array called @imagelink_arr, and adds
to it the first IMG tag, if any, found in each element of @data_arr.
You may later iterate over @imagelink_arr using the foreach operator,
or assign it wholesale to some other array.

my @imagelink_arr = ();
push @imagelink_arr, ($_ =~ m/(<img\s+src.*?>)/is) foreach @data_arr;

These examples have been tested using version 5.6 of the Perl
interpreter.

Consult Perldoc.com for further detail on the syntax of Perl regular
expressions.
http://www.perldoc.com/perl5.6/pod/perlre.html

Suggested search terms:
perl regex regular expression manual help

Regards,
leapinglizard

Comments

Subject: Re: A simple Perl question
From: dagoon-ga on 25 Jun 2002 06:14 PDT

This piece of code should do:

$table[1] = $table[0] = '<a
href="/artikelen/Nieuws/1024894334943.html"
class="SubLink">LPF-achterban boos over uitspraken Herben<img
src="/ad.nl/images/more.gif" width="8" height="7" border="0"
hspace="4"></a>';

# parse
for ($i=0; $i<@table; $i++)
{
	$table[$i] =~ s/(<a href.*? class="SubLink">.*)<img .*?>(.*)/$1$2/gi;
} 

# test
foreach  $t (@table) 
{
	print("$t\n");
}

Subject: Re: A simple Perl question
From: sean_projectscim_com-ga on 25 Jun 2002 07:47 PDT

#you have your variable $data already to the string you mentioned.
# this code will strip the image from the $data string
$data =~ s/<img[^>]*>//;

#if you have a perl numeric array, $data, the easiest way is like this:

for($x=0; $x<= $#data; $x++) {
  $data[$x] =~ s/<img[^>]*>//;
  print $data[$x]; print "\n";
}

Subject: Re: A simple Perl question
From: juerd-ga on 02 Jul 2002 14:33 PDT

> my $imagelink = ($data =~ m/(<img\s+src.*?>)/is)[0]; 

That can also be written (prettier) as:

my ($imagelink) = $data =~ /(<img\s+src.(?>)/is;


As for parsing HTML with a regex: don't. The perldoc "perlfaq9" says:

***

       How do I remove HTML from a string?

               The most correct way (albeit not the fastest) is
               to use HTML::Parser from CPAN.  Another mostly
               correct way is to use HTML::FormatText which not
               only removes HTML but also attempts to do a little
               simple formatting of the resulting plain text.

               Many folks attempt a simple-minded regular expres-
               sion approach, like "s/<.*?>//g", but that fails
               in many cases because the tags may continue over
               line breaks, they may contain quoted angle-brack-
               ets, or HTML comment may be present.  Plus, folks
               forget to convert entities--like "&lt;" for exam-
               ple.

               Here's one "simple-minded" approach, that works
               for most files:

                   #!/usr/bin/perl -p0777
                   s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
               If you want a more complete solution, see the
               3-stage striphtml program in
               http://www.perl.com/CPAN/authors/Tom_Chris-
               tiansen/scripts/striphtml.gz .

               Here are some tricky cases that you should think
               about when picking a solution:

                   <IMG SRC = "foo.gif" ALT = "A > B">

                   <IMG SRC = "foo.gif"
                        ALT = "A > B">

                   <!-- <A comment> -->

                   <script>if (a<b && a>c)</script>

                   <# Just data #>

                   <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

               If HTML comments include other tags, those solu-
               tions would also break on text like this:

                   <!-- This section commented out.
                       <B>You can't see me!</B>
                   -->

***

HTML::TreeBuilder's lookdown method can be quite handy for simple HTML parsing.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy