![]() |
|
![]() | ||
|
Subject:
A simple Perl question
Category: Computers Asked by: corn_jan-ga List Price: $3.50 |
Posted:
25 Jun 2002 05:27 PDT
Expires: 25 Jul 2002 05:27 PDT Question ID: 32850 |
Hi, Suppose I have a string $data which can contain a link with the following format: <a href=/HTML LINK class=SubLink>TEXT<img src=IMAGE LINK width =8 height=7 border=0 hspace=4></a> An example of such a link is the following: <a href="/artikelen/Nieuws/1024894334943.html" class="SubLink">LPF-achterban boos over uitspraken Herben<img src="/ad.nl/images/more.gif" width="8" height="7" border="0" hspace="4"></a> Note that only the items typed in capitals are variable. Suppose I would then like to remove the following portion from the link: <img src="IMAGE LINK " width="8" height="7" border="0" hspace="4"> I need a command which removes this bit. Can you give me a Perl command which I can use in a Perl script which does this? Suppose also that I have an aray of strings containing several of these links. Can you also give me a Perl command which removes <img src="IMAGE LINK " width="8" height="7" border="0" hspace="4"> from all of the links in the array in one go? Thanks, Corne |
![]() | ||
|
Subject:
Re: A simple Perl question
Answered By: leapinglizard-ga on 27 Jun 2002 09:08 PDT |
The following line of Perl declares a variable called $imagelink and assigns to it the first IMG tag it finds in a string called $data. If the string contains no IMG tags, $imagelink will remain uninitialized, and any later attempt to use it may cause fatal errors. Note that the character inside the square brackets is a zero, not a capital O. my $imagelink = ($data =~ m/(<img\s+src.*?>)/is)[0]; The following Perl fragment assumes that you have an array of strings called @data_arr. It declares an array called @imagelink_arr, and adds to it the first IMG tag, if any, found in each element of @data_arr. You may later iterate over @imagelink_arr using the foreach operator, or assign it wholesale to some other array. my @imagelink_arr = (); push @imagelink_arr, ($_ =~ m/(<img\s+src.*?>)/is) foreach @data_arr; These examples have been tested using version 5.6 of the Perl interpreter. Consult Perldoc.com for further detail on the syntax of Perl regular expressions. http://www.perldoc.com/perl5.6/pod/perlre.html Suggested search terms: perl regex regular expression manual help Regards, leapinglizard |
![]() | ||
|
Subject:
Re: A simple Perl question
From: dagoon-ga on 25 Jun 2002 06:14 PDT |
This piece of code should do: $table[1] = $table[0] = '<a href="/artikelen/Nieuws/1024894334943.html" class="SubLink">LPF-achterban boos over uitspraken Herben<img src="/ad.nl/images/more.gif" width="8" height="7" border="0" hspace="4"></a>'; # parse for ($i=0; $i<@table; $i++) { $table[$i] =~ s/(<a href.*? class="SubLink">.*)<img .*?>(.*)/$1$2/gi; } # test foreach $t (@table) { print("$t\n"); } |
Subject:
Re: A simple Perl question
From: sean_projectscim_com-ga on 25 Jun 2002 07:47 PDT |
#you have your variable $data already to the string you mentioned. # this code will strip the image from the $data string $data =~ s/<img[^>]*>//; #if you have a perl numeric array, $data, the easiest way is like this: for($x=0; $x<= $#data; $x++) { $data[$x] =~ s/<img[^>]*>//; print $data[$x]; print "\n"; } |
Subject:
Re: A simple Perl question
From: juerd-ga on 02 Jul 2002 14:33 PDT |
> my $imagelink = ($data =~ m/(<img\s+src.*?>)/is)[0]; That can also be written (prettier) as: my ($imagelink) = $data =~ /(<img\s+src.(?>)/is; As for parsing HTML with a regex: don't. The perldoc "perlfaq9" says: *** How do I remove HTML from a string? The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text. Many folks attempt a simple-minded regular expres- sion approach, like "s/<.*?>//g", but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brack- ets, or HTML comment may be present. Plus, folks forget to convert entities--like "<" for exam- ple. Here's one "simple-minded" approach, that works for most files: #!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs If you want a more complete solution, see the 3-stage striphtml program in http://www.perl.com/CPAN/authors/Tom_Chris- tiansen/scripts/striphtml.gz . Here are some tricky cases that you should think about when picking a solution: <IMG SRC = "foo.gif" ALT = "A > B"> <IMG SRC = "foo.gif" ALT = "A > B"> <!-- <A comment> --> <script>if (a<b && a>c)</script> <# Just data #> <![INCLUDE CDATA [ >>>>>>>>>>>> ]]> If HTML comments include other tags, those solu- tions would also break on text like this: <!-- This section commented out. <B>You can't see me!</B> --> *** HTML::TreeBuilder's lookdown method can be quite handy for simple HTML parsing. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |