Google Answers Logo
View Question
 
Q: Regex ( Answered 5 out of 5 stars,   1 Comment )
Question  
Subject: Regex
Category: Computers > Programming
Asked by: grabby-ga
List Price: $20.00
Posted: 28 Jul 2004 09:20 PDT
Expires: 27 Aug 2004 09:20 PDT
Question ID: 380299
I want to extract a URL from some HTML into a variable using a Perl
regex, it *always* ends in page0001.html.

ie:- http://some.long.protracted.url.domain/and/some/other/stuff/page0001.html

Clarification of Question by grabby-ga on 28 Jul 2004 09:32 PDT
By the way, I just need the regex, not the whole script.

Request for Question Clarification by palitoy-ga on 28 Jul 2004 10:47 PDT
After greyknight's comments, do you require any further information on
this subject?  If you do I would be happy to help in any way that I
can.

Request for Question Clarification by palitoy-ga on 28 Jul 2004 10:58 PDT
Assuming you have the HTML page in a variable called $html, the
following few lines will extract the URL you require:

$html =~ m/href\=(.*)page0001\.html/ ;
$url_matched = $1 ;

The first line looks for a href reference in the html code and matches
everything until it comes across "page0001.html" and remembers it. 
$url_matched then stores the matched information in a variable.

$url_matched may contain a ' or " character depending on how well the
html code was written, if it does it can be removed by adding \' or \"
to the regex immediately before (.*).

If this is sufficient for your answer please let me know and I will
post it as an official answer.

Clarification of Question by grabby-ga on 28 Jul 2004 11:00 PDT
Thanks greyknight! This method from the O'Reilly book seems a little
overboard really. I simply need to match the
http://---->/page0001.html I'd need a code fragment to put this into a
variable. (this is a continuation of the question from yesterday!)
Palitoy can you help?

Clarification of Question by grabby-ga on 28 Jul 2004 11:05 PDT
this is the first 3 lines of HTML that I am trying to grab the URL from :-

<HTML>
<head><title>report</title></head>
<SCRIPT>location.replace('http://foo.bar.com/something/here/blah/page0001.html');
Answer  
Subject: Re: Regex
Answered By: palitoy-ga on 28 Jul 2004 11:29 PDT
Rated:5 out of 5 stars
 
Hello grabby

This is the regex you require.

Assuming that the above text is in a variable called $html after you
have scraped the page:

$html =~ m/replace\(\'(.*)page0001\.html/ ;
$url_matched = $1 ;

For your example above $url_matched will now be equal to:
http://foo.bar.com/something/here/blah/

If you need any more information on this please ask for clarification
and I will do my best to help.  Similarly if you would like this
explained more fully I would be glad to help.

Request for Answer Clarification by grabby-ga on 28 Jul 2004 13:26 PDT
hmmm, this doesn't seem to work for me, any chance of a standalone
perl script that I can pipe stdout of my script into so that I can
check the regex? (and my own sanity!)   :-)
Again, I'll be nice with the tips :-)

Clarification of Answer by palitoy-ga on 29 Jul 2004 00:38 PDT
Does this suit your needs?

#!/usr/bin/perl

# so what are we using?
use LWP;

# start the browser
$browser = LWP::UserAgent->new();

# the url of the file to fetch
# for my example I created a file called test.html with the html code
# you supplied in the clarification and called it test.html, it can though
# be anything on the internet
$url = "http://localhost/test.html";

# get the file
$response = $browser->get($url);

# put the page into a variable
$html = $response->content ;

# find the section you require with the regex
$html =~ m/replace\(\'(.*)page0001\.html/ ;
$url_matched = $1 ;

# print the matched url onto the screen
print $url_matched;

# quit the program
exit(0);

Clarification of Answer by palitoy-ga on 29 Jul 2004 01:36 PDT
Thanks for the 5-star rating and generous tip.  If you need any more
information on regular expressions let me know.
grabby-ga rated this answer:5 out of 5 stars and gave an additional tip of: $25.00
you beauty!

Comments  
Subject: Re: Regex
From: greyknight-ga on 28 Jul 2004 10:41 PDT
 
\b
# Match the leading part (proto://hostname, or just hostname)
(
    # http://, or https:// leading part
    (https?)://[-\w]+(\.\w[-\w]*)+
  |
    # or, try to find a hostname with our more specific sub-expression
    (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
    # Now ending .com, etc. For these, require lowercase
    (?-i: com\b
        | edu\b
        | biz\b
        | gov\b
        | in(?:t|fo)\b # .int or .info
        | mil\b
        | net\b
        | org\b
        | [a-z][a-z]\b # two-letter country codes
    )
)

# Allow an optional port number
( : \d+ )?

# We definately need at least one /

(/)


# This part of the URL is optional 
(
     # The rest are heuristics for what seems to work well
     [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*
     (?:
        [.!,?]+  [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]+
     )*
)?

# It should end in page0001.html

( page0001\.html )



Most of this regular expression was borrowed from Jeffrey Friedl who
wrote some excellent books on regular expressions (e.g. Mastering
Regular Expressions)

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy