Google Answers: Regex

View Question

Q: Regex ( Answered 5 out of 5 stars

Question

Subject: Regex
Category: Computers > Programming
Asked by: grabby-ga
List Price: $20.00

Posted: 28 Jul 2004 09:20 PDT
Expires: 27 Aug 2004 09:20 PDT
Question ID: 380299

I want to extract a URL from some HTML into a variable using a Perl regex, it always ends in page0001.html. ie:- http://some.long.protracted.url.domain/and/some/other/stuff/page0001.html
Clarification of Question by grabby-ga on 28 Jul 2004 09:32 PDT By the way, I just need the regex, not the whole script.
Request for Question Clarification by palitoy-ga on 28 Jul 2004 10:47 PDT After greyknight's comments, do you require any further information on this subject? If you do I would be happy to help in any way that I can.
Request for Question Clarification by palitoy-ga on 28 Jul 2004 10:58 PDT Assuming you have the HTML page in a variable called $html, the following few lines will extract the URL you require: $html =~ m/href\=(.)page0001\.html/ ; $url_matched = $1 ; The first line looks for a href reference in the html code and matches everything until it comes across "page0001.html" and remembers it. $url_matched then stores the matched information in a variable. $url_matched may contain a ' or " character depending on how well the html code was written, if it does it can be removed by adding \' or \" to the regex immediately before (.). If this is sufficient for your answer please let me know and I will post it as an official answer.
Clarification of Question by grabby-ga on 28 Jul 2004 11:00 PDT Thanks greyknight! This method from the O'Reilly book seems a little overboard really. I simply need to match the http://---->/page0001.html I'd need a code fragment to put this into a variable. (this is a continuation of the question from yesterday!) Palitoy can you help?
Clarification of Question by grabby-ga on 28 Jul 2004 11:05 PDT this is the first 3 lines of HTML that I am trying to grab the URL from :- <HTML> <head><title>report</title></head> <SCRIPT>location.replace('http://foo.bar.com/something/here/blah/page0001.html');

Answer

Subject: Re: Regex
Answered By: palitoy-ga on 28 Jul 2004 11:29 PDT
Rated: 5 out of 5 stars

Hello grabby This is the regex you require. Assuming that the above text is in a variable called $html after you have scraped the page: $html =~ m/replace\(\'(.*)page0001\.html/ ; $url_matched = $1 ; For your example above $url_matched will now be equal to: http://foo.bar.com/something/here/blah/ If you need any more information on this please ask for clarification and I will do my best to help. Similarly if you would like this explained more fully I would be glad to help.
Request for Answer Clarification by grabby-ga on 28 Jul 2004 13:26 PDT hmmm, this doesn't seem to work for me, any chance of a standalone perl script that I can pipe stdout of my script into so that I can check the regex? (and my own sanity!) :-) Again, I'll be nice with the tips :-)
Clarification of Answer by palitoy-ga on 29 Jul 2004 00:38 PDT Does this suit your needs? #!/usr/bin/perl # so what are we using? use LWP; # start the browser $browser = LWP::UserAgent->new(); # the url of the file to fetch # for my example I created a file called test.html with the html code # you supplied in the clarification and called it test.html, it can though # be anything on the internet $url = "http://localhost/test.html"; # get the file $response = $browser->get($url); # put the page into a variable $html = $response->content ; # find the section you require with the regex $html =~ m/replace\(\'(.*)page0001\.html/ ; $url_matched = $1 ; # print the matched url onto the screen print $url_matched; # quit the program exit(0);
Clarification of Answer by palitoy-ga on 29 Jul 2004 01:36 PDT Thanks for the 5-star rating and generous tip. If you need any more information on regular expressions let me know.

grabby-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $25.00

you beauty!

Comments

Subject: Re: Regex
From: greyknight-ga on 28 Jul 2004 10:41 PDT

\b
# Match the leading part (proto://hostname, or just hostname)
(
    # http://, or https:// leading part
    (https?)://[-\w]+(\.\w[-\w]*)+
  |
    # or, try to find a hostname with our more specific sub-expression
    (?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \. )+ # sub domains
    # Now ending .com, etc. For these, require lowercase
    (?-i: com\b
        | edu\b
        | biz\b
        | gov\b
        | in(?:t|fo)\b # .int or .info
        | mil\b
        | net\b
        | org\b
        | [a-z][a-z]\b # two-letter country codes
    )
)

# Allow an optional port number
( : \d+ )?

# We definately need at least one /

(/)


# This part of the URL is optional 
(
     # The rest are heuristics for what seems to work well
     [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]*
     (?:
        [.!,?]+  [^.!,?;"'<>()\[\]{}\s\x7F-\xFF]+
     )*
)?

# It should end in page0001.html

( page0001\.html )



Most of this regular expression was borrowed from Jeffrey Friedl who
wrote some excellent books on regular expressions (e.g. Mastering
Regular Expressions)

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy