Google Answers: PHP/Regex for extracting Second-Level domains from URLs

View Question

Q: PHP/Regex for extracting Second-Level domains from URLs ( No Answer, 10 Comments )

Question

Subject: PHP/Regex for extracting Second-Level domains from URLs
Category: Computers > Programming
Asked by: fattymelt-ga
List Price: $100.00

Posted: 05 Mar 2005 08:32 PST
Expires: 06 Mar 2005 06:44 PST
Question ID: 485160

Given URLs of the form:

http://www.example.com
http://www.example.com/
http://www.example.net (any third-level domain, .net, .co.uk, etc.)
http://www.example.com/example.html?a=1&b=2
http://www2.example.com (any sub-domain)
http://example.com (no sub-domain)

I need PHP code that uses a Regular Expression to extract just the
second-level domain "example.com" from all forms of the full URL.

Code should take a URL as input and give second-level domain
"example.com" as output.

Clarification of Question by fattymelt-ga on 05 Mar 2005 08:36 PST

To clarify...

each of the URLs I gave as examples describe one aspect of variation,
but the code needs to handle any URL with any number of the
variations.

Clarification of Question by fattymelt-ga on 05 Mar 2005 14:22 PST

eliteskills -

I appreciate you time, but my questions states

"I need PHP code that uses a Regular Expression to extract just the
second-level domain "example.com" from all forms of the full URL.

Code should take a URL as input and give second-level domain
"example.com" as output."

So, for example, your code should take in
"http://www.jimmyr.com/index.php" and give back "jimmyr.com"

Keep in mind, it needs to handle all of the variations I listed.

thanks

Answer

There is no answer at this time.

Comments

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: eliteskillsdotcom-ga on 05 Mar 2005 13:32 PST

Cant get it to work perfectly but this is what I could come up with:


<?
$urls = 'bsldkfs http://www.jimmyr.com/index.php and
https://eliteskill.com asdf asd http:/www.eliteskills.com/tacos/ sdf
sd sd f http://www.eliteskills.com/';

preg_match_all('/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', $urls, $return);

echo '<pre>';
print_r($return[0]);
echo '</pre>';

?>

It doesn't grab directories. Maybe someone else can figure it out from there.

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: fattymelt-ga on 05 Mar 2005 15:04 PST

eliteskills -

I appreciate you time, but my questions states

"I need PHP code that uses a Regular Expression to extract just the
second-level domain "example.com" from all forms of the full URL.

Code should take a URL as input and give second-level domain
"example.com" as output."

So, for example, your code should take in
"http://www.jimmyr.com/index.php" and give back "jimmyr.com"

Keep in mind, it needs to handle all of the variations I listed.

thanks

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: eliteskillsdotcom-ga on 05 Mar 2005 16:44 PST

I was not providing an answer.

"Cant get it to work perfectly but this is what I could come up with"


It's just what I could do so the next guy might be able to finish it up.


 That's basically the function but it has to be modified to accept
directories, and crop the domain part. If not all through
preg_match_all then looping the array  and using a preg_replace.

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: fattymelt-ga on 05 Mar 2005 17:28 PST

Just to make sure there is no confusion here...

The code I am asking for takes a URL as input.  I will be supplying
that. I do not need any code that extract URLs from a string.  No one
that will be trying to answer this should need any code that extracts
a URL from a string.

I already have the URLs. I need code that extracts the second-level
domain from a URL.

thanks

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: eliteskillsdotcom-ga on 05 Mar 2005 19:31 PST

I spent a bit of time on this. If you want to contribute anything go
to http://www.eliteskills.com/donate.php . Let me know if it doesn't
work or more information is needed.

Coding:
----------------------------------------------------------------

<?
$urls = 'http://www.jimmyr.com/index.php
https://eliteskill.com
http://www.eliteskills.com/tacos/
http://google.com/search?q=query%20string%20from%20hell%20here
ftp://www.ftp.com/
http://us.mail.yahoo.com/
://www.google.co.uk/
http://us.f526.mail.yahoo.com/
http://www.eliteskills.com/dmozsubmit/categ/Kids_and_Teens/Arts/';

preg_match_all('/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', $urls, $return);
// Grab the url list and put into array return

$numElements = count($return[0]);
// Count how many elements in array

$foo=array();
$foo=$return[0];
for($counter=0; $counter < $numElements; $counter++)
{
// loop through array contents outputting spliced url

$url=$foo[$counter];
echo "In: $url";
$url=ereg_replace("\.(php|asp|html|htm|cfm)", "", $url);
// add any other extentions, this was all I could think of that may link
// This is to not confused the ending of a url as part of the domain when counting

$urlcount = explode(".",$url);
$urlcount1 = count($urlcount);
$urlcount1--;

if (ereg("co.uk", $url)){
$urlcount1--;
}
// Accomodates for the dual co.uk ending



// Below it divides the url by how many subdomains it has to properly crop it
if ($urlcount1==1){
$url=preg_replace("/(http(s)?|ftp):(\/\/)/i", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);
}
if ($urlcount1==2){
$url=preg_replace("/(http(s)?|ftp):(\/\/)[^\.]+\./i", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);
}
if ($urlcount1==3){
$url=preg_replace("/(http(s)?|ftp):(\/\/)[^\.]+\.[^\.]+\./i", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);
}
if ($urlcount1==4){
$url=preg_replace("/(http(s)?|ftp):(\/\/)[^\.]+\.[^\.]+\.[^\.]+\./i", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);
}
echo "<br />Out: $url, $urlcount1<br /><br />";
}


?>




--Output--
------------------------------------------------------------------

In: http://www.jimmyr.com/index.php
Out: jimmyr.com, 2

In: https://eliteskill.com
Out: eliteskill.com, 1

In: http://www.eliteskills.com/tacos/
Out: eliteskills.com, 2

In: http://google.com/search?q=query%20string%20from%20hell%20here
Out: google.com, 1

In: ftp://www.ftp.com/
Out: ftp.com, 2

In: http://us.mail.yahoo.com/
Out: yahoo.com, 3

In: ://www.google.co.uk/
Out: google.co.uk, 2

In: http://us.f526.mail.yahoo.com/
Out: yahoo.com, 4

In: http://www.eliteskills.com/dmozsubmit/categ/Kids_and_Teens/Arts/
Out: eliteskills.com, 2

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: eliteskillsdotcom-ga on 05 Mar 2005 19:41 PST

<?


$url="whatever url you want to enter";
// Or just $url=$_POST["whateveryounamedtheinputbox"]; if you're
grabbing it from a form.

echo "In: $url";
$url=ereg_replace("\.(php|asp|html|htm|cfm)", "", $url);
// add any other extentions, this was all I could think of that may link
// This is to not confused the ending of a url as part of the domain when counting

$urlcount = explode(".",$url);
$urlcount1 = count($urlcount);
$urlcount1--;

if (ereg("co.uk", $url)){
$urlcount1--;
}
// Accomodates for the dual co.uk ending



// Below it divides the url by how many subdomains it has to properly crop it
if ($urlcount1==1){
$url=preg_replace("/(http(s)?|ftp):(\/\/)/i", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);
}
if ($urlcount1==2){
$url=preg_replace("/(http(s)?|ftp):(\/\/)[^\.]+\./i", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);
}
if ($urlcount1==3){
$url=preg_replace("/(http(s)?|ftp):(\/\/)[^\.]+\.[^\.]+\./i", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);
}
if ($urlcount1==4){
$url=preg_replace("/(http(s)?|ftp):(\/\/)[^\.]+\.[^\.]+\.[^\.]+\./i", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);
}
echo "<br />Out: $url, $urlcount1<br /><br />";


?>

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: garyking-ga on 05 Mar 2005 20:23 PST

No offense, but I don't think this question is worth $100.

Try asking at a PHP forum; you will probably get a better response at
one such as: http://www.phpbuilder.com/board/

Good luck!

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: fattymelt-ga on 05 Mar 2005 21:39 PST

that code almost does the trick, but:

1) hard-coding the "co.uk" doesn't take into account other third-level
domains (e.g. "co.in", etc.)

2) because you split on "." your code breaks if the querystring we're
to inlucde a "." (e.g. .../index.asp?price=1.23

3) I don't want to have to come up with an exhaustive list of file
extensions (there are plenty more than php,asp,html,htm,cfm)

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: eliteskillsdotcom-ga on 05 Mar 2005 22:05 PST

You're right. This has been fun play con regular expressions. I think
i got it now... 0.o maybe this time.



Much cleaner anyways.



<?
$urls = 'http://www.jimmyr.com/in.d.e.x.php
https://eliteskill.com
http://www.eliteskills.com/tacos/
http://google.com/search?q=query%20string%20from%20hell%20here
http://google.com/search?q=255.255.255.255
http://google.co.in
https://google.ru/*&!0@3#)($*)__Q)(E
ftp://www.ftp.com/
http://us.mail.yahoo.com/
://www.google.co.uk/
http://us.f526.mail.yahoo.com/
http://www.eliteskills.com/dmozsubmit/categ/Kids_and_Teens/Arts/';

preg_match_all('/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', $urls, $return);
// Grab the url list and put into array return

$numElements = count($return[0]);
$foo=array();
$foo=$return[0];
for($counter=0; $counter < $numElements; $counter++)
{

$url=$foo[$counter];
echo "In: $url";

$url=preg_replace("/((http(s)?|ftp):\/\/)/", "", $url);
$url=preg_replace("/([^\/]+)(.*)/", "\\1", $url);


$urlcount = explode(".",$url);
$urlcount1 = count($urlcount);
$urlcount1--;

if (ereg("co\.", $url)){
$urlcount1--;
}

$url=preg_replace("/([^\.]+)\./i", "", $url,$urlcount1-1);

echo "<br />Out: $url, $urlcount1<br /><br />";
}


?>

Subject: Re: PHP/Regex for extracting Second-Level domains from URLs
From: fattymelt-ga on 06 Mar 2005 06:44 PST

eliteskills -

very nice. thanks. I'm cancelling the question and will be hitting up
your "donate" button!


FYI.. this code:

if (ereg("co\.", $url)){
$urlcount1--;
}

screws things up for a domain that ends in "co" such as www.AcmeCo.com

I change the regex to "\.co\." to get around that problem. Otherwise,
this is some good code. Thanks, again.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy