Google Answers Logo
View Question
 
Q: Scrape domains (only) from google results. ( Answered 5 out of 5 stars,   4 Comments )
Question  
Subject: Scrape domains (only) from google results.
Category: Computers > Programming
Asked by: author20-ga
List Price: $20.00
Posted: 08 Mar 2004 09:50 PST
Expires: 07 Apr 2004 10:50 PDT
Question ID: 314579
I want to have a list of the domains from a google search, so I wind
up with a list of domains -- and no other information, not even files.
So if the results shows:

http://www.domain.com/otherstuff/files/page.html

It saves a the following

http://www.domain.com

along with other domains. It would be nice to have an option to save
the sub-directories and files also, but it must work with google.

Just identify the shareware program that does it and I'm going to pay
$20 immediately. If the program doesn't meet my requirements, I'll pay
you $10 if I use it anyway.

Request for Question Clarification by sycophant-ga on 10 Mar 2004 04:53 PST
What platform does this program need to run on?

Do you care about duplicates - ie. can www.domain.com show up more
than once, or do you want the software to handle duplicate removal.

Are you willing to save the Google results pages yourself and pass
them through a parser, or does the program need to be able to execute
the search too - and if so, how many pages of results do you want it
to go through?

If you have a web account or server capable of running a PHP script, I
can create you a script that will perform this functionality.

Regards,
Sycophant-ga

Clarification of Question by author20-ga on 10 Mar 2004 07:01 PST
1. duplicates not an issue, I can delete later
2. must execute save automatically, complete automation of parse also
3. PHP definately OK, but I was told about some freeware program that
does it, will close post as soon as person identifies it
Answer  
Subject: Re: Scrape domains (only) from google results.
Answered By: sycophant-ga on 10 Mar 2004 16:32 PST
Rated:5 out of 5 stars
 
Hi author20, 

I have done a number of searches online, and been unable to fins a
product that does what you describe. There are a number of prducts
able to extact URL from text (or saved HTML pages), but none that I
could find would execute a search and return only the domains rather
than full URLs.

However, I have written a simple script in PHP which does fulfil your
demands - it is currently operationational at this URL:
http://dylan.wibble.net/programming/ga-test/314579/grab.php

The required files can be downloaded here:
http://dylan.wibble.net/programming/ga-test/314579/google-grab.zip

This program is written to work with current Google results, and
should continue to do so. However, as it is simply parsing the
contents of Google's HTML results, any change in formatting could
affect the program's functionality.

All required files are included in the zip file. They are the grab.php
script and an HTTP class. This application should run on any PHP
server with outbound HTTP access.

I hope this small application will do what you need. 

Regards,
Sycophant-ga

Request for Answer Clarification by author20-ga on 10 Mar 2004 17:21 PST
Checking this right now, if I like, I will accept answer.

Clarification of Answer by sycophant-ga on 12 Mar 2004 23:07 PST
Hi author20,

As you haven't gotten back to me about it, I assume you have had a
chance to look over the script I submitted for you and hope you find
it suitable. If there are any problems, let me know.

Regads,
Sycophant-ga

Request for Answer Clarification by author20-ga on 13 Mar 2004 00:17 PST
Hello,

I'm downloading now, and due to internet problems last night, didn't
get to as fast as i wanted. Stand by...

Request for Answer Clarification by author20-ga on 13 Mar 2004 00:20 PST
Hi,

THis looks perfect, let me test and I will accept answer tomarrow
morning. I appologize for hte delay, and I probably will ask you for
help on another project soon.

Request for Answer Clarification by author20-ga on 13 Mar 2004 00:48 PST
Hi,

This works pretty good, but I can't control it. 

Can you have it display domains in alphabetical order?  For example, if I put:

"amazon associate"+800  ....it returns a whole bunch of domains, but I
know that I am missing a lot that are too many hits to bring up. But
if it were to qualify and allow user to get only the domains that
start with "a"...and then "b" and then "c"...and so on....

I could be sure to get more total...Is there a way you can add this
filter and make it work during the search? The point is, to get more
names in an orderly, managable way.

Clarification of Answer by sycophant-ga on 13 Mar 2004 01:20 PST
Hi author20, 

If I understand your query, you want to limit your search to domains
beginning with a certain letter, or range of letters.

I can do so the with results I return from the program, but it will
only do so from the results returned from the search. Google's search
does not allow for any scope filtering such as that. You can restrict
a search to a specific domain, but you are not able to limit it to a
wildcarded group of domains.

Can you think of any other measures that may make this work better for
you? I can only scrape the pages returned by a Google search, so if it
can be done there somehow, I can do it.

Regards,
Sycophant

Request for Answer Clarification by author20-ga on 13 Mar 2004 06:33 PST
Hi,

Could we save the results to a temporary file on the client PC, then
parse it with a script after the fact?

It would possibly require 

1. limiting the save to -- perhaps -- 100 pages (so it will not take
more than 4 minutes or so to save)
2. parsing the results -- using your current script
3. parsing the results of your script -- with a new script that is
able to then check the values of 1)page title 2)first text on page
3)meta tag info inside 4)other useful data that is descriptive

Would this work? I could obviously pay more for the script in #3, but
I am also checking another solution submitted by another:

http://www.domainpunch.com/products/domainfilter/

Let me know what you think is the best way to scrape results
accurately, without losing any data, and without making the search
results unmanagable.

Request for Answer Clarification by author20-ga on 13 Mar 2004 06:34 PST
Oh -- we could use either Microsoft Access as the database or mySQL.
Both are supported on Windows PCs, and that is our platform for
execution.

Request for Answer Clarification by author20-ga on 13 Mar 2004 06:46 PST
Oh -- here are some other ideas:

My first use is to identify the most successful amazon associate
sites, so the first google search criterion is "amazon associate". 
But here are ways to further limit the results:

1. only sites that have google traffic statistics 
2. only sites that list a phone number, street address, city, state
listed somewhere in their site (perform a site search after "amazon
associate" is found
3. only sites that feature more than one amazon product, amazon
products have prices, so finding multiple instances of currency
strings is a way to find multiple listings of products (on same or
different pages, both are desirable)
4. sites that use the following terms a lot, "discount" "savings"
"closeout" "10% off" "20% off" etc.
5. sites that are not in geocities, yahoo, aol and the other online
services that attract a lot of loser Amazon associate sites.  So, we
should build in a filter to avoid the huge shared domains
6. filter the title of the page (?) or the first text words (is this possible)?

Just brainstorming. I believe that there is a way, and I hope we don't
have to save then parse, this would take forever, and if GOogle will
message the search for us, why not use them?

Request for Answer Clarification by author20-ga on 13 Mar 2004 07:07 PST
Hi,

I think a 2 part process is necessary, and probably must be done
offline, but possibly online and real time.

1. search for a term
2. save results from as many pages as the user specifies (up to 1000)
into a  disk file.
3. parse the file and separate the domains from the URLs from the data
in the pages that are in the hit list (not entire site but the title,
meta tags and text contents of the page)

4. Then parse the database with a database query tool that allows you
to search for specific text in the page that is associated with the
domain or URL.


You might want to check out the tool at:

http://www.domainpunch.com/products/domainfilter/

It Can do the following:

Extract domain names from lists of URLs by removing http://, www, html
file names, etc.

Change or remove domain name extensions in a large list of domain
names with a single click.

Add domain name extensions like .COM, .INFO etc to a large list of
words with a single click.

Extract domain names that do not contain repeated letters or digits
(eg: zzsleep.com, aaweb.com, etc)

Domain Name Filter can process very large domain name lists with
little CPU overhead.

Clarification of Answer by sycophant-ga on 13 Mar 2004 20:56 PST
Hi,

I understand what you are looking for, but unfortunately it falls
outside the scope of what I can do in the time I currently have
available to me.

I don't really think I am going to be able to add that sort of
functionality to the script I've written. Also, I am not really sure
that PHP is the language to do it in.

As I am unable to create the sort of application you are looking for,
I am willing to withdraw my answer - let me know.

Regards,
Sycophant-ga

Request for Answer Clarification by author20-ga on 13 Mar 2004 21:37 PST
I think I can use this for now, but I'll probably ask you to enhance later on. 

I will test your program a bit more, and try to get used to the
limitations and pay the $20 tomarrow.

Request for Answer Clarification by author20-ga on 14 Mar 2004 06:34 PST
Hi,
The program is so powerful, and I will pay for it definately, but wanted to ask:

1. could you include the found hyperlink to the right of the domain
(leave 10 or more spaces), allowing user to click and see the page?
2. could you include the date to the right of the link?

3. Also, why does it hang if you ask for more than 3 pages?  Is there
a way to limit the first query and put, "See next 5 pages" at the
bottom? Right now, eveything beyond the scope of the first search is
lost.

With a feature that allows you to see "Next 5 pages" it seems that in
theory, we won't lose any records.

I will pay a tip of $10 if you can add the link to the right of the
domain. I hope the date is possible.  Listen, I do want to pay you
next week for enhancements, and I think they can be done within PHP.

Request for Answer Clarification by author20-ga on 14 Mar 2004 06:54 PST
Hi,

Again -- the program is very powerful.  I definately want to use it,
and will send payment if you can address the above issues. Also,
definately will give you a super-high rating, and additional jobs in
the near future.

Just wanted to let you know that I am aware you put a lot of time in
the script, and that it will be valuable.

One more request - to the right of the domains and URLs, and date (if
possible), can you put the hit count? That is, if a particular domain
shows up 12 times, can you put "12" to the right?

Clarification of Answer by sycophant-ga on 14 Mar 2004 21:57 PST
Hi again,

Here we go... Revised operation:

1) Choose search term, and start page, and pages to return at once.
2) Search - This performs the search on Google, and strips the results
for the specified number of pages.
3) Repeat Step 2 as many times as necessary (manually at the moment)
until enough results have been gathered.
4) Output collected data.

The is return in as a list, ordered alphabetically by domain name
(ads.blah.com comes before www.apple.com), one line per unique domain
name. Each line has beside it a count of individual links found within
that domain, and a link to each one.

See http://dylan.wibble.net/programming/ga-test/314579/grab2.php for a
working example, and
http://dylan.wibble.net/programming/ga-test/314579/grab2.zip to
download the revised code.

Regads,
Sycophant-ga

Clarification of Answer by sycophant-ga on 14 Mar 2004 22:02 PST
Sorry, I forgot to mention, I haven't been able to ad the date
details, as not every (in fact not many) result includes a date
listing. Also, as the individual links are being consolidated into a
single line, there's no place to put the date for each individual
matching page for a domain.

Regards,
Sycophant

Request for Answer Clarification by author20-ga on 15 Mar 2004 06:25 PST
Hi,

Good design, like the enhancements.  There are two bugs.  I will
accept and pay ASAP if you can:

1. fix bug that appears on main page:
"Warning: Undefined variable: domains in
/home/www/docs/dylan.wibble.net/programming/ga-test/314579/grab2.php
on line 138"

2. Make it so selecting the link opens a new window, by taking user to
selected link (use if "#" was nice), some pages will not let you
return to the referring page, so you have to go to the script page by
entering the URL manually.

Clarification of Answer by sycophant-ga on 15 Mar 2004 20:43 PST
Hi,

Both those bugs have been corrected.

I may continue to develop this application as an independant project
overtime, but that will not be for sometime. I'll let you know somehow
if that happens.

Regards,
Sycophant-ga
author20-ga rated this answer:5 out of 5 stars and gave an additional tip of: $15.00
One of the most skilled Google API experts I have seen, I am hiring
him again to enhance this super Google search tool.

Comments  
Subject: Re: Scrape domains (only) from google results.
From: pinkfreud-ga on 10 Mar 2004 16:47 PST
 
I wonder whether this software would meet your needs:

http://www.domainpunch.com/products/domainfilter/
Subject: Re: Scrape domains (only) from google results.
From: author20-ga on 10 Mar 2004 17:20 PST
 
I will check right now...
Subject: Re: Scrape domains (only) from google results.
From: edschmoe-ga on 18 Mar 2004 18:01 PST
 
Why do screen scrapes when Google offers an API to allow applications
to query it???????

www.google.com/apis/
Subject: Re: Scrape domains (only) from google results.
From: sycophant-ga on 18 Mar 2004 20:19 PST
 
I choose to implement a HTTP parse, or scrape because the Google API
conditions are very specific, and I wasn't sure that the intended use
would fall within the usage guidelines.

I have used the API for other projects and I may do in future.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy