Google Answers: Scrape domains (only) from google results.

View Question

Q: Scrape domains (only) from google results. ( Answered 5 out of 5 stars

Question

Subject: Scrape domains (only) from google results.
Category: Computers > Programming
Asked by: author20-ga
List Price: $20.00

Posted: 08 Mar 2004 09:50 PST
Expires: 07 Apr 2004 10:50 PDT
Question ID: 314579

I want to have a list of the domains from a google search, so I wind
up with a list of domains -- and no other information, not even files.
So if the results shows:

http://www.domain.com/otherstuff/files/page.html

It saves a the following

http://www.domain.com

along with other domains. It would be nice to have an option to save
the sub-directories and files also, but it must work with google.

Just identify the shareware program that does it and I'm going to pay
$20 immediately. If the program doesn't meet my requirements, I'll pay
you $10 if I use it anyway.

Request for Question Clarification by sycophant-ga on 10 Mar 2004 04:53 PST

What platform does this program need to run on?

Do you care about duplicates - ie. can www.domain.com show up more
than once, or do you want the software to handle duplicate removal.

Are you willing to save the Google results pages yourself and pass
them through a parser, or does the program need to be able to execute
the search too - and if so, how many pages of results do you want it
to go through?

If you have a web account or server capable of running a PHP script, I
can create you a script that will perform this functionality.

Regards,
Sycophant-ga

Clarification of Question by author20-ga on 10 Mar 2004 07:01 PST

1. duplicates not an issue, I can delete later
2. must execute save automatically, complete automation of parse also
3. PHP definately OK, but I was told about some freeware program that
does it, will close post as soon as person identifies it

Answer

Subject: Re: Scrape domains (only) from google results.
Answered By: sycophant-ga on 10 Mar 2004 16:32 PST
Rated: 5 out of 5 stars

Hi author20, I have done a number of searches online, and been unable to fins a product that does what you describe. There are a number of prducts able to extact URL from text (or saved HTML pages), but none that I could find would execute a search and return only the domains rather than full URLs. However, I have written a simple script in PHP which does fulfil your demands - it is currently operationational at this URL: http://dylan.wibble.net/programming/ga-test/314579/grab.php The required files can be downloaded here: http://dylan.wibble.net/programming/ga-test/314579/google-grab.zip This program is written to work with current Google results, and should continue to do so. However, as it is simply parsing the contents of Google's HTML results, any change in formatting could affect the program's functionality. All required files are included in the zip file. They are the grab.php script and an HTTP class. This application should run on any PHP server with outbound HTTP access. I hope this small application will do what you need. Regards, Sycophant-ga
Request for Answer Clarification by author20-ga on 10 Mar 2004 17:21 PST Checking this right now, if I like, I will accept answer.
Clarification of Answer by sycophant-ga on 12 Mar 2004 23:07 PST Hi author20, As you haven't gotten back to me about it, I assume you have had a chance to look over the script I submitted for you and hope you find it suitable. If there are any problems, let me know. Regads, Sycophant-ga
Request for Answer Clarification by author20-ga on 13 Mar 2004 00:17 PST Hello, I'm downloading now, and due to internet problems last night, didn't get to as fast as i wanted. Stand by...
Request for Answer Clarification by author20-ga on 13 Mar 2004 00:20 PST Hi, THis looks perfect, let me test and I will accept answer tomarrow morning. I appologize for hte delay, and I probably will ask you for help on another project soon.
Request for Answer Clarification by author20-ga on 13 Mar 2004 00:48 PST Hi, This works pretty good, but I can't control it. Can you have it display domains in alphabetical order? For example, if I put: "amazon associate"+800 ....it returns a whole bunch of domains, but I know that I am missing a lot that are too many hits to bring up. But if it were to qualify and allow user to get only the domains that start with "a"...and then "b" and then "c"...and so on.... I could be sure to get more total...Is there a way you can add this filter and make it work during the search? The point is, to get more names in an orderly, managable way.
Clarification of Answer by sycophant-ga on 13 Mar 2004 01:20 PST Hi author20, If I understand your query, you want to limit your search to domains beginning with a certain letter, or range of letters. I can do so the with results I return from the program, but it will only do so from the results returned from the search. Google's search does not allow for any scope filtering such as that. You can restrict a search to a specific domain, but you are not able to limit it to a wildcarded group of domains. Can you think of any other measures that may make this work better for you? I can only scrape the pages returned by a Google search, so if it can be done there somehow, I can do it. Regards, Sycophant
Request for Answer Clarification by author20-ga on 13 Mar 2004 06:33 PST Hi, Could we save the results to a temporary file on the client PC, then parse it with a script after the fact? It would possibly require 1. limiting the save to -- perhaps -- 100 pages (so it will not take more than 4 minutes or so to save) 2. parsing the results -- using your current script 3. parsing the results of your script -- with a new script that is able to then check the values of 1)page title 2)first text on page 3)meta tag info inside 4)other useful data that is descriptive Would this work? I could obviously pay more for the script in #3, but I am also checking another solution submitted by another: http://www.domainpunch.com/products/domainfilter/ Let me know what you think is the best way to scrape results accurately, without losing any data, and without making the search results unmanagable.
Request for Answer Clarification by author20-ga on 13 Mar 2004 06:34 PST Oh -- we could use either Microsoft Access as the database or mySQL. Both are supported on Windows PCs, and that is our platform for execution.
Request for Answer Clarification by author20-ga on 13 Mar 2004 06:46 PST Oh -- here are some other ideas: My first use is to identify the most successful amazon associate sites, so the first google search criterion is "amazon associate". But here are ways to further limit the results: 1. only sites that have google traffic statistics 2. only sites that list a phone number, street address, city, state listed somewhere in their site (perform a site search after "amazon associate" is found 3. only sites that feature more than one amazon product, amazon products have prices, so finding multiple instances of currency strings is a way to find multiple listings of products (on same or different pages, both are desirable) 4. sites that use the following terms a lot, "discount" "savings" "closeout" "10% off" "20% off" etc. 5. sites that are not in geocities, yahoo, aol and the other online services that attract a lot of loser Amazon associate sites. So, we should build in a filter to avoid the huge shared domains 6. filter the title of the page (?) or the first text words (is this possible)? Just brainstorming. I believe that there is a way, and I hope we don't have to save then parse, this would take forever, and if GOogle will message the search for us, why not use them?
Request for Answer Clarification by author20-ga on 13 Mar 2004 07:07 PST Hi, I think a 2 part process is necessary, and probably must be done offline, but possibly online and real time. 1. search for a term 2. save results from as many pages as the user specifies (up to 1000) into a disk file. 3. parse the file and separate the domains from the URLs from the data in the pages that are in the hit list (not entire site but the title, meta tags and text contents of the page) 4. Then parse the database with a database query tool that allows you to search for specific text in the page that is associated with the domain or URL. You might want to check out the tool at: http://www.domainpunch.com/products/domainfilter/ It Can do the following: Extract domain names from lists of URLs by removing http://, www, html file names, etc. Change or remove domain name extensions in a large list of domain names with a single click. Add domain name extensions like .COM, .INFO etc to a large list of words with a single click. Extract domain names that do not contain repeated letters or digits (eg: zzsleep.com, aaweb.com, etc) Domain Name Filter can process very large domain name lists with little CPU overhead.
Clarification of Answer by sycophant-ga on 13 Mar 2004 20:56 PST Hi, I understand what you are looking for, but unfortunately it falls outside the scope of what I can do in the time I currently have available to me. I don't really think I am going to be able to add that sort of functionality to the script I've written. Also, I am not really sure that PHP is the language to do it in. As I am unable to create the sort of application you are looking for, I am willing to withdraw my answer - let me know. Regards, Sycophant-ga
Request for Answer Clarification by author20-ga on 13 Mar 2004 21:37 PST I think I can use this for now, but I'll probably ask you to enhance later on. I will test your program a bit more, and try to get used to the limitations and pay the $20 tomarrow.
Request for Answer Clarification by author20-ga on 14 Mar 2004 06:34 PST Hi, The program is so powerful, and I will pay for it definately, but wanted to ask: 1. could you include the found hyperlink to the right of the domain (leave 10 or more spaces), allowing user to click and see the page? 2. could you include the date to the right of the link? 3. Also, why does it hang if you ask for more than 3 pages? Is there a way to limit the first query and put, "See next 5 pages" at the bottom? Right now, eveything beyond the scope of the first search is lost. With a feature that allows you to see "Next 5 pages" it seems that in theory, we won't lose any records. I will pay a tip of $10 if you can add the link to the right of the domain. I hope the date is possible. Listen, I do want to pay you next week for enhancements, and I think they can be done within PHP.
Request for Answer Clarification by author20-ga on 14 Mar 2004 06:54 PST Hi, Again -- the program is very powerful. I definately want to use it, and will send payment if you can address the above issues. Also, definately will give you a super-high rating, and additional jobs in the near future. Just wanted to let you know that I am aware you put a lot of time in the script, and that it will be valuable. One more request - to the right of the domains and URLs, and date (if possible), can you put the hit count? That is, if a particular domain shows up 12 times, can you put "12" to the right?
Clarification of Answer by sycophant-ga on 14 Mar 2004 21:57 PST Hi again, Here we go... Revised operation: 1) Choose search term, and start page, and pages to return at once. 2) Search - This performs the search on Google, and strips the results for the specified number of pages. 3) Repeat Step 2 as many times as necessary (manually at the moment) until enough results have been gathered. 4) Output collected data. The is return in as a list, ordered alphabetically by domain name (ads.blah.com comes before www.apple.com), one line per unique domain name. Each line has beside it a count of individual links found within that domain, and a link to each one. See http://dylan.wibble.net/programming/ga-test/314579/grab2.php for a working example, and http://dylan.wibble.net/programming/ga-test/314579/grab2.zip to download the revised code. Regads, Sycophant-ga
Clarification of Answer by sycophant-ga on 14 Mar 2004 22:02 PST Sorry, I forgot to mention, I haven't been able to ad the date details, as not every (in fact not many) result includes a date listing. Also, as the individual links are being consolidated into a single line, there's no place to put the date for each individual matching page for a domain. Regards, Sycophant
Request for Answer Clarification by author20-ga on 15 Mar 2004 06:25 PST Hi, Good design, like the enhancements. There are two bugs. I will accept and pay ASAP if you can: 1. fix bug that appears on main page: "Warning: Undefined variable: domains in /home/www/docs/dylan.wibble.net/programming/ga-test/314579/grab2.php on line 138" 2. Make it so selecting the link opens a new window, by taking user to selected link (use if "#" was nice), some pages will not let you return to the referring page, so you have to go to the script page by entering the URL manually.
Clarification of Answer by sycophant-ga on 15 Mar 2004 20:43 PST Hi, Both those bugs have been corrected. I may continue to develop this application as an independant project overtime, but that will not be for sometime. I'll let you know somehow if that happens. Regards, Sycophant-ga

author20-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $15.00

One of the most skilled Google API experts I have seen, I am hiring
him again to enhance this super Google search tool.

Comments

Subject: Re: Scrape domains (only) from google results.
From: pinkfreud-ga on 10 Mar 2004 16:47 PST

I wonder whether this software would meet your needs:

http://www.domainpunch.com/products/domainfilter/

Subject: Re: Scrape domains (only) from google results.
From: author20-ga on 10 Mar 2004 17:20 PST

I will check right now...

Subject: Re: Scrape domains (only) from google results.
From: edschmoe-ga on 18 Mar 2004 18:01 PST

Why do screen scrapes when Google offers an API to allow applications
to query it???????

www.google.com/apis/

Subject: Re: Scrape domains (only) from google results.
From: sycophant-ga on 18 Mar 2004 20:19 PST

I choose to implement a HTTP parse, or scrape because the Google API
conditions are very specific, and I wasn't sure that the intended use
would fall within the usage guidelines.

I have used the API for other projects and I may do in future.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy