Google Answers Logo
View Question
 
Q: Automatically download all 157 entries from a google search ( Answered 5 out of 5 stars,   2 Comments )
Question  
Subject: Automatically download all 157 entries from a google search
Category: Computers > Internet
Asked by: unclebob3322-ga
List Price: $20.00
Posted: 31 Jan 2005 08:54 PST
Expires: 02 Mar 2005 08:54 PST
Question ID: 466383
I google for "daily minutes site:www.mydomain.com" and google tells me:
    Results 1 - 10 of about 857 from www.mydomain.com for daily minutes

I want to automatically download the html of all 857 pages into c:\my temp folder\

I know how to do it manually one URL at a time, but it would take me hours.

Any ideas?

Request for Question Clarification by secret901-ga on 31 Jan 2005 11:14 PST
Hi unclebob3322-ga,
The Google API <://www.google.com/apis/> allows you to write a
computer program to do that automatically.  Will you be willing to run
or write a computer program to do that?

secret901-ga

Clarification of Question by unclebob3322-ga on 31 Jan 2005 14:55 PST
I know a little vba and some scripting, so I might have the talent. 

But I certainly don't have the time. 

If it can't be delivered within two days, I'll just download the pages
manually.   I would be willing to increase the price for someone to do
this, but $100 would be the highest I could go.

Request for Question Clarification by secret901-ga on 31 Jan 2005 17:40 PST
Hi unclebob3322-ga,
I have a Java program written for another project that essentially
does what you want.  I can make some minor modifications to it to suit
your needs.  Can you run Java applications and would you be willing to
obtain a Google API key (free) for it?

secret901-ga

Clarification of Question by unclebob3322-ga on 31 Jan 2005 20:33 PST
I never played with Java, so it is hard to say.  I followed the google
instructions got a key, and downloaded and extracted the googleapi.
the readme.txt says:

To quickly try the API, run
  java -cp googleapi.jar com.google.soap.search.GoogleAPIDemo <key> search Foo

BUT, IT DOESN'T TELL ME HOW TO REGISTER JAVA!!!
No setup.exe that I can see.  What am I missing?


Separate subject: Jobo may very well be what I want.  Its web site
says it will run a lot faster if I install Java first.  Does the
Googleapi Java qualify, or is jobo talking about a different java?

Thanks for your help

Clarification of Question by unclebob3322-ga on 31 Jan 2005 20:34 PST
Plus, jobo doesn't say it will run on windows xp.

Am I out of luck

Request for Question Clarification by rainbow-ga on 31 Jan 2005 22:03 PST
Hi unclebob3322,

Would you like me to download these urls manually into a word
document? I can then upload this doument and post a url for you to
download the document from.

Waiting to hear your views.

Regards,
Rainbow

Request for Question Clarification by secret901-ga on 31 Jan 2005 22:38 PST
Hi unclebob3322-ga,
The Java runtime environment (JRE) can be downloaded for free from the
Sun website at http://java.sun.com.  Download the latest version
(1.5): http://java.sun.com/j2se/1.5.0/download.jsp.

Since Jobo is a Java program, it will run on any system that has Java
installed, including Windows XP.

Let me know if this works out for you.

secret901-ga

Request for Question Clarification by secret901-ga on 31 Jan 2005 22:45 PST
By the way, the Jobo Windows installer
<http://www.matuschek.net/software/jobo/download.html> has a built-in
Java Runtime Environment, so you don't have to download one from Sun
if you only want to try Jobo.

secret901-ga

Request for Question Clarification by palitoy-ga on 31 Jan 2005 22:51 PST
Hello unclebob3322-ga,

If you like the look of Jobo, you may also want to assess HTTrack
(http://www.httrack.com), this does the same job as Jobo as far as I
can tell without the need to install Java.  I have used HTTrack
extensively and it is a very good piece of software.

If this solves your problem please let me know and I will post this as your answer.

Request for Question Clarification by pafalafa-ga on 01 Feb 2005 03:27 PST
Bob et al,

I've been following this question with interest, but I'm not sure if
you want to download 157 entire websites, or just the URL's.

Can you clarify?

If the latter, this link may be of use:

http://hacks.oreilly.com/pub/h/1090

although I have trouble actually getting the thing to work.  But then
again, I'm totally ignorant of things Java, so others may have better
luck.

pafalafa-ga

Clarification of Question by unclebob3322-ga on 01 Feb 2005 10:14 PST
pafalafa-ga:  I need the content of the pages, not just the URL?s.

rainbow-ga:  Your offer to download them for me is very smart (cutting
the gordian knot!), but my goal was to be able to repeat this every
few weeks.
 
Secret901-ga:  I?ll play with JRE and Jobo after I evaluate
palitoy-ga?s suggestion of httrack.  (Download programming can be
complicated and I hope that HTtrack will be easier than programming it
myself.)

palitoy-ga:  httrack has an awful lot of settings that I don?t understand. 

Here is what I did, but it didn't work.  What am I missing?

1.I Googled for ?excel shortcut hotkeys site:www.experts-exchange.com?
(this is a simple test that only gets 5 hits instead of 850!)
2.I copied the resulting url to clipboard:
://www.google.com/search?hl=en&q=excel+shortcut+hotkeys+site%3Awww.experts-exchange.com
3.I opened httrack and pasted into web address url
4.I used all the default options except I set maximum mirroring depth
and external depth to 2. (I also Unchecked ?use proxy for ftp
transfers?
5.I clicked on finish

I get this message:
  HTTrack has detected that the current mirror is empty. If it was 
  an update, the previous mirror has been restored.
  Reason: the first page(s) either could not be found, or a 
  connection problem occured.
 => Ensure that the website still exists, and/or check your proxy settings! <=

What am I doing wrong?

P.S.  here is the httrack log:

HTTrack3.32-2+swf launched on Tue, 01 Feb 2005 13:00:11 at
://www.google.com/search?q=conference+call+minutes+site:www.sl.universalservice.org&hl=en&lr=&start=10&sa=N
+*.png +*.gif +*.jpg +*.css +*.js -ad.doubleclick.net/*
(winhttrack -qir2%e2C2%Ps2u1%sN0%I0p3DaK0H0%kf2A25000%f0#f -F
"Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" -%F "<!--
Mirrored from %s%s by HTTrack Website Copier/3.x [XR&CO'2004], %s -->"
-%l "en, en, *" ://www.google.com/search?q=conference+call+minutes+site:www.sl.universalservice.org&hl=en&lr=&start=10&sa=N
-O "C:\My Web Sites\google for minutes,C:\My Web Sites\google for
minutes" +*.png +*.gif +*.jpg +*.css +*.js -ad.doubleclick.net/* -%A
php3,php,php2,asp,jsp,pl,cfm,nsf=text/html )
Information, Warnings and Errors reported for this mirror:
note: the hts-log.txt file, and hts-cache folder, may contain
sensitive information,
 such as username/password authentication for websites mirrored in this project
 do not share these files/folders if you want these information to remain private
13:00:13 Info:  No data seems to have been transfered during this
session! : restoring previous one!

Request for Question Clarification by palitoy-ga on 01 Feb 2005 10:47 PST
You were correct to keep all the default settings and to also alter
the depth settings to 2.  The other thing you need to do is to go to
the "Spider" tab in the options box and then make sure all the boxes
there are checked (you may not need some of them checked but when I
tested this I had them all checked!).  You also need to alter the
drop-down box to "no robots.txt rules".

I also changed the BrowserID to "Mozilla/4.05 [fr] (Win98; I)" on the
BrowserID tab but again this may not be completely necessary.

With all these settings, HTTrack should download a number of pages
from different websites and place them in separate directories under
the URL for each website (for example if a page from
www.1234567890.com was saved it would be saved in a directory called
www.1234567890.com).  By default this would be placed at "C:\My Web
Sites\httptrackprojectname\www.1234567890.com".

Let me know if you have any success with these settings.

Clarification of Question by unclebob3322-ga on 01 Feb 2005 13:46 PST
rainbow-ga:  I changed my mind. 

I?m running out of time, so I would love it if you would download them for me.
How do I give you my e-mail address?

The desired google query is 
  Conference call minutes site:www dot sl dot universalservice dot org

If rainbow will do it, I will close this question.
----------------------

For the other folks, I greatly appreciate your help, but I?m out of time.

For posterity, here is what made me run out of time.

Secret901-ga:  mozilla 4.5 was certainly needed.  But things were still not right.

Part of the problem was my experts exchange test query lead to some
massive pages, so I reverted to the desired query (see above).

The conference call minutes query gave me 157 hits distributed on one
and a half pages (my google preferences are set at 100).
But, for some reason httrack did not honor my google preference, and
gave me 16 pages of 10 items.

Even worse, the url?s all seemed to continue to point to the www dot
sl dot universal dot org

I tried changing the link limits to 3 & 3, but it ran a lot longer so
I had to cancel it.  I then tried 2&3 and excluding google.com in the
scans tab.

This worked, but only for the first of the 15 pages.  The other pages
still kept point to www.

Plus everytime I changed the httrack ?project name?, I had to reenter
all the parameters.

Sorry, this is fun, but I am running out of time, I have already spent
3 hours and I can?t afford any more.

Hope rainbow can get the job done for me.

Request for Question Clarification by rainbow-ga on 01 Feb 2005 14:09 PST
Hi unclebob3322,

Thank you for your clarification.

I have a few things I need to clarify before proceeding.

Please verify this is the desired google query:

://www.google.com/search?num=50&hl=en&lr=&newwindow=1&rls=GGLD%2CGGLD%3A2004-17%2CGGLD%3Aen&q=site%3A.www.sl.universalservice.org+Conference+call+minutes+

Also, in your question you state you want the html's of the 157
websites. Later you say you need the content of the pages:

"Clarification of Question by unclebob3322-ga on 01 Feb 2005 10:14 PST 
pafalafa-ga:  I need the content of the pages, not just the URL?s."

Please clarify. If the former is correct, I will be glad to list the
url's in a word document for you to download. However, it is beyond my
expertise if you are seeking the content of the pages.

Regarding your email address, all answers must be posted here, as
Researchers are not allowed personal contact.

Waiting to hear from you.

Best regards,
Rainbow

Clarification of Question by unclebob3322-ga on 01 Feb 2005 20:54 PST
Each of the 153 search results contains an url. 

Here is what I would do manually.  Hope you can automate it.

Open a new word document.  

for each url in the search,
  goto the url  (each one contains minutes from a meeting, that?s what
I want downloaded)
  ctrl a to select the whole page
  goto the word document 
  edit > paste special > unformated text
next url

So, when I am done, I will have 153 pages of English words. 
I don?t mind if a bunch of extra junk comes, but the English words are critical.

Clarification of Question by unclebob3322-ga on 01 Feb 2005 21:03 PST
by the way, yes you do have the correct Google Query.

Also, it is nearly trivial to get the url's listed.

If that is all you planned on giving me, I'm sorry to disappoint you.

For instance, it took me one minute to manually get a list of first 100 entries.

www dot sl dot universalservice dot org/vendor/agenda/061902min dot asp 
www dot sl dot universalservice dot org/vendor/agenda/100301min dot asp 
www dot sl dot universalservice dot org/vendor/agenda/103101min dot asp 
www dot sl dot universalservice dot org/vendor/agenda/102302min dot asp 
www dot sl dot universalservice dot org/vendor/agenda/082703min dot asp 
... and 95 more ....

Request for Question Clarification by rainbow-ga on 01 Feb 2005 22:29 PST
Hi unclebob3322,

For each page downloaded, do you want the url of the page or just the
contents of the page? Also, do you require each page be downloaded
onto a separate word document or rather them all on one document? If
on one, do you require some sort of separation mark?

Regards,
Rainbow

Clarification of Question by unclebob3322-ga on 02 Feb 2005 10:40 PST
I don?t care if you put the URL into the word document.  
I don?t care if they are all in one document, or in 150 separate documents.  
I don?t care if there is a separation mark.

I don?t know why everyone seems so confused about my needs. But just
to be clear, here are even more details of what I want.

I browsed to www dot sl dot universalservice dot
org/vendor/agenda/061902min dot asp

I did a ctrlA ctrlC and word>edit>paste special>unformated text
The results are shown below.  Naturally, I want this repeated 153 times.

--- first part of pasted text is junk, but I don?t care if you give it
to me or not ------------
---- chances are it will be a lot easier if you give it to me
----------------------
----------  just so you can see what this junk looks like, here are
the first 7 lines ------------
Graphics OffGraphics Off
About the SLD

2004 SP Training
WebEx Recordings
Training Presentations
Submit a Question

-------- I omit several hunder lines of futher junk  ---------
------which is followed by the real data that I actually need. Here
are the first 8 lines of payload-------------
Minutes from the Wednesday Service Provider Conference Call
June 19, 2002 
UPDATES
Year 2002

For Funding Year 2002 we've committed $504 million so far, which means
that, 63% of the expected
---------------I omit several hundred lines of payload  -------------

Clarification of Question by unclebob3322-ga on 02 Feb 2005 10:43 PST
When you are done, there will be maybe 20,000 lines of text.  
I do NOT want that to show on Google answers for the next 5 years.
How can you get it to me with a little privacy?  Perhaps you can post an FTP link?
Answer  
Subject: Re: Automatically download all 157 entries from a google search
Answered By: rainbow-ga on 03 Feb 2005 12:06 PST
Rated:5 out of 5 stars
 
Hi unclebob3322,

At the following url you will find the word document containing the
information you requested. Since the word document is over 2 mb?s in
size I compressed it into a zipped format.

This file is available for download for a limited time only.

http://www.darkfriends.net/misc/contents.zip

My apologies for all the questions asked before this work was
undertaken. However, I considered it very important to clarify all
aspects of the
question at the outset so as to answer the question to your full satisfaction.

If you have any questions regarding my answer please don't hesitate to
ask before rating.

Best regards,
Rainbow

Request for Answer Clarification by unclebob3322-ga on 03 Feb 2005 14:57 PST
great job!   
Three questions questions: 

which tool you ended up using?

How hard would it be to modify it to go one level lower if top page
has a button that says "click here to access conference call minutes".

how much of a bonus do you deserve?

Clarification of Answer by rainbow-ga on 03 Feb 2005 15:32 PST
Hi unclebob3322,

I'm pleased you are happy with my work.

Actually, I didn't use any tool. I manually copied and pasted the
contents to a word document. It was a bit time-consuming, but the only
way I thought would get the job done immediately since I felt you
needed it as soon as possible.

If you would like, I can go back to the pages that say "click here to
access conference call minutes"and copy and paste the contents into a
new word document. I did notice these, but as I wasn't sure what to
copy and paste, I left them as they were.

As for a bonus I deserve, thank you for asking, however, that would be
entirely up to you.

Best wishes,
Rainbow
unclebob3322-ga rated this answer:5 out of 5 stars and gave an additional tip of: $15.00
Thanks again

Comments  
Subject: Re: Automatically download all 157 entries from a google search
From: secret901-ga on 31 Jan 2005 17:52 PST
 
An idea:
Download Jobo <http://www.matuschek.net/software/jobo/>, tell it to
download only 1 level deep.  Give it the URL of the search result of
the first page, second page, etc (make sure that it shows 100 results
per page).  So instead of saving 857 pages manually, you will have to
only save 9 pages.

Regards,
secret901-ga
Google Answers Researcher
Subject: Re: Automatically download all 157 entries from a google search
From: rainbow-ga on 03 Feb 2005 16:36 PST
 
Thank you for the rating and tip.

Best regards,
Rainbow

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy