Google Answers Logo
View Question
 
Q: writing web spider ( Answered 3 out of 5 stars,   3 Comments )
Question  
Subject: writing web spider
Category: Computers > Software
Asked by: googley-ga
List Price: $5.00
Posted: 14 Jul 2002 19:01 PDT
Expires: 13 Aug 2002 19:01 PDT
Question ID: 39614
I am developing a web spider which will check for broken images(image
tags which don't have corrosponding images existing on the webserver)
on a web site.
I want the code in visual basic which will check for existence of
images on web server without creating instance of browser or any
browser control(since it takes time). I want some efficient code which
will send http request for a .jpg file and get response from server
which will not contain the file but specify whether the file exists or
not. The code should be complete with error handling routines.
Answer  
Subject: Re: writing web spider
Answered By: wengland-ga on 15 Jul 2002 11:12 PDT
Rated:3 out of 5 stars
 
Greetings!

While I am not a VB Coder, I can provide you with a link to sample
code that will make your application do what you want.

The DevX.com website has code and an article from their Spring 1998
"Getting Started with Visual Basic" magazine at:

http://www.devx.com/free/codelib/view.asp?id=342155

The WinInet .dll file (provided by Microsoft) provides the calls to
directly query a web server to retrieve or check for the existence of
a document.  The WinInet DLL gives complete Internet functionality to
any VB app.

This .dll file provides a *ton* of connection methods and utilities. 
The suggested one to use is InternetOpenURL, which connects to a web
server and makes sure the file requested exists.  This should fulfil
your requirement.

The sample code provided in the article shows the exact steps to take
to make a connection and check for the existence of a file.

Sounds like a neat project; I hope you publish it when you are
finished.  I could use a tool like this.


Related Links

WinInet: Enable HTTP Communication in Windows-Based Client
Applications
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnmag01/html/USEMON.asp

VBinet.exe - samples of WinInet code in VB
http://support.microsoft.com/default.aspx?scid=kb;EN-US;q185519

Using WinInet Asynchronously in VB
http://support.microsoft.com/default.aspx?scid=kb;EN-US;q189850

Vbhttp.exe Demonstrates How to Use HTTP WinInet APIs in Visual Basic
http://support.microsoft.com/default.aspx?scid=kb;EN-US;q259100


Search terms:

wininet @ microsoft.com
http://search.microsoft.com/default.asp?boolean=ALL&nq=NEW&so=RECCNT&ig=01&ig=02&ig=03&ig=04&ig=05&ig=06&ig=07&ig=08&ig=09&ig=10&i=00&i=01&i=02&i=03&i=04&i=05&i=06&i=07&i=08&i=09&qu=wininet

visual basic (within results above) @ microsoft.com
http://search.microsoft.com/Default.asp?so=RECCNT&boolean=ALL&siteid=us&p=1&nq=WITHIN&fqu=%2522WININET%2522&qu=wininet&qu=visual+basic&nso=RECCNT&ig=1&ig=2&ig=3&ig=4&ig=5&ig=6&ig=7&ig=8&ig=9&ig=10&i=00&i=01&i=02&i=03&i=04&i=05&i=06&i=07&i=08&i=09

vb http library @ google.com
://www.google.com/search?q=vb+http+library
googley-ga rated this answer:3 out of 5 stars
thanks. Your answer is helpful to head in correct direction.
I already got it done using internet transfer control.

Comments  
Subject: Re: writing web spider
From: philip_lynx-ga on 14 Jul 2002 19:44 PDT
 
And all that for $5? Unless there is open source for that (which I
doubt, as you specifically require VB), good luck!
Subject: Re: writing web spider
From: iaint-ga on 15 Jul 2002 03:20 PDT
 
My knowledge of Visual Basic is insufficient to allow me to give you
the answer you requested, but I can give you some tips which may help
you (or someone else) look in the right direction.

All you need to do is open a TCP/IP connection to your target
webserver and then use the HTTP "HEAD" request to determine whether or
not your required file exists. The format of the HEAD command (and the
rest of HTTP/1.1) is fully covered by RFC 2616 but in essence all you
will need to send to the server is the following three lines:

HEAD /path/to/target/file.jpg HTTP/1.1
Host: www.webservername.com
Connection: Close

(followed by two CR/LF pairs)

You then need to capture the output from the server which will likely
consist of between 5-10 lines of text. If the requested file is
accessible the server should return a 'HTTP/1.1 200 OK' response as
its first line, if not you will most probably get 'HTTP/1.1 404'
(although if it exists but is not available for other reasons, other
statuses could occur. Consult the HTTP documentation for full
details).

Most computer languages make it fairly easy to use TCP/IP sockets,
often with library files or modules which can make it almost as simple
as writing to a local file. A quick Google search:
://www.google.co.uk/search?q=visual+basic+tcp/ip+socket+open

has revealed, amongst many others, the site
http://www.15seconds.com/issue/990408.htm

which should give you some tips and source code examples that will
help you continue your software development.

Regards
iaint-ga

HTTP 1.1 Specification:
http://www.ietf.org/rfc/rfc2616.txt
Subject: Re: writing web spider
From: saulg-ga on 11 Sep 2002 07:46 PDT
 
Hi googley-ga 
I've been writing VB code since 1994 (VB2) and have previously written
a spider that downloaded some 25,000 pages automatically (took some 18
hours!)

I believe that the program should first parse the links and then
attempt to fetch & report errors when resources are not available.

I Wouldn't mind having a go if you're still interested.

saulg-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy