Google Answers Logo
View Question
 
Q: VisualBasic.Net Web Browser control ( Answered 5 out of 5 stars,   0 Comments )
Question  
Subject: VisualBasic.Net Web Browser control
Category: Computers > Programming
Asked by: snowman5000-ga
List Price: $50.00
Posted: 05 Dec 2002 15:48 PST
Expires: 04 Jan 2003 15:48 PST
Question ID: 120006
I'm working with web browser control in VisualBasic.Net. I want to
know how to automatically extract the source code from a webpage and
load it into a variable so my program can conduct analysis on it. What
code can I use to achieve this?
Answer  
Subject: Re: VisualBasic.Net Web Browser control
Answered By: mathtalk-ga on 06 Dec 2002 09:42 PST
Rated:5 out of 5 stars
 
Hi, snowman5000-ga:

Since you are already working in VB.Net with what I assume is the "Web
browser" COM component (from shdocvw.dll), I will focus on the
question of how to extract "source code" from a Web pages in it.  By
this I assume you mean the HTML of a typical Web page.

As you probably know, while the .Net environment "wraps" COM
components with "interop" interfaces that mirror the methods and
properties of the underlying COM objects, there is no native .Net "Web
browser control" per se.

For a discussion of this see the thread at:

[Web Browser Control for DotNet?]
http://developersdex.com/vb/message.asp?p=1120&r=2339929

On the other hand there are some classes "native" to the .Net
framework:

 System.Net.HttpWebRequest
 System.Net.HttpWebResponse

which would suffice for submitting URLs to the Web and returning the
text of the HTTP responses.  For example, this project does that sort
of thing in VB.Net:

http://www.c-sharpcorner.com/vbnet/httpdwnloader.asp

But let's suppose you already have a simple Web browser application
like the one described here:

[Web Browser in C# and VB.Net]
http://www.c-sharpcorner.com/Internet/WebBrowserInCSMDB.asp

Perhaps you are thinking of adding a button that "extracts" the HTML
text of the current page?  This could allow you to navigate around
with links and then extract the source of selected pages
interactively.

In any case we should look at the Document property of the Web browser
control, which returns (the automation object of) the active document.
 When this active document is an HTML page, per your question, the
object returned is of type HTMLDocument.

The Web browser control can "contain" other types of documents, such
as Word, Acrobat, etc.  So it might be useful to know that the Web
browser control's Type property returns a string which identifies the
type of document object that it contains.

By default, if you simply drag and drop a Web browser control from the
toolbox onto a form in your VB.Net project, it winds up being named
axWebBrowser1.  Let's assume that to be your case here.

Note that the HTMLDocument is a completely different automation object
from the Web browser control itself.  The HTMLDocument interface is
provided by mshtml.dll, and to use this in your project you will need
to right-click on the References folder and select Add Reference.  Go
to the .Net tab of the Add Reference dialog box and double click the
component named Microsoft.mshtml.  Click OK.

The text of the HTML document can now be obtained this way:

Dim HTMLBody as String
Dim HTMLDoc as mshtml.HTMLDocument
HTMLDoc = AxWebBrowser1.Document
HTMLBody = HTMLDoc.body.outerHTML

To test this approach I created a new VB.Net "Windows application"
project called DOMDemo, added the Web browser control, a label "URL",
a textbox (to enter URLs), and a "Go" button.  With the addition only
of the following code as the handler for clicks on Button1:

        Cursor.Current = Cursors.WaitCursor
        AxWebBrowser1.Navigate(TextBox1.Text)
        Cursor.Current = Cursors.Default

Voila, a working Web browser!  I then added a second button
("Extract") to the form,  and the following code as handler for clicks
on Button2:

        Dim HTMLBody, HTMLTrunc As String
        Dim HTMLDoc As mshtml.HTMLDocument
        HTMLDoc = AxWebBrowser1.Document
        HTMLBody = HTMLDoc.body.outerHTML
        HTMLTrunc = Mid(HTMLBody, 1, 100)
        MsgBox(HTMLTrunc)

Here I'm truncating the HTML body down to a hundred characters just
because there's a limit of about a thousand characters.  I set a debug
breakpoint here anyway so I could see whatever might be of interest.
        
Obviously there could be various things you might want to change about
this, depending on the sort of analysis you plan on doing.  In
particular I'd point out the collections HTMLDocument.anchors and
HTMLDocument.links that might expedite your analysis, if you were
interested in checking links between pages.  If you replace outerHTML
by outerText, then one sees plain text (without tags).  If you replace
outerHTML by innerHTML, etc. one gets content from between the outer
tags.

More references that may be of interest:

[MSDN Library MSHTML Reference]
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp

[Mastering IE: The Web Browser Control]
(an introductory VB6 take, slightly disorganised)
http://www.vbwm.com/art_2001/IE05/

[Accessing the DHTML DOM from C#]
(ok, it's got C# code, but very relevant anyway)
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vsgrfwalkthroughaccessingdhtmldomfromc.asp

[Trapping DHTML events from the WebBrowser control]
(deals with duplicate interface names, not crucial above)
http://www.vb2themax.com/Item.asp?PageID=TipBank&ID=561

regards, mathtalk-ga


Search Strategy:

Keywords: "VB.Net" "Web browser control"
://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22VB.Net%22+%22Web+browser+control%22&btnG=Google+Search

Keywords: VB "mshtml.htmldocument"
://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=VB+%22mshtml.htmldocument%22&btnG=Google+Search

Clarification of Answer by mathtalk-ga on 06 Dec 2002 09:45 PST
Oops!  When I said there's a limit of about a thousand characters, I
did not make clear that I meant the limit of what MsgBox function can
display.  There is no such limit in the DHTML DOM implementation, or
in the Web browser itself.

regards, mathtalk-ga

Clarification of Answer by mathtalk-ga on 06 Dec 2002 16:09 PST
Thanks, snowman5000, for the kind words (and tip!).  After working on
your interesting question, the feedback means a lot to me.

best wishes, mathtalk-ga

Clarification of Answer by mathtalk-ga on 09 Dec 2002 20:52 PST
Hi, snowman5000-ga:

While researching your question, I tended to ignore sites that require
user registration, even if it is free to do so, because in some cases
it might turn out to be "spam bait".

However this site is operated by Wrox Press, and I've been registered
there for a long time & have a lot of respect for them.  I've never
had any suspicion they might be sharing my email address with third
parties, and IIRC their TOS promise not to do this.  So you might want
to look here (this is a free article but you'll have to register at
the site to read it):

[Programming Internet Explorer in C#]
http://www.csharptoday.com/content.asp?id=1980&csharp0161

from the abstract:

"You can also consider this case study to be about COM to .NET
interoperability."

My perception is that 9 times out of 10 the translation of C# lines to
VB.Net lines is fairly obvious (at least after someone shows it to
you!), so if you are pursuing this IE topic further, the above article
may be helpful.

regards and thanks again,
mathtalk-ga
snowman5000-ga rated this answer:5 out of 5 stars and gave an additional tip of: $5.00
This answer was exactly what I was looking for. This will be very
useful to me. Much Thanks.

Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy