Hi, snowman5000-ga:
Since you are already working in VB.Net with what I assume is the "Web
browser" COM component (from shdocvw.dll), I will focus on the
question of how to extract "source code" from a Web pages in it. By
this I assume you mean the HTML of a typical Web page.
As you probably know, while the .Net environment "wraps" COM
components with "interop" interfaces that mirror the methods and
properties of the underlying COM objects, there is no native .Net "Web
browser control" per se.
For a discussion of this see the thread at:
[Web Browser Control for DotNet?]
http://developersdex.com/vb/message.asp?p=1120&r=2339929
On the other hand there are some classes "native" to the .Net
framework:
System.Net.HttpWebRequest
System.Net.HttpWebResponse
which would suffice for submitting URLs to the Web and returning the
text of the HTTP responses. For example, this project does that sort
of thing in VB.Net:
http://www.c-sharpcorner.com/vbnet/httpdwnloader.asp
But let's suppose you already have a simple Web browser application
like the one described here:
[Web Browser in C# and VB.Net]
http://www.c-sharpcorner.com/Internet/WebBrowserInCSMDB.asp
Perhaps you are thinking of adding a button that "extracts" the HTML
text of the current page? This could allow you to navigate around
with links and then extract the source of selected pages
interactively.
In any case we should look at the Document property of the Web browser
control, which returns (the automation object of) the active document.
When this active document is an HTML page, per your question, the
object returned is of type HTMLDocument.
The Web browser control can "contain" other types of documents, such
as Word, Acrobat, etc. So it might be useful to know that the Web
browser control's Type property returns a string which identifies the
type of document object that it contains.
By default, if you simply drag and drop a Web browser control from the
toolbox onto a form in your VB.Net project, it winds up being named
axWebBrowser1. Let's assume that to be your case here.
Note that the HTMLDocument is a completely different automation object
from the Web browser control itself. The HTMLDocument interface is
provided by mshtml.dll, and to use this in your project you will need
to right-click on the References folder and select Add Reference. Go
to the .Net tab of the Add Reference dialog box and double click the
component named Microsoft.mshtml. Click OK.
The text of the HTML document can now be obtained this way:
Dim HTMLBody as String
Dim HTMLDoc as mshtml.HTMLDocument
HTMLDoc = AxWebBrowser1.Document
HTMLBody = HTMLDoc.body.outerHTML
To test this approach I created a new VB.Net "Windows application"
project called DOMDemo, added the Web browser control, a label "URL",
a textbox (to enter URLs), and a "Go" button. With the addition only
of the following code as the handler for clicks on Button1:
Cursor.Current = Cursors.WaitCursor
AxWebBrowser1.Navigate(TextBox1.Text)
Cursor.Current = Cursors.Default
Voila, a working Web browser! I then added a second button
("Extract") to the form, and the following code as handler for clicks
on Button2:
Dim HTMLBody, HTMLTrunc As String
Dim HTMLDoc As mshtml.HTMLDocument
HTMLDoc = AxWebBrowser1.Document
HTMLBody = HTMLDoc.body.outerHTML
HTMLTrunc = Mid(HTMLBody, 1, 100)
MsgBox(HTMLTrunc)
Here I'm truncating the HTML body down to a hundred characters just
because there's a limit of about a thousand characters. I set a debug
breakpoint here anyway so I could see whatever might be of interest.
Obviously there could be various things you might want to change about
this, depending on the sort of analysis you plan on doing. In
particular I'd point out the collections HTMLDocument.anchors and
HTMLDocument.links that might expedite your analysis, if you were
interested in checking links between pages. If you replace outerHTML
by outerText, then one sees plain text (without tags). If you replace
outerHTML by innerHTML, etc. one gets content from between the outer
tags.
More references that may be of interest:
[MSDN Library MSHTML Reference]
http://msdn.microsoft.com/library/default.asp?url=/workshop/browser/mshtml/reference/reference.asp
[Mastering IE: The Web Browser Control]
(an introductory VB6 take, slightly disorganised)
http://www.vbwm.com/art_2001/IE05/
[Accessing the DHTML DOM from C#]
(ok, it's got C# code, but very relevant anyway)
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vsgrfwalkthroughaccessingdhtmldomfromc.asp
[Trapping DHTML events from the WebBrowser control]
(deals with duplicate interface names, not crucial above)
http://www.vb2themax.com/Item.asp?PageID=TipBank&ID=561
regards, mathtalk-ga
Search Strategy:
Keywords: "VB.Net" "Web browser control"
://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%22VB.Net%22+%22Web+browser+control%22&btnG=Google+Search
Keywords: VB "mshtml.htmldocument"
://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=VB+%22mshtml.htmldocument%22&btnG=Google+Search |
Clarification of Answer by
mathtalk-ga
on
09 Dec 2002 20:52 PST
Hi, snowman5000-ga:
While researching your question, I tended to ignore sites that require
user registration, even if it is free to do so, because in some cases
it might turn out to be "spam bait".
However this site is operated by Wrox Press, and I've been registered
there for a long time & have a lot of respect for them. I've never
had any suspicion they might be sharing my email address with third
parties, and IIRC their TOS promise not to do this. So you might want
to look here (this is a free article but you'll have to register at
the site to read it):
[Programming Internet Explorer in C#]
http://www.csharptoday.com/content.asp?id=1980&csharp0161
from the abstract:
"You can also consider this case study to be about COM to .NET
interoperability."
My perception is that 9 times out of 10 the translation of C# lines to
VB.Net lines is fairly obvious (at least after someone shows it to
you!), so if you are pursuing this IE topic further, the above article
may be helpful.
regards and thanks again,
mathtalk-ga
|