Google Answers Logo
View Question
 
Q: parsing in html using java ( No Answer,   1 Comment )
Question  
Subject: parsing in html using java
Category: Computers > Programming
Asked by: naveen82-ga
List Price: $50.00
Posted: 14 Nov 2005 19:58 PST
Expires: 25 Nov 2005 15:55 PST
Question ID: 593060
sample code in java for table tag parsing and then searching for a keyword in <td>

Clarification of Question by naveen82-ga on 14 Nov 2005 20:04 PST
I need to search among the tables in a text file ( This is the source
from the web page and has been stored) and search for the table whose
size is greater among other tables and also search for the table which
has a specific keyword in it.I need to even check for the count of the
keywords occurences. Your help would be greatly appreciated.

Request for Question Clarification by leapinglizard-ga on 14 Nov 2005 23:49 PST
I don't think it's necessary to fully parse the HTML in order to find
tables containing a keyword. It suffices to look for occurrences of
the keyword between pairs of <table> and </table> tags. I can give you
more detailed assistance if you specify more precisely the
requirements of the task. Furthermore, what do you mean  by the "size"
of a table? And what does it mean to "check for the count"?

leapinglizard

Clarification of Question by naveen82-ga on 15 Nov 2005 03:18 PST
Since, my intrest is only on the tables I need not look onto other
tags of the page. I need to search the text in the <td> of the tables.
A typical file of mine contains <a href> tags and also text(any
general web page). I need to search for the table that has more number
of <td>(s) in it and also for the repetition of perticular word (like
a search string)in the text. Based on this observation I need to
return only one table. Third party softwares like SourceForge HTML or
JDOM or any kind of DOM can be applied for this.

Request for Question Clarification by leapinglizard-ga on 15 Nov 2005 06:36 PST
I can give you sample code for this, but I'll need to know a few more details.

How do you want to identify the table with the greatest number of <td>
pairs? Should the program output the full text of the table, or just
the beginning and ending indices in the text file?

How do you want to handle ties? Would it do to pick the first table
that has the maximum number of <td> pairs?

Finally, I want to make sure I understand the keyword task. Do you
want to count the occurrences of a given keyword only in the table
that has the maximum number of <td> pairs?

I'll be able to help you in short order once we get these matters resolved.

leapinglizard

Clarification of Question by naveen82-ga on 15 Nov 2005 07:34 PST
The thing I need is mining a web page(search engine result). I have
gone through the part of storing the web page. A web page actually
contains many tables. Its like my result is stored in a table (general
observation) that has maximum size (as we see in the search engine it
has more number of <tr> and <td> 's in it. And also the search query
is present in the <td> of the table (IT ALSO HAS HREF TAGS IN IT). A
general search string is repeated more often in the result table
(another general observation). Hence I need to extract that table
which has more <tr> and also based on the search string.My output is
to present the whole table as a html page agian. So I need that table
(if two tables has exact match then they both are to be presented) in
a html file.  As far as I know a DOM would be really sufficient for
this. But, I am very new to the use of DOM. Any help regarding this
would be greatly appreciated.

Clarification of Question by naveen82-ga on 15 Nov 2005 07:41 PST
One of the important things I dint emphasize is I need the code in
java. Any parser build in java can be used I guess.

Clarification of Question by naveen82-ga on 24 Nov 2005 19:06 PST
hi,
I was just curious about my question ? Can anyone plz respond to that
? The comment recieved from larkas was great! I went through that! But
what I understood was it is for XML. I dont want to do something like
converting from html to xml and then back to html again! If someone
knows how the html parsing is done for a search of a key word in the
html page! Thanks in advance! Any help would be really appreciated!
Answer  
There is no answer at this time.

Comments  
Subject: Re: parsing in html using java
From: larkas-ga on 17 Nov 2005 00:17 PST
 
Take a look at this article:

http://www-128.ibm.com/developerworks/java/library/j-jtp03225.html?ca=sgr-lnxw07XQuery

It shows how to screen scrape HTML using Xquery. It may be overkill
for what you want to do, but it has sample code that will work fine
and has plenty of room to grow with you.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy