Google Answers: parsing in html using java

View Question

Q: parsing in html using java ( No Answer, 1 Comment )

Question

Subject: parsing in html using java
Category: Computers > Programming
Asked by: naveen82-ga
List Price: $50.00

Posted: 14 Nov 2005 19:58 PST
Expires: 25 Nov 2005 15:55 PST
Question ID: 593060

sample code in java for table tag parsing and then searching for a keyword in <td>
Clarification of Question by naveen82-ga on 14 Nov 2005 20:04 PST I need to search among the tables in a text file ( This is the source from the web page and has been stored) and search for the table whose size is greater among other tables and also search for the table which has a specific keyword in it.I need to even check for the count of the keywords occurences. Your help would be greatly appreciated.
Request for Question Clarification by leapinglizard-ga on 14 Nov 2005 23:49 PST I don't think it's necessary to fully parse the HTML in order to find tables containing a keyword. It suffices to look for occurrences of the keyword between pairs of <table> and </table> tags. I can give you more detailed assistance if you specify more precisely the requirements of the task. Furthermore, what do you mean by the "size" of a table? And what does it mean to "check for the count"? leapinglizard
Clarification of Question by naveen82-ga on 15 Nov 2005 03:18 PST Since, my intrest is only on the tables I need not look onto other tags of the page. I need to search the text in the <td> of the tables. A typical file of mine contains <a href> tags and also text(any general web page). I need to search for the table that has more number of <td>(s) in it and also for the repetition of perticular word (like a search string)in the text. Based on this observation I need to return only one table. Third party softwares like SourceForge HTML or JDOM or any kind of DOM can be applied for this.
Request for Question Clarification by leapinglizard-ga on 15 Nov 2005 06:36 PST I can give you sample code for this, but I'll need to know a few more details. How do you want to identify the table with the greatest number of <td> pairs? Should the program output the full text of the table, or just the beginning and ending indices in the text file? How do you want to handle ties? Would it do to pick the first table that has the maximum number of <td> pairs? Finally, I want to make sure I understand the keyword task. Do you want to count the occurrences of a given keyword only in the table that has the maximum number of <td> pairs? I'll be able to help you in short order once we get these matters resolved. leapinglizard
Clarification of Question by naveen82-ga on 15 Nov 2005 07:34 PST The thing I need is mining a web page(search engine result). I have gone through the part of storing the web page. A web page actually contains many tables. Its like my result is stored in a table (general observation) that has maximum size (as we see in the search engine it has more number of <tr> and <td> 's in it. And also the search query is present in the <td> of the table (IT ALSO HAS HREF TAGS IN IT). A general search string is repeated more often in the result table (another general observation). Hence I need to extract that table which has more <tr> and also based on the search string.My output is to present the whole table as a html page agian. So I need that table (if two tables has exact match then they both are to be presented) in a html file. As far as I know a DOM would be really sufficient for this. But, I am very new to the use of DOM. Any help regarding this would be greatly appreciated.
Clarification of Question by naveen82-ga on 15 Nov 2005 07:41 PST One of the important things I dint emphasize is I need the code in java. Any parser build in java can be used I guess.
Clarification of Question by naveen82-ga on 24 Nov 2005 19:06 PST hi, I was just curious about my question ? Can anyone plz respond to that ? The comment recieved from larkas was great! I went through that! But what I understood was it is for XML. I dont want to do something like converting from html to xml and then back to html again! If someone knows how the html parsing is done for a search of a key word in the html page! Thanks in advance! Any help would be really appreciated!

Answer

There is no answer at this time.

Comments

Subject: Re: parsing in html using java
From: larkas-ga on 17 Nov 2005 00:17 PST

Take a look at this article:

http://www-128.ibm.com/developerworks/java/library/j-jtp03225.html?ca=sgr-lnxw07XQuery

It shows how to screen scrape HTML using Xquery. It may be overkill
for what you want to do, but it has sample code that will work fine
and has plenty of room to grow with you.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy