|
|
Subject:
Extracting text from multiple web pages: Sources of Source Code Wanted
Category: Computers > Programming Asked by: parallel54-ga List Price: $35.00 |
Posted:
18 Oct 2004 14:18 PDT
Expires: 17 Nov 2004 13:18 PST Question ID: 416630 |
We are seeking to extract articles from a number of media sites on the web. So we want a tool that strips out the HTML and gives us the headline and article text as a text or XML file. We want code that does this for large numbers of sites, as we don't want to have to individually code extraction rules for each site. I know there are quite a few companies that offer this as a service but really we want actual source code so we can amend it and integrate it into a larger application. Ideally there would be open source projects that we can tap into on this. Happy to provide more details if required. |
|
There is no answer at this time. |
|
Subject:
Re: Extracting text from multiple web pages: Sources of Source Code Wanted
From: mrd3nny-ga on 18 Oct 2004 15:30 PDT |
I found several projects on sourceforge that look like what you might want. Once I found that looked good was http://sourceforge.net/projects/ipoddernet/. It's written in C#, so should be easily modifible for your needs. What you are looking for is called as RSS reader. Most news sites offer RSS feeds. Search on SourceForge for "RSS" for other options. Denny |
Subject:
Re: Extracting text from multiple web pages: Sources of Source Code Wanted
From: parallel54-ga on 18 Oct 2004 20:06 PDT |
Thanks but no we don't need an RSS reader. We are looking to extract data in a structured format (like RSS) but from sites that do not use RSS. They simply publish their articles on the web. Usually via a Content Management System but using varying standards. That's the tricky bit the varying standards. |
Subject:
Re: Extracting text from multiple web pages: Sources of Source Code Wanted
From: samudbhava-ga on 26 Oct 2004 06:07 PDT |
i would suggest you divide the work between different applications. for example, use httrack (open source/free) for downloading websites. then try to find some text processing tools (or roll your own using perl or python) which helps you intelligently filter out the html and get nearly (nearly) structured text. then pay someone to do the remaining clean up. keep optimising the filter till you reach an equilibrium between the tool and the persons job. |
Subject:
Re: Extracting text from multiple web pages: Sources of Source Code Wanted
From: psethi-ga on 15 Nov 2004 16:27 PST |
Hi, I can provide you with a script that would read URLs from a text file, fetch the web page from the URL and then parse the HTML and return you the content that you are looking for from the web page. One way to do it would be have 1 or more regular expressions along with the URL in the text file and then match that regular expression with the web page and spit out whatever is matched. let me know, if you need more clarification or if you are looking for something else. Thanks, |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |