Google Answers: Extracting text from multiple web pages: Sources of Source Code Wanted

View Question

Q: Extracting text from multiple web pages: Sources of Source Code Wanted ( No Answer, 4 Comments )

Question

Subject: Extracting text from multiple web pages: Sources of Source Code Wanted
Category: Computers > Programming
Asked by: parallel54-ga
List Price: $35.00

Posted: 18 Oct 2004 14:18 PDT
Expires: 17 Nov 2004 13:18 PST
Question ID: 416630

We are seeking to extract articles from a number of media sites on the
web. So we want a tool that strips out the HTML and gives us the
headline and article text as a text or XML file. We want code that
does this for large numbers of sites, as we don't want to have to
individually code extraction rules for each site. I know there are quite a few
companies that offer this as a service but really we want actual
source code so we can amend it and integrate it into a larger
application. Ideally there would be open source projects that we can
tap into on this. Happy to provide more details if required.

Answer

There is no answer at this time.

Comments

Subject: Re: Extracting text from multiple web pages: Sources of Source Code Wanted
From: mrd3nny-ga on 18 Oct 2004 15:30 PDT

I found several projects on sourceforge that look like what you might
want.  Once I found that looked good was
http://sourceforge.net/projects/ipoddernet/.  It's written in C#, so
should be easily modifible for your needs.

What you are looking for is called as RSS reader.  Most news sites
offer RSS feeds.  Search on SourceForge for "RSS" for other options.

Denny

Subject: Re: Extracting text from multiple web pages: Sources of Source Code Wanted
From: parallel54-ga on 18 Oct 2004 20:06 PDT

Thanks but no we don't need an RSS reader. We are looking to extract
data in a structured format (like RSS) but from sites that do not use
RSS. They simply publish their articles on the web. Usually via a
Content Management System but using varying standards. That's the
tricky bit the varying standards.

Subject: Re: Extracting text from multiple web pages: Sources of Source Code Wanted
From: samudbhava-ga on 26 Oct 2004 06:07 PDT

i would suggest you divide the work between different applications. 
for example, use httrack (open source/free) for downloading websites.
then try to find some text processing tools (or roll your own using
perl or python) which helps you intelligently filter out the html and
get nearly (nearly) structured text.  then pay someone to do the
remaining clean up.  keep optimising the filter till you reach an
equilibrium between the tool and the persons job.

Subject: Re: Extracting text from multiple web pages: Sources of Source Code Wanted
From: psethi-ga on 15 Nov 2004 16:27 PST

Hi,

I can provide you with a script that would read URLs from a text file,
fetch the web page from the URL and then parse the HTML and return you
the content that you are looking for from the web page.

One way to do it would be have 1 or more regular expressions along
with the URL in the text file and then match that regular expression
with the web page and spit out whatever is matched.

let me know, if you need more clarification or if you are looking for
something else.

Thanks,

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy