I have a list of a few hundred thousand URLs in an access database
(but I guess I could put them in any common format).
I need a piece of software that can
1) Look at the list of URLs
2) Go to each one and search the entire site for a certain keyword
3) Return a new list of the URLs that contain that keyword
For example, for a list of 5 sites:
http://www.nasa.com
http://www.mickeymouse.com
http://www.londonzoo.com
http://www.americanidol.com
http://www.johndoe.com
And a keyword "space shuttle"
The results would be something like:
http://www.nasa.com/spaceshuttle/page1.html
http://www.nasa.com/spaceshuttle/page2.html
http://www.nasa.com/spaceshuttle/page3.html
http://www.nasa.com/spaceshuttle/page4.html
http://www.nasa.com/spaceshuttle/page5.html
http://www.mickeymouse.com/spaceshuttle/index.html
...because they are pages that contain the phrase "space shuttle"
somewhere on the page. Looking at file names or meta-tags is not
enough - the program needs to search entire pages.
The program needs to use its own spidering - we don't want a
metasearch product or one that relies on search engines like Google or
Yahoo.
The spider needs to follow every internal link from the one given, and
check the entire site. So, if we gave it nasa.com, it would follow
links from the nasa.com home page, and follow links of the resulting
pages and so on until it had looked at the entire navigable site -
thousands of pages.
If it has Boolean-type features like - search for "space shuttle" but
not "discovery", that would be a preferable.
If a program can do the job but several hundred thousand URLs is too
much for it, something that can do blocks of 1,000 URLs would be ok.
We have found two programs that almost do what we want:
VelocityScape WebScraper Plus
http://www.velocityscape.com/
This looks like it will do what we require, but we are unable to get
the evaluation version to run, and are waiting to hear from their
support people.
NetTools Spider
http://www.questronixsoftware.com/Products/NetToolsSpider/NetToolsSpider.aspx
We got the evaluation version to work, but we could not get it to do
most of things it should.
Please use the above as examples of the type of product we are looking for.
We are not after help to get those 2 programs running.
We do not need to copy entire sites to our hard drive, and if the
keyword search aspect requires entire sites to be downloaded and
stored, we are not interested. We believe this would require several
terabytes and wouldn't be cost effective.
Price range of the product - free to $10,000 |
Request for Question Clarification by
tox-ga
on
29 Jul 2005 19:13 PDT
Dear Joel,
I've identified a reputable and reliable firm through recommendations
and research that specializes in building custom software using web
spidering, search, and archiving technologies.
The firm is able to create a software that is able to do everything
you ask for, with customized interface, and is guaranteed to run
perfectly on the platform of your choice. They also have award
winning post-sales support incase any further questions may arise.
While I am not allowed to publish the estimated price on public
domain, I am able to say that the software is within your budget.
If I am able to refer you to such a firm and reserve a project spot
for immediate start, would it be acceptable as an answer?
Cheers,
Tox-ga
|
Clarification of Question by
joel1357-ga
on
30 Jul 2005 01:24 PDT
Tox,
We noticed that the question was locked up for an extended period of
time, which we will assume means that you spent quite a bit of time
trying to secure the answer we were looking for. We spent at least 50
hours trying to figure this out on our own and in the meantime believe
we have identified 2 companies that can build the tool as well. Our
willingness to pay $400 for the answer was based on finding an out of
the box solution that we could immediately begin utilizing. We would
like to hear the name of the company that you came up with to have
another resource to add to our list of potential candidates of
building this tool but we cannot pay $400. Your solution will still
require many hours of hard work before we are able to use the tool,
and we won't necessarily use the company that you provide to build our
tool. With that in mind, we propose to cancel this question, ask
another question for only you to answer in the amount of $50. If that
is acceptable once you post the name of the company we will contact
them to see if they would be the ones that we want to build our tool.
If we select this company we would then add a tip to that question in
the amount of $100. If this is acceptable let me know.
Joel
|
Request for Question Clarification by
tox-ga
on
30 Jul 2005 02:42 PDT
Dear Joel,
Yes, that would be acceptable.
Since my last request for clarification, I've been researching
further, comparing the different alternatives in more detail, as I'd
like to be certain that the option you go with is the best possible
one to meet your unique needs. I've also consulted with
professionals/experts in the area and professors at technical
colleges, and taking all the factors given in consideration, I am
confident that the custom software company in question is the best
solution.
Also, with the company that I am speaking of, you will be able to
utlize the software in no more than a matter of days with similar
amount of effort as implementing an out of the box solution.
I will unlock this question now so that you can cancel the question.
Also, on the new question, could you post the name of the two
companies that you've found just to make sure that it does not overlap
with mine?
Cheers,
Tox-ga
|
Request for Question Clarification by
tox-ga
on
30 Jul 2005 03:07 PDT
I forgot to mention in my other post, the company does also have
ready-made software available that is able to do what you ask but
because it was custom made for a different company, it will be a lot
more expensive than other out-of-the-box software due to licensing
issues (almost same price as custom software).
The price is still within the budget you indicated but for the same
price I, personally, would've gotten a software custom made. However,
it seems as time and effort is very critical for you at the moment so
I will mention it now.
Would you like me to post the answer in this question, or in a new question?
Cheers,
Tox-ga
|
Clarification of Question by
joel1357-ga
on
30 Jul 2005 03:43 PDT
Tox,
I've posted a new question per our agreement. Please provide the answer there.
Thanks
|