I have a question for a person with experience implementing intranet
search engine software.
I am assembling a very large collection of government documents, all
of which are in the .PDF format. (Some are text-searchable, others
are scanned images that will be converted to text-searchable format by
OCR conversion.)
Each document will have a limited amount of additional information
associated with it that in some cases is not contained in the document
itself (for instance, date authored, agency which created it, etc.).
I would like to implement a robust, professional, configurable,
scalable (10,000s of documents) and *fast* indexed search engine
capable of doing complex searches on this data. I would like users to
have the choice between "natural language" keyword searching and
complex boolen searches. Most importantly is a thorough proximity
operator (for instance, "farmer within 15 words of barn"). Finally, I
need the ability for users to be able to limit searches by the
associated information (for instance, "no documents created more than
a year ago"), which I presume would mean somehow integrating the PDF
files with a SQL database or some such thing.
I would like to know what strategies an experienced person would
recommend, including server types, programming languages, and any
off-the-shelf products that could be configured to do what I describe.
Would freelance programmers be good for this?
No need for excessive detail, just looking for some initial thoughts. |
Request for Question Clarification by
rhansenne-ga
on
14 Aug 2003 02:00 PDT
Hi tjsnodgrass-ga,
Have you tried Jakarta Lucene (http://jakarta.apache.org/lucene)? It's
extremely fast, flexible, modular and robust and many third-party
plugins exist. It can do proximity based, fuzzy, wildcard, boolean and
weighted searches. I've used it successfully in a number of projects.
If you're into Java programming and don't want to spend a fortune on a
commercial engine, you might want to check it out.
Kind regards,
rhansenne-ga.
|