Google Answers: Search engine software

View Question

Q: Search engine software ( No Answer, 1 Comment )

Question

Subject: Search engine software
Category: Computers > Programming
Asked by: tjsnodgrass-ga
List Price: $25.00

Posted: 03 Aug 2003 13:57 PDT
Expires: 02 Sep 2003 13:57 PDT
Question ID: 238557

I have a question for a person with experience implementing intranet search engine software. I am assembling a very large collection of government documents, all of which are in the .PDF format. (Some are text-searchable, others are scanned images that will be converted to text-searchable format by OCR conversion.) Each document will have a limited amount of additional information associated with it that in some cases is not contained in the document itself (for instance, date authored, agency which created it, etc.). I would like to implement a robust, professional, configurable, scalable (10,000s of documents) and fast indexed search engine capable of doing complex searches on this data. I would like users to have the choice between "natural language" keyword searching and complex boolen searches. Most importantly is a thorough proximity operator (for instance, "farmer within 15 words of barn"). Finally, I need the ability for users to be able to limit searches by the associated information (for instance, "no documents created more than a year ago"), which I presume would mean somehow integrating the PDF files with a SQL database or some such thing. I would like to know what strategies an experienced person would recommend, including server types, programming languages, and any off-the-shelf products that could be configured to do what I describe. Would freelance programmers be good for this? No need for excessive detail, just looking for some initial thoughts.
Clarification of Question by tjsnodgrass-ga on 03 Aug 2003 13:59 PDT I forgot one thing, not sure if it's relevant. When the search results are displayed, I want snippets of text showing how the keywords appear in the file. Thanks!!!
Request for Question Clarification by joseleon-ga on 04 Aug 2003 01:52 PDT Hello: Do you want a customized solution or an already existing product? Regards.
Clarification of Question by tjsnodgrass-ga on 04 Aug 2003 07:58 PDT I presume there are no existing products that would provide all the functions I describe, so I assume some degree of customization would be required. I'm not awash in funds, however, so it's not like I can afford to pay $100k for a Google search appliance license.
Request for Question Clarification by joseleon-ga on 04 Aug 2003 08:35 PDT Hello: A good starting software to achieve what you want is mnoGoSearch: mnoGoSearch http://www.mnogosearch.org Check out the feature list Features http://search.mnogo.ru/features.html We use it at my current job (intranet level), I have made some small customizations to make it work better in our environment and we are very happy using it. It's OpenSource (GPL), so you can get developers to add the features you want or even pay the authors to create a customized version for you. What do you think? Regards.
Clarification of Question by tjsnodgrass-ga on 05 Aug 2003 11:31 PDT Yeah, I've already considered mnogosearch -- it just seems to lack some of the key features I'm looking for. I was kinda hoping I'd find someone who knows whether or not something like that is easily customizable to my description. Oh well -- thanks for taking the time to offer your suggestions!!
Request for Question Clarification by rhansenne-ga on 14 Aug 2003 02:00 PDT Hi tjsnodgrass-ga, Have you tried Jakarta Lucene (http://jakarta.apache.org/lucene)? It's extremely fast, flexible, modular and robust and many third-party plugins exist. It can do proximity based, fuzzy, wildcard, boolean and weighted searches. I've used it successfully in a number of projects. If you're into Java programming and don't want to spend a fortune on a commercial engine, you might want to check it out. Kind regards, rhansenne-ga.

Answer

There is no answer at this time.

Comments

Subject: Re: Search engine software
From: tuhadasevadar-ga on 11 Aug 2003 19:01 PDT

I suggest : http://www.aspseek.org/

aspseek is very very very fast. It has most "regular" features you are
looking for, for example searching documents according to the dates,
show how text appears in the document. It might not have complex
natural language system but I am amazed with its relevance. User can
also set the relevance if needed.

For your PDF documents, there is a plugin that will go through a PDF
document and pick up text while indexing.

Since it is like client/server based, you can make your client as
smart as you want it to be. You don't have to use s.cgi( client that
connects to the server ). You can ask someone to modify it, modify the
php client , or even write one in java or asp.net or whatever you
want.

I think flexibility, power, features offered by aspseek are very close
to what you want.

Hope that helps.
Paul

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy