Google Answers Logo
View Question
 
Q: Search engine software ( No Answer,   1 Comment )
Question  
Subject: Search engine software
Category: Computers > Programming
Asked by: tjsnodgrass-ga
List Price: $25.00
Posted: 03 Aug 2003 13:57 PDT
Expires: 02 Sep 2003 13:57 PDT
Question ID: 238557
I have a question for a person with experience implementing intranet
search engine software.

I am assembling a very large collection of government documents, all
of which are in the .PDF format.  (Some are text-searchable, others
are scanned images that will be converted to text-searchable format by
OCR conversion.)

Each document will have a limited amount of additional information
associated with it that in some cases is not contained in the document
itself (for instance, date authored, agency which created it, etc.).

I would like to implement a robust, professional, configurable,
scalable (10,000s of documents) and *fast* indexed search engine
capable of doing complex searches on this data.  I would like users to
have the choice between "natural language" keyword searching and
complex boolen searches.  Most importantly is a thorough proximity
operator (for instance, "farmer within 15 words of barn").  Finally, I
need the ability for users to be able to limit searches by the
associated information (for instance, "no documents created more than
a year ago"), which I presume would mean somehow integrating the PDF
files with a SQL database or some such thing.

I would like to know what strategies an experienced person would
recommend, including server types, programming languages, and any
off-the-shelf products that could be configured to do what I describe.
 Would freelance programmers be good for this?

No need for excessive detail, just looking for some initial thoughts.

Clarification of Question by tjsnodgrass-ga on 03 Aug 2003 13:59 PDT
I forgot one thing, not sure if it's relevant.  When the search
results are displayed, I want snippets of text showing how the
keywords appear in the file.  Thanks!!!

Request for Question Clarification by joseleon-ga on 04 Aug 2003 01:52 PDT
Hello:
  Do you want a customized solution or an already existing product?

Regards.

Clarification of Question by tjsnodgrass-ga on 04 Aug 2003 07:58 PDT
I presume there are no existing products that would provide all the
functions I describe, so I assume some degree of customization would
be required.  I'm not awash in funds, however, so it's not like I can
afford to pay $100k for a Google search appliance license.

Request for Question Clarification by joseleon-ga on 04 Aug 2003 08:35 PDT
Hello:
  A good starting software to achieve what you want is mnoGoSearch:

mnoGoSearch
http://www.mnogosearch.org

Check out the feature list

Features
http://search.mnogo.ru/features.html  

We use it at my current job (intranet level), I have made some small
customizations to make it work better in our environment and we are
very happy using it.

It's OpenSource (GPL), so you can get developers to add the features
you want or even pay the authors to create a customized version for
you.

What do you think?

Regards.

Clarification of Question by tjsnodgrass-ga on 05 Aug 2003 11:31 PDT
Yeah, I've already considered mnogosearch -- it just seems to lack
some of the key features I'm looking for.  I was kinda hoping I'd find
someone who knows whether or not something like that is easily
customizable to my description.  Oh well -- thanks for taking the time
to offer your suggestions!!

Request for Question Clarification by rhansenne-ga on 14 Aug 2003 02:00 PDT
Hi tjsnodgrass-ga,

Have you tried Jakarta Lucene (http://jakarta.apache.org/lucene)? It's
extremely fast, flexible, modular and robust and many third-party
plugins exist. It can do proximity based, fuzzy, wildcard, boolean and
weighted searches. I've used it successfully in a number of projects.

If you're into Java programming and don't want to spend a fortune on a
commercial engine, you might want to check it out.

Kind regards,

rhansenne-ga.
Answer  
There is no answer at this time.

Comments  
Subject: Re: Search engine software
From: tuhadasevadar-ga on 11 Aug 2003 19:01 PDT
 
I suggest : http://www.aspseek.org/

aspseek is very very very fast. It has most "regular" features you are
looking for, for example searching documents according to the dates,
show how text appears in the document. It might not have complex
natural language system but I am amazed with its relevance. User can
also set the relevance if needed.

For your PDF documents, there is a plugin that will go through a PDF
document and pick up text while indexing.

Since it is like client/server based, you can make your client as
smart as you want it to be. You don't have to use s.cgi( client that
connects to the server ). You can ask someone to modify it, modify the
php client , or even write one in java or asp.net or whatever you
want.

I think flexibility, power, features offered by aspseek are very close
to what you want.

Hope that helps.
Paul

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy