Google Answers Logo
View Question
 
Q: Keyword filtration ( Answered,   0 Comments )
Question  
Subject: Keyword filtration
Category: Miscellaneous
Asked by: cre8tive-ga
List Price: $100.00
Posted: 18 Nov 2005 03:24 PST
Expires: 18 Dec 2005 03:24 PST
Question ID: 594623
We have a list of keywords that people search trough search engines.
One keyword per line.
There are many keywords that include trademarked terms such as
"microsoft office" or "britney spears" that we want to eliminate. We
also want to eliminate all made up or non-sense keywords such as
"yellow cat".
Our goel is to end up with only those keywords that are generic and
have a meaning such as "search engine", "black cat", "phone service".
We are looking for criterias, based on which we can write a software
that will filter the list.
Thank you.

Request for Question Clarification by pafalafa-ga on 18 Nov 2005 04:26 PST
Hmmm.  This seems like one of those tasks that a human can do easily,
but would be pretty well-nigh impossible for a computer.

Sort of like face recognition.

How large is the list?

Clarification of Question by cre8tive-ga on 18 Nov 2005 05:06 PST
Hi,
The list contains over 2 million entries.
It's going to take forever for a human to read and analyze the list.
I'm looking for some useful criterias that would at least considerably
shorten the list by eliminating some trademaked and nonsense keywords.
Answer  
Subject: Re: Keyword filtration
Answered By: webadept-ga on 18 Nov 2005 08:11 PST
 
Hi,

You are looking for criteria for cleaning phrase lists. I noticed that
you said keywords, but what you show are key phrases, one phrase per
line.

What we need here are black lists, and gray lists and some white
lists. What does confuse me is your example of ?Yellow Cat? being
non-sense but Black Cat being okay, when yellow cat is a perfectly
good search term. However that is up to you? but you are going to need
to figure out the criteria of what you believe is non-sense, and what
you believe isn?t before you can do so with a program.

[ ://www.google.com/search?q=site:wikipedia.org&q=%22Yellow+Cat ]

Your first criteria seem pretty easy; Names of famous people and
company trade marks. By using databases available on the web you
should be able to filter these out and by using ?LIKE? statements in
the database you can even get them when they are hidden in longer key
phrases. (Yellow Cat by the way is also a company in the UK and is
registered as a trade mark name ).

The UK Patent Office [ http://webdb4.patent.gov.uk/tm/text ]
The US Trademark Office [http://www.uspto.gov/main/profiles/acadres.htm ] 

It is a simple matter to work out a Perl script that can go through
these to on-line databases and query the list of names you have, while
building your own Black List file so that you don?t need to do the
online search again.

The next one you have here is Famous names, and again there are
several resources on the Internet to build these lists from.

Cyndi?s List ? Famous People [http://www.cyndislist.com/famous.htm ]
Text file of Famous Scientology/Celebrities
[http://home.snafu.de/tilman/faq-you/celeb.txt ]

[ http://dir.yahoo.com/Entertainment/music/artists/ ]


There are plenty of lists out there that can be built into your black lists. 

Gray lists are what I think of as phrases which might be okay. ?Tool?
for example is the name of a Rock band. I don?t want every phrase with
the word ?Tool? in it removed however. It is a pretty common word.
This is where dictionaries of white words come in..
[http://java.sun.com/docs/books/tutorial/collections/interfaces/ex5/dictionary.txt
] for example and [http://research.yale.edu/swahili/software/Glossary/swahili/IT_glossary_sw.csv
].

These are very easy to find. Just search on Google for 
dictionary filetype:txt  
or
dictionary filetype:csv 

and you will come up with several. So, back to a gray list which is
the combination of a black list word with a white list word. Tool on
it?s own is not okay (no empty fields in the white word lists), Tool
Madonna is also not okay (two black list words. Acorn Tool is fine
(white list word with black list word or phrase).

Nonsense words as I would look at them are word phrases that aren?t
said in normal literature or usage. As I showed before, one person?s
nonsense word can be another?s Company Logo or literary descriptive
term. If most of these key phrases are two to three words, we can
build white lists of common word combinations. Doing this requires a
bit of time, and some scripting language, but can be effective.

I would start with some of the Blogs out there which are written by
people who use good language skills. For example [
http://www.searchengine-weblog.com/  ]. Ripping the text from these
pages, then removing words with 3 characters or less (replace those
with marks that indicate their existence) and then slicing up two and
three word phrases from the remaining words can build phrase check
lists.

Another resource is using the Google API to send quires to Google?s
search engine for checking nonsense search phrases.
[://www.google.com/apis/ ]


Pages and Documents of interest

Multiword Expression Filtering for Building Knowledge Maps (PDF file) 
http://acl.ldc.upenn.edu/acl2004/mwe/pdf/venkatsubramanyan.pdf

Text Mining and Web-Based Information
http://filebox.vt.edu/users/wfan/text_mining.html

Text Mining
http://en.wikipedia.org/wiki/Text_mining

Text mining tools take on unstructured data
http://www.computerworld.com/databasetopics/businessintelligence/story/0,10801,93968,00.html



If you find that you need more information regarding this question,
please feel free to use the Clarification Function, and I will get
back to you as soon as I can.

Thanks, 

Webadept-ga
Comments  
There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy