Hi,
You are looking for criteria for cleaning phrase lists. I noticed that
you said keywords, but what you show are key phrases, one phrase per
line.
What we need here are black lists, and gray lists and some white
lists. What does confuse me is your example of ?Yellow Cat? being
non-sense but Black Cat being okay, when yellow cat is a perfectly
good search term. However that is up to you? but you are going to need
to figure out the criteria of what you believe is non-sense, and what
you believe isn?t before you can do so with a program.
[ ://www.google.com/search?q=site:wikipedia.org&q=%22Yellow+Cat ]
Your first criteria seem pretty easy; Names of famous people and
company trade marks. By using databases available on the web you
should be able to filter these out and by using ?LIKE? statements in
the database you can even get them when they are hidden in longer key
phrases. (Yellow Cat by the way is also a company in the UK and is
registered as a trade mark name ).
The UK Patent Office [ http://webdb4.patent.gov.uk/tm/text ]
The US Trademark Office [http://www.uspto.gov/main/profiles/acadres.htm ]
It is a simple matter to work out a Perl script that can go through
these to on-line databases and query the list of names you have, while
building your own Black List file so that you don?t need to do the
online search again.
The next one you have here is Famous names, and again there are
several resources on the Internet to build these lists from.
Cyndi?s List ? Famous People [http://www.cyndislist.com/famous.htm ]
Text file of Famous Scientology/Celebrities
[http://home.snafu.de/tilman/faq-you/celeb.txt ]
[ http://dir.yahoo.com/Entertainment/music/artists/ ]
There are plenty of lists out there that can be built into your black lists.
Gray lists are what I think of as phrases which might be okay. ?Tool?
for example is the name of a Rock band. I don?t want every phrase with
the word ?Tool? in it removed however. It is a pretty common word.
This is where dictionaries of white words come in..
[http://java.sun.com/docs/books/tutorial/collections/interfaces/ex5/dictionary.txt
] for example and [http://research.yale.edu/swahili/software/Glossary/swahili/IT_glossary_sw.csv
].
These are very easy to find. Just search on Google for
dictionary filetype:txt
or
dictionary filetype:csv
and you will come up with several. So, back to a gray list which is
the combination of a black list word with a white list word. Tool on
it?s own is not okay (no empty fields in the white word lists), Tool
Madonna is also not okay (two black list words. Acorn Tool is fine
(white list word with black list word or phrase).
Nonsense words as I would look at them are word phrases that aren?t
said in normal literature or usage. As I showed before, one person?s
nonsense word can be another?s Company Logo or literary descriptive
term. If most of these key phrases are two to three words, we can
build white lists of common word combinations. Doing this requires a
bit of time, and some scripting language, but can be effective.
I would start with some of the Blogs out there which are written by
people who use good language skills. For example [
http://www.searchengine-weblog.com/ ]. Ripping the text from these
pages, then removing words with 3 characters or less (replace those
with marks that indicate their existence) and then slicing up two and
three word phrases from the remaining words can build phrase check
lists.
Another resource is using the Google API to send quires to Google?s
search engine for checking nonsense search phrases.
[://www.google.com/apis/ ]
Pages and Documents of interest
Multiword Expression Filtering for Building Knowledge Maps (PDF file)
http://acl.ldc.upenn.edu/acl2004/mwe/pdf/venkatsubramanyan.pdf
Text Mining and Web-Based Information
http://filebox.vt.edu/users/wfan/text_mining.html
Text Mining
http://en.wikipedia.org/wiki/Text_mining
Text mining tools take on unstructured data
http://www.computerworld.com/databasetopics/businessintelligence/story/0,10801,93968,00.html
If you find that you need more information regarding this question,
please feel free to use the Clarification Function, and I will get
back to you as soon as I can.
Thanks,
Webadept-ga |