Hello again,
As something of a word-sleuth myself, I couldn't let this question
pass with it only half-answered, so I've collected up a number of
word-tools for you that should be able to help address the various
questions you raised in your clarifications. Some of them are pretty
simple, while one or two are quite esoteric in how they operate. But
they all have enormous power for dealing with words in a manner that
should prove valuable to your work.
So...let's get to it. You specifically asked about the following:
--I need to find the possible misspellings of the word
"molecular...I'm thinking about any big newsgroup material split into
words and a program that finds anything close to the word I input.
--What can be used to compare any word in the dictionary and find similar ones
Since some of the tools below can be applied to both these questions,
I want to describe the tools in turn, and then also discuss how they
could be of use to you in your work.
==========
There are a number of lists of commonly misspelled/mistyped words that
can serve as a basic reference.
This list from Cornell includes some frequency-of-occurrence
information for misspelled words:
http://www.library.cornell.edu/tsmanual/TSSU/comis1.html
Other lists of misspellings include:
http://www.wsu.edu:8080/~brians/errors/misspelled.html
Commonly misspelled words
-----
http://www.careers.cam.ac.uk/students/work/spelling.asp
Some Common Spelling Mistakes
==========
A great tool for finding words, word variations, and so on, is
advanced search feature on this page of crossword-puzzle solver tools:
http://www.puzzlers.org/wordlists/grepdict_full.php
The National Puzzlers League
As the "Advanced Search" instructions indicate, you can use wildcards
to substitute individual letters or whole groups of letters, with some
spiffy additional controls to specifiy end-of-word, etc:
"...the wildcard character is a dot: ".", not a question mark. To
specify any number of characters, say ".*" rather than "*"; e.g.
"s.*py" will find "spy", "scrappy" and "slaphappy". It will also find
"espy", "sulfapyridine" and "unspying". To avoid these, use "^" to
mark the beginning of a word, and "$" to mark the end: "^s.*py$" will
match "spy" and "sappy", but not "espy" or "spying".
For instance, a search on [ .olecu. ] returns about a hundred words
containing -olecu-, including molecule, of course, as well as:
molecula
molecular
molecularist
molecularity
molecularly
molecule's
molecules
moleculon
Play around here. This is a great site.
==========
Spell-checkers already have built-in criteria for identifying
misspellings and suggesting replacement words. One of the most
versatile free, online spell-chekcers is Aspell:
http://aspell.net/suggest/
which describes itself this way:
"Welcome to the Aspell Spell Helper. Its goal is to help out all the
bad spellers on the net by doing a really good job of coming up with
suggestions for misspelled words."
Using the online search box, and asking for suggestions for: molicul
resulted in the following list:
molecule
molecular
Mogul
mogul
molal
molecules
millijoule
modicum
helical
magical
medical
musical
follicle
local
monocle
Miguel
Moll
moil
moll
follicular
molehill
monocular
milieu
Moloch
Moluccas
Malcolm
Felicle
Mikol
Mobil
Mosul
calculi
colic
molecule's
slickly
and offered additional options to:
--Try Harder
--Try using the Huge Dictionary
-----
A similar tool is Ispell:
http://fmg-www.cs.ucla.edu/geoff/ispell.html
"Ispell is a fast screen-oriented spelling checker that shows you your
errors in the context of the original file, and suggests possible
corrections when it can figure them out."
Their site includes a description of the differences between Ispell and Aspell:
-----
What's the Difference Between Ispell and Aspell?
Aspell is a spelling checker written by Kevin Atkinson. Its primary
advantage is that it is better at making suggestions when a word is
seriously misspelled. For example, when given "trubble", ispell will
suggest only "rubble", where aspell suggests "trouble" (as its first
choice" as well as "dribble", "rubble", and a lot of other words.
http://aspell.sourceforge.net/
GNU Aspell is a Free and Open Source spell checker designed to
eventually replace Ispell. It can either be used as a library or as an
independent spell checker. Its main feature is that it does a much
better job of coming up with possible suggestions than just about any
other spell checker out there for the English language, including
Ispell and Microsoft Word. It also has many other technical
enhancements over Ispell such as using shared memory for dictionaries
and intelligently handling personal dictionaries when more than one
Aspell process is open at once.
==========
There is a tool at this site which I haven't tried yet, but have heard
good things about:
-----
http://www.siu.edu/~nmc/busca.html
BUSCA: A SEARCHER FOR WORD PATTERNS IN TEXTS - Version 3 -- December 1997
Busca is a DOS-based program that searches a set of text files for a
specified pattern of words or for a string of characters.
When searching for a word pattern, Busca uses the punctuation of the
text to search sentence by sentence. The word pattern is defined in
terms of a focus word, with possibilities for specifying the first,
second, and/or third neighboring word before and/or after it, as well
as a "floating" word located anywhere in the sentence. Words in the
search template can be defined in terms of their beginning (xxx-),
their ending (-xxx), a contained string (-xxx-), or their entirety
(xxx). Each word position in the template may contain up to ten
alternative forms.
When searching for a character string, Busca works much like the
"Search" function of a conventional text editor.
Busca can be directed to search a set of texts that are located in a
large number of files, and these files may reside in different DOS
directories.
-----
In short, you can use BUSCA to examine a body of text (such as a few
megabytes of newsgroup discussions that you might download). By
searching on, e.g., -olecu-, you should return molecule, and
molecular, along with misspellings of these words (as long as the
misspellings still retain the core group of letters -olecu-).
==========
The BUSCA tool, above, is valuable because you can search any body of
text with it (unlike some of the other tools, which only search
dictionaries where everything is presumably, already spelled
corectly).
Another tool that also works with real-world text, and can hence find
actual misspellings, is at:
http://www.itri.bton.ac.uk/~Adam.Kilgarriff/bnc-readme.html
This page offers numerous word frequency lists from the British
National Corpus, a massive collection of electronic texts of all sorts
-- books, newspapers, transcripts, school reports, etc -- that contain
a good sampling of misspelled words along with all the properly
spelled words.
The files for "Unlemmatised lists" (lists without parts of speech
attached) are described this way:
-----
These are all available in 6 forms:
sorted alphabetically ("al") or by frequency (highest frequency first) ("num");
the complete lists, or a smaller file containing only those items
occurring over five times (suffix "o5");
all lists are available compressed using gzip (".gz").
-----
I downloaded and unzipped the "all words by frequency" file. As you
might guess, "the" is at the top of the list as the most frequently
used word in the BNC. But the trick here is to scroll down to the
rare words that only appear a few times in the corpus. Many of these
are misspellings. For instance, the list of words that appear only
one time in the BNC include:
1 becayse nn1 1
1 becauses nn2-vvz 1
1 becausehe nn1 1
1 becausec nn1 1
1 becausebecausebecause nn1-vvb 1
1 becaue nn1-vvb 1
1 becauase nn1-vvb 1
1 becasuse nn1-vvb 1
making an appearance at a frequency of "3" each is:
3 becasue vvb 3
3 becasue nn1-vvb 3
[I'm not sure of the meaning of the nn1, vvb, etc. but they probably
aren't relevant to your work).
As you can see, this list can not only be explored for misspellings,
but you can sort the misspellings by the frequency of their
occurrence.
==========
This next tool is perhaps the most complex of the bunch, but in many
ways it is also the most powerful. It has the intimidating title of
"The MRC Psycholinguistic Database":
http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm
I highly recommend spending some time experimenting with the many
options made available here. You can do some incredible things.
For instance, in the first set of options:
1). Select the database fields to be displayed in the output
I simply selected "Word" to list only words as the output file.
In the "Simple Letter Match" Patterns box, I entered:
?olec*
All other boxes were unchecked. By clicking on go, I received the
following list of words that contain -olec- beginning with the second
letter of the word:
BOLECTION
COLECTOMY
MOLECULAR
MOLECULE
MOLECULES
POLECAT
POLECATS
SOLECISM
SOLECISMS
SOLECIZE
==========
Lastly, there are some interesting papers and tools at this site on "A
note on undetected typing errors":
http://portal.acm.org/citation.cfm?id=6146&jmp=abstract&dl=GUIDE&dl=ACM
You may find some of the information here spurs further thought about
finding the most common typos and misspellings.
==========
I hope these tools -- which I've collected over the years, as they
fascinate me -- are just as fascinating and useful to you in your
work. Let's see you use them to find out how many spelling mistakes
I've made in your answer....!
If anything here is not clear -- or if you need additional information
-- please let me know before rating this answer. Just post a Request
for Clarification to let me know how I can assist you further.
Best of luck.
pafalafa-ga
search strategy: made use of bookmarked sites |