Google Answers: Guessing misprint possibility and frequency

View Question

Q: Guessing misprint possibility and frequency ( Answered, 4 Comments )

Question

Subject: Guessing misprint possibility and frequency
Category: Computers > Algorithms
Asked by: advert2k2p-ga
List Price: $150.00

Posted: 23 Feb 2004 05:56 PST
Expires: 24 Mar 2004 05:56 PST
Question ID: 309795

I?ve got an unusual logical task. Let?s say we have any word i.e. ?molecular? We need to define what misprints can be made while this word is rapidly entered using keyboard. I also need to define possible misprints and their frequency (possibility). Of course I understand that it?s a really hard task, but may be there are any researches made regarding this problem. Of course it depends on the keyboard (I mean a usual QWERTY keyboard of course). Actually, it?s the programming task, but I need to understand the logic first of all. To sum up: I need an algorithm to check the most frequent misprints for any input and the frequency of any misprint if it?s possible. The help is highly appreciated
Request for Question Clarification by pafalafa-ga on 23 Feb 2004 09:24 PST Hello there. Interesting challenge you have. I can't, myself, think of an algorithm that would do the trick, but perhaps there's another way to approach the problem at hand. Have you thought of using an off-the-shelf (or off-the-web) tool to serve a similar purpose. Feeding an unrecognized, and potentially misspelled word into any of a number of available spell-check systems should return to you a suggestion for a correct spelling, and can even provide quantitative data on misspellings. For instance, typing "molicular" (without the quote marks) into the Google search bar not only returns the corrected spelling: ----- Did you mean: molecular ----- but also gives you some information about how often "molicular" appears on the web: ----- Searched the web for molicular. Results 1 - 100 of about 180. ----- On the other hand, a search for "moleculat" returns a much greater number of misspelled results: Searched the web for moleculat. Results 1 - 100 of about 584 Just wanted to toss this all out to see if might be of any use to you. Let me know what you think. pafalafa-ga (correctly spelled!)
Clarification of Question by advert2k2p-ga on 23 Feb 2004 10:45 PST Ok, half of the problem solved. Let's say I know how to count the word mispelling frequency using google (great idea btw, though you're promoting google). I've got the other part of the problem - I need to find the possible misspellings of the word "molecular" Right now I'm thinking about any big newsgroup material split into words and a program that finds anything close to the word I input.
Clarification of Question by advert2k2p-ga on 23 Feb 2004 10:47 PST I've got another question then. What can be used to compare any word in the dictionary and find similar ones (people can change letter's place or even avoid typing any letter)
Request for Question Clarification by pafalafa-ga on 24 Feb 2004 06:57 PST Hello again, It's good to hear that the first part of your question is answered, but I'm still not certain about the best way to proceed on the next part of your question. Please take a look at the following links, and let me know what you think. These are misspellings lists from various sources, along with some spelling software. My reason for presenting them is to make clear that there ARE techniques out there for identifying common and not-so-common misspellings -- how else could Google (or any other spell-checker) "recognize" a misspelled word and suggest a correction? I'm just not sure which tool(s) would best meet your needs, or what sort of information you're after at this point. So have a look at the links below, and let me know if anything here strikes a note. Thanks. ========== http://mviscript.hypermart.net/out.html common misspellings of common words ----- ://www.google.com/jobs/britney.html an occasionally humorous collection of misspellings of Britney Spears (brutany spears?) ----- http://www.barnsdle.demon.co.uk/spell/error.html A Study of Some of the Most Commonly Misspelled Words ----- http://directory.google.com/Top/Arts/Writers_Resources/Software/Spelling_and_Grammar/Spell_Checkers/ List of spellcheck software ----- Let me know if any of this looks like it might be useful (and if not, I'd appreciate your thoughts on what WOULD work for your project).

Answer

Subject: Re: Guessing misprint possibility and frequency
Answered By: pafalafa-ga on 21 Mar 2004 16:41 PST

Hello again,

As something of a word-sleuth myself, I couldn't let this question
pass with it only half-answered, so I've collected up a number of
word-tools for you that should be able to help address the various
questions you raised in your clarifications.  Some of them are pretty
simple,  while one or two are quite esoteric in how they operate.  But
they all have enormous power for dealing with words in a manner that
should prove valuable to your work.

So...let's get to it.  You specifically asked about the following:

--I need to find the possible misspellings of the word
"molecular...I'm thinking about any big newsgroup material split into
words and a program that finds anything close to the word I input.


--What can be used to compare any word in the dictionary and find similar ones 

Since some of the tools below can be applied to both these questions,
I want to describe the tools in turn, and then also discuss how they
could be of use to you in your work.


==========

There are a number of lists of commonly misspelled/mistyped words that
can serve as a basic reference.

This list from Cornell includes some frequency-of-occurrence
information for misspelled words:

http://www.library.cornell.edu/tsmanual/TSSU/comis1.html




Other lists of misspellings include:


http://www.wsu.edu:8080/~brians/errors/misspelled.html

Commonly misspelled words


-----


http://www.careers.cam.ac.uk/students/work/spelling.asp

Some Common Spelling Mistakes

==========


A great tool for finding words, word variations, and so on, is
advanced search feature on this page of crossword-puzzle solver tools:

http://www.puzzlers.org/wordlists/grepdict_full.php

The National Puzzlers League
  
As the "Advanced Search" instructions indicate, you can use wildcards
to substitute individual letters or whole groups of letters, with some
spiffy additional controls to specifiy end-of-word, etc:

"...the wildcard character is a dot: ".", not a question mark. To
specify any number of characters, say ".*" rather than "*"; e.g.
"s.*py" will find "spy", "scrappy" and "slaphappy". It will also find
"espy", "sulfapyridine" and "unspying". To avoid these, use "^" to
mark the beginning of a word, and "$" to mark the end: "^s.*py$" will
match "spy" and "sappy", but not "espy" or "spying".

For instance, a search on [ .olecu. ] returns about a hundred words
containing -olecu-, including molecule, of course, as well as:

molecula 
molecular 
molecularist 
molecularity 
molecularly 
molecule's 
molecules 
moleculon 

Play around here.  This is a great site.  

==========

Spell-checkers already have built-in criteria for identifying
misspellings and suggesting replacement words.  One of the most
versatile free, online spell-chekcers is Aspell:

http://aspell.net/suggest/

which describes itself this way:

"Welcome to the Aspell Spell Helper. Its goal is to help out all the
bad spellers on the net by doing a really good job of coming up with
suggestions for misspelled words."

Using the online search box, and asking for suggestions for: molicul

resulted in the following list:


molecule
molecular
Mogul
mogul
molal
molecules
millijoule
modicum
helical
magical
medical
musical
follicle
local
monocle
Miguel
Moll
moil
moll
follicular
molehill
monocular
milieu
Moloch
Moluccas
Malcolm
Felicle
Mikol
Mobil
Mosul
calculi
colic
molecule's
slickly

and offered additional options to:

--Try Harder
--Try using the Huge Dictionary

-----

A similar tool is Ispell:


http://fmg-www.cs.ucla.edu/geoff/ispell.html


"Ispell is a fast screen-oriented spelling checker that shows you your
errors in the context of the original file, and suggests possible
corrections when it can figure them out."

Their site includes a description of the differences between Ispell and Aspell:

-----
 
What's the Difference Between Ispell and Aspell?

Aspell is a spelling checker written by Kevin Atkinson. Its primary
advantage is that it is better at making suggestions when a word is
seriously misspelled. For example, when given "trubble", ispell will
suggest only "rubble", where aspell suggests "trouble" (as its first
choice" as well as "dribble", "rubble", and a lot of other words.
http://aspell.sourceforge.net/

GNU Aspell is a Free and Open Source spell checker designed to
eventually replace Ispell. It can either be used as a library or as an
independent spell checker. Its main feature is that it does a much
better job of coming up with possible suggestions than just about any
other spell checker out there for the English language, including
Ispell and Microsoft Word. It also has many other technical
enhancements over Ispell such as using shared memory for dictionaries
and intelligently handling personal dictionaries when more than one
Aspell process is open at once.

==========

There is a tool at this site which I haven't tried yet, but have heard
good things about:

-----

http://www.siu.edu/~nmc/busca.html

BUSCA: A SEARCHER FOR WORD PATTERNS IN TEXTS - Version 3 -- December 1997

Busca is a DOS-based program that searches a set of text files for a
specified pattern of words or for a string of characters.
When searching for a word pattern, Busca uses the punctuation of the
text to search sentence by sentence. The word pattern is defined in
terms of a focus word, with possibilities for specifying the first,
second, and/or third neighboring word before and/or after it, as well
as a "floating" word located anywhere in the sentence. Words in the
search template can be defined in terms of their beginning (xxx-),
their ending (-xxx), a contained string (-xxx-), or their entirety
(xxx). Each word position in the template may contain up to ten
alternative forms.

When searching for a character string, Busca works much like the
"Search" function of a conventional text editor.

Busca can be directed to search a set of texts that are located in a
large number of files, and these files may reside in different DOS
directories.
-----

In short, you can use BUSCA to examine a body of text (such as a few
megabytes of newsgroup discussions that you might download).  By
searching on, e.g., -olecu-, you should return molecule, and
molecular, along with misspellings of these words (as long as the
misspellings still retain the core group of letters -olecu-).


==========

The BUSCA tool, above, is valuable because you can search any body of
text with it (unlike some of the other tools, which only search
dictionaries where everything is presumably, already spelled
corectly).

Another tool that also works with real-world text, and can hence find
actual misspellings, is at:


http://www.itri.bton.ac.uk/~Adam.Kilgarriff/bnc-readme.html

This page offers numerous word frequency lists from the British
National Corpus, a massive collection of electronic texts of all sorts
-- books, newspapers, transcripts, school reports, etc -- that contain
a good sampling of misspelled words along with all the properly
spelled words.


The files for "Unlemmatised lists"  (lists without parts of speech
attached) are described this way:

-----

These are all available in 6 forms: 

sorted alphabetically ("al") or by frequency (highest frequency first) ("num"); 
the complete lists, or a smaller file containing only those items
occurring over five times (suffix "o5");

all lists are available compressed using gzip (".gz").

-----

I downloaded and unzipped the "all words by frequency" file.  As you
might guess, "the" is at the top of the list as the most frequently
used word in the BNC.  But the trick here is to scroll down to the
rare words that only appear a few times in the corpus.  Many of these
are misspellings.  For instance, the list of words that appear only
one time in the BNC include:

1 becayse nn1 1
1 becauses nn2-vvz 1
1 becausehe nn1 1
1 becausec nn1 1
1 becausebecausebecause nn1-vvb 1
1 becaue nn1-vvb 1
1 becauase nn1-vvb 1
1 becasuse nn1-vvb 1

making an appearance at a frequency of "3" each is:

3 becasue vvb 3
3 becasue nn1-vvb 3

[I'm not sure of the meaning of the nn1, vvb, etc. but they probably
aren't relevant to your work).

As you can see, this list can not only be explored for misspellings,
but you can sort the misspellings by the frequency of their
occurrence.


==========

This next tool is perhaps the most complex of the bunch, but in many
ways it is also the most powerful.  It has the intimidating title of
"The MRC Psycholinguistic Database":

http://www.psy.uwa.edu.au/mrcdatabase/uwa_mrc.htm

I highly recommend spending some time experimenting with the many
options made available here.  You can do some incredible things.

For instance, in the first set of options:

1). Select the database fields to be displayed in the output

I simply selected "Word" to list only words as the output file.  

In the "Simple Letter Match" Patterns box, I entered:

?olec*

All other boxes were unchecked.  By clicking on go, I received the
following list of words that contain -olec- beginning with the second
letter of the word:

BOLECTION                
COLECTOMY                
MOLECULAR                
MOLECULE                 
MOLECULES                
POLECAT                  
POLECATS                 
SOLECISM                 
SOLECISMS                
SOLECIZE  


==========

Lastly, there are some interesting papers and tools at this site on "A
note on undetected typing errors":

http://portal.acm.org/citation.cfm?id=6146&jmp=abstract&dl=GUIDE&dl=ACM


You may find some of the information here spurs further thought about
finding the most common typos and misspellings.

==========

I hope these tools -- which I've collected over the years, as they
fascinate me -- are just as fascinating and useful to you in your
work.  Let's see you use them to find out how many spelling mistakes
I've made in your answer....!

If anything here is not clear -- or if you need additional information
-- please let me know before rating this answer.  Just post a Request
for Clarification to let me know how I can assist you further.

Best of luck.

pafalafa-ga





search strategy:  made use of bookmarked sites

Comments

Subject: Re: Guessing misprint possibility and frequency
From: mathtalk-ga on 25 Feb 2004 21:04 PST

It seems to me that the discussion on this Question began in terms of
wanting an apriori model of likelihood of mistyping and then turn in
the direction of obtaining aposteriori data, e.g. by Google searches.

Given a word, one can consider several possible sources of error:

 - the typist remembers the spelling incorrectly but attempts a
phonetically related string

 - the typist fails to press the key for one or more letters of an
attempted spelling

 - the typist hits an incorrect key (or more than one key) during an
attempted spelling

While the layout of a keyboard would perhaps be a guide to
constructing relative frequency distributions for the latter two
categories, and the soundex algorithm may be of some help in assessing
the relative likelihood of errors from the first category, it is still
required to put in "by hand" what the absolute probabilities of
mistakes are.  In effect the same model may apply to both a careful
and a careless typist, but some parameters are needed to describe the
difference between them.

regards, mathtalk-ga

Subject: Re: Guessing misprint possibility and frequency
From: rpcxdr-ga on 03 Mar 2004 17:15 PST

Google is fast - just try all letter combinations.  Google will map
them to their most likely spelling.  Let me get out my calculator... I
would guess it would take about a week to find all 5 letter
mispellings using the Google mapping.  Using a few rough heuristics
like the user getting the first letter right and maybe getting a vowel
in there you might be able to search all 7 or 9 letter combinations.

Interestingly, Google will correct spellings on non-dictionary words
like "gogle" and "googte".  So a mass search of Google will turn up
mispellings of words that you could not find in a dictionary.  That
is, until "google" is added.

rpc

Subject: Re: Guessing misprint possibility and frequency
From: biocomp-ga on 10 Mar 2004 16:56 PST

We could get a little bit pragmatic here. The 
following is inspired by research in bio-computing.
In this field, there is a precise definition
of the *distance* between two words: missing letters,
replaced letters, or added letters are typical errors
that contribute to a numerical evaluation of the
similarity, or distance, between words.

Here is what I suggest:

1) Obtain a huge corpus of possibly mistyped text.
[This could be the hard part. Downloading millions
of 'home-pages'???]

2) Use an approximate string-matching algorithm
-- like the ones used for finding similar biological
sequences in whole genomes search -- with your word
as a query sequence. There exists some really fast ones:
Search for 'agrep' on the web.

3) The results -- if the sampling of the texts is really huge --
could be displayed by the number of occurrences of each
misspelled word,  by increasing distances. For example:

molecular 12450
molicular 23
molcular  12
nolecular 7
and so on..

4) Discard the first line -- which should correspond 
to the correct spelling -- and take the rest as your 
empirical distribution of mistypings. 

I am aware that several parameters have to be adjusted,
such as the number of misspelling allowed in a word of a 
fixed length, but it could be a start. If the part
about 'approximate string-matching algorithms' is unclear,
do not hesitate to ask, I am a specialist.

Subject: Re: Guessing misprint possibility and frequency
From: smcinmass-ga on 10 Mar 2004 17:13 PST

Since you mentioned identifying the misspelling as the user is typing,
have you considered using tries to autocomplete common words as they
are being typed?

http://www.guides.sk/CRCDict/HTML/trie.html

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy