Google Answers Logo
View Question
 
Q: Automatic detection of language in a text document ( No Answer,   3 Comments )
Question  
Subject: Automatic detection of language in a text document
Category: Computers > Software
Asked by: jgrahamc-ga
List Price: $20.00
Posted: 25 Oct 2006 06:14 PDT
Expires: 24 Nov 2006 05:14 PST
Question ID: 776717
I need a way to take a text document and tell me what language it was
written in.  At the least I need to be able to recognize Western
European languages (English, French, Spanish, etc.).   My preference
is either an open source solution, or a document that describes how I
could build this myself.

Request for Question Clarification by rainbow-ga on 27 Oct 2006 12:25 PDT
Let me know if this answers your question:

About automatic language detection
http://office.microsoft.com/en-us/assistance/HP052585571033.aspx

Best regards,
Rainbow
Answer  
There is no answer at this time.

Comments  
Subject: Re: Automatic detection of language in a text document
From: lafra-ga on 25 Oct 2006 06:40 PDT
 
Hello there.
If you need just a way to recognize the language you'll just need to
open it with Microsoft Word and go through the "Spelling and Grammar
Check". In the window it should automatically appear the language you
should use to check the entire document otherwise you will find a lot
of mistakes.
If you need a software or a plugin something similar I can't help you. Bye :-)
Subject: Re: Automatic detection of language in a text document
From: harrysnet-ga on 25 Oct 2006 17:55 PDT
 
As a first approximation, an assuming you have dictionaries of your 
target languages, would be to use some command line tool (such as the
spell command in Unix) for each of the languages. As output you would 
get a file of unknown words for each language, which can be counted.
The language that gives you the fewest unknown words would be your target 
language.

All these can be done with existing commands and shell scripts in a 
Unix environment. With windows you are probably out of luck with scripting, 
although you can probably do this with a fairly simple program.

I have no idea of the failure rate for this approach. For higher accuracies 
you would need to do syntactic parsing of the text as well. This would increase 
the difficult a lot though, and a script solution goes out the window.
Subject: Re: Automatic detection of language in a text document
From: owain-ga on 27 Oct 2006 12:09 PDT
 
Although not conclusive, you could count the number of each letter a-z
in the document and plot a histogram of this. Different languages have
different letter frequency histograms.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy