|
|
Subject:
Automatic detection of language in a text document
Category: Computers > Software Asked by: jgrahamc-ga List Price: $20.00 |
Posted:
25 Oct 2006 06:14 PDT
Expires: 24 Nov 2006 05:14 PST Question ID: 776717 |
I need a way to take a text document and tell me what language it was written in. At the least I need to be able to recognize Western European languages (English, French, Spanish, etc.). My preference is either an open source solution, or a document that describes how I could build this myself. | |
|
|
There is no answer at this time. |
|
Subject:
Re: Automatic detection of language in a text document
From: lafra-ga on 25 Oct 2006 06:40 PDT |
Hello there. If you need just a way to recognize the language you'll just need to open it with Microsoft Word and go through the "Spelling and Grammar Check". In the window it should automatically appear the language you should use to check the entire document otherwise you will find a lot of mistakes. If you need a software or a plugin something similar I can't help you. Bye :-) |
Subject:
Re: Automatic detection of language in a text document
From: harrysnet-ga on 25 Oct 2006 17:55 PDT |
As a first approximation, an assuming you have dictionaries of your target languages, would be to use some command line tool (such as the spell command in Unix) for each of the languages. As output you would get a file of unknown words for each language, which can be counted. The language that gives you the fewest unknown words would be your target language. All these can be done with existing commands and shell scripts in a Unix environment. With windows you are probably out of luck with scripting, although you can probably do this with a fairly simple program. I have no idea of the failure rate for this approach. For higher accuracies you would need to do syntactic parsing of the text as well. This would increase the difficult a lot though, and a script solution goes out the window. |
Subject:
Re: Automatic detection of language in a text document
From: owain-ga on 27 Oct 2006 12:09 PDT |
Although not conclusive, you could count the number of each letter a-z in the document and plot a histogram of this. Different languages have different letter frequency histograms. |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |