Google Answers: TextCat in .NET

View Question

Q: TextCat in .NET ( Answered 5 out of 5 stars

, 0 Comments )

Question

Subject: TextCat in .NET
Category: Computers > Programming
Asked by: kvl-ga
List Price: $50.00

Posted: 16 Feb 2005 01:01 PST
Expires: 18 Mar 2005 01:01 PST
Question ID: 475320

Hi,

There is a Perl software-code (open source) that can guess the
language of series of words. More info on
http://odur.let.rug.nl/~vannoord/TextCat/ BUT we need the same
software in a Microsoft C# .NET version.

The questions is anwsered:

- if a .NET version would be available, where/how we can get it license free.
- if not available, can we have it developed (is it technically possible)?
- if it could be developed, some tips and tricks how you see it done
(some references).

Thanks in advance,
Kris Vlaemynck

Answer

Subject: Re: TextCat in .NET
Answered By: cyclometh-ga on 20 Feb 2005 00:04 PST
Rated: 5 out of 5 stars

Hi, kvl-ga!

Thanks for the opportunity to tackle this question for you. I've done
some research and I believe I can provide you with the information
you're looking for.

To summarize the question, you were seeking information on a .NET
implementation of the TextCat utility, which is a Perl script. TextCat
implements the "N-Gram-Based Text Categorization" algorithm, by
William B. Cavnar and John M. Trenkle.

I located the original document that TextCat's algorithm is based on
via CiteSeer. The following links go to the CiteSeer system, to the
specific citation page for Cavnar and Trenkle's paper, and to a PDF of
the paper itself. The paper is also available in several other
formats, including PostScript, from the citation page.

[CiteSeer, the Scientific Digital Research Library at PSU]
http://citeseer.ist.psu.edu/

[N-Gram-Based Text Categorization]
http://citeseer.ist.psu.edu/68861.html

[N-Gram-Based Text Categorization Paper, PDF Format]
http://citeseer.ist.psu.edu/cache/papers/cs/810/http:zSzzSzwww.info.unicaen.frzSz~giguetzSzclassifzSzcavnar_trenkle_ngram.pdf/n-gram-based-text.pdf

Your first question:

"If a .NET version would be available, where/how we can get it license free?"

I'm afraid that my research has not provided any information about a
.NET version of TextCat or a similar utility for categorizing text by
language that is implemented in C# or any other .NET language.

However, I was able to locate several other applications and libraries
that implement the same or similar algorithm. Most of these come from
the TextCat web site's "Competitors" list. Although not specifically
covered under your question, I am listing them here as they may be of
interest or use to you, and in the interest of completeness.

[TextCat "Competitors" list]
http://odur.let.rug.nl/~vannoord/TextCat/competitors.html

Lextek International has an SDK (Software Development Kit) available
called "Lextek Language Identifier".

[Lextek Language Identifier]
http://www.lextek.com/langid/

The Lextek Language Identifier library is a C library, and is
commercial, which means you are likely not interested in using it, as
you have indicated you're most interested in free/open source
implementations. However, I include it as something for you to review,
because although it is commercial it claims to support approximately
260 languages, which is far in excess of TextCat's 69.

It is likely that even if you were interested in purchasing the Lextek
product, it would not be trivial to use, as it is a C library, which
requires some effort to invoke safely from a C# application.

During my research, I was initially excited by a tool called
"JTextCat", which is a Java tool that performs the same task that
TextCat does. The reason I found this interesting initially is that
Java and C# are very close syntactically, and a port of a Java
application to C# would not be a difficult task.

[JTextCat]
http://www.jedi.be/JTextCat/index.html

Unfortunately, JTextCat is not a pure implementation of the algorithm
as TextCat is- it is more appropriately described as an interface to a
library called "libtextcat", a Linux/FreeBSD/UNIX library written in
C. It is possible that libtextcat could be ported to Windows and used
from a C# application, but this would be subject to the same issues as
using Lextek's product above, in addition to the requirements of
porting the C code to the Windows platform, which would not be a
trivial task.

[libtextcat]
http://software.wise-guys.nl/libtextcat/

"Libtextcat is a library with functions that implement the
classification technique described in Cavnar & Trenkle, "N-Gram-Based
Text Categorization" [Libtextcat web site,
http://software.wise-guys.nl/libtextcat/].

Although a very powerful piece of software, as noted above, libtextcat
is primarily a UNIX package, only claiming to work on Linux, IRIX64
and FreeBSD.

Your second question:

"If not available, can we have it developed (is it technically possible)?"

The answer to this question is an unqualified yes. The main TextCat
perl script is only 229 lines long, much of which is taken up by the
"usage" information. It should be possible to port the TextCat script
to a C# implementation relatively easily for any developer expert in
both Perl and C#. Essentially, the program simply breaks the input
text or file and breaks it into "chunks" called N-Grams. See the paper
by Cavnar and Trinkle, cited above, for a detailed explanation of
exactly what an N-Gram is.

In summary, the document is broken into tokens and a check is done
against a set of "language models". In TextCat, the models are text
files containing frequency data for N-grams in various languages, and
have a file extension of ".lm". The highest scoring language is
reported as the document language.

If you wanted to develop your own version of TextCat, I would
recommend porting the perl script to a C# application as a straight
functional port, using the same structure and logic, but simply in C#
instead of Perl. You would then be able to take advantage of the
existing language models. You would need someone who is both an expert
in Perl and C# to do the port.

Your third question:

"if it could be developed, some tips and tricks how you see it done
(some references)."

The simplest solution is not to write your own version of TextCat, but
to simply "wrap" the existing version in a C# wrapper. To do this
would require the following:

* A Perl interpreter for Windows
* The TextCat Perl code and support files
* A C# "wrapper"

I was able to successfully do this using free/opensource software on
my machine, and produced a C# application that would return the
language of a document by calling the TextCat Perl script. This
technique has the advantage of being easy and fast to implement, with
the minor disadvantage of the requirement to install a Perl
interpreter on any system where the code will be executed. However,
this is a minor issue, and comes with the additional advantage of
being able to use other Perl scripts on the target machine.

I will detail how to get TextCat working in a C# wrapper, including
the required software and potential snags you might run into. In
addition, I will provide the c# code I wrote to demonstrate the
concept, as it is very short and quite simple to use.

Firstly, you'll need a Perl interpreter for Windows. The one I
recommend and use myself is ActiveState's ActivePerl, the most popular
and robust Windows Perl interpreter available. It can be found at the
following link.

[ActiveState's ActivePerl]
http://www.activestate.com/Products/ActivePerl/

The installer is approximately 12 MB in size, and comes in several
varieties; you'll have to choose the one most appropriate for your
target environment. I used the MSI installer, at 12.6 MB. Installing
is a simple process- simply run the installer. There is no
configuration required. ActivePerl is free for both personal and
commercial use, and the source code is available from the ActiveState
web site noted above.

Next, you'll need the TextCat Perl scripts and support files.

[TextCat Sources]
http://odur.let.rug.nl/~vannoord/TextCat/text_cat.tgz

I had a problem opening this archive in WinZip, because for some
reason when I downloaded the archive using Firefox, it appended a .gz
extension to the end of the archive, and this confused WinZip. If you
already have the Perl scripts or don't use FireFox, this shouldn't be
a problem. If you find the file is named text_cat.tgz.gz after you
have downloaded it, simply remove the .gz extension and unzip it with
Winzip or another similar archival utility.

Alternatively, you could use Cygwin, which is a Linux layer for
Windows, and allows you to run many Linux applications (including
Perl) on Windows using a Unix-like interface.

[Cygwin]
http://www.cygwin.com

Cygwin is NOT required to use TextCat or extract the files from the
archive on Windows. It is simply included here because I used it to
run TextCat to work out how it operated before trying to use it on
Windows.

After downloading the TextCat Perl scripts, extract the archive as
noted above into a directory of your choice. I used
"C:\apps\text_cat". You'll need to open the "text_cat" perl script in
a text editor and change line 17 to something like this:

$opt_d ||= 'c:\apps\text_cat\LM';

Obviously, change the path "c:\apps\text_cat" to wherever you
installed TextCat. This tells TextCat where to find its language
models.

At this point, if you have installed ActivePerl as noted above, you
can test TextCat by opening a command prompt and entering the
following commands, as shown below:

C:\>cd apps\text_cat

C:\Apps\text_cat>perl text_cat Copyright
english

This assumes, of course, that you have extracted text_cat into
c:\Apps. The command line "perl text_cat Copyright" tells TextCat to
evaluate the file called "Copyright" in the main TextCat directory. It
should respond with one word: "english", which is correct as the
TextCat copyright document is written in English. Feel free to use
other documents if you like to test out TextCat's functionality.

If you didn't take the step of modifying line 17, you can override it
by using the following command line:

perl text_cat -d c:\apps\text_cat\LM Copyright

The "-d c:\apps\text_cat\LM" option tells TextCat where to find the
language models instead of using the default in the script.

If everything works, you're ready to take the next step, which is to
construct a C# wrapper for TextCat.

Create a C# Console application and call it "TextCatWrapper" in Visual
Studio, then copy/paste the following code into the main module
window, replacing what Visual Studio creates for you as defaults:

--BEGIN SOURCE CODE--

using System;
using System.Diagnostics;

namespace TextCatWrapper
{
/// <summary>
/// Summary description for Class1.
/// </summary>
class Class1
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main(string[] args)
{
//
// TODO: Add code to start application here
//
ProcessStartInfo info;
Process proc;
string result;

//Uncomment or comment this line depending on whether or not you modified the
//text_cat script to point to your language models directory.
//info=new ProcessStartInfo("perl", "text_cat -d
c:\\apps\\text_cat\\LM Copyright");

//Use the following line instead of the one above if your text_cat
script has been
//modified to point to your language models directory.
info=new ProcessStartInfo("perl", "text_cat Copyright");

info.RedirectStandardOutput = true;
info.UseShellExecute=false;
info.WorkingDirectory="C:\\apps\\text_cat";
info.WindowStyle=ProcessWindowStyle.Hidden;
proc=Process.Start(info);
result=proc.StandardOutput.ReadToEnd();
Console.WriteLine("TextCat Produced: {0}", result);
}
}
}

--END SOURCE CODE--

Save and rebuild your application, then execute it from a DOS prompt.
If you run it from the IDE, it will start, finish and then close the
window before you can view it! The following is a sample of what it
should produce:

C:\Projects\TextCatWrapper\bin\Debug>textcatwrapper
TextCat Produced: english

The code above uses the System.Diagnostics.Process and
System.Diagnostics.ProcessInfo classes to execute the ActivePerl Perl
interpreter and pass the appropriate command line options to the
TextCat script. It uses the redirected output from the process to get
TextCat's output, then displays what TextCat returned.

As I noted above, this is the method I would recommend to use TextCat
from C#. It has the advantage of being able to be implemented in only
a few minutes and is based entirely on open-source or free software-
both TextCat and ActivePerl are available free of charge. The sample
code above is of course merely a proof of concept, and you would want
to implement a more robust wrapper class for it, probably as a class
library that you could then use in your application(s). For example,
all the options are hard coded in this file, and you would want more
flexibility in any code you were planning to use in a real
application.

I was able to put this together in just a few minutes using the
software noted above. The single biggest time investment was
downloading and installing ActivePerl. Because of the ease of which
this can be done, this is the method I would recommend.

A few caveats to the above method should be noted. Using the Process
and ProcessInfo classes may require appropriate .NET Framework
security attributes to be set, due to the fact that invoking external
processes can be considered "unsafe". Also, the above example was
coded using Visual Studio .NET 2003 and the .NET Framework 1.1.

In the MSDN documentation, I would recommend reading about the following:

System.Diagnostics.Process class
System.Diagnostics.ProcessInfo Class
System.Security Namespace
System.Security.Permissions.SecurityPermission Class

The MSDN documentation is available on the MSDN website at Microsoft.

[MSDN Online]
http://msdn.microsoft.com

If your requirements don't allow you to use ActivePerl, or you require
a "native" C# implementation, a port of the TextCat Perl code is going
to be necessary. The developer(s) that do this would need to be
familiar with both Perl and C#. In particular, the TextCat script uses
hashes extensively as well as a significant amount of string
manipulation. Because Perl is essentially a language intended for
processing and manipulating strings, it is exceedingly efficient at
it. A C# implementation of TextCat would probably be somewhat larger,
due to the need for more code to properly handle the string
manipulation. However, C# and the .NET Framework come with their own
robust string manipulation tools.

If you intend to do a native C# implementation, I would work with a
developer who is an expert in Perl to provide you with a plan for
doing the port and identify any potential stumbling blocks. Unless I
am sorely mistaken, the porting process should be a fairly
straightforward one for any developer who is an expert in both Perl
and C#.

One caveat to be considered for any port is that TextCat is released
under the GNU General Public License and the Perl Artistic License.
That means you can modify it or redistribute it, but only under the
terms of those licenses, so a port of TextCat to C# would have to be
open-source as well. I am not certain how a "wrapper" as shown above
would fall under the terms of the GNU License.

[TextCat Copyright Information]
http://odur.let.rug.nl/~vannoord/TextCat/Copyright

If you're not familiar with the GNU General Public License, you can
find information about it at the OpenSource Initiative web site.

[OpenSource Initiative]
http://www.opensource.org

TextCat is also alternatively licensed under the Perl Artistic
License, which may provide more flexibility.

[Perl Artistic License]
http://www.perl.com/language/misc/Artistic.html

In summary, your original three questions:

If a .NET version would be available, where/how we can get it license free?

No .NET version is available that I could locate. However, several
other alternatives exist.

If not available, can we have it developed (is it technically possible)?

Yes. A port from Perl to C# is possible. It would require the services
of a developer expert in both Perl and C#. A port would likely be
subject to the terms of the GNU General Public License or the Perl
Artistic License and might have to be open source as well.

If it could be developed, some tips and tricks how you see it done?
(some references).

My preferred approach, as detailed above, is not to rewrite anything
but simply to wrap the existing script in a C# wrapper that calls an
external Perl processor to invoke the TextCat script.

If you cannot use the "wrapper" approach, a straight port is the only
alternative. As noted above, a developer familiar with both Perl and
C# would be required to perform the port. Any port would be subject to
the terms of either the GNU General Public License or the Perl
Artistic License, according to the Copyright information for TextCat.

I hope that this has answered your question to your satisfaction. If
you require any clarification on the information I have provided,
please do not hesitate to ask. Once again, I appreciate the
opportunity to answer this question for you, and certainly enjoyed
researching it for you.

Best regards,

cyclometh-ga

Clarification of Answer by cyclometh-ga on 21 Feb 2005 01:52 PST
Thank you for the rating and the feedback, and thanks for using Google Answers!

Regards,

cyclometh-ga

kvl-ga rated this answer: 5 out of 5 stars

Thanks a lot!

Comments

There are no comments at this time.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy