Google Answers Logo
View Question
 
Q: I need to filter Outlook (2002) emails based on language (english and Italian) ( No Answer,   2 Comments )
Question  
Subject: I need to filter Outlook (2002) emails based on language (english and Italian)
Category: Computers > Programming
Asked by: diegosala-ga
List Price: $25.00
Posted: 31 May 2003 11:00 PDT
Expires: 30 Jun 2003 11:00 PDT
Question ID: 211184
Hello!

This is my case:

I'm an italian technical computer assistant, very expert of the
Windows environment but inexpert in programming things (and in english
language:)).

Italian law now permits to prosecute the italian spammers and obtain
from them about $100 with a judge's sentence. Included in the illicit
is all non-personal non-expressly-preventively-authorized email.

I use Outlook XP (2002) as email client.
I receive about 3000 emails of spam every month and I redirect them
(some manually and the bigger part automatically by the SpamNet
filter) in a Outlook folder named "spam".
About 99% of this spam is english and the remaining 1% is mostly
italian.

I want to prosecute all the italian spammers and because all of them
sends me their spam in italian language I need to firstly filter and
take apart (in another folder named, in example, "italian spam") the
emails in the "spam" folder that has been wrote in italian language.

I pay the money to the person that can give me a ready final (and
free) automated solution for doing this job of filtering Outlook
(2002) emails based on language (italian).
The solution must have the "identification quality" of the below
programs, based on full dictionaries.
The solution must be able to be implemented by me in less of 15
minutes of my manual operations. I don't want and I'm not able to
program in Perl, ASP, VBScript, etc.. If you want to write a Perl
program to interface with one of the sites shown below, I have some
expertise to make it operative on a web server without too much
explainations.

Here's a good list of many little language identification tools:
http://odur.let.rug.nl/~vannoord/TextCat/competitors.html.
These programs are able to identify the language of a text essentially
by using dictionaries.
I actually don't know of a practical method to interface them to
Outlook but some of them offer a good API to do the job. Plus, Outlook
can natively export emails in text (or Excel) formats.


Thanks

Request for Question Clarification by poormattie-ga on 01 Jun 2003 10:55 PDT
Are you using Outlook XP with an MS Exchange backend for your email?
If not, are you using IMAP or POP3?

Clarification of Question by diegosala-ga on 01 Jun 2003 12:02 PDT
in answer to poormattie-ga:

> Are you using Outlook XP with an MS Exchange
> backend for your email?
> If not, are you using IMAP or POP3?

Thanks for the question.
The answer is: no. I'm using Outlook XP and it directly downloads my
emails from some POP3 accounts. The folder "spam" is a typical local
PST file's folder.
I need to filter now the OLD spam in the "spam" folder (about 60000
emails) and regularly, let's say 1time/month, the new spam that I'll
receive in future.
I don't want a procedure that gets the emails directly from the POP3
server before of or bypassing Outlook.
It would be good instead, in example, something that interfaces with
the Outlook's rules, considering that the Outlook's rules can acts on
the old messages too (by using the "Apply" button instead of the "Ok"
button). I know, in example, that Outlook's rules can recall a
VBscript but I'm not able to program in VBscript.

Clarification of Question by diegosala-ga on 01 Jun 2003 15:18 PDT
Thanks poormattie and endo for your comments. I appreciated your try.
However, the reasons I cannot accept the POPFile (+ Outclass) solution
are essentially that:
- I require "a ready final automated solution" and "the solution must
be able to be implemented by me in less of (about) 15 minutes of my
manual operations";
- The filtering of the old 60000 messages of my "spam" folder (namely
the last 2 years of spam) is the most relevant reason I need the
automated procedure, because it takes to me an entire day to manually
do this filtering.
I don't know how to serve in POP3 these my old 60000 messages;
- POPFile trained could be good but I don't think that this solution
"has the identification quality of the language's identification
programs based on dictionaries" with a 10 minutes training (let's say
giving to it 10 italian + 1000 english messages from my spam archive).

Plus, leaving a program installed and opened *all the time* in my PC
between Outlook and POP3 server while I need to execute the filtering
procedure just 1time/month is very onerous.
_______________

I have two tips for you and others that are working on my problem:

In VBA the method for saving an Outlook mail to a text file is
"SaveAs(Path, Type)": i.e. MailItem.SaveAs "C:\Temp", olTxt

I found a program, InboxRULES with IRSasave module, that adds a custom
rule to Outlook to do an automatic exportation of mails to text files.
It's hightly customizable (it can also execute a program, like the
TextCat language identifier, with the text file's name as parameter).
Obviously it, being an Outlook's rule, can act on old archived Outlook
messages too. The only "problem" of this is that it costs too much.

Request for Question Clarification by endo-ga on 01 Jun 2003 18:12 PDT
Hi,
Outclass will actually be able to go through the emails you already
have saved to disk in Outlook.

>Plus, leaving a program installed and opened *all the time* in my PC
between Outlook and POP3 server while I need to execute the filtering
procedure just 1time/month is very onerous.

Well this would replace spamcatcher, which you are already running, do
the same job at least as well, and do the extra language sorting as
well. Although you might consider it a waste of time to run popfile
all the time, it would do the sorting a bit at a time, I cannot
imagine how long it would take to save every email from Outlook to
disk, then try and sort them, for them to somehow be reorganised
within Outlook again. Sorry to seem like insisting, but since popfile
is a ready implemented solution, which has been tried and tested by a
lot of people with a lot of success, it still seems to be the best
solution.

Thanks.
endo

Clarification of Question by diegosala-ga on 03 Jun 2003 13:24 PDT
Hi!

> Outclass will actually be able to go through the emails
> you already have saved to disk in Outlook.

Thanks for the information. I didn't know this. You convinced me to
give it a try.
I installed POPFile and Outclass.
I spent 4 hours (not 15 minutes) for manually perfectly filtering a
set of about 7000 italian messages (not only spam) and about 4300
english messages, that I used to train the program.
When it was time to try the results I found that Outclass program has
a big bug: when I select more than 500-2000 messages and press
"Classify Now" it does nothing. Id est I have to select less than a
month of messages at a time for it to operate. I doesn't accept more
than 500-2000 messages at a time.
Anyway, I tested it on two months and it did much errors. For example
it classified the spanish language messages like the italian messages.
Spanish is a language similar to italian but the point is that I think
a dictionary language identificator would be able to distinguish them.
9 word of 10 are different between italian and spanish.
 
> I cannot imagine how long it would take to save every email
> from Outlook to disk, then try and sort them, for them to
> somehow be reorganised within Outlook again.

Yes. I think you are right. I think that a solution that is completely
integrated in Outlook would be the best.

> popfile is a ready implemented solution

My concept of "ready solution", even abounding (I accept to do a
15min. manual implementation), is not compatibile with hours of
training...

Thanks anyway for you availability

Request for Question Clarification by endo-ga on 03 Jun 2003 14:22 PDT
Hi,

I'm sorry you were disappointed by popfile. Popfile and outclass are
still under intense development and I'm sure the bug you mention will
be fixed in a future release.

When you say "much errors" can you classify this as a percentage?

Also you mention that for training you:
>set of about 7000 Italian messages (not only spam) and about 4300
>English messages, that I used to train the program.

But you hadn't trained it on any Spanish messages? And it's important
to separate the Italian messages that are spam and the genuine Italian
messages, or you won't get the desired results. I'm sorry you've spent
so much time on this, but usually for popfile, you'd train it on a
daily basis on the emails you receive on a given day, it would then
improve day after day, up to a satisfactory percentage, and only when
you realize that it has reached a level of separation that you can
accept, would I then run it on the emails you have stored.

On one hand I don't want to waste your time, but I am still convinced
that popfile is your best available solution, maybe you were expecting
too much of it from the first day? Just keep on training it day after
day and I'm sure that after a while it will achieve a percentage that
you will find acceptable.

Let me support further my claim that this will probably be your best
available solution: about 3 weeks ago, I was at a University 3rd year
project session, these are projects presented by students in their 3rd
year and who have spent half of their time during that year on these
projects. One of them was text classification based on text. My point
is that I don't believe there is a simple implementation, and that if
there was, it would be worth a lot more than the price you have set.

Sorry if you think this is not satisfactory, but I think it would work
out in the end, it just needs a couple of days, not necessarily
intensive training but on a day to day basis.
Thanks.
endo

Clarification of Question by diegosala-ga on 03 Jun 2003 17:44 PDT
> Hi, 
>  
> I'm sorry you were disappointed by popfile. Popfile and outclass are
> still under intense development and I'm sure the bug you mention
will
> be fixed in a future release.

I hope so. I just wrote an email to the author of Outclass to inform
him on the bug.

> When you say "much errors" can you classify this as a percentage?

In the "italian" destination folder I see on 50 messages about 30
english, 10 spanish and 10 italian messages.
 
I understand that I can improve the program identification quality by
training and correcting it (by reclassifing) but my principal need was
to save myself from 10 hours of manual work: I wouldn't want to arrive
to the point to have spent the same time to solve the problem with a
semi-automatic solution...

I think, anyway, that POPFile is a very good program for generic (and
spam) use. However, I need to experiment it more for my specific
purpose.
In case the author answer to me giving to me the solution of the bug
problem, I'll give the program a second try.
In the meantime, I leave the question opened for a
dictionary-solution.

Thanks again for your contribute, endo.
Answer  
There is no answer at this time.

Comments  
Subject: Re: I need to filter Outlook (2002) emails based on language (english and Italian)
From: poormattie-ga on 01 Jun 2003 13:44 PDT
 
Well, considering you clarified that you do not want a utility that
accesses the POP3 server directly, then the option I describe is not
appropriate. Yet, I hope that this might end up being an unnecessary
restriction, and, thus, am posting this comment. If this can qualify
as an acceptable alternative for you, please give me the opportunity
to post this as an answer.

You should definitely look into the tool POPFile. You can obtain and
learn more about it here:
http://popfile.sourceforge.net

You can see how to configure it with Outlook here:
http://popfile.sourceforge.net/manual/email.html

And their excellent FAQ can be found here:
https://sourceforge.net/docman/display_doc.php?docid=14421&group_id=63137

POPFile is a very powerful little utility that I use on a day-to-day
basis. It's an email classifier that learns (extraordinarily well)
from past history. Basically, it works as a POP3 proxy. Whenever
Outlook (or other email client) goes to retrieve email from your POP3
server, it asks POPFile first. POPFile retrieves the email from the
server and classifies it into categories based on what it has learned.
It can then place that category name into the subject of the email or
into a line in the mail's headers.

This turns out to be very powerful, and quite easy to use with email
rule systems. For instance, at home I use Outlook Express. POPFile
places the category name (I'll touch more on categories later) in the
subject line of the email. I then use my Outlook Express email rules
as normal to redirect to different folders, though I have added a
number of new rules that take advantage of POPFile's classification
subject-prefix. POPFile has been working so well for me that I've even
taken to using it as the primary basis for many rules that Outlook
Express just can't express (i.e. mailing lists without their own
subject headers, automatic online purchase receipts, etc).

You may have heard of POPFile as primarily a spam-fighting tool. Well,
while it's awfully good at that, it is truly just an email
classification tool. You can create any categories you wish without
any need to deal with spam in any way. It's ability to classify into
dynamic categories is why this tool can come in so handy for your
problem.

For instance, you could create 3 categories-- GoodEmail,
NonItalianSpam, ItalianSpam. You could then train it using email you
already have (e.g. some of your 60000 via their "insert.pl" tool), or
allow it to build over time with new email (which is recommended). No
matter what, POPFile doesn't know anything about those categories in
the beginning, so you're going to need to teach it at first. Once it
gets the general idea, though, it does an excellent job classifying
emails.

I'd recommend teaching POPFile with new emails for a while until it's
accurate enough for you, and then try classifying your older email
(perhaps as text files via their 'bayes.pl' command-line tool).

POPFile works by assigning each word a probability of how likely it
belongs in a certain category. While you train it, POPFile builds up
dictionaries containing the odds of these words and their likely
categories. Whenever an email comes it, POPFile will add up the
probabilities of all of the words in the email to determine the
correct categorization.

Because you're specifically interested in different languages, it's
quite likely that POPFile will work extraordinarily well for you. This
is because it'll be quite rare that the same words are used in normal
(English) spam and in Italian spam. POPFile is very unlikely to get
the two categories confused even with only a small bit of training.

There are lots of different ways to categorize your email based on
your current needs. If the detection of Italian spam is all you care
about, though, then those three categories (Good, NonItalianSpam, and
ItalianSpam) are what I'd recommend. The jury is still out on how much
difference you'll see when you increase the number of categories you
want for other uses. I, for one, use 7-8 different categories for my
everday mail, and POPFile does an almost supernatural job of
classifying things correctly.

You mentioned that you already use a tool (i.e. SpamNet) for spam, and
I suspect you can find a post in the POPFile forums of people who have
(or have not) been successful using SpamNet alongside POPFile. I do
know from my experience with the project that POPFile works best if it
sees the email before other spam filters have injected their bias into
the situation. That's why it works great as an initial POP3 proxy,
and, in some cases, two email filters can even enhance one another.
Regardless, I'd still recommend POPFile as the only email filter,
simply because it's so accurate-- 99.21% correct classification for me
in the last 6 months. (Better than if I paid some other human to try
to help me classify it.)

This is the tool that I believe would require the least work for you
and still produce excellent results. If I were developing a solution
similar to what you need to detect Italian spam, POPFile is precisely
the approach I would take.

As I mentioned above, if this could possibly work as an answer to your
question, I can add additional information (if you need) and submit it
as an answer.

Thanks for your question, and best of luck with your attempts to solve
your problem!
Subject: Re: I need to filter Outlook (2002) emails based on language (english and Italian)
From: endo-ga on 01 Jun 2003 14:27 PDT
 
I was going to propose popfile as well but was too late :)
Anyways you can use Outclass with Outlook and popfile to do what you
want. Outclass can be found here:
http://www.vargonsoft.com/Outclass/default.aspx

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy