Well, considering you clarified that you do not want a utility that
accesses the POP3 server directly, then the option I describe is not
appropriate. Yet, I hope that this might end up being an unnecessary
restriction, and, thus, am posting this comment. If this can qualify
as an acceptable alternative for you, please give me the opportunity
to post this as an answer.
You should definitely look into the tool POPFile. You can obtain and
learn more about it here:
http://popfile.sourceforge.net
You can see how to configure it with Outlook here:
http://popfile.sourceforge.net/manual/email.html
And their excellent FAQ can be found here:
https://sourceforge.net/docman/display_doc.php?docid=14421&group_id=63137
POPFile is a very powerful little utility that I use on a day-to-day
basis. It's an email classifier that learns (extraordinarily well)
from past history. Basically, it works as a POP3 proxy. Whenever
Outlook (or other email client) goes to retrieve email from your POP3
server, it asks POPFile first. POPFile retrieves the email from the
server and classifies it into categories based on what it has learned.
It can then place that category name into the subject of the email or
into a line in the mail's headers.
This turns out to be very powerful, and quite easy to use with email
rule systems. For instance, at home I use Outlook Express. POPFile
places the category name (I'll touch more on categories later) in the
subject line of the email. I then use my Outlook Express email rules
as normal to redirect to different folders, though I have added a
number of new rules that take advantage of POPFile's classification
subject-prefix. POPFile has been working so well for me that I've even
taken to using it as the primary basis for many rules that Outlook
Express just can't express (i.e. mailing lists without their own
subject headers, automatic online purchase receipts, etc).
You may have heard of POPFile as primarily a spam-fighting tool. Well,
while it's awfully good at that, it is truly just an email
classification tool. You can create any categories you wish without
any need to deal with spam in any way. It's ability to classify into
dynamic categories is why this tool can come in so handy for your
problem.
For instance, you could create 3 categories-- GoodEmail,
NonItalianSpam, ItalianSpam. You could then train it using email you
already have (e.g. some of your 60000 via their "insert.pl" tool), or
allow it to build over time with new email (which is recommended). No
matter what, POPFile doesn't know anything about those categories in
the beginning, so you're going to need to teach it at first. Once it
gets the general idea, though, it does an excellent job classifying
emails.
I'd recommend teaching POPFile with new emails for a while until it's
accurate enough for you, and then try classifying your older email
(perhaps as text files via their 'bayes.pl' command-line tool).
POPFile works by assigning each word a probability of how likely it
belongs in a certain category. While you train it, POPFile builds up
dictionaries containing the odds of these words and their likely
categories. Whenever an email comes it, POPFile will add up the
probabilities of all of the words in the email to determine the
correct categorization.
Because you're specifically interested in different languages, it's
quite likely that POPFile will work extraordinarily well for you. This
is because it'll be quite rare that the same words are used in normal
(English) spam and in Italian spam. POPFile is very unlikely to get
the two categories confused even with only a small bit of training.
There are lots of different ways to categorize your email based on
your current needs. If the detection of Italian spam is all you care
about, though, then those three categories (Good, NonItalianSpam, and
ItalianSpam) are what I'd recommend. The jury is still out on how much
difference you'll see when you increase the number of categories you
want for other uses. I, for one, use 7-8 different categories for my
everday mail, and POPFile does an almost supernatural job of
classifying things correctly.
You mentioned that you already use a tool (i.e. SpamNet) for spam, and
I suspect you can find a post in the POPFile forums of people who have
(or have not) been successful using SpamNet alongside POPFile. I do
know from my experience with the project that POPFile works best if it
sees the email before other spam filters have injected their bias into
the situation. That's why it works great as an initial POP3 proxy,
and, in some cases, two email filters can even enhance one another.
Regardless, I'd still recommend POPFile as the only email filter,
simply because it's so accurate-- 99.21% correct classification for me
in the last 6 months. (Better than if I paid some other human to try
to help me classify it.)
This is the tool that I believe would require the least work for you
and still produce excellent results. If I were developing a solution
similar to what you need to detect Italian spam, POPFile is precisely
the approach I would take.
As I mentioned above, if this could possibly work as an answer to your
question, I can add additional information (if you need) and submit it
as an answer.
Thanks for your question, and best of luck with your attempts to solve
your problem! |