Google Answers: Classification of Emails with Neural Networks

View Question

Q: Classification of Emails with Neural Networks ( Answered 4 out of 5 stars

Question

Subject: Classification of Emails with Neural Networks
Category: Computers > Programming
Asked by: spinalwiz-ga
List Price: $7.00

Posted: 04 Nov 2002 04:06 PST
Expires: 04 Dec 2002 04:06 PST
Question ID: 98084

I was wondering if you could give me a bit of advice. I'm not looking
for technical detail, just a few pointers to point me in the right
direction.

I was thinking of developing a neural network for classification of
emails. Basically, the user creates a number of different folders for
emails to be classified in. The user then sorts the the email data to
be used for training the network into the appropriate folders. From
these emails, keywords are extracted (all words in the body except
standard word types, prepositions etc.) and entered into a hash table
for each folder. The hash table contains all words extracted from the
folder and a number indicating the number of times it occurs in that
folder.

Then, when an email is used as training data for the net, the input
vector is formed for each word from the number of times it occurs in
each folder. E.g, if you had folders: WORK, FRIENDS, INTERNET, GAMES
and the word to be trained was "holiday" wich occurs 20 times in WORK,
90 times in friends, 30 times in internet and not at all in games,
then the input vector might be (0.2, 0.9, 0.3, 0).

The output from the net would be a number representing which category
it fell into (and possibly the strength of the match in this
category?). This value would then be fed into another net that takes
values from all of the words in the email as inputs.

I was thinking of using a backpropagation network to implement it. I
am not concerned with the programming of the network in particular, so
I was going to use one of the java packages available on the internet
to implement it. I have a moderate amount of experience with java, and
none at all with neural networks. Also, I thought Perl might be the
most suitable language to use, just using Java to implement the neural
network (I couldn't find any suitable neural network modules for
Perl).

Does this sound feasible to you and can you give me any pointers?

I already have texts available covering neural networks and Perl.

Thanks

Answer

Subject: Re: Classification of Emails with Neural Networks
Answered By: josh_g-ga on 06 Nov 2002 08:02 PST
Rated: 4 out of 5 stars

Hello spinalwiz,

The idea you've described sounds interesting.  However, I suspect that
your choice of a neural network may not be the most directly practical
choice of algorithm.  Other approaches are known to better handle
textual input, and will be more flexible in terms of selecting word
importance.

The nature of neural nets is that they are very good at taking
numerical input and creating some nonlinear function to produce
reasonably correct numerical output.  When the output required is a
simple on/off, the numerical representation is just fine.  When you
want output with a range of possible answers, and these answers have
no sort of numerical relation, this representation may make things
difficult.

How will you quantify the error of output during training?  It is easy
to map the folders to integer values, eg. WORK=0, FRIENDS=1, etc. 
What may make this problematic is that the ordering of the folders is
arbitrary, and yet backprogatation of error will consider some folders
to be "closer" to each other.  For example, misclassifying a work
email as FRIENDS will be considered less erroneous than labelling it
GAMES.  While the neural net may still converge to a working solution,
the nature of this approach doesn't really fit well with the problem. 
(It also would make it very difficult to represent the strength of a
match to a particular category.  Closeness to a certain number might
mean a stronger match - or it might mean that it is being pulled
erroneously in both directions away from that category.)

There is also the question of whether context matters.  In your
approach, the words hashed must contain the classification information
outside of context - that is, all you will consider is the number of
times those words appear, with no relation to what other words were in
the same phrase, sentence, or paragraph.  For example, the
significance of the word "free" in "FREE XXX PICS" will be stored
identically as if it were in, "Are you free to go out Saturday night?"
 This isn't only a downside of your neural net approach - some
language analysis approaches will consider context, but some will
ignore it.

Now, what is intriguing about your approach is that it is very similar
to the Naive Bayesian approach, in terms of what information you are
extracting.  Naive Bayes analysis does not contain any context
information either, and gives weight to words based on how frequently
they appear in that specific folder.  In fact, if you would alter your
inputs from the sum word count to a probability of the word appearing
in that folder [ P(word|folder) ] , and multiply that by the
probability of any email belonging to that folder [ P(folder) ], you'd
be using a form of Naive Bayes to create the input for your neural
net.

In probability notation,

P(folder|email) = P(email|folder) * P(folder) / P(email)
, where P(email) can be ignored during comparision since it is common
to all.

This is the approach used by the Naive Bayes email classifier that
wilsong-ga commented on.
(http://sourceforge.net/docman/display_doc.php?docid=13334&group_id=63137)

Now what I would suspect would happen if you'd try this, is that all
of the important information is already determined in your input
generation, and the neural net would be redundant.  Or worse, it may
add too much possible complexity to what should map to a very simple,
linear function, and the backpropagation process would fail to
converge.  Of course, neural nets and any probabilistic machine
learning approach are very situation dependant, and results can often
be surprising.  So if this isn't a total headache for you to
implement, it may still be worth trying. :)

However, since other people are showing successful applications of a
more pure Bayesian approach in exactly this application, I would
suggest that you look into that first.  Implementing a purely Bayesian
method will most likely be easier to do in Perl than trying to
integrate a neural net, and will probably work as well if not better.

Some topics for you to look for, if you choose to read up on these
ideas:

 - Markov models, which essentially is a Bayesian approach that
assumes the present thing under consideration is independant of things
in the past.  ie. something which deliberately ignores context,
exactly what you'd be doing

 - context-sensitive statistical approaches, such as Mutual
Information Clustering.  This approach uses decision trees along with
statistical information.  Decision trees themselves are a machine
learning approach worth learning about on their own merit, as well.


References:

A Plan for Spam
http://www.paulgraham.com/spam.html
 * the approach used here to filter email is definitely worth looking
at!  I didn't have time to summarize it here.

POPFile Automatic Email Sorting using Naive Bayes
http://popfile.sourceforge.net/

And both of these books are worth finding at a library, if a little
expensive to buy:

_Artificial Intelligence, 3rd edition_, G F Luger & W A Stubblefield.
Addison Wesle Longman 1998. ISBN 0-805-31196-3

_Machine Learning_, Tom Mitchell. WCB/McGraw-Hill 1997.  ISBN
0-07-042807-7


I hope this helps!

 - josh_g-ga

spinalwiz-ga rated this answer: 4 out of 5 stars

Comments

Subject: Re: Classification of Emails with Neural Networks
From: michael3-ga on 04 Nov 2002 07:02 PST

I suspect you might run into problems with short emails where there
may be insufficient data for the network to act on effectively.  For
example, how would the network deal with the many emails I get which
just say something like 'OK', or 'I agree'?  The majority of my emails
are - taken out of context -likewise so general as to be difficult for
a program, or indeed anyone else, to categorise for me.

Subject: Re: Classification of Emails with Neural Networks
From: wilsong-ga on 04 Nov 2002 07:07 PST

Naive Bayesian algorithm ( perl code avaliable ) for your perusual 
http://popfile.sourceforge.net/
May work for your application or give you a good starting point for your
neural net, if speed isn't an issue, Perl could be your solution.

Subject: Re: Classification of Emails with Neural Networks
From: spinalwiz-ga on 04 Nov 2002 07:55 PST

michael3,

Good point. I suppose extracting words from the subject as well as the
body might be a solution to this . . .

Subject: Re: Classification of Emails with Neural Networks
From: michael2-ga on 04 Nov 2002 08:55 PST

It would proably help to feed in the From and CC fields somehow, so
that emails exchanged within the same group of people will tend to get
categorised together.

Subject: Re: Classification of Emails with Neural Networks
From: spinalwiz-ga on 05 Nov 2002 08:08 PST

Using the sender of CC fields might help, but possibly the sender
field might end up carrying so much weight that the actual content of
the email will become unimportant. So that if one of your work
coleagues, for example, became your friend, most emails from him would
still end up in the WORK category.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy