Hello spinalwiz,
The idea you've described sounds interesting. However, I suspect that
your choice of a neural network may not be the most directly practical
choice of algorithm. Other approaches are known to better handle
textual input, and will be more flexible in terms of selecting word
importance.
The nature of neural nets is that they are very good at taking
numerical input and creating some nonlinear function to produce
reasonably correct numerical output. When the output required is a
simple on/off, the numerical representation is just fine. When you
want output with a range of possible answers, and these answers have
no sort of numerical relation, this representation may make things
difficult.
How will you quantify the error of output during training? It is easy
to map the folders to integer values, eg. WORK=0, FRIENDS=1, etc.
What may make this problematic is that the ordering of the folders is
arbitrary, and yet backprogatation of error will consider some folders
to be "closer" to each other. For example, misclassifying a work
email as FRIENDS will be considered less erroneous than labelling it
GAMES. While the neural net may still converge to a working solution,
the nature of this approach doesn't really fit well with the problem.
(It also would make it very difficult to represent the strength of a
match to a particular category. Closeness to a certain number might
mean a stronger match - or it might mean that it is being pulled
erroneously in both directions away from that category.)
There is also the question of whether context matters. In your
approach, the words hashed must contain the classification information
outside of context - that is, all you will consider is the number of
times those words appear, with no relation to what other words were in
the same phrase, sentence, or paragraph. For example, the
significance of the word "free" in "FREE XXX PICS" will be stored
identically as if it were in, "Are you free to go out Saturday night?"
This isn't only a downside of your neural net approach - some
language analysis approaches will consider context, but some will
ignore it.
Now, what is intriguing about your approach is that it is very similar
to the Naive Bayesian approach, in terms of what information you are
extracting. Naive Bayes analysis does not contain any context
information either, and gives weight to words based on how frequently
they appear in that specific folder. In fact, if you would alter your
inputs from the sum word count to a probability of the word appearing
in that folder [ P(word|folder) ] , and multiply that by the
probability of any email belonging to that folder [ P(folder) ], you'd
be using a form of Naive Bayes to create the input for your neural
net.
In probability notation,
P(folder|email) = P(email|folder) * P(folder) / P(email)
, where P(email) can be ignored during comparision since it is common
to all.
This is the approach used by the Naive Bayes email classifier that
wilsong-ga commented on.
(http://sourceforge.net/docman/display_doc.php?docid=13334&group_id=63137)
Now what I would suspect would happen if you'd try this, is that all
of the important information is already determined in your input
generation, and the neural net would be redundant. Or worse, it may
add too much possible complexity to what should map to a very simple,
linear function, and the backpropagation process would fail to
converge. Of course, neural nets and any probabilistic machine
learning approach are very situation dependant, and results can often
be surprising. So if this isn't a total headache for you to
implement, it may still be worth trying. :)
However, since other people are showing successful applications of a
more pure Bayesian approach in exactly this application, I would
suggest that you look into that first. Implementing a purely Bayesian
method will most likely be easier to do in Perl than trying to
integrate a neural net, and will probably work as well if not better.
Some topics for you to look for, if you choose to read up on these
ideas:
- Markov models, which essentially is a Bayesian approach that
assumes the present thing under consideration is independant of things
in the past. ie. something which deliberately ignores context,
exactly what you'd be doing
- context-sensitive statistical approaches, such as Mutual
Information Clustering. This approach uses decision trees along with
statistical information. Decision trees themselves are a machine
learning approach worth learning about on their own merit, as well.
References:
A Plan for Spam
http://www.paulgraham.com/spam.html
* the approach used here to filter email is definitely worth looking
at! I didn't have time to summarize it here.
POPFile Automatic Email Sorting using Naive Bayes
http://popfile.sourceforge.net/
And both of these books are worth finding at a library, if a little
expensive to buy:
_Artificial Intelligence, 3rd edition_, G F Luger & W A Stubblefield.
Addison Wesle Longman 1998. ISBN 0-805-31196-3
_Machine Learning_, Tom Mitchell. WCB/McGraw-Hill 1997. ISBN
0-07-042807-7
I hope this helps!
- josh_g-ga |