After a detailed research, I located the type of papers and documents
that will help you with your research. I have provided a brief
description of every algorithm and some resources for the basic
information but as you need more details on the appropriate usage and
the usefulness of these classifiers, I did extensive research to find
the kind of papers that describe such comparisons and the
effectiveness/problems of these algorithms.
AGAIN, As far as I can assume you may already have knowledge of the
algorithms, therefore instead of Repeating the information or writing
my own review, I tried hard to find the type of papers that can help
you with your research.
You may ignore a brief description of each algorithm and go directly
to the 'DOCUMENTS & PAPERS' section or you may find the resources with
the algorithms very useful.
A great resources for such type of papers is
The Ricchio's Algorithm is based on the Relevancy Feedback Algorithms
for Document Relevancy. Relevancy Feedback Models are an effective way
of modifying and expanding user queries (such as search engines).
Ricchio's Algorithm is one of the earliest methods used for queries.
This algorithm is basesd on the idea that if the relevance for a query
is known, an optimal query vector will maximize the average
query-document similarity for relevant documents, and will
query-document similarity for non relevant documents.
If we look at the basic formula for the Ricchio's Algorithm, the
intuitive idea of Ricchio's Algorithm is to iteratively increase the
weights of those terms contained in labeled relevant documents while
penalizing the terms in the irrelevant documents.
Recent studies have shown that Ricchio's Algorithm has a poor
performance when the proportion of relevant documents in the whole
corpus is low.
Naive Byes algorithms are among the most successful known algorithms
for learning to classify text documents. It predicts by reading a set
of examples in attribute value-representation and than by using the
Bayes Theorem to estimate the posterior probabilities of all
qualifications. For each
instance of the example language a classification with the highest
posterier probability is chosen
as the prediction.
Suppose your data consist of fruits, described by their color and
shape. Bayesian classifiers operate by saying "If you see a fruit
that is red and round, which type of fruit is it most likely to be,
based on the observed data sample? In future, classify red and round
fruit as that type of fruit."
A difficulty arises when you have more than a few variables and
classes -- you would require an enormous number of observations
(records) to estimate these probabilities.
Naive Bayes classification gets around this problem by not requiring
that you have lots of observations for each possible combination of
the variables. Rather, the variables are assumed to be independent of
one another and, therefore the probability that a fruit that is red,
round, firm, 3" in diameter, etc. will be an apple can be calculated
from the independent probabilities that a fruit is red, that it is
round, that it is firm, that is 3" in diameter, etc.
In other words, Na´ve Bayes classifiers assume that the effect of an
variable value on a given class is independent of the values of other
variable. This assumption is called class conditional independence. It
is made to simplify the computation and in this sense considered to be
This assumption is a fairly strong assumption and is often not
applicable. However, bias in estimating probabilities often may not
make a difference in practice -- it is the order of the probabilities,
not their exact values, that determine the classifications.
Studies comparing classification algorithms have found the Na´ve
Bayesian classifier to be comparable in performance with
classification trees and with neural network classifiers. They have
also exhibited high accuracy and speed when applied to large
The problem with multinomial Naive Bayes is that when one class has
more training examples than another, Naive Bayes selects poor weights
for the decision boundary. This is due to an under-studied bias effect
that shrinks weights for classes with few training examples. ANother
systematic problem with Naive Bayes is that features are assumed to be
independent. As a result, even when the words are dependent, each word
contributes evidence individually. Thus the magnitude for the weights
for classes with strong word dependencies is larger than for classes
with weak word dependencies.
PORBLEMS WITH NAIVE BAYES
The goal of this clustering method is to simply separate the data
based on the assumed similarities between various classes. Thus, the
classes can be differentiated from one another by searching for
similarities between the data provided.
The K-Nearest Neighbor is suitable for data streams. KNN does not
build a classifier in advance. When a new sample arrives, KNN finds
the K neighbors nearest to the new samples from the training space
based on some suitable similarity or distance metric.
KNN is a good choice when simplicity and accuracy are the predominant
issues. KNN can be superior when a resident, trained and tested
classifiers has a short useful lifespan, such as in the case with the
data streams where new data is added rapidly and the training set is
ever changing. KNN does not rely on prior probabilities, and it is
computationally efficient. The main computation is the sorting of the
training documents in order to find out the K nearest neighbors for
the test document.
K-Nearest Neighbor is useful when their are less than 20 attributes
per instance, there is lots of training data, training is very fast,
learning complex target functions and don't want to loose information.
The disadvantages of using such a function is that it is slow in
sorting out queries and irrelevant attributes can fool the neighbor.
The Decision Tree exploration engine, new to PolyAnalyst 4.1, helps
solve the task of classifying cases into multiple categories. Decision
Tree is PolyAnalyst's fastest algorithm when dealing with large
amounts of attributes. Decision Tree report provides an easily
interpreted decision tree diagram and a predicted versus real table.
Problems to Solve:
Classification of cases into multiple categories
Categorical or Boolean (Yes/No) attribute
Predicted versus Real table (confusion matrix)
Decision Tree diagram
Optimal Number of Records:
Minimum of 100 records
Maximum of 5,000,000 records
Summary Statistics - to deselect attributes that contain to many
values to provide any useful insight to the exploration engine.
Information Gain splitting criteria
Shannon information theory and statistical significance tests.
The Data Used:
Decision Tree works on data of any type. The DT algorithm is
well-poised for analyzing very large databases because it does not
require loading all the data in machine main memory simultaneously.
PolyAnalyst takes a full advantage of this feature by implementing
incremental DT learning with the help of the OLE DB for Data Mining
The DT algorithm calculation time scales very well (grows only
linearly) with increasing number of data columns. At the same time, it
grows more than linearly with the growing number of data records - as
N*log(N), where N is the number of records. Yet, for data of about
100,000 records, the DT algorithm is often the fastest exploration
algorithm of PolyAnalyst.
Problems to Solve:
Decision Tree algorithm helps solving the task of classifying cases
into multiple categories. In many cases, this is the fastest, as well
as easily interpreted machine learning algorithm of PolyAnalyst. The
DT algorithm provides intuitive rules for solving a great variety of
classification tasks ranging from predicting buyers/non-buyers in
database marketing, to automatically diagnosing patient in medicine,
and to determining customer attrition causes
in banking and insurance.
The target attribute of a Decision Tree exploration must be of a
Boolean (yes/no) or categorical data type.
When to Use This Algorithm:
The Decision Tree exploration engine is used for task such as
classifying records or predicting outcomes. You should use decision
trees when you goal is to assign your records to a few broad
categories. Decision Trees provide easily understood rules that can
help you identify the best fields for further exploration.
The Decision Tree report starts of by giving measures resulting from
the decision tree. These measures are the Number of non-terminal
nodes, Number of leaves, and depth of the constructed tree. Next, the
report provides classification statistics on the decision tree.
The drawback of this algorithm is that large number of gini indices
have to be computed at each node of the decision tree. In order to
decide which attribute is to be split at each node, the gini indices
have to be computed for all the attributes and for each successive
pair of values for all patterns which have not been classified
OVERVIEW WITH AN EXAMPLE
Support Vector Machine
A support vector machine is a supervised learning algorithm developed
over the past decade by Vapnik and others (Vapnik, Statistical
Learning Theory, 1998). The algorithm addresses the general problem of
learning to discriminate between positive and negative members of a
given class of n-dimensional vectors. For example, if you have a
series of mRNA expression level measurements for each of a large
number of genes, the SVM can learn to answer questions such as, ``Does
the given gene belong to functional class X?'' where X is some
category such as ``ribosomal genes'' or ``sugar and carbohydrate
transporters.'' If you have a collection of digitized images of
handwritten digits, the SVM can say whether a
given image is, say, the number 9.
The SVM algorithm operates by mapping the given training set into a
possibly high-dimensional feature space and attempting to locate in
that space a plane that separates the positive from the negative
examples. Having found such a plane, the SVM can then predict the
classification of an unlabeled example by mapping it into the feature
space and asking on which side of the separating plane the example
lies. Much of the SVM's power comes from its criterion for selecting a
separating plane when many candidates planes exist: the SVM chooses
the plane that maintains a maximum margin from any point in the
training set. Statistical learning theory suggests that, for some
classes of well-behaved data, the choice of the maximum margin
hyperplane will lead to maximal generalization when predicting the
classification of previously unseen examples (Vapnik, Statistical
Learning Theory, 1998). The SVM algorithm can also be extended to cope
with noise in the training set and with multiple classes (Cristianini
and Shawe-Taylor, An Introduction to Support Vector Machines, 2000).
Say that we have a training data set containing n examples, each of
which is a vector of m numbers. These vectors may be thought of as
points in an m-dimensional space. In theory, a simple way to build a
binary classifier is to construct a hyperplane (i.e., a plane in a
space with more than three dimensions) separating class members
(positive examples) from non-members (negative examples) in this
space. Unfortunately, most real-world problems involve non-separable
data for which there does not exist a hyperplane that successfully
separates the positive from the negative examples. One solution to the
inseparability problem is to map the data into a higher-dimensional
space and define a separating hyperplane there. This
higher-dimensional space is called the feature space, as opposed to
the input space occupied by the training examples. With an
appropriately chosen feature space of sufficient dimensionality, any
consistent training set can be made separable. However, translating
the training set into a higher-dimensional space incurs both
computational and learning-theoretic costs. Furthermore, artificially
separating the data in this way exposes the learning system to the
risk of finding trivial solutions that overfit the data.
SVMs elegantly sidestep both difficulties. They avoid overfitting by
choosing the maximum margin separating hyperplane from among the many
that can separate the positive from negative examples in the feature
space. Also, the decision function for classifying points with respect
to the hyperplane only involves dot products between points in the
feature space. Because the algorithm that finds a separating
hyperplane in the feature space can be stated entirely in terms of
vectors in the input space and dot products in the feature space, a
support vector machine can locate the hyperplane without ever
representing the space explicitly, simply by defining a function,
called a kernel function, that plays the role of the dot product in
the feature space. This technique avoids the computational burden of
explicitly representing the feature vectors.
Latent Semantic Analysis
Latent Semantic Analysis (LSA) analyzes word-word, word-passage, and
passage-passage relationships. There's a good relationship between
what LSA extracts and what people understand. LSA doesn't use
first-hand knowledge of the world, but extracts from "episodes" of
word content. LSA doesn't use word order or logic relationships. LSA
uses unitary expressions of meaning instead of relationships between
successive words. A word is a kind of average of meaning through all
Dimensionality is reduced to approximate human cognition. LSA is a
theory of knowledge representation. Dimensionality reduction solves
the problem of "insufficient evidence" or "poverty of the stimulus".
LSA uses a matrix decomposition algorithm to reduce the
dimensionality. Ideally, the dimension of the reconstruction equals
that of the passages. Results show that the meaning similarities are
close to that of humans, LSA's rate of knowledge acquisition
approximates that of humans, and LSA depends on the dimensionality.
LSA can be use to test theories of human cognition. LSA skips over the
order of words to capture relationships in word choice. LSA uses a
pre-processing step for word correlation over many passages and
contexts. LSA uses a very large number of relationships. Theories of
human cognition cannot be settled by theoretical and philosophical
LSA is an automatic mathematical algorithm for extracting
relationships in word usage in passages. It doesn't use dictionaries,
external knowledge, or grammar rules. First, represent words as a
matrix, each row is a word, and each column is a text passage or
context. Next, do a preliminary transformation of the matrix. Each
word frequency in the Next, LSA applies singular value decomposition
(SVD) to the matrix, which is factor analysis. The decomposition
results in dimensionality reduction. Extract words from the passage
into a word matrix, do a linear decomposition of the matrix, then
reduce the dimensionality of the matrix. The LSA matrix adds words not
in the passage, like human minor knowledge acquisition. LSA is
intuitively sensible, with a three-fourths gain in total comprehension
vocabulary inferred from knowledge about words not in the passage or
paragraph. Human children have a rapid growth of vocabulary and
knowledge. Humans draw conclusions from missing data. Reducing the
dimensionality of representation is useful when the representation
matches the data. The data should not be perfectly regenerated. The
similarity of dimensionality reduction is the cosine between vectors.
Before the SVD is computed, LSA does a data preprocessing matrix data
transformation. Save the log word frequency, and the entropy for each
row and column of the word. Weight each word occurrence by an estimate
of its importance in the passage. Knowing a word provides information
about the passage it appeared in. Matrix transformations are used in
information retrieval and human cognition models. A web site provides
LSA based word or passage vectors, similarities between words and
words, words and passages, and passages and passages. LSA is able to
model human conceptual knowledge.
LSA links information retrieval and human semantic memory. Latent
Semantic Indexing (LSI), like LSA, was tested against pre-examined
documents. Direct comparisons were muddied by preprocessing words. LSA
does synonym tests, since most near neighbors are related by the
cosine. LSA vectors were created from many passages. LSA captures
synonymity by knowledge
of captured vocabularies. TOEFL vocabulary simulates human performance
between word choice. LSA errors were compared to student errors. The
role of dimension reduction was analyzed. LSA simulates word sorting
and word relationships. Subject-matter knowledge and sematic priming
are in LSA. Predictive learning and text comprehension for humans.
************DOCUMENTS & PAPERS***************************
The research compares the effectiveness of five different text
algorithms (Rocchio's Algorithm, Decision Tree, Naive Bayes, Bayes
Nets & Support Vector Machines) in terms of learning speed, real time
classification speed and classification accuracy.
The report discusses a variety of ways of evaluating the effectiveness
of text categorization systems.
This is a very detailed study of machine learning in Automated Text
Classification, different algorithms and issue pertaining to document
representation, classifier construction and classifier evaluation.
This document describes some of the basics of text categorization and
how to access the quality of each.
The paper surveys the machine learning text classification algorithms
and measure their efficiency to predict the best classification
This is a thesis written by one of the doctorate candidates of the
PAce University. It not only clearly defines different sets of text
categorization algorithms but also compare and contrast them for
different datasets so as to help system designers chose an appropriate
algorithm given their system constraints and application domains. (see
page 37 of the thesis which is Page 47 of the acrobat reader).
This paper reports a controlled study with statistical significant
tests on five text categorization methods and their robustness in
dealing with different situations.
This paper focuses on a comparative evaluation of a wide-range of text
categorization methods, including previously published results on the
Reuters corpus and new results of additional experiments. A controlled
study using three classifiers, kNN, LLSF and WORD, was conducted to
examine the impact of configuration variations in five versions of
Reuters on the observed performance of classifiers.
Analysis and empirical evidence suggest that the evaluation results on
some versions of Reuters were significantly affected by the inclusion
of a large portion of unlabelled documents, making those results
difficult to interpret and leading to considerable confusions in the
literature. Using the results evaluated on the other versions of
Reuters which exclude the unlabelled documents, the performance of
twelve methods are compared directly or indirectly. For indirect
comparisons, kNN, LLSF and WORD were used as baselines, since they
were evaluated on all versions of Reuters that exclude the unlabelled
documents. As a global observation, kNN, LLSF and neural network
method had the best performance, except for a Naive Bayes approach,
the other algorithms performed relatively well.
The paper describes four machine learning techniques that are common
with the Text categorization and then evaluate their performance
Power Point presentation of five text classifiers (Read the
Hope this will help you. I will try to post more resources if I come
across any. Please clarify if you have any questions regarding this
research or if you need more help.