A classifier is an algorithm which takes as input a training set of
(m+1)-tuples of the form:
(Att1, Att2, ... , Attm, class), and a testing set of m-tuples of the
form (Att1, Att2, ... , Attm),
where Att1, ... Attm are the "attributes" or "variables" of the tuple.
Here I am concerned about problems where the class is a discrete
variable. In these cases
there are usually a small number of possible values for the class,
typically between 2 and 10.
The number of tuples, on the other hand, can go up to 10^5 or 10^6.
By "looking" at the training set, the classifier finds a function
which correlates the class with
the attributes.
It then uses this function to classify all the tuples of the testing
set, that is, to add to each
of them one element: the class to which it "thinks" this tuple belongs.
In discrete classification problems, the class has to be one of n
known values. The accuracy
of the prediction of the classifier is then calculated by:
Number of correctly classified tuples in the testing set / Number of
tuples in the testing set.
It is usually expressed as a percentage.
Exemple:
Training set (3 4-tuples):
Att1 Att2 Att3 Class
0 0 1 A
1 1 0 B
1 0 1 A
Testing set (3 3-tuple):
Att1 Att2 Att3
1 1 1
0 0 0
0 1 1
The classifier might decide that the class is A if Att2 = 0 and else
B. It would then return:
Att1 Att2 Att3
1 1 1 B
0 0 0 A
0 1 1 B
If this is the correct answer, the classifier gets 100% accuracy, if
only two of these tuples have been
properly classified it gets 67%, etc.
Remark: mistakes are not independent. From our very simple example:
The classifier deduced :
If (Att2 = 0) -> class = A, else -> class = B
Maybe these tuples were generated following the rule:
If (Att3 = 0) -> class = B, else -> class = A
Which would mean that the result
Att1 Att2 Att3
1 1 1 B -> Should be A
0 0 0 A -> Should be B
0 1 0 B
has two dependent mistakes in it, accuracy 33%. (They are 'dependent'
because they were both
caused by the same - wrong - rule).
This simple measure is usefull when comparing the results of several
m-class problems, where m
is constant. If on problem one the class has 3 possible values and the
classifier gets 98% accuracy
while on problem two the class has 3 possible values and the
classifier gets 81% accuracy, we
can say that this classifier did better on problem one.
No so if in problem one the class has 2 possible values while in
problem two the class has 4 possible
values.
If the class has n possible values, the probability of correctly
classifying r out of t tuples is:
C(t, r) * ((1/n)^r) * ((1 - 1/n)^(t-r)) with C(t,r) = t!/(r!(t-r)!) if
the classifier is answering at random.
We could say that, as long as the accuracy is higher than (1 / Number
of possible class values),
the less the result is likely, the better it is. So If on problem one
the class has 2 possible values
and the classifier gets 98% accuracy while on problem two the class
has 4 possible values and the
classifier gets 81% accuracy, both problems having 100 tuples to classify,
P(98% accuracy for problem one (2 class values) if the classification is random) =
0.00000000000000000000000000390486148084400843678806
first significant digit at 10^(-27)
P(81% accuracy for problem two (4 class values) if the classification is random) =
0.00000000000000000000000000000009571962731167119452
first significant digit at 10^(-32)
This result says that the classifier did better on problem two because
if it classified randomly (= if
it didn't know what it was doing!) it is more unlikely that it would
get 81% accuracy distributing 100
tuples into 4 groups than 98% accuracy distributing 100 tuples into 2 groups.
Problems:
If we duplicate the testing set (and there is nothing to prevent us
from doing so), the problem
gets neither easier nor more difficult for the classifier. For a
2-classes problem :
The probability of classifying correctly and randomly 15 out 20 tuples is 0.0147858
If we duplicate the testing set :
The probability of classifying correctly and randomly 30 out 40 tuples
is 0.000770943
Did the classifier do better the second time?
No. The problem was equivalent. The classifier got 75% accuracy each time.
It didn't do better, it did the same.
This shows that when comparing accuracies, only the percentage should be taken into
account. Formulas which would consider separately the number of tuples
in the testing
set and the number of correctly classified tuples won't work.
What is needed is a function of (accuracy, number of classes), not a
function of 3 variables,
(number of tuples in the testing set, number of correctly classified
tuples, number of classes).
As the 'reference' is usually a 2-class (binary) problem, the nicest
formula would look like :
f(accuracy, number of classes) = 'the corresponding accuracy if this
had been a binary problem'.
Do we really care about the probability of getting exactly 81%
accuracy? We could just as
well use the probability of getting at least 81%, the probability of
getting less, or
some other such measurement. As long as it makes sense, any measurement is fine.
'Exact' functions would be better, but using Stirling's approximation
(n! is about equal to (n/e)^n)
for the number of tuples in the testing set or the number of well
classified tuples is acceptable,
because these numbers are usually large.
Real life databases now often contain hundred of thousands of tuples
(rows in their tables).
Ideally, we would like a result which could be calculated on a small
hand-held calculator instead
of having to program using large variables to hold all the digits.
Even for my hypothetical
100 tuples example I had to use the computer to get the result, and
even 8 bytes variables
won't let me do the same calculation for 10 000 tuples.
So formulas which have the number of classes as a power are alright,
but formulas which
would have the number of tuples as a power are impractical.
Here are some general guidelines with the accuracy as a real number between 0 and 1
(corresponding to the percentage) and n, an integer, n > 1.
f(accuracy, number of classes) = f(x, n)
f(0, n) = 0
f(1, n) = 1
f((1/n), n) = 1/2
f(x, 2) = x
Though n will never be anything else than an integer, it can be a
formula of 2 real numbers,
continuity would be nice in [0, 1] for x and [2, +infinity[ for n. The
behaviour of the function
outside these intervals is irrelevant.
Someone told me that an infinity of functions would fit these 4
conditions, and that I needed
to add more specifications. But my added specification is (maybe
unfortunately!) completely
informal: it is that the formula has to make sense : it has to come from somewhere,
to be arguably the right formula in this context. (Which is why I
talked about probabilities).
Any aditionnal ideas or comments welcome.
Mary
PS: I know about ROC curves, I am looking for a way to compare accuracies. |