Google Answers: I need a function to compare classifier confusion matrices in discrete problems

View Question

Q: I need a function to compare classifier confusion matrices in discrete problems ( No Answer, 3 Comments )

Question

Subject: I need a function to compare classifier confusion matrices in discrete problems
Category: Science > Math
Asked by: mariane0-ga
List Price: $40.00

Posted: 26 May 2005 08:58 PDT
Expires: 25 Jun 2005 08:58 PDT
Question ID: 525894

This is a follow-up to question : "I need a function to compare classifier accuracies in discrete problems " http://answers.google.com/answers/threadview?id=285691 The answer provided by rincewind-ga was perfect, but I would now need to go more deeply into this problem. Please read the previous question as I'm picking up where the last one left off. A confusion matrix is a table of classifier results: P N <- Predicted class of the examples P TP FN N FP TN ^ actual class of the examples TP means True Positives. It is often expressed as a frequency: the number of accurately classified positive examples / the total number of examples. FP, False Positives, is the frequency with which positive examples have been (mis)classified as negatives examples, and so on for the negatives. Very many measures of classifier performance build up from these 4 numbers. So, how would you reduce bigger square matrices, representing N classes problems, to these 4 numbers? Should we distinguish between examples of class A misclassified as belonging to class B and examples of class A misclassified as belonging to class C in the 3 class confusion matrix: A B C <- Predicted class of the examples A AA AB AC B BA BB BC C CA CB CC ^ actual class of the examples Where XY means the number of examples of class X classified as belonging to class Y over the total number of examples. Any ideas welcome if they are coherent with rincewind-ga's answer :-). Please explain step by step if you go into calculus or something. Mary
Clarification of Question by mariane0-ga on 26 May 2005 09:44 PDT It may be usefull to consider grouping the results as one class vs. all the others. In this case, if we consider class A vs (B and C) A B C <- Predicted class of the examples A AA AB AC B BA BB BC C CA CB CC ^ actual class of the examples TP <- AA FP <- BA and CA FN <- AB and AC TN <- BB and CC (and eventually BC and CB which have "correctly" been classified as not belonging to class A. Mary
Clarification of Question by mariane0-ga on 28 May 2005 11:01 PDT The random classifier answers at random, equi distributed. 2 class values: pos = number of positive examples p2 = probability of properly classifying a positive example = 1/2 q2 = probability of misclassifying a positive example = 1 - p = 1/2 P2(X) = probability of properly classifying X of the pos positive examples P2(X) = Combin(pos, X) * p2^X * q2^(pos-X) N class values: pos = number of examples of class A = number of positive examples pN = probability of properly classifying an example of class A = 1/N qN = probability of misclassifying an example of class A = 1 - p PN(X) = probability of properly classifying X of the examples of class A PN(X) = Combin(pos, X) * pN^X * qN^(pos-X) with Combin(pos, X) = X! / (X! * (N-X)!) So a possibility might be to find a relationship between these two Bernouilli distributions. We know: pos pN qN We want: a way to calculate P2(X) from PN(X) for all X PS Are you still around mathtalk-ga? I'm sure this would be easy for you :-). Please help. Mary
Clarification of Question by mariane0-ga on 28 May 2005 11:14 PDT I'm not saying I don't want somebody else to answer, just that it was mathtalk-ga who gave me this formula in the first place. And said it was "simple". "Simple" I find rather encouraging, it's "trivial" which I resent ;-). Mary
Clarification of Question by mariane0-ga on 30 May 2005 09:28 PDT PS I realise that in my way of calculating this I loose even true/false information about classes other than A. To be more precise, I consider that examples of class B classified as C and examples of class C classified as B are correct, in that they don't belong to class A and have "accurately" been classified as not belonging to class A.
Request for Question Clarification by mathtalk-ga on 25 Jun 2005 07:56 PDT Hi, mariane0-ga: I'm not sure if you are still interested in this Question, and I was reluctant to post an Answer for which you would be charged if, as seemed might be the case, you have lost the ability to log into your Google Answers account. Since the Question is about to expire, I'll post a link for you to an article which discusses the issue of classifier comparison in a multiple outcome context: [Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development] http://discover1.mc.vanderbilt.edu/discover/public/Publications/MCSVM.pdf See in particular the Sections titled "Performance metrics" and "Statistical comparison among classifiers". As you will see, the first metric used in this study is accuracy, ie. percentage of correct classifications. However as the authors note (and you intuit, and I was elaborating on), this measure of performance is sensitive to the overall distributions of the true outcomes. To give a trivial example, since a small number of people have hearts on the right side of the body, an accurate classifier could be obtained by predicting the heart is on the left in every case, without using any input information. But if it were important to identify that small subset, accuracy would not be a suitable measure of performance. For this reason the authors also use a second metric called "relative classifier information" (RCI) which gives one approach to correcting for the number and distribution of categories: "RCI is an entropy-based measure that quantifies how much the uncertainty of a decision problem is reduced by a classifier relative to classifying using the priors [16]." [Reference given to paper by V. Sindwani et al] They also reference some other multi-category comparison methodologies which they decided not to pursue in this study. Hope this helps! You should still be able to post Comments on this thread after the Question expires if you would like further discussion. regards, mathtalk-ga

Answer

There is no answer at this time.

Comments

Subject: Re: I need a function to compare classifier confusion matrices in discrete probl
From: mathtalk-ga on 29 May 2005 21:58 PDT

Hi, Mary:

I'm glad I used the word "simple" then, in my Answer long ago in Nov.
2003, but perhaps a clarfication of my meaning is called for.

I've tried to review the recent Questions posted by you, esp. the
detailed Comment posted by Lars (rincewind-ga, hooray for Terry
Pratchett fans!).

The present Question concerns relating the probabilities P2(X) and
PN(X).  I think there is a thread of thought here leading all the way
back to my earlier Answer, and I feel it best to try and reconstruct
the chain as best I can.

The phrase I used then, "a simple binomial distribution", certainly
refers to a context in which the outcome is "binary".  I guess
"trivial" should be reserved for those cases in which only one outcome
is possible!  When more than two outcomes are possible, repeated
independent trials (assuming a fixed underlying probability for the
outcomes) could be said to have a "multinomial distribution".

Now it's true that the multinomial (N-outcome) way of looking at
"classifier" results might be simplified to a binary framework of just
recording "right" or "wrong" results.  In general the N outcomes are
not equally likely, but there is a "Bayesian" approach for describing
the probabilities of "right vs. wrong" in terms of the conditional
probabilities of N-outcomes given a particular "class" of test data.

But let's back up a bit and try to pin down better what is really
sought.  After all your Subject line says, "I need a function to
compare classifier confusion matrices in discrete problems".  In that
regard I would suggest that a statistic which attempts to "compare"
all classifiers needs to take into account the underlying
probabilities of the actual outcomes.

In that connection I would choose as the "benchmark" not a random
classifier which picks all the outcomes with equal chances (without
regard to the attributes or "input variables" of the test data), but
rather one which yields the outcomes according to their natural
frequencies.

To get past that point, at least temporarily, let's assume for the
sake of discussion that the N outcomes do occur in reality with equal
probability, so that the random classifier as Mary has stipulated it
would agree with my frame of comparison.

The main point to be made is that the expedient of recording only the
rightness or wrongness of a classifier's output leads only to a 2xN
tabulation of results, not a 2x2 tabulation.  The confusion matrix
presumes that the "real" situation is binary, and that the truth (or
falsity) of the classifier can then be equated to "true positive" or
"true negative" (resp. "false positive" or "false negative").

Suppose the "real" outcomes are A, B, or C as in the illustration.  We
could indeed summarize a classifiers results (on a suitably broad test
set) as:

  True A   True B   True C
  False A  False B  False C

By further restricting attention to cases as A or non-A, we could summarize as:

  True A   True non-A
  False A  False non-A

Even if we correctly simulate the proper frequencies of A,B,C in our
test data, it is hard to avoid the prospective conclusion that we are
throwing away information about the classifier's performance by
restricting attention in this way to A vs. non-A discrimination.

Perhaps this is precisely what Mary wants to do here, and I'm game to
provide the necessary formulas from probability to link the NxN
matrices to 2x2 matrices under then appropriate simplifying
assumptions.  I just thought it best to drag out the rationale (or at
least the enthusiasm!) for doing so.

regards, mathtalk-ga

Subject: Re: I need a function to compare classifier confusion matrices in discrete probl
From: mariane0-ga on 30 May 2005 09:23 PDT

Hi mathtalk, nice to hear from you again. 

> a statistic which attempts to "compare"
all classifiers needs to take into account the underlying
probabilities of the actual outcomes 

They are unknown... All we know is the number of examples 
of such a class in our training dataset. This can be used as 
a substitution, I guess. 

I'm actually trying to write this comparison function, here's 
the justification: 

Any N-classes classification problem can be transformed into several
2-classes problems. When the classes can't be ordered this is usually
done by classifying one class vs. the N-1others, and repeating the
process for each class, but there are many other, dataset-dependent,
possibilities. When the classes can be ordered, even if the order is
empirical, it is usually done by classifying the top-most class vs.
the others, then the two top-most classes together vs. the others,
etc. It is quite possible that the classifier which worked best on the
original dataset won't be the same as the one which gets the best
results on the two-classes problems.

There are several reasons why someone would do that. A researcher may
wish to experiment with several classifiers on a 3-class-values
dataset while one of these classifiers is only built to handle
2-class-values problems. A data miner may wish to focus upon a
particular class. A user may find large trees difficult to read and so
reduce the number of class values in order to reduce the size of the
induced decision tree. Supposing he were using C4.5 on the iris
dataset, he would notice that grouping the virginicas with the
versicolors reduced the number of leaves from 5 to 2 and increase the
accuracy from 96% to 99%, and he would then wonder which tree was
?best?.

Anybody presenting this kind of results now ends up saying something
like "on the two-classes problems the average accuracy was 81%, while
on the 5-classes problems it was only 75%. Of course, a 5-classes
problem is intrisically more difficult than a two-classes problem".
But he can't say on which dataset his classifier worked best while
taking into account this increased difficulty, which is a bit
unsatisfying for him (and for his audience too, if they notice).

As you noticed, this has been bugging me for a while. I used rincewind's 
formula to compare accuracies, but if want to go any further I need a way 
to reduce an NxN confusion matrix into a 2x2 confusion matrix, because 
many measures of classifier performance are based upon formulas with TP, 
TN, FP and FN. 

Let us then consider class A vs. the others. 

Now, using frequencies (these numbers over the total number of examples), 
I can use Rincewind's formula to know TP + TN (= the accuracy on 2 classes). 

Value of the accuracy on the original 3-classes problem = "Acc3".
Value of the corresponding accuracy on the 2-classes problem = "Acc2".

To be consistent, the error must of course be 1 ? the accuracy, 
so (FP + FN) = 1 - Acc2. 

The contribution of class A towards accuracy should be scaled:
TP = AA * Acc2 / Acc3 and TN = Acc2 ? TP.

The proportion of examples of class A vs. the total number of examples
should not be modified. This is a descriptor of the data set, not a
classifier result.

So TP + FN = AA + AB + AC -> FN = AA + AB + AC ? TP
Finally, the sum of frequencies being 1,
FP = 1 ? (TP + TN + FN). 

This is as far as I got, BUT how on earth does this correspond to multinomial 
probabilities??? 

You see, I can understand Rincewind's formula, but I don't understand 
yours. I mean, I can use it, but I wouldn't know how to derive it. 

So I'm wondering whether my way of calculating this makes any sense from 
a statistical point of view? 

I'm hoping that a way of relating the probabilities P2(X) and
PN(X) would somehow "prove" that Rincewind's formula is the right 
choice... ? 

As for throwing away information, yes, well, I think this can't be 
helped: there's no way a 2x2 matrix can contain as much information 
as an NxN matrix, but I'm hoping that it would be a valid answer 
to this problem to repeat the calculation for all classes (taken 
one by one vs. all the others) and to average the results... 
What do you think?  

> I'm game to
provide the necessary formulas from probability to link the NxN
matrices to 2x2 matrices under then appropriate simplifying
assumptions.

Please do so. If you can think of any way to link this link with 
Rincewind's formula, all the better. If not, I'll just try a few 
numerical examples to see how the results of both transformation 
compare to each other. 

Regards. 

Mary

Subject: Re: I need a function to compare classifier confusion matrices in discrete probl
From: mathtalk-ga on 25 Jun 2005 08:10 PDT

The formulas for RCI (relative classifier information) as based on a
confusion matrix Q are summarized in the last two pages of this
presentation:

[Comparison of Machine Learning Algorithms on Different Microarray Data Sets]
http://www.dkfz.de/biostatistics/Reisensburg2004/folien/Brors-2004.pdf


best wishes, mathtalk-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy