|
|
Subject:
I need a function to compare classifier confusion matrices in discrete problems
Category: Science > Math Asked by: mariane0-ga List Price: $40.00 |
Posted:
26 May 2005 08:58 PDT
Expires: 25 Jun 2005 08:58 PDT Question ID: 525894 |
This is a follow-up to question : "I need a function to compare classifier accuracies in discrete problems " http://answers.google.com/answers/threadview?id=285691 The answer provided by rincewind-ga was perfect, but I would now need to go more deeply into this problem. Please read the previous question as I'm picking up where the last one left off. A confusion matrix is a table of classifier results: P N <- Predicted class of the examples P TP FN N FP TN ^ actual class of the examples TP means True Positives. It is often expressed as a frequency: the number of accurately classified positive examples / the total number of examples. FP, False Positives, is the frequency with which positive examples have been (mis)classified as negatives examples, and so on for the negatives. Very many measures of classifier performance build up from these 4 numbers. So, how would you reduce bigger square matrices, representing N classes problems, to these 4 numbers? Should we distinguish between examples of class A misclassified as belonging to class B and examples of class A misclassified as belonging to class C in the 3 class confusion matrix: A B C <- Predicted class of the examples A AA AB AC B BA BB BC C CA CB CC ^ actual class of the examples Where XY means the number of examples of class X classified as belonging to class Y over the total number of examples. Any ideas welcome if they are coherent with rincewind-ga's answer :-). Please explain step by step if you go into calculus or something. Mary | |
| |
| |
| |
| |
|
|
There is no answer at this time. |
|
Subject:
Re: I need a function to compare classifier confusion matrices in discrete probl
From: mathtalk-ga on 29 May 2005 21:58 PDT |
Hi, Mary: I'm glad I used the word "simple" then, in my Answer long ago in Nov. 2003, but perhaps a clarfication of my meaning is called for. I've tried to review the recent Questions posted by you, esp. the detailed Comment posted by Lars (rincewind-ga, hooray for Terry Pratchett fans!). The present Question concerns relating the probabilities P2(X) and PN(X). I think there is a thread of thought here leading all the way back to my earlier Answer, and I feel it best to try and reconstruct the chain as best I can. The phrase I used then, "a simple binomial distribution", certainly refers to a context in which the outcome is "binary". I guess "trivial" should be reserved for those cases in which only one outcome is possible! When more than two outcomes are possible, repeated independent trials (assuming a fixed underlying probability for the outcomes) could be said to have a "multinomial distribution". Now it's true that the multinomial (N-outcome) way of looking at "classifier" results might be simplified to a binary framework of just recording "right" or "wrong" results. In general the N outcomes are not equally likely, but there is a "Bayesian" approach for describing the probabilities of "right vs. wrong" in terms of the conditional probabilities of N-outcomes given a particular "class" of test data. But let's back up a bit and try to pin down better what is really sought. After all your Subject line says, "I need a function to compare classifier confusion matrices in discrete problems". In that regard I would suggest that a statistic which attempts to "compare" all classifiers needs to take into account the underlying probabilities of the actual outcomes. In that connection I would choose as the "benchmark" not a random classifier which picks all the outcomes with equal chances (without regard to the attributes or "input variables" of the test data), but rather one which yields the outcomes according to their natural frequencies. To get past that point, at least temporarily, let's assume for the sake of discussion that the N outcomes do occur in reality with equal probability, so that the random classifier as Mary has stipulated it would agree with my frame of comparison. The main point to be made is that the expedient of recording only the rightness or wrongness of a classifier's output leads only to a 2xN tabulation of results, not a 2x2 tabulation. The confusion matrix presumes that the "real" situation is binary, and that the truth (or falsity) of the classifier can then be equated to "true positive" or "true negative" (resp. "false positive" or "false negative"). Suppose the "real" outcomes are A, B, or C as in the illustration. We could indeed summarize a classifiers results (on a suitably broad test set) as: True A True B True C False A False B False C By further restricting attention to cases as A or non-A, we could summarize as: True A True non-A False A False non-A Even if we correctly simulate the proper frequencies of A,B,C in our test data, it is hard to avoid the prospective conclusion that we are throwing away information about the classifier's performance by restricting attention in this way to A vs. non-A discrimination. Perhaps this is precisely what Mary wants to do here, and I'm game to provide the necessary formulas from probability to link the NxN matrices to 2x2 matrices under then appropriate simplifying assumptions. I just thought it best to drag out the rationale (or at least the enthusiasm!) for doing so. regards, mathtalk-ga |
Subject:
Re: I need a function to compare classifier confusion matrices in discrete probl
From: mariane0-ga on 30 May 2005 09:23 PDT |
Hi mathtalk, nice to hear from you again. > a statistic which attempts to "compare" all classifiers needs to take into account the underlying probabilities of the actual outcomes They are unknown... All we know is the number of examples of such a class in our training dataset. This can be used as a substitution, I guess. I'm actually trying to write this comparison function, here's the justification: Any N-classes classification problem can be transformed into several 2-classes problems. When the classes can't be ordered this is usually done by classifying one class vs. the N-1others, and repeating the process for each class, but there are many other, dataset-dependent, possibilities. When the classes can be ordered, even if the order is empirical, it is usually done by classifying the top-most class vs. the others, then the two top-most classes together vs. the others, etc. It is quite possible that the classifier which worked best on the original dataset won't be the same as the one which gets the best results on the two-classes problems. There are several reasons why someone would do that. A researcher may wish to experiment with several classifiers on a 3-class-values dataset while one of these classifiers is only built to handle 2-class-values problems. A data miner may wish to focus upon a particular class. A user may find large trees difficult to read and so reduce the number of class values in order to reduce the size of the induced decision tree. Supposing he were using C4.5 on the iris dataset, he would notice that grouping the virginicas with the versicolors reduced the number of leaves from 5 to 2 and increase the accuracy from 96% to 99%, and he would then wonder which tree was ?best?. Anybody presenting this kind of results now ends up saying something like "on the two-classes problems the average accuracy was 81%, while on the 5-classes problems it was only 75%. Of course, a 5-classes problem is intrisically more difficult than a two-classes problem". But he can't say on which dataset his classifier worked best while taking into account this increased difficulty, which is a bit unsatisfying for him (and for his audience too, if they notice). As you noticed, this has been bugging me for a while. I used rincewind's formula to compare accuracies, but if want to go any further I need a way to reduce an NxN confusion matrix into a 2x2 confusion matrix, because many measures of classifier performance are based upon formulas with TP, TN, FP and FN. Let us then consider class A vs. the others. Now, using frequencies (these numbers over the total number of examples), I can use Rincewind's formula to know TP + TN (= the accuracy on 2 classes). Value of the accuracy on the original 3-classes problem = "Acc3". Value of the corresponding accuracy on the 2-classes problem = "Acc2". To be consistent, the error must of course be 1 ? the accuracy, so (FP + FN) = 1 - Acc2. The contribution of class A towards accuracy should be scaled: TP = AA * Acc2 / Acc3 and TN = Acc2 ? TP. The proportion of examples of class A vs. the total number of examples should not be modified. This is a descriptor of the data set, not a classifier result. So TP + FN = AA + AB + AC -> FN = AA + AB + AC ? TP Finally, the sum of frequencies being 1, FP = 1 ? (TP + TN + FN). This is as far as I got, BUT how on earth does this correspond to multinomial probabilities??? You see, I can understand Rincewind's formula, but I don't understand yours. I mean, I can use it, but I wouldn't know how to derive it. So I'm wondering whether my way of calculating this makes any sense from a statistical point of view? I'm hoping that a way of relating the probabilities P2(X) and PN(X) would somehow "prove" that Rincewind's formula is the right choice... ? As for throwing away information, yes, well, I think this can't be helped: there's no way a 2x2 matrix can contain as much information as an NxN matrix, but I'm hoping that it would be a valid answer to this problem to repeat the calculation for all classes (taken one by one vs. all the others) and to average the results... What do you think? > I'm game to provide the necessary formulas from probability to link the NxN matrices to 2x2 matrices under then appropriate simplifying assumptions. Please do so. If you can think of any way to link this link with Rincewind's formula, all the better. If not, I'll just try a few numerical examples to see how the results of both transformation compare to each other. Regards. Mary |
Subject:
Re: I need a function to compare classifier confusion matrices in discrete probl
From: mathtalk-ga on 25 Jun 2005 08:10 PDT |
The formulas for RCI (relative classifier information) as based on a confusion matrix Q are summarized in the last two pages of this presentation: [Comparison of Machine Learning Algorithms on Different Microarray Data Sets] http://www.dkfz.de/biostatistics/Reisensburg2004/folien/Brors-2004.pdf best wishes, mathtalk-ga |
If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you. |
Search Google Answers for |
Google Home - Answers FAQ - Terms of Service - Privacy Policy |