Augh-ROC

Area under Receiver-Operator Curve, also known as AUROC, is a very popular metric of accuracy in machine learning; still there are some misconceptions and lesser known facts about it, which I would like to highlight here.

Somewhat embarrassingly, AUROC is not strictly a measure of accuracy, like error rate, precision or recall; it does not assess how well predicted classes agree with true ones, but how well some predicted confidence score agrees with true classes. To this end, AUROC pulls much more data from the model and effectively post-optimises it; this improvement can sometimes be operationally implemented (mostly when the model is overall sane, but the training procedure wrongly estimated the internal positive/negative score threshold due to a class imbalance – then the threshold can be easily tweaked to regain sane accuracy), but it may be spurious as well. On the other hand, in comparison with a simple error rate, AUROC effectively weights mis-classification penalty with model’s belief in such errors, which may or may not make sense (it does not when there is a noise in the decision, for instance). Finally, some models simply won’t give you confidence scores, thus AUROC makes no sense for them.

Despite popular belief, AUROC is not invariant to the class order, neither formally nor effectively; changing it will transform AUROC of $a$ into $1-a$ . Consequently, there should be a defined positive class which should get higher scores than the negative ones, so to give AUROC its three pivotal values: AUROC=1, for perfect classification, when there is some threshold score $t$ so that all objects with score $s_i>t$ are positive; AUROC=.5, for random guessing; finally AUROC=0, for anti-perfect classification, when there is some $t$ so that all objects with score $s_i<t$ are negative.

While the last situation is pretty esoteric and in principle equally hard to obtain than the first, people tend to handweave it out and blindly flip every AUROC bellow half. And, what’s worse, implement this behaviour in software. Sure, most often this is just a harmless compensation of messing the class order in the modelling code, but there are situations when it can hurt.

Imagine we have a model fit to some data and actually classifying it well, with AUROC, say, 0.9. Now we pull some very new data though it and get an AUROC of 0.1; does it mean it is basically the same as before? No, obviously – it means everything is seriously screwed up, probably because there are paradoxes in the data (very similar groups of objects with different classes) or you have a bad model and a small sample. Still, an overly smart software could silently flip the value into 0.9 leaving you convinced the model is good. On the other hand, you may also want to produce some AUROC sample for simulation or permutation test; flipping $<.5$ elements can substantially bias it.

You may have heard about the Mann-Whitney-Wilcoxon test, mostly (wrongly) known as a non-parametric version of the Student’s t-test. MWW gets samples from two populations and checks whether it is significantly more likely for a member of one to be higher than from the other. In some special cases it is a test of median difference, distribution equivalence or even mean difference; honestly I don’t get why people are excited about that since I find the general setting most useful and intuitive, but this is a topic for an another entry. Anyway, as most tests, MWW converts given data into a test statistic, here called $U$ , measuring the deviation from null, and provides a way to convert its value into a p-value (i.e. distribution of statistic assuming $H_0$ ). As it turns out, AUROC is just a normalised $U$ ; explicitly, AUROC $=1-\frac{U}{n_1n_2}$ , where $n_i$ is the number of objects in the $i$ -th class. While $U$ is just a sum of ranks of scores given to, say, a negative class minus correction for this class size, it is much easier to calculate AUROC this way than integrating ROC; here is a code in R:

#cls has to be a logical vector, with TRUEs for positive objects
auroc<-function(score,cls){
 n1<-sum(!cls); sum(cls)->n2;
 U<-sum(rank(score)[!cls])-n1*(n1+1)/2;
 return(1-U/n1/n2);
}

#... or, because everything is better as a cryptic one-liner:
auroc1l<-function(score,cls)
 mean(rank(score)[cls]-1:sum(cls))/sum(!cls)

Moreover, a question whether AUROC is significantly higher than achievable at random is equivalent to a question weather scores for positive class are higher than for negative one according to MWW. Assuming no ties in score, the distribution of $U$ can be exactly computed just from counting how many possible positive/negative class arrangements yield it; there is a recursive formula in the Mann and Whitney paper from 1947 (while Wilcoxon has invented the test in 1945, M&W made all the boring work to make it usable). R has a built-in *wilcox distribution function family (C source here), which one can use to implement distribution and quantile functions for AUROC:

#p-value of having AUROC of auroc or more
# with nx positive and ny negative classes
pAuroc<-function(auroc,nx,ny){
 W<-round((1-auroc)*nx*ny);
 pwilcox(W,nx,ny)
}

#Minimal AUROC with p-value of smaller than p
# with nx positive and ny negative classes
qAuroc<-function(nx,ny,p=.05){
 ca<-1-(qwilcox(p,nx,ny)-1)/nx/ny
 #Even AUROC=1 won't be significant
 if(!is.finite(ca) || (ca>1)) return(NA);
 return(ca);
}

When there are ties, it is best to make a permutation test; also it is worth knowing that for larger sample sizes $U\sim N(n_1n_2/2,\sqrt{(n_1n_2(n_1+n_2+1))/12})$ .

Nevertheless, we can see how a minimal significant AUROC depends on the sample size and class imbalance: Space of significant AUROC values. The minimal number of objects required to have even AUROC of 1 significant at .05 is 7, 4 in one class and 3 in the other; in case there is one object in one class, the second has to have at least 20. One the other hand, for more than 100 objects even discouraging AUROC values like .55 can get significant. Consequently, it is clear that AUROC is neither robust to sample size, nor, despite other popular belief, to the class imbalance – it is much harder to randomly get a high AUROC when classes are balanced, thus comparing AUROCs from samples with different class compositions will not be fair (obviously especially when they are small). Sure, as I mentioned earlier, AUROC helps finding models broken by imbalance (this is likely the base of this misconception), but I hope it is clear that this is an entirely different thing.

Well, so much for now; the reproduce code is traditionally available here, here is a PDF with a printable version of the plot.

Previously: Nomads, later: Some PR, some F.

CC-BY mbq, written 5-6-2016, last revised 28-7-2018.

permalink | all posts