ROC & AUC

We saw some new things in the logistic regression example, let’s try to understand them a bit more.

Sigmoid or s-curve

Let’s start with the S-curve. Imagine that your car battery has been giving you trouble, especially when the weather gets colder. Being the awesome data nerd you are, you start collecting data and see that when the temperature drops below 0 Celsius, it becomes a struggle to start the car. Now look at this S-curve again and think of the values on the x-axis as the temperature. And the y-axis representing the probability of your car actually starting.

Now looking at the chart, you can see that the chance of the battery being useful at -6c is zer0. As the temperature goes up, the probability goes up. Rising faster as the temperature approaches 0c and you stand a 50% chance that it would actually of getting the engine going. And at around the 4c mark, the chance of you getting your car going now stands close to 100%. This sharp rise in the probability between -2c and 2c is what gives this curve the S shape. So basically, the 0c mark is kind of the activation point for your battery, it which point the likelihood of it working well is 50%. Trivial, right?

Prediction vs Reality

Now, you go and sit in the car, the temperature outside is 0c, you turn on the ignition, it is anyone’s guess if the car will start or not. Here are the possible outcomes.

  • You predicted it would start and it did, that’s a True Positive (TP)
  • You predicted it would start but it didn’t, that’s a False Positive (FP)
  • You predicted it wouldn’t start and it didn’t, that’s a True Negative (TN)
  • You predicted it wouldn’t start but it did, that’s a False Negative (FN)

If you accumulate the data across several days, it would look something like this

TemperaturePredictionRealityOutcome
0Will startDidn’t startFP
-1Wont startStartedFN
-1Wont startDidn’t startTN
2Will startDidn’t startFP
0Will startStartedTP

Confusion matrix

A neat and concise way to represent the above prediction vs reality table is in a confusion matrix, it looks something like this.

Actual class
Car Started Didn’t Start
Predicted
class
Car Started # of True Positives # of False Positives
Didn’t Start # of False Negatives # of True Negatives

From the above, you can tell easily how good the predictions have been. It can be a bit confusing to decipher at the start, after all, it’s a confusion matrix.

Threshold

Being someone who plans for the worst case, you want to figure out a safe temperature threshold where your car would always start, you find it’s best to set that at 6c. So at 6c and above, your car always revs up (True Positive) BUT this also means, that you are not covering the instances below 6c where it also sometimes starts.

Now you also have an optimistic side, so you want to also catch the situations below 6c where the car will rev up and you decide to lower the threshold to 4c. Now the probability of your car revving up is still pretty high, close to 95% but sometimes (5% chance) you will be wrong. So you continue to lower the threshold and see that as you go lower, you start having more True Positives but now you also have more False Positives. You bring it down all the way to -6c, at first you are pretty excited to see that now you can catch all the situations where your car actually does start up BUT immediately that smiley face turns into a frowny face when you see that now you actually don’t see any situations where the car won’t start.

Sensitivity and Specificity

The property of the model that aims to minimise the False Positives is called Specificity. Conversely, the property that aims to minimise False Negatives is called Sensitivity. Ideally ..

  • Our model should be 100% sensitive so we don’t get any False Negatives
  • Our model should be 100% specific so we don’t get any False Positives

.. but those situations are hard to come by so we have to settle somewhere in between. Life is all about give and take .. so depending on your situation, you have to find that sweet spot (test threshold) where things give you the best results.

Maybe in some situations, you don’t want to see any False Positives, so you go for higher specificity e.g. good candidates to loan out your hard-earned $100,000 to. In other situations, it’s possible, you don’t want to have any False Negatives, so you go for more sensitivity e.g. detecting rare diseases.

The math

The numbers from the confusion matrix can be fed into the equations below to figure out the sensitivity and specificity.

    \[ Sensitivity = \frac{True\; Positives}{True\; Positives + False\; Negatives}  \]

    \[ Specificity = \frac{True\; Negatives}{True\; Negatives + False\; Positives}  \]

By the way, there is another term that you probably need to know, it’s called the False Positive Rate, which is basically 1 – Specificity or ..

    \[ False\; Positive\; Rate\; (FPR) = 1 - Specificity = \frac{False\; Positives}{False\; Positives + True\; Negatives}  \]

ROC (Receiver Operating Characteristic)

The ROC curve is essentially a neat way of showing you the False Positive Rate (1 – specificity) on the x-axis and True Positive Rate (sensitivity) on the y-axis, at different probability thresholds. Now, look at this below chart again from our logistic regression example.

Looking at the top right corner, we see that the sensitivity is at it’s highest at 1 and specificity is 0 (as x-axis shows 1 – specificity). Using the car example above, this is probably the point where you set the threshold at -6c, where you cover ALL of the times when the car starts up (True Positives) BUT because of your low threshold, your predictions would also tell you that the car started, in situations where it actually didn’t (False Positives). Moving to the left of the chart, the FPR is going down (specificity is going up) but then the sensitivity is going down too.

In short, the ROC, helps you identify the trade-off between sensitivity and specificity at different thresholds. In an ideal world, the chart would look like the orange line above where the sensitivity is 1 (100%) and False Positive Rate is 0 (specificity is 1 or 100%). If the model is bad, then it would be closer to the blue line, which indicates a classifier that is no better than the flip of a coin.

To clear it up, even more, if we had a situation where the distinction between the 2 outcomes (classes) was very clear, we would end up with a 100% specific and 100% sensitive model. A situation like that could be .. the car never starts between below 0c, a very clear boundary. In this situation, we won’t even need probability and could just use

If ( temp < 0 )      
   wont start 
else if( temp >= 0)
   will start

Unfortunately, this isn’t the case and we have to make an educated guess based on probability. The more muddled the boundary between the 2 classes, the worse our model would perform at any threshold and the closer our ROC will be to the diagonal blue line.

AUC – Area under the curve

The AUC is just that, giving you the percentage of area under the curve of the ROC line. In our logistic regression example, it was around 0.82 (82%), which is pretty good. In our ideal world ROC image above, the AUC would be 1 (100%). The higher this value, the better, if it’s around 0.5, then it’s no better than a coin flip (50-50 probability).

Conclusion with a word of warning

The ROC helps us identify the trade-off between the sensitivity and specificity, across all the thresholds. It is a great indicator of the distribution of data across the threshold spectrum and allows us to identify room for improvement in our modelling based on what we are optimising for e.g. specificity vs sensitivity. As for the AUC, the more area under the curve, the better.

It is important to point out here that with simple logistic regression functions like the scikit-learn LogisticRegression, the threshold is set at 0.5. So anything above 50% probability is a 1 and less is a 0.

What this means is that you might have an AUC of 95% but looking at the probabilities you can see that the best threshold for your situation is at 0.3 instead of 0.5. In this case, you can play around with the weights parameter for the features but unfortunately, there is no way to provide your own threshold.

If you decide to use a different threshold, you can take the probabilities from the predict_proba function and use those to manually classify your results.

By the way, if you are still struggling with the whole ROC and AUC thing, this video might help.

Leave a comment