How to Compute the Statistical Significance of Two Classifiers’ Performance Difference

A question I see quite often, in scientific forums, is how to determine whether the observed performance difference of two classifiers is statistically significant. Let us work through an example, as to how we can compute this statistical significance (click on the link below):

How to Compute the Statistical Significance of two Classifiers’ Performance Difference

Published by

Unknown's avatar

alitheia15

Data Mining-Analytics Software Consultant

6 thoughts on “How to Compute the Statistical Significance of Two Classifiers’ Performance Difference”

  1. Great post. My questions is: in order for Dj to be normally distributed, we need large enough number of folds (minimum 25-30). However, in real applications, we typically use 10-fold cross validations, which complicates the normality assumption. What do you suggest in such cases? Thanks!

    Like

    1. Hi Xiang,
      You are right that 10-fold cross validations are the most typical. In this case, we do what we usually do when we deal with non-normal data. Use a non-parametric test. The non-parametric equivalent of a parametric paired t-test is the sign test. Now as to the confidence interval, it gets a bit tricky, but it can still be calculated. Here is a link to a reference, where the confidence interval for the sign test is computed:

      Click to access rank.pdf

      Like

  2. Great post. Thank you for sharing this. It made me think that given there is no statistical significance in performance difference between the two classifiers you covered, if you would therefore only care about using the one runs fastest, or if there were still some reason to consider using both.

    Like

    1. Hi Kevin,
      Performance testing and statistical significance testing are both important and address different issues. Performance testing can either pertain to the speed of the algorithm or its accuracy. In this post it pertains to accuracy. Statistical significance testing tells you how the accuracy difference you might see between the two classifiers, generalizes to the population of data, and so it allows you to confirm (or not) accuracy differences beyond the specific data you used.

      Like

      1. Sure I got the difference. The question is really once you have used the test you shared to demonstrate accuracy, is speed the only other differentiator?

        Like

      2. Yes, if accuracy difference of the two classifiers was proved to be statistically insignificant, then the only differentiator would be speed.

        Like

Leave a reply to Kevin Newman Cancel reply