How to Compute the Statistical Significance of Two Classifiers’ Performance Difference

A question I see quite often, in scientific forums, is how to determine whether the observed performance difference of two classifiers is statistically significant. Let us work through an example, as to how we can compute this statistical significance (click on the link below):

How to Compute the Statistical Significance of two Classifiers’ Performance Difference

Published by

alitheia15

Data Mining-Analytics Software Consultant View all posts by alitheia15

6 thoughts on “How to Compute the Statistical Significance of Two Classifiers’ Performance Difference”

Xiang says:

March 28, 2016 at 7:49 pm

Great post. My questions is: in order for Dj to be normally distributed, we need large enough number of folds (minimum 25-30). However, in real applications, we typically use 10-fold cross validations, which complicates the normality assumption. What do you suggest in such cases? Thanks!

LikeLike

Reply
1. Theophano Mitsa says:
  
  March 28, 2016 at 10:44 pm
  
  Hi Xiang,
  You are right that 10-fold cross validations are the most typical. In this case, we do what we usually do when we deal with non-normal data. Use a non-parametric test. The non-parametric equivalent of a parametric paired t-test is the sign test. Now as to the confidence interval, it gets a bit tricky, but it can still be calculated. Here is a link to a reference, where the confidence interval for the sign test is computed:
  
  Click to access rank.pdf
  
  LikeLike
  
  Reply
Kevin Newman says:

March 28, 2016 at 9:37 pm

Great post. Thank you for sharing this. It made me think that given there is no statistical significance in performance difference between the two classifiers you covered, if you would therefore only care about using the one runs fastest, or if there were still some reason to consider using both.

LikeLike

Reply
1. Theophano Mitsa says:
  
  March 28, 2016 at 10:35 pm
  
  Hi Kevin,
  Performance testing and statistical significance testing are both important and address different issues. Performance testing can either pertain to the speed of the algorithm or its accuracy. In this post it pertains to accuracy. Statistical significance testing tells you how the accuracy difference you might see between the two classifiers, generalizes to the population of data, and so it allows you to confirm (or not) accuracy differences beyond the specific data you used.
  
  LikeLike
  
  Reply
  1. Kevin Newman says:
    
    March 29, 2016 at 12:54 pm
    
    Sure I got the difference. The question is really once you have used the test you shared to demonstrate accuracy, is speed the only other differentiator?
    
    LikeLike
  2. Theophano Mitsa says:
    
    March 29, 2016 at 1:05 pm
    
    Yes, if accuracy difference of the two classifiers was proved to be statistically insignificant, then the only differentiator would be speed.
    
    LikeLike

Share this:

Related

Published by

alitheia15

6 thoughts on “How to Compute the Statistical Significance of Two Classifiers’ Performance Difference”

Leave a reply to Kevin Newman Cancel reply