Category: regression
Leave-one-out cross validation in R and computation of the predicted residual sum of squares(PRESS) statistic
We will use the gala dataset in the faraway package to demonstrate leave-one-out cross-validation. In this type of validation, one case of the data set is left out and used as the testing set and the remaining data are used as the training set for the regression. This process is repeated until each case in the data set has served as the testing set.
The key concept in creating the iterative leave-one-out process in R is creating a column vector c1 and attaching it to gala as shown below, This allows us to uniquely identify the row that is to be left out in each iteration.
> library(faraway)
> gala[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
>c1<-c(1:30)
> gala2<-cbind(gala,c1)
> gala2[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent c1
Baltra 58 23 25.09 346 0.6 0.6 1.84 1
Bartolome 31 21 1.24 109 0.6 26.3 572.33 2
> diff1<-numeric(30)
> for(i in 1:30){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=gala2)
+ specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2)
+ diff1[i]<-gala2[i,1]-specpr }
>summ1<-numeric(1)
>summ1=0
> for(i in 1:30){summ1<-summ1+diff1[i]^2}
> summ1
[1] 259520.5
The variable summ1 holds the value of the PRESS statistic.
The Akaike and Bayesian information criteria for regression quality
R squared, the coefficient of determination, tells us about the goodness of fit of our regression. Why? Because it is defined as the percentage of variation of the dependent variable explained by the regression model. The higher the explaining power of our model, the larger the R squared. R squared increases with the number of variables in our model, and this is not very good because it does not protect us from overfitting. Two criteria that can protect us from overfitting are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). They both penalize a model for additional, but not very useful, terms. To put it simply, AIC and BIC encourage model conciseness, while R squared does not. In contrast to R squared, the smaller AIC and BIC, the better.
Let’s compute AIC and BIC in R, using two different models for the gala dataset in R. It shows the number of species in an area, based on several parameters such as area, elevation, etc.
> library(faraway)
> gala[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
>model<-lm(Species~.,data=gala)
> summary(model)
……………………..
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.337942 9.423550 -1.628 0.117
Endemics 4.393654 0.481203 9.131 4.13e-09 ***
Area 0.013258 0.011403 1.163 0.257
Elevation -0.047537 0.047596 -0.999 0.328
Nearest -0.101460 0.500871 -0.203 0.841
Scruz 0.008256 0.105884 0.078 0.939
Adjacent 0.001811 0.011879 0.152 0.880
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.96 on 23 degrees of freedom
Multiple R-squared: 0.9494, Adjusted R-squared: 0.9362
F-statistic: 71.88 on 6 and 23 DF, p-value: 9.674e-14
Now we use the step function R, that gives us the best model according to R. This model turns out to be a more concise model than the full model above. We call it model2, and it is shown below.
> model2<-step(model)
> model2<-lm(Species~Endemics+Area+Elevation,data=gala)
> summary(model2)
………………………..
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.891237 7.569210 -2.099 0.0456 *
Endemics 4.331791 0.304686 14.217 8.97e-14 ***
Area 0.012669 0.008936 1.418 0.1681
Elevation -0.041439 0.023653 -1.752 0.0916 .
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27.29 on 26 degrees of freedom
Multiple R-squared: 0.9492, Adjusted R-squared: 0.9433
F-statistic: 161.8 on 3 and 26 DF, p-value: < 2.2e-16
> AIC(model)
[1] 295.128
> AIC(model2)
[1] 289.243
> BIC(model)
[1] 306.3376
> BIC(model2)
[1] 296.249
We see that the more concise model2 is rewarded by both AIC and BIC (AIC and BIC are both smaller for model2).