cross validation – Theo's Data Science Blog

Reduction of Regression Prediction Error by Incorporating Var Interactions and Factorization

In this post, we work with dataset mtcars in R. The dataset has 32 observations and 11 variables. Various regression models were tried on the model. Each one of these models was optimized in regards to AIC, using stepwise regression. The prediction error was computed using leave-one-out cross validation.

The smallest prediction error and also the smallest regression standard error was achieved, when we incorporated as much knowledge as possible about our independent variables. Specifically, looking at the correlation matrix of the data one can see that some of the variables are correlated and to account for that an interaction term was included in the model. In addition, some of the variables were of discrete nature taking only a few unique values. Knowledge about this was incorporated in the regression, by entering these variables as factors in the model. The complete code for the development and testing of the models is in the link below.

Regression Code Link

Below is a version that takes into account that some categorical variables are ordered. However, the prediction and standard regression errors remain the same as above:

Regression Code Link

Alternative ways to compute the cross-validation and test errors in R linear regression

In this post, we will look at two different ways to compute the cross-validation error and two different ways to compute the test error in R linear regression. The data set we will use is the gala set in package faraway.

> library(faraway)

> data(gala)

> index<-1:nrow(gala)

> testindex<-sample(index,trunc(length(index)/3))

testset<-gala[testindex,]

trainset<-gala[-testindex,]

c1<-c(1:nn)

> trainset2<-cbind(trainset,c1)

> diff1<-numeric(nn)

> foo1<-numeric(nn)

> foo2<-numeric(nn)

> diff2<-numeric(nn)

> #Cross validation using predict in two different ways:(a) the data points are explicitly defined using a list() (b) the data points are implicitly given by deleting the output column (column 1).

> newdata<-trainset[,-5:-7]

> for(i in 1:nn){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=trainset2)

+ foo1[i]<-predict(model1,list(Endemics=trainset2[i,2],Area=trainset2[i,3],Elevation=trainset2[i,

+ 4]))

+ foo2[i]<-predict(model1,newdata[i,-1])

+ diff1[i]<-trainset2[i,1]-foo1[i]

+ diff2[i]<-trainset2[i,1]-foo2[i]

+ }

#As we see below, the two ways are equivalent.

> diff1

[1] 9.9654289 -0.4143694 -14.5205954 47.9751592 8.1244225

[6] 14.0009744 -2.5409240 30.3767094 -5.6358873 12.5939696

[11] -34.3419570 -428.6932984 9.7597007 -25.9936027 -13.6063039

[16] -12.0757886 86.7751835 -27.2387452 -10.6432398 -9.8346438

> diff2

[1] 9.9654289 -0.4143694 -14.5205954 47.9751592 8.1244225

[6] 14.0009744 -2.5409240 30.3767094 -5.6358873 12.5939696

[11] -34.3419570 -428.6932984 9.7597007 -25.9936027 -13.6063039

[16] -12.0757886 86.7751835 -27.2387452 -10.6432398 -9.8346438

> sum11=0

> for(i in 1:nn) {

+ sum11=sum11+diff1[i]^2

+ }

> sum11<-sum11/nn

#The number below is the Mean Square Error.

> sum11

[1] 9926.731

#Computation of the test error in 2 different ways: (a) Using the predict() function to compute the predicted values (b) Use the coefficients of the regression equation to compute the predicted values.

> tt<-nrow(testset)

> diff3<-numeric(tt)

> model12<-lm(Species~Endemics+Area+Elevation,data=trainset)

> difftest<-numeric(tt)

> newdata2<-testset[,-5:-7]

#Using the predict() function for the computation of the predicted values.

> lmpred<-predict(model2,newdata2[,-1])

> for(i in 1:tt) {

+ difftest[i]=testset[i,1]-lmpred[i]}

> summ2=0

> for(i in 1:tt) {

+ summ2=summ2+difftest[i]^2}

> summ2<-summ2/tt

> summ2

#This is the mean squared test error.

[1] 1091.469

#Now let’s proceed to the second way of computing the test error.

> cc<-as.matrix(cbind(rep(1,10),testset[,c(2,3,4)]))

> pred.value<-cc%*% model2$coef

> for(i in 1:tt) {

+ difftest[i]<-testset[i,1]-pred.value[i]}

> sum3=0

> for(i in 1:tt) {

+ sum3=sum3+difftest[i]^2}

> sum3<-sum3/tt

> sum3

[1] 1091.469

#As expected the two different methods of computing the test error, yield the same numbers.

Support vector machine regression in R: Tuning the parameters yields the best predictive power

In this example, we will compare 4 types of regression on the benchmark dataset BostonHousing in package “mlbench”. Specifically, we will compare linear regression, regression tree regression, support vector machine regression using default parameters, and support vector machine regression using tuned parameters. The predictive power will be computed using the PRESS metric, described in previous posts. Here is the code:

>library(mlbench)

>library(e1071)

>data(BostonHousing)

>c1<-c(1:506) . Note that 506 is the number of rows in BostonHousing (nrow(BostonHousing)).

BostonHousingmod<-cbind(BostonHousing,c1).

NOTE: The introduction of the c1 variable is purely for the computation of the PRESS statistic as shown below.

LINEAR REGRESSION CODE AND COMPUTATION OF PRESS METRIC

> difflm<-numeric(506)
> for(i in 1: 506) {model1m<-lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(model1m,newData)
+ difflm[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ1=0
> for(i in 1:506) {summ1=summ1+difflm[i]^2}
> summ1
[1] 12005.23

So the PRESS metric is 12005.23.

SUPPORT VECTOR MACHINE REGRESSION USING DEFAULT PARAMETERS
> diffsvm<-numeric(506)
> for(i in 1: 506) {modelsvm<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelsvm,newData)
+ diffsvm[i]<-BostonHousingmod[i,14]-specpr
+ }
>
> summ2=0
> for(i in 1:506) {summ2=summ2+diffsvm[i]^2}
> summ2
[1] 6436.89

So the PRESS metric here is 6436.89.

REGRESSION TREE CODE AND COMPUTATION OF PRESS
> diffrpart<-numeric(506)
> for(i in 1: 506) {modelrpart<-rpart(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelrpart,newData)
+ diffrpart[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ3=0
> for(i in 1:506) {summ3=summ3+diffrpart[i]^2}
> summ3
[1] 11452.59

So, the PRESS metric here is 11452.59.

CODE FOR THE TUNING OF THE SUPPORT VECTOR MACHINE REGRESSION PARAMETERS

difftune<-numeric(506)
> cc<-seq(0.1,0.9,by=0.1)
> cc2<-seq(100,1000,by=100)
> modeltune<-tune.svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=cc,cost=cc2, data=BostonHousingmod)
> modeltune

Parameter tuning of ‘svm’:

– sampling method: 10-fold cross validation

– best parameters:
gamma cost
0.1 100

– best performance: 10.65047

CODE FOR THE SUPPORT VECTOR MACHINE REGRESSION USING TUNED PARAMETERS AND COMPUTATION OF THE PRESS METRIC

> for(i in 1: 506) {finalmodel<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=0.1,cost=100,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(finalmodel,newData)
+ difftune[i]<-BostonHousingmod[i,14]-specpr
+ }
> sumnew=0
> for(i in 1:506) {sumnew=sumnew+difftune[i]^2}
> sumnew
[1] 5627.534

SO, the tuned support machine regression has the BEST (SMALLEST) PRESS metric with value 5627.534. Therefore, for this example, the tuned support vector machine regression has shown the best predictive power.

By the way, this is what the finalmodel contains:
> finalmodel

Call:
svm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio +
b + lstat, data = BostonHousingmod, gamma = 0.1, cost = 100, subset = (c1 != i))
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 100
gamma: 0.1
epsilon: 0.1
Number of Support Vectors: 329

My posts on Analytic bridge regarding regression,cross validation, and predictive power

http://www.analyticbridge.com/profiles/blogs/cross-validation-in-r-a-do-it-yourself-and-a-black-box-approach

http://www.analyticbridge.com/profiles/blogs/use-press-not-r-squared-to-judge-predictive-power-of-regression

Leave-one-out cross validation in R and computation of the predicted residual sum of squares(PRESS) statistic

We will use the gala dataset in the faraway package to demonstrate leave-one-out cross-validation. In this type of validation, one case of the data set is left out and used as the testing set and the remaining data are used as the training set for the regression. This process is repeated until each case in the data set has served as the testing set.

The key concept in creating the iterative leave-one-out process in R is creating a column vector c1 and attaching it to gala as shown below, This allows us to uniquely identify the row that is to be left out in each iteration.

> library(faraway)

> gala[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

>c1<-c(1:30)

> gala2<-cbind(gala,c1)

> gala2[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent c1

Baltra 58 23 25.09 346 0.6 0.6 1.84 1

Bartolome 31 21 1.24 109 0.6 26.3 572.33 2

> diff1<-numeric(30)

> for(i in 1:30){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=gala2)

+ specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2)

+ diff1[i]<-gala2[i,1]-specpr }

>summ1<-numeric(1)

>summ1=0

> for(i in 1:30){summ1<-summ1+diff1[i]^2}

> summ1

[1] 259520.5

The variable summ1 holds the value of the PRESS statistic.