Prediction Using Caret Model Ensembling

 

INTRODUCTION

In this blog post, we will use the caret R package to predict the median California housing price. The original dataset can be found on the Kaggle website: https://www.kaggle.com/camnugent/california-housing-prices/kernels  and it has 10 columns and 20641 rows.

The caret package is one of the most useful in R, offering a wide array of capabilities, ranging from data exploration and feature selection to implementation of a large number of models.

SUMMARY OF WORK IN THIS POST

First we sample 4000 random rows from the dataset for faster processing times. Then we do feature engineering, by removing the longitude and latitude columns, and creating new proportion features, such as people per household. We also remove rows with missing data.

Then we  use caret for the following:

  • Center and scale.
  • Creation of the training and test set. 
  • One hot encoding (dummy vars).
  • Feature selection using caret’s RFE method.
  • Implementation of PLS, Lasso, Random Forest, XGB Tree, and SVMpoly regression.
  • Model Comparison and model ensembling.

It is also noteworthy that we will utilize the multiple cores of our PC for faster processing, by using the doParallel library. Finally, we evaluate the performance of the models using the test set error.

Here is the link to the RMarkdown script:

CODE

CONCLUSION

As expected, the stacked model yielded the smallest test set error (not by much), but still the smallest. 

 

 

 

 

Reduction of Regression Prediction Error by Incorporating Var Interactions and Factorization

In this post, we work with dataset mtcars in R. The dataset has 32 observations and 11 variables. Various regression models were tried on the model. Each one of these models was optimized in regards to AIC, using stepwise regression. The prediction error was computed using leave-one-out cross validation.

The smallest prediction error and also the smallest regression standard error was achieved, when we incorporated as much knowledge as possible about our independent variables. Specifically, looking at the correlation matrix of the data one can see that some of the variables are correlated and to account for that an interaction term was included in the model. In addition, some of the variables were of discrete nature taking only a few unique values. Knowledge about this was incorporated in the regression, by entering these variables as factors in the model. The complete code for the development and testing of the models is in the link below.

Regression Code Link

Below is a version that takes into account that some categorical variables are ordered. However, the prediction and standard regression errors remain the same as above:

Regression Code Link

3-way Variable Selection in R Regression (lasso,stepwise,and best subset)

In this post, you can find code (link below) for doing variable selection in R regression in three different ways. The variable selection was done on the well-known R dataset prostate. The data is inherently separated in train and test cases. The regressions were applied on the training data and then the prediction mean square error was computed for the test data.

  • Stepwise regression: Here we use the R function step(), where the AIC criterion serves as a guide to add/delete variables. The regression implementation that is returned by step() has achieved the lowest AIC.
  • Lasso regression: This is a form of penalized regression that does feature selection inherently. Penalized regression adds bias to the regression equation in order to reduce variance and therefore, reduce prediction error and avoid overfitting. Lasso regression sets some coefficients to zero, and therefore does implicit feature selection.
  • Best subset regression: Here we use the R package leaps and specifically the function regsubsets(), which returns the best model of size m=1…,n where n is the number of input variables.

Regarding which variables are removed, it is interesting to note that:

  • Lasso regression and stepwise regression result in the removal of the same variable (gleason).
  • In best subset selection, when we select the regression with the smallest cp (mallow’s cp), the best subset is the one of size 7, with one variable removed (gleason again). When we select, the subset with the smallest BIC (Bayes Information Criterion), the best subset is the one of size 2 (the two variables that remain are lcavol and lweight).

Regarding the test error, the smallest values are achieved with lasso regression and best subset selection with regression of size 2.

Code for regression variable selection

 

Prediction in R using Ridge Regression

Ridge regression is a regularization method, where the coefficients are shrank with the purpose of reducing the variance of the solution and therefore improving prediction accuracy. Below we will implement ridge regression on the longley and prostate data sets using two methods: the lm.ridge() function and  the linearRidge() function. Pay special attention to the scaling of the coefficients and the offseting of the predicted values for the lm.ridge() function.

Ridge regression in R examples

Support vector machine regression in R: Tuning the parameters yields the best predictive power

In this example, we will compare 4 types of regression on the benchmark dataset BostonHousing in package “mlbench”. Specifically, we will compare linear regression, regression tree regression, support vector machine regression using default parameters, and support vector machine regression using tuned parameters. The predictive power will be computed using the PRESS metric, described in previous posts. Here is the code:

>library(mlbench)

>library(e1071)

>data(BostonHousing)

>c1<-c(1:506) . Note that 506 is the number of rows in BostonHousing (nrow(BostonHousing)).

BostonHousingmod<-cbind(BostonHousing,c1).

NOTE: The introduction of the c1 variable is purely for the computation of the PRESS statistic as shown below.

LINEAR REGRESSION CODE AND COMPUTATION OF PRESS METRIC

> difflm<-numeric(506)
> for(i in 1: 506) {model1m<-lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(model1m,newData)
+ difflm[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ1=0
> for(i in 1:506) {summ1=summ1+difflm[i]^2}
> summ1
[1] 12005.23

So the PRESS metric is 12005.23.

SUPPORT VECTOR MACHINE REGRESSION USING DEFAULT PARAMETERS
> diffsvm<-numeric(506)
> for(i in 1: 506) {modelsvm<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelsvm,newData)
+ diffsvm[i]<-BostonHousingmod[i,14]-specpr
+ }
>
> summ2=0
> for(i in 1:506) {summ2=summ2+diffsvm[i]^2}
> summ2
[1] 6436.89

So the PRESS metric here is 6436.89.

REGRESSION TREE CODE AND COMPUTATION OF PRESS
> diffrpart<-numeric(506)
> for(i in 1: 506) {modelrpart<-rpart(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelrpart,newData)
+ diffrpart[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ3=0
> for(i in 1:506) {summ3=summ3+diffrpart[i]^2}
> summ3
[1] 11452.59

So, the PRESS metric here is 11452.59.

CODE FOR THE TUNING OF THE SUPPORT VECTOR MACHINE REGRESSION PARAMETERS

difftune<-numeric(506)
> cc<-seq(0.1,0.9,by=0.1)
> cc2<-seq(100,1000,by=100)
> modeltune<-tune.svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=cc,cost=cc2, data=BostonHousingmod)
> modeltune

Parameter tuning of ‘svm’:

– sampling method: 10-fold cross validation

– best parameters:
gamma cost
0.1 100

– best performance: 10.65047

CODE FOR THE SUPPORT VECTOR MACHINE REGRESSION USING TUNED PARAMETERS AND COMPUTATION OF THE PRESS METRIC

> for(i in 1: 506) {finalmodel<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=0.1,cost=100,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(finalmodel,newData)
+ difftune[i]<-BostonHousingmod[i,14]-specpr
+ }
> sumnew=0
> for(i in 1:506) {sumnew=sumnew+difftune[i]^2}
> sumnew
[1] 5627.534

SO, the tuned support machine regression has the BEST (SMALLEST) PRESS metric with value 5627.534. Therefore, for this example, the tuned support vector machine regression has shown the best predictive power.

By the way, this is what the finalmodel contains:
> finalmodel

Call:
svm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio +
b + lstat, data = BostonHousingmod, gamma = 0.1, cost = 100, subset = (c1 != i))
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 100
gamma: 0.1
epsilon: 0.1
Number of Support Vectors: 329

 

 

My posts on Analytic bridge regarding regression,cross validation, and predictive power

 

http://www.analyticbridge.com/profiles/blogs/cross-validation-in-r-a-do-it-yourself-and-a-black-box-approach

http://www.analyticbridge.com/profiles/blogs/use-press-not-r-squared-to-judge-predictive-power-of-regression

Leave-one-out cross validation in R and computation of the predicted residual sum of squares(PRESS) statistic

We will use the gala dataset in the faraway package to demonstrate leave-one-out cross-validation. In this type of validation, one case of the data set is left out and used as the testing set and the remaining data are used as the training set for the regression. This process is repeated until each case in  the data set has served as the testing set.

The key concept in creating the iterative leave-one-out process in R is creating a column vector c1 and attaching it to gala as shown below, This allows us to uniquely identify the row that is to be left out in each iteration.

> library(faraway)

> gala[1:3,]

          Species Endemics  Area Elevation Nearest Scruz Adjacent

Baltra         58       23 25.09       346     0.6   0.6     1.84

Bartolome      31       21  1.24       109     0.6  26.3   572.33

Caldwell        3        3  0.21       114     2.8  58.7     0.78

>c1<-c(1:30)

> gala2<-cbind(gala,c1)

> gala2[1:3,]

          Species Endemics  Area Elevation Nearest Scruz Adjacent c1

Baltra         58       23      25.09       346     0.6        0.6        1.84       1

Bartolome   31       21     1.24       109       0.6      26.3     572.33      2

> diff1<-numeric(30)

> for(i in 1:30){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=gala2)

+ specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2)

+ diff1[i]<-gala2[i,1]-specpr }

 

>summ1<-numeric(1)

>summ1=0

> for(i in 1:30){summ1<-summ1+diff1[i]^2}

> summ1

[1] 259520.5

The variable summ1 holds the value of  the PRESS statistic.