Prediction Using Caret Model Ensembling

INTRODUCTION

In this blog post, we will use the caret R package to predict the median California housing price. The original dataset can be found on the Kaggle website: https://www.kaggle.com/camnugent/california-housing-prices/kernels and it has 10 columns and 20641 rows.

The caret package is one of the most useful in R, offering a wide array of capabilities, ranging from data exploration and feature selection to implementation of a large number of models.

SUMMARY OF WORK IN THIS POST

First we sample 4000 random rows from the dataset for faster processing times. Then we do feature engineering, by removing the longitude and latitude columns, and creating new proportion features, such as people per household. We also remove rows with missing data.

Then we use caret for the following:

Center and scale.
Creation of the training and test set.
One hot encoding (dummy vars).
Feature selection using caret’s RFE method.
Implementation of PLS, Lasso, Random Forest, XGB Tree, and SVMpoly regression.
Model Comparison and model ensembling.

It is also noteworthy that we will utilize the multiple cores of our PC for faster processing, by using the doParallel library. Finally, we evaluate the performance of the models using the test set error.

Here is the link to the RMarkdown script:

CODE

CONCLUSION

As expected, the stacked model yielded the smallest test set error (not by much), but still the smallest.

Principal Components-Based Modeling for the Prediction of a Brand’s Image Rating

All brands strive to have an excellent Image in their respective markets, where Image is comprised of such components as reputation and trustworthiness. For this purpose, companies often conduct surveys among their customers to inquire about their satisfaction, complaints, and whether expectations are being met. An interesting question that arises is that, given such survey results, whether we can predict the brand’s Image.

Package plspm in R contains a dataset called mobile, which contains variables that pertain to the above issue. Specifically it contains 24 variables that encapsulate the following latent (underyling) concepts: Image, Expectations, Quality, Value, Satisfaction, Complaints, and Loyalty. The variable scale is from 0-100. For example, the Image latent concept, has five corresponding variables in the dataset and the Expectations latent concept has three corresponding variables in the dataset. There are a total of 250 observations in the dataset.

A methodology often used for data with latent concepts is that of Principal Components Analysis (PCA), which is based on the computation of eigenvectors and eigenvalues of the data covariance matrix. In this post, we will employ Principal Components Regression(PCR), Partial Least Regression(PLSR), and Ridge Regression (all related to PCA) to build a model and create a prediction for the average Image ratings of the products in the dataset. So the response variable will be the average of the 5 variables related to Image, and predictors will be the variables related to the other concepts (Value, Satisfaction, etc). Although, all aforementioned regressions are based on the Principal Components idea, they differ in how they treat the low/high variance directions. PCR just keeps a certain number of high variance directions and throws away the rest. PLSR inflates some of the high variance directions, while shrinking the low variance directions [Has13]. On the other hand, Ridge Regression shrinks all directions, exhibiting a preference for the high-variance directions (i.e., it shrinks low-variance more). Finally, for comparison purposes, we also compute the linear regression of the predictors. As is shown in the code below, PCR and PLSR do the best in predicting the average Image rating of each product. This could indicate that keeping/inflating the high variance directions produces better predictions for the dataset, than shrinking high-variance directions less than low-variance directions as is done in Ridge Regression.

[Has13] Hastie, T., Tibshirani, R. and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2013.

Below is the link for the full R code:

Code for Regressions based on Principal Components

An Example of R Logistic Regression for Weather Prediction (updated 10/17)

In this post, we will see how we can use R Logistic Regression invoked via the Generalized Linear Model function glm() to predict the value of a dichotomous variable that indicates whether it will rain or not in Australia. We use the weather dataset in the rattle package.

Below is the link for the full code:

https://drive.google.com/file/d/0B5vwqG-vGN-ZUE03QkJNYU1jMUk/view?usp=sharing

Code for R Logistic Regression

Two-way ANOVA in R

In this post we look at how we can compute the two-way ANOVA of a balanced design. The dataset is weightgain in package HSAUR and it shows the weight gains of rats put on four different diets, with two varying factors (the source of protein, which can be beef or cereal and the type which can be high or low).

The R Code and graphs (main effects, interactions) are shown in the link below.

Two-way ANOVA Code Link

Reduction of Regression Prediction Error by Incorporating Var Interactions and Factorization

In this post, we work with dataset mtcars in R. The dataset has 32 observations and 11 variables. Various regression models were tried on the model. Each one of these models was optimized in regards to AIC, using stepwise regression. The prediction error was computed using leave-one-out cross validation.

The smallest prediction error and also the smallest regression standard error was achieved, when we incorporated as much knowledge as possible about our independent variables. Specifically, looking at the correlation matrix of the data one can see that some of the variables are correlated and to account for that an interaction term was included in the model. In addition, some of the variables were of discrete nature taking only a few unique values. Knowledge about this was incorporated in the regression, by entering these variables as factors in the model. The complete code for the development and testing of the models is in the link below.

Regression Code Link

Below is a version that takes into account that some categorical variables are ordered. However, the prediction and standard regression errors remain the same as above:

Regression Code Link

3-way Variable Selection in R Regression (lasso,stepwise,and best subset)

In this post, you can find code (link below) for doing variable selection in R regression in three different ways. The variable selection was done on the well-known R dataset prostate. The data is inherently separated in train and test cases. The regressions were applied on the training data and then the prediction mean square error was computed for the test data.

Stepwise regression: Here we use the R function step(), where the AIC criterion serves as a guide to add/delete variables. The regression implementation that is returned by step() has achieved the lowest AIC.
Lasso regression: This is a form of penalized regression that does feature selection inherently. Penalized regression adds bias to the regression equation in order to reduce variance and therefore, reduce prediction error and avoid overfitting. Lasso regression sets some coefficients to zero, and therefore does implicit feature selection.
Best subset regression: Here we use the R package leaps and specifically the function regsubsets(), which returns the best model of size m=1…,n where n is the number of input variables.

Regarding which variables are removed, it is interesting to note that:

Lasso regression and stepwise regression result in the removal of the same variable (gleason).
In best subset selection, when we select the regression with the smallest cp (mallow’s cp), the best subset is the one of size 7, with one variable removed (gleason again). When we select, the subset with the smallest BIC (Bayes Information Criterion), the best subset is the one of size 2 (the two variables that remain are lcavol and lweight).

Regarding the test error, the smallest values are achieved with lasso regression and best subset selection with regression of size 2.

Code for regression variable selection

Prediction in R using Ridge Regression

Ridge regression is a regularization method, where the coefficients are shrank with the purpose of reducing the variance of the solution and therefore improving prediction accuracy. Below we will implement ridge regression on the longley and prostate data sets using two methods: the lm.ridge() function and the linearRidge() function. Pay special attention to the scaling of the coefficients and the offseting of the predicted values for the lm.ridge() function.

Ridge regression in R examples

Alternative ways to compute the cross-validation and test errors in R linear regression

In this post, we will look at two different ways to compute the cross-validation error and two different ways to compute the test error in R linear regression. The data set we will use is the gala set in package faraway.

> library(faraway)

> data(gala)

> index<-1:nrow(gala)

> testindex<-sample(index,trunc(length(index)/3))

testset<-gala[testindex,]

trainset<-gala[-testindex,]

c1<-c(1:nn)

> trainset2<-cbind(trainset,c1)

> diff1<-numeric(nn)

> foo1<-numeric(nn)

> foo2<-numeric(nn)

> diff2<-numeric(nn)

> #Cross validation using predict in two different ways:(a) the data points are explicitly defined using a list() (b) the data points are implicitly given by deleting the output column (column 1).

> newdata<-trainset[,-5:-7]

> for(i in 1:nn){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=trainset2)

+ foo1[i]<-predict(model1,list(Endemics=trainset2[i,2],Area=trainset2[i,3],Elevation=trainset2[i,

+ 4]))

+ foo2[i]<-predict(model1,newdata[i,-1])

+ diff1[i]<-trainset2[i,1]-foo1[i]

+ diff2[i]<-trainset2[i,1]-foo2[i]

+ }

#As we see below, the two ways are equivalent.

> diff1

[1] 9.9654289 -0.4143694 -14.5205954 47.9751592 8.1244225

[6] 14.0009744 -2.5409240 30.3767094 -5.6358873 12.5939696

[11] -34.3419570 -428.6932984 9.7597007 -25.9936027 -13.6063039

[16] -12.0757886 86.7751835 -27.2387452 -10.6432398 -9.8346438

> diff2

[1] 9.9654289 -0.4143694 -14.5205954 47.9751592 8.1244225

[6] 14.0009744 -2.5409240 30.3767094 -5.6358873 12.5939696

[11] -34.3419570 -428.6932984 9.7597007 -25.9936027 -13.6063039

[16] -12.0757886 86.7751835 -27.2387452 -10.6432398 -9.8346438

> sum11=0

> for(i in 1:nn) {

+ sum11=sum11+diff1[i]^2

+ }

> sum11<-sum11/nn

#The number below is the Mean Square Error.

> sum11

[1] 9926.731

#Computation of the test error in 2 different ways: (a) Using the predict() function to compute the predicted values (b) Use the coefficients of the regression equation to compute the predicted values.

> tt<-nrow(testset)

> diff3<-numeric(tt)

> model12<-lm(Species~Endemics+Area+Elevation,data=trainset)

> difftest<-numeric(tt)

> newdata2<-testset[,-5:-7]

#Using the predict() function for the computation of the predicted values.

> lmpred<-predict(model2,newdata2[,-1])

> for(i in 1:tt) {

+ difftest[i]=testset[i,1]-lmpred[i]}

> summ2=0

> for(i in 1:tt) {

+ summ2=summ2+difftest[i]^2}

> summ2<-summ2/tt

> summ2

#This is the mean squared test error.

[1] 1091.469

#Now let’s proceed to the second way of computing the test error.

> cc<-as.matrix(cbind(rep(1,10),testset[,c(2,3,4)]))

> pred.value<-cc%*% model2$coef

> for(i in 1:tt) {

+ difftest[i]<-testset[i,1]-pred.value[i]}

> sum3=0

> for(i in 1:tt) {

+ sum3=sum3+difftest[i]^2}

> sum3<-sum3/tt

> sum3

[1] 1091.469

#As expected the two different methods of computing the test error, yield the same numbers.

Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R

OUTLIER DETECTION

In this discussion, we are using the dataset gala in package faraway.

> gala[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

and the model we will use is:

> model<-lm(Species~Endemics+Area+Elevation,data=gala)

A. Graphical Approaches for Outlier Detection

A. 1. From the graph of standardized residuals.

Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.

>places<-row.names(gala)

> plot(rstandard(model))

> identify(1:30,rstandard(model),places)

NOTE: The identify function enables the interactive labeling of index graphs in R.

A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)

The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).

>library(MASS)

>plot(rstudent(model),main=”Jackknifed residuals”)

> identify(1:30,rstudent(model),places)

B. NON-GRAPHICAL APPROACHES.

B.1 Looking at the magnitudes of studentized residuals.

>jack<-rstudent(model)

> absjack<-abs(jack)

> absjack[absjack >2]

Isabela SanSalvador SantaCruz

3.060831 3.287014 4.092622

Now let’s do a Bonferonni test.

>library(car)

> outlierTest(model)

rstudent unadjusted p-value Bonferonni p

SantaCruz 4.092622 0.00039043 0.011713

So, according to the Bonferroni test, only Santacruz is an outlier.

B,2 Looking at the magnitude of Standard residuals

> standard<-abs(rstandard(model))

> standard[standard > 2]

Isabela SanSalvador SantaCruz

2.662224 2.801042 3.229701

NOTE THE SMALLER MAGNITUDES

INFLUENTIAL CASES

An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.

A. Non-Graphical Approaches for Detection of Influential Cases

A.1. Influential Cases From lm.influence()

> modi<-influence.measures(model)

> summary(modi)

Potentially influential observations of

lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :

dfb.1_ dfb.Endm dfb.Area dfb.Elvt dffit cov.r cook.d hat

Fernandina -0.28 -0.87 -0.70 1.40_* 1.44_* 4.29_* 0.52 0.75_*

Isabela -1.69_* 2.07_* -15.46_* 1.62_* -19.88_* 14.14_* 74.71_* 0.98_*

SanSalvador 0.65 -1.16_* 0.69 -0.08 -1.73_* 0.35_* 0.54 0.22

SantaCruz -0.85 2.41_* -0.32 -0.86 2.80_* 0.22_* 1.22_* 0.32

SantaMaria -0.05 0.16 -0.08 -0.03 0.19 1.46_* 0.01 0.22

A. 2. Influential Cases From Cook’s distance

Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.

> cook<-cooks.distance(model)

> cook[cook > 1]

Isabela SantaCruz
74.709410 1.223785

A. 3. Influential Cases From The Computation of Leverage

Leverage: it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:

av. leverage=(number of predictors + 1 /divided by sample size)

Then we look for cases with leverage values > 2 or 3 X av. Leverage. NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.

In R, the leverage is measured by the hat values:

> x<-model.matrix(model)

> lev<-hat(x)

> places<-row.names(gala)

> names(lev)<-places

> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz

0.7547450 0.9768328 0.3193987

B. Graphical Detection of Influential Cases

B.1 From Cook’s distance

>cook<-cooks.distance(model)

> plot(cook,ylab=”Cooks distance”)

> identify(1:30,cook,places)

B.2. From Leverage

>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)

> sum(lev)

[1] 4

> abline(h=8/30) (NOTE: This line is at 2*sum(lev)/ N.)

> identify(1:30,lev,places)

[1] 12 16 25

B.3.From the function plot(model)

Support vector machine regression in R: Tuning the parameters yields the best predictive power

In this example, we will compare 4 types of regression on the benchmark dataset BostonHousing in package “mlbench”. Specifically, we will compare linear regression, regression tree regression, support vector machine regression using default parameters, and support vector machine regression using tuned parameters. The predictive power will be computed using the PRESS metric, described in previous posts. Here is the code:

>library(mlbench)

>library(e1071)

>data(BostonHousing)

>c1<-c(1:506) . Note that 506 is the number of rows in BostonHousing (nrow(BostonHousing)).

BostonHousingmod<-cbind(BostonHousing,c1).

NOTE: The introduction of the c1 variable is purely for the computation of the PRESS statistic as shown below.

LINEAR REGRESSION CODE AND COMPUTATION OF PRESS METRIC

> difflm<-numeric(506)
> for(i in 1: 506) {model1m<-lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(model1m,newData)
+ difflm[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ1=0
> for(i in 1:506) {summ1=summ1+difflm[i]^2}
> summ1
[1] 12005.23

So the PRESS metric is 12005.23.

SUPPORT VECTOR MACHINE REGRESSION USING DEFAULT PARAMETERS
> diffsvm<-numeric(506)
> for(i in 1: 506) {modelsvm<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelsvm,newData)
+ diffsvm[i]<-BostonHousingmod[i,14]-specpr
+ }
>
> summ2=0
> for(i in 1:506) {summ2=summ2+diffsvm[i]^2}
> summ2
[1] 6436.89

So the PRESS metric here is 6436.89.

REGRESSION TREE CODE AND COMPUTATION OF PRESS
> diffrpart<-numeric(506)
> for(i in 1: 506) {modelrpart<-rpart(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelrpart,newData)
+ diffrpart[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ3=0
> for(i in 1:506) {summ3=summ3+diffrpart[i]^2}
> summ3
[1] 11452.59

So, the PRESS metric here is 11452.59.

CODE FOR THE TUNING OF THE SUPPORT VECTOR MACHINE REGRESSION PARAMETERS

difftune<-numeric(506)
> cc<-seq(0.1,0.9,by=0.1)
> cc2<-seq(100,1000,by=100)
> modeltune<-tune.svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=cc,cost=cc2, data=BostonHousingmod)
> modeltune

Parameter tuning of ‘svm’:

– sampling method: 10-fold cross validation

– best parameters:
gamma cost
0.1 100

– best performance: 10.65047

CODE FOR THE SUPPORT VECTOR MACHINE REGRESSION USING TUNED PARAMETERS AND COMPUTATION OF THE PRESS METRIC

> for(i in 1: 506) {finalmodel<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=0.1,cost=100,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(finalmodel,newData)
+ difftune[i]<-BostonHousingmod[i,14]-specpr
+ }
> sumnew=0
> for(i in 1:506) {sumnew=sumnew+difftune[i]^2}
> sumnew
[1] 5627.534

SO, the tuned support machine regression has the BEST (SMALLEST) PRESS metric with value 5627.534. Therefore, for this example, the tuned support vector machine regression has shown the best predictive power.

By the way, this is what the finalmodel contains:
> finalmodel

Call:
svm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio +
b + lstat, data = BostonHousingmod, gamma = 0.1, cost = 100, subset = (c1 != i))
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 100
gamma: 0.1
epsilon: 0.1
Number of Support Vectors: 329