R – Theo's Data Science Blog

Python And R for Data Wrangling: Compare Pandas and Tidyverse Code Side-By-Side, and Learn Speed-Up Tips.

Medium Post

Prediction Using Caret Model Ensembling

INTRODUCTION

In this blog post, we will use the caret R package to predict the median California housing price. The original dataset can be found on the Kaggle website: https://www.kaggle.com/camnugent/california-housing-prices/kernels and it has 10 columns and 20641 rows.

The caret package is one of the most useful in R, offering a wide array of capabilities, ranging from data exploration and feature selection to implementation of a large number of models.

SUMMARY OF WORK IN THIS POST

First we sample 4000 random rows from the dataset for faster processing times. Then we do feature engineering, by removing the longitude and latitude columns, and creating new proportion features, such as people per household. We also remove rows with missing data.

Then we use caret for the following:

Center and scale.
Creation of the training and test set.
One hot encoding (dummy vars).
Feature selection using caret’s RFE method.
Implementation of PLS, Lasso, Random Forest, XGB Tree, and SVMpoly regression.
Model Comparison and model ensembling.

It is also noteworthy that we will utilize the multiple cores of our PC for faster processing, by using the doParallel library. Finally, we evaluate the performance of the models using the test set error.

Here is the link to the RMarkdown script:

CODE

CONCLUSION

As expected, the stacked model yielded the smallest test set error (not by much), but still the smallest.

Principal Components-Based Modeling for the Prediction of a Brand’s Image Rating

All brands strive to have an excellent Image in their respective markets, where Image is comprised of such components as reputation and trustworthiness. For this purpose, companies often conduct surveys among their customers to inquire about their satisfaction, complaints, and whether expectations are being met. An interesting question that arises is that, given such survey results, whether we can predict the brand’s Image.

Package plspm in R contains a dataset called mobile, which contains variables that pertain to the above issue. Specifically it contains 24 variables that encapsulate the following latent (underyling) concepts: Image, Expectations, Quality, Value, Satisfaction, Complaints, and Loyalty. The variable scale is from 0-100. For example, the Image latent concept, has five corresponding variables in the dataset and the Expectations latent concept has three corresponding variables in the dataset. There are a total of 250 observations in the dataset.

A methodology often used for data with latent concepts is that of Principal Components Analysis (PCA), which is based on the computation of eigenvectors and eigenvalues of the data covariance matrix. In this post, we will employ Principal Components Regression(PCR), Partial Least Regression(PLSR), and Ridge Regression (all related to PCA) to build a model and create a prediction for the average Image ratings of the products in the dataset. So the response variable will be the average of the 5 variables related to Image, and predictors will be the variables related to the other concepts (Value, Satisfaction, etc). Although, all aforementioned regressions are based on the Principal Components idea, they differ in how they treat the low/high variance directions. PCR just keeps a certain number of high variance directions and throws away the rest. PLSR inflates some of the high variance directions, while shrinking the low variance directions [Has13]. On the other hand, Ridge Regression shrinks all directions, exhibiting a preference for the high-variance directions (i.e., it shrinks low-variance more). Finally, for comparison purposes, we also compute the linear regression of the predictors. As is shown in the code below, PCR and PLSR do the best in predicting the average Image rating of each product. This could indicate that keeping/inflating the high variance directions produces better predictions for the dataset, than shrinking high-variance directions less than low-variance directions as is done in Ridge Regression.

[Has13] Hastie, T., Tibshirani, R. and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2013.

Below is the link for the full R code:

Code for Regressions based on Principal Components

An Example of R Logistic Regression for Weather Prediction (updated 10/17)

In this post, we will see how we can use R Logistic Regression invoked via the Generalized Linear Model function glm() to predict the value of a dichotomous variable that indicates whether it will rain or not in Australia. We use the weather dataset in the rattle package.

Below is the link for the full code:

https://drive.google.com/file/d/0B5vwqG-vGN-ZUE03QkJNYU1jMUk/view?usp=sharing

Code for R Logistic Regression

Prediction using the R SuperLearner package

In this post, the R SuperLearner package is used to predict the values of the testset part of the prostate dataset.
In the SuperLearner approach, prediction is performed by combining weighted versions of different learners. As shown in the code below, the mean square prediction error is 0.319. As a comparison, in an earlier post (3-way variable selection in R regression), the mean square errors of linear regression and lasso regression were 0.516 and 0.493 respectively.

Here is the SuperLearner code:
> library(SuperLearner)
>library(ElemStatLearn)
> data(prostate)
#Separate the training and test sets.
> head(prostate)
lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829 TRUE
2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE
3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189 TRUE
4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE
5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636 TRUE
6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678 TRUE
#The variable to be predicted is lpsa. The train variable is a dummy variable that indicates whether a case belongs to the trainset or the testset.
> trainset testset testset1 testset2 trainset1 trainset2<-trainset1[,-9]
#Specify the learners that will be used by the superlearner.
mylibrary X newX<-testset2
ay out out$SL.predict
[,1]
7 1.814375
9 1.110389
10 1.237715
15 1.871654
22 2.699901
25 1.943528
26 1.977818
28 1.965107
32 1.988913
34 1.227339
36 2.875185
42 2.231019
44 2.346486
48 2.783155
49 2.419382
50 2.120854
53 2.388720
54 3.046706
55 3.001810
57 1.612384
62 3.444458
64 3.635944
65 2.350286
66 2.748382
73 2.853494
74 3.341276
80 3.117150
84 3.186377
95 3.241076
97 3.657895
#Let’s compute now the mean square error between the predicted values and the actual testset values.
> sum=0
> tt for(i in 1:tt) {
+ sum sumg sumg
[1] 0.3191994
>

Two-way ANOVA in R

In this post we look at how we can compute the two-way ANOVA of a balanced design. The dataset is weightgain in package HSAUR and it shows the weight gains of rats put on four different diets, with two varying factors (the source of protein, which can be beef or cereal and the type which can be high or low).

The R Code and graphs (main effects, interactions) are shown in the link below.

Two-way ANOVA Code Link

Reduction of Regression Prediction Error by Incorporating Var Interactions and Factorization

In this post, we work with dataset mtcars in R. The dataset has 32 observations and 11 variables. Various regression models were tried on the model. Each one of these models was optimized in regards to AIC, using stepwise regression. The prediction error was computed using leave-one-out cross validation.

The smallest prediction error and also the smallest regression standard error was achieved, when we incorporated as much knowledge as possible about our independent variables. Specifically, looking at the correlation matrix of the data one can see that some of the variables are correlated and to account for that an interaction term was included in the model. In addition, some of the variables were of discrete nature taking only a few unique values. Knowledge about this was incorporated in the regression, by entering these variables as factors in the model. The complete code for the development and testing of the models is in the link below.

Regression Code Link

Below is a version that takes into account that some categorical variables are ordered. However, the prediction and standard regression errors remain the same as above:

Regression Code Link

Partial Correlation in R

When computing the correlation between two variables, an interesting question is how it will be affected if we control for another variable(s), where control means hold constant.

R offers the function pcor.test() in the ppcor package. This function has the format pcor.test(X,Y,Z) where X and Y are the variables for which we want to compute the correlation, while Z consists of column vectors containing the controlling variable(s).

In the link below, we apply pcor.test() to a dataset and compare it will computing correlations by group. The results show that computing correlations by group, provides useful insights, while pcor.test() does not.

Code for partial correlation and correlation by group

3-way Variable Selection in R Regression (lasso,stepwise,and best subset)

In this post, you can find code (link below) for doing variable selection in R regression in three different ways. The variable selection was done on the well-known R dataset prostate. The data is inherently separated in train and test cases. The regressions were applied on the training data and then the prediction mean square error was computed for the test data.

Stepwise regression: Here we use the R function step(), where the AIC criterion serves as a guide to add/delete variables. The regression implementation that is returned by step() has achieved the lowest AIC.
Lasso regression: This is a form of penalized regression that does feature selection inherently. Penalized regression adds bias to the regression equation in order to reduce variance and therefore, reduce prediction error and avoid overfitting. Lasso regression sets some coefficients to zero, and therefore does implicit feature selection.
Best subset regression: Here we use the R package leaps and specifically the function regsubsets(), which returns the best model of size m=1…,n where n is the number of input variables.

Regarding which variables are removed, it is interesting to note that:

Lasso regression and stepwise regression result in the removal of the same variable (gleason).
In best subset selection, when we select the regression with the smallest cp (mallow’s cp), the best subset is the one of size 7, with one variable removed (gleason again). When we select, the subset with the smallest BIC (Bayes Information Criterion), the best subset is the one of size 2 (the two variables that remain are lcavol and lweight).

Regarding the test error, the smallest values are achieved with lasso regression and best subset selection with regression of size 2.

Code for regression variable selection

Alternative ways to compute the cross-validation and test errors in R linear regression

In this post, we will look at two different ways to compute the cross-validation error and two different ways to compute the test error in R linear regression. The data set we will use is the gala set in package faraway.

> library(faraway)

> data(gala)

> index<-1:nrow(gala)

> testindex<-sample(index,trunc(length(index)/3))

testset<-gala[testindex,]

trainset<-gala[-testindex,]

c1<-c(1:nn)

> trainset2<-cbind(trainset,c1)

> diff1<-numeric(nn)

> foo1<-numeric(nn)

> foo2<-numeric(nn)

> diff2<-numeric(nn)

> #Cross validation using predict in two different ways:(a) the data points are explicitly defined using a list() (b) the data points are implicitly given by deleting the output column (column 1).

> newdata<-trainset[,-5:-7]

> for(i in 1:nn){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=trainset2)

+ foo1[i]<-predict(model1,list(Endemics=trainset2[i,2],Area=trainset2[i,3],Elevation=trainset2[i,

+ 4]))

+ foo2[i]<-predict(model1,newdata[i,-1])

+ diff1[i]<-trainset2[i,1]-foo1[i]

+ diff2[i]<-trainset2[i,1]-foo2[i]

+ }

#As we see below, the two ways are equivalent.

> diff1

[1] 9.9654289 -0.4143694 -14.5205954 47.9751592 8.1244225

[6] 14.0009744 -2.5409240 30.3767094 -5.6358873 12.5939696

[11] -34.3419570 -428.6932984 9.7597007 -25.9936027 -13.6063039

[16] -12.0757886 86.7751835 -27.2387452 -10.6432398 -9.8346438

> diff2

[1] 9.9654289 -0.4143694 -14.5205954 47.9751592 8.1244225

[6] 14.0009744 -2.5409240 30.3767094 -5.6358873 12.5939696

[11] -34.3419570 -428.6932984 9.7597007 -25.9936027 -13.6063039

[16] -12.0757886 86.7751835 -27.2387452 -10.6432398 -9.8346438

> sum11=0

> for(i in 1:nn) {

+ sum11=sum11+diff1[i]^2

+ }

> sum11<-sum11/nn

#The number below is the Mean Square Error.

> sum11

[1] 9926.731

#Computation of the test error in 2 different ways: (a) Using the predict() function to compute the predicted values (b) Use the coefficients of the regression equation to compute the predicted values.

> tt<-nrow(testset)

> diff3<-numeric(tt)

> model12<-lm(Species~Endemics+Area+Elevation,data=trainset)

> difftest<-numeric(tt)

> newdata2<-testset[,-5:-7]

#Using the predict() function for the computation of the predicted values.

> lmpred<-predict(model2,newdata2[,-1])

> for(i in 1:tt) {

+ difftest[i]=testset[i,1]-lmpred[i]}

> summ2=0

> for(i in 1:tt) {

+ summ2=summ2+difftest[i]^2}

> summ2<-summ2/tt

> summ2

#This is the mean squared test error.

[1] 1091.469

#Now let’s proceed to the second way of computing the test error.

> cc<-as.matrix(cbind(rep(1,10),testset[,c(2,3,4)]))

> pred.value<-cc%*% model2$coef

> for(i in 1:tt) {

+ difftest[i]<-testset[i,1]-pred.value[i]}

> sum3=0

> for(i in 1:tt) {

+ sum3=sum3+difftest[i]^2}

> sum3<-sum3/tt

> sum3

[1] 1091.469

#As expected the two different methods of computing the test error, yield the same numbers.