feature selection – Theo's Data Science Blog

INTRODUCTION

In this blog post, we will use the caret R package to predict the median California housing price. The original dataset can be found on the Kaggle website: https://www.kaggle.com/camnugent/california-housing-prices/kernels and it has 10 columns and 20641 rows.

The caret package is one of the most useful in R, offering a wide array of capabilities, ranging from data exploration and feature selection to implementation of a large number of models.

SUMMARY OF WORK IN THIS POST

First we sample 4000 random rows from the dataset for faster processing times. Then we do feature engineering, by removing the longitude and latitude columns, and creating new proportion features, such as people per household. We also remove rows with missing data.

Then we use caret for the following:

Center and scale.
Creation of the training and test set.
One hot encoding (dummy vars).
Feature selection using caret’s RFE method.
Implementation of PLS, Lasso, Random Forest, XGB Tree, and SVMpoly regression.
Model Comparison and model ensembling.

It is also noteworthy that we will utilize the multiple cores of our PC for faster processing, by using the doParallel library. Finally, we evaluate the performance of the models using the test set error.

Here is the link to the RMarkdown script:

CODE

CONCLUSION

As expected, the stacked model yielded the smallest test set error (not by much), but still the smallest.

In this post, you can find code (link below) for doing variable selection in R regression in three different ways. The variable selection was done on the well-known R dataset prostate. The data is inherently separated in train and test cases. The regressions were applied on the training data and then the prediction mean square error was computed for the test data.

Stepwise regression: Here we use the R function step(), where the AIC criterion serves as a guide to add/delete variables. The regression implementation that is returned by step() has achieved the lowest AIC.
Lasso regression: This is a form of penalized regression that does feature selection inherently. Penalized regression adds bias to the regression equation in order to reduce variance and therefore, reduce prediction error and avoid overfitting. Lasso regression sets some coefficients to zero, and therefore does implicit feature selection.
Best subset regression: Here we use the R package leaps and specifically the function regsubsets(), which returns the best model of size m=1…,n where n is the number of input variables.

Regarding which variables are removed, it is interesting to note that:

Lasso regression and stepwise regression result in the removal of the same variable (gleason).
In best subset selection, when we select the regression with the smallest cp (mallow’s cp), the best subset is the one of size 7, with one variable removed (gleason again). When we select, the subset with the smallest BIC (Bayes Information Criterion), the best subset is the one of size 2 (the two variables that remain are lcavol and lweight).

Regarding the test error, the smallest values are achieved with lasso regression and best subset selection with regression of size 2.

Code for regression variable selection

Tag: feature selection

Prediction Using Caret Model Ensembling

3-way Variable Selection in R Regression (lasso,stepwise,and best subset)