In this post, you can find code (link below) for doing variable selection in R regression in three different ways. The variable selection was done on the well-known R dataset prostate. The data is inherently separated in train and test cases. The regressions were applied on the training data and then the prediction mean square error was computed for the test data.
- Stepwise regression: Here we use the R function step(), where the AIC criterion serves as a guide to add/delete variables. The regression implementation that is returned by step() has achieved the lowest AIC.
- Lasso regression: This is a form of penalized regression that does feature selection inherently. Penalized regression adds bias to the regression equation in order to reduce variance and therefore, reduce prediction error and avoid overfitting. Lasso regression sets some coefficients to zero, and therefore does implicit feature selection.
- Best subset regression: Here we use the R package leaps and specifically the function regsubsets(), which returns the best model of size m=1…,n where n is the number of input variables.
Regarding which variables are removed, it is interesting to note that:
- Lasso regression and stepwise regression result in the removal of the same variable (gleason).
- In best subset selection, when we select the regression with the smallest cp (mallow’s cp), the best subset is the one of size 7, with one variable removed (gleason again). When we select, the subset with the smallest BIC (Bayes Information Criterion), the best subset is the one of size 2 (the two variables that remain are lcavol and lweight).
Regarding the test error, the smallest values are achieved with lasso regression and best subset selection with regression of size 2.
Code for regression variable selection
Hi Natalia,
Lasso does inherent variable selection and it keeps the variables it considers significant. So, I am not sure why on top of Lasso you want to do linear regression as well. I would just do the Lasso. Now, you do have to check whether the variables considered as significant by LASSO also make sense in your study settings. Here is a very useful relevant discussion:
http://stats.stackexchange.com/questions/7935/what-are-disadvantages-of-using-the-lasso-for-variable-selection-for-regression
LikeLike
THanks a lot for your fast reply! Do you have any suggestions as to how to report the results from a LASSO in a paper? normally reviewers would ask to see a test statistic and its associated p-value. I’ve been looking into the literature and it’s not quite clear to me how to do this.
Thanks again
Regards
LikeLike
Hi again,
Here is a good reference on how to report results from a lasso:
Click to access Variable_Selection.pdf
LikeLike
Hi again,
Here is a good reference on how to report results from a lasso:
Click to access Variable_Selection.pdf
LikeLike