INTRODUCTION
In this blog post, we will use the caret R package to predict the median California housing price. The original dataset can be found on the Kaggle website: https://www.kaggle.com/camnugent/california-housing-prices/kernels and it has 10 columns and 20641 rows.
The caret package is one of the most useful in R, offering a wide array of capabilities, ranging from data exploration and feature selection to implementation of a large number of models.
SUMMARY OF WORK IN THIS POST
First we sample 4000 random rows from the dataset for faster processing times. Then we do feature engineering, by removing the longitude and latitude columns, and creating new proportion features, such as people per household. We also remove rows with missing data.
Then we use caret for the following:
- Center and scale.
- Creation of the training and test set.
- One hot encoding (dummy vars).
- Feature selection using caret’s RFE method.
- Implementation of PLS, Lasso, Random Forest, XGB Tree, and SVMpoly regression.
- Model Comparison and model ensembling.
It is also noteworthy that we will utilize the multiple cores of our PC for faster processing, by using the doParallel library. Finally, we evaluate the performance of the models using the test set error.
Here is the link to the RMarkdown script:
CONCLUSION
As expected, the stacked model yielded the smallest test set error (not by much), but still the smallest.