This blog contains case studies, tips, examples in data science. In most cases, there is code included.
Disclaimer: This is a personal blog. There are no guaranties or representations related to the currentness,validity,completeness,or accuracy of any information presented in the blog. Author is not liable for any error/damage/loss that results from use of the blog’s contents. Feel free to add comments, but I reserve the right to delete comments if I deem them inappropriate.
Dear Respectable Mr.Mitsa,
I’m research assistant Emre DÜNDER, Department of Statistics, Samsun Ondokuz Mayıs University. I have read your blog about lasso regression Your blog is very instructive, thanks a lot. I’m requesting you to give me some information. I performed some simulations for tuning parameter selection based on BIC value. I have seen an interesting result: although the obtained BIC value is small, prediction error is high. I mean, lower BIC valued models correspond to having higher prediction error models. As in your article, BIC selects the models with lower prediction error. Due to I’m familiar with R, I wrote some R codes.
In fact, I saw that, as the selected lambda values increase, prediction error increases. As I saw, BIC tends to select higher lambda values, so this occurs the increment of the prediction error. But in many articles, BIC shows better performance in terms of prediction error. I computed the prediction error as mean((y-yhat)^2).
Why do I obtain such a result? I appreciate if you help me about it. This is a very big conflict. I use R project and I’m sending you some R codes below. I’m waiting for your kind responses.
Best regards.
########################################################################
########################################################################
############################## CODES ###################################
########################################################################
########################################################################
library(mpath)
##### Data Simulation
n<-100
p<-8
x <- scale(matrix(rnorm(n*p),n,p))
beta<-c(3,1.5,5,0,0,0,0,0)
y<-scale(x%*%beta+rnorm(n))
x<-(x-mean(x))/sd(x) #### Normalize x
y<-(y-mean(y))/sd(y) #### Normalize y
v<-data.frame(x,y)
#### BIC Computation
BIC.lasso<-function(x,y,lam){
m <-glmreg(y~x,data=v,family="gaussian",lambda=lam)
kat<-as.matrix(coef(m))[-1]
K<-length(kat[kat!=0])
fit<-coef(m)[1]+x%*%coef(m)[-1]
result<-log(sum((y-fit)^2)/n)+(log(n)*K/n)
return(result)
}
#### Lambda Grids
l<-0:99
lam<-10^(-2+4*l/99)
b<-c();
#### BIC values for every lambda
for(i in 1:100)
b[i]<-BIC.lasso(x,y,lam[i])
min.lam<-lam[which(b==min(b))[1]] ##### First minimum lamda value
plot(lam,b) ######
b #### BIC values vector
##### Comparison of the Methods
l<-cv.glmreg(y~x,family="gaussian",data=v)$lambda.optim
mg<-glmreg(y~x,data=v,family="gaussian",lambda=l)
opt<-glmreg(y~x,data=v,family="gaussian",lambda=min.lam)
cbind(as.matrix(coef(mg)),as.matrix(coef(opt))) ####### BIC shrinks too much
#### Prediction Error Computation
MSE<-function(x,y,lam){
v<-data.frame(x,y)
m<-glmreg(y~x,data=v,family="gaussian",lambda=lam)
fit<-coef(m)[1]+x%*%coef(m)[-1]
mte=apply((y-fit)^2,2,mean)
return(mte)
}
#### Inconsistency ???
BIC.lasso(x,y,l)
BIC.lasso(x,y,min.lam)
MSE(x,y,l)
MSE(x,y,min.lam)
LikeLike
Emre,
As far as BIC is concerned, the one thing we know about it, is that it favors regressions with a relatively smaller number of coefficients. So, there is no literature that says that BIC is better than Cp at prediction tasks. It really all depends on your data and neither BIC nor AIC nor Cp are a panacea, i.e. universally the best at prediction.
LikeLike