3-way Variable Selection in R Regression (lasso,stepwise,and best subset)

In this post, you can find code (link below) for doing variable selection in R regression in three different ways. The variable selection was done on the well-known R dataset prostate. The data is inherently separated in train and test cases. The regressions were applied on the training data and then the prediction mean square error was computed for the test data.

Stepwise regression: Here we use the R function step(), where the AIC criterion serves as a guide to add/delete variables. The regression implementation that is returned by step() has achieved the lowest AIC.
Lasso regression: This is a form of penalized regression that does feature selection inherently. Penalized regression adds bias to the regression equation in order to reduce variance and therefore, reduce prediction error and avoid overfitting. Lasso regression sets some coefficients to zero, and therefore does implicit feature selection.
Best subset regression: Here we use the R package leaps and specifically the function regsubsets(), which returns the best model of size m=1…,n where n is the number of input variables.

Regarding which variables are removed, it is interesting to note that:

Lasso regression and stepwise regression result in the removal of the same variable (gleason).
In best subset selection, when we select the regression with the smallest cp (mallow’s cp), the best subset is the one of size 7, with one variable removed (gleason again). When we select, the subset with the smallest BIC (Bayes Information Criterion), the best subset is the one of size 2 (the two variables that remain are lcavol and lweight).

Regarding the test error, the smallest values are achieved with lasso regression and best subset selection with regression of size 2.

Code for regression variable selection

Prediction in R using Ridge Regression

Ridge regression is a regularization method, where the coefficients are shrank with the purpose of reducing the variance of the solution and therefore improving prediction accuracy. Below we will implement ridge regression on the longley and prostate data sets using two methods: the lm.ridge() function and the linearRidge() function. Pay special attention to the scaling of the coefficients and the offseting of the predicted values for the lm.ridge() function.

Ridge regression in R examples

Alternative ways to compute the cross-validation and test errors in R linear regression

In this post, we will look at two different ways to compute the cross-validation error and two different ways to compute the test error in R linear regression. The data set we will use is the gala set in package faraway.

> library(faraway)

> data(gala)

> index<-1:nrow(gala)

> testindex<-sample(index,trunc(length(index)/3))

testset<-gala[testindex,]

trainset<-gala[-testindex,]

c1<-c(1:nn)

> trainset2<-cbind(trainset,c1)

> diff1<-numeric(nn)

> foo1<-numeric(nn)

> foo2<-numeric(nn)

> diff2<-numeric(nn)

> #Cross validation using predict in two different ways:(a) the data points are explicitly defined using a list() (b) the data points are implicitly given by deleting the output column (column 1).

> newdata<-trainset[,-5:-7]

> for(i in 1:nn){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=trainset2)

+ foo1[i]<-predict(model1,list(Endemics=trainset2[i,2],Area=trainset2[i,3],Elevation=trainset2[i,

+ 4]))

+ foo2[i]<-predict(model1,newdata[i,-1])

+ diff1[i]<-trainset2[i,1]-foo1[i]

+ diff2[i]<-trainset2[i,1]-foo2[i]

+ }

#As we see below, the two ways are equivalent.

> diff1

[1] 9.9654289 -0.4143694 -14.5205954 47.9751592 8.1244225

[6] 14.0009744 -2.5409240 30.3767094 -5.6358873 12.5939696

[11] -34.3419570 -428.6932984 9.7597007 -25.9936027 -13.6063039

[16] -12.0757886 86.7751835 -27.2387452 -10.6432398 -9.8346438

> diff2

[1] 9.9654289 -0.4143694 -14.5205954 47.9751592 8.1244225

[6] 14.0009744 -2.5409240 30.3767094 -5.6358873 12.5939696

[11] -34.3419570 -428.6932984 9.7597007 -25.9936027 -13.6063039

[16] -12.0757886 86.7751835 -27.2387452 -10.6432398 -9.8346438

> sum11=0

> for(i in 1:nn) {

+ sum11=sum11+diff1[i]^2

+ }

> sum11<-sum11/nn

#The number below is the Mean Square Error.

> sum11

[1] 9926.731

#Computation of the test error in 2 different ways: (a) Using the predict() function to compute the predicted values (b) Use the coefficients of the regression equation to compute the predicted values.

> tt<-nrow(testset)

> diff3<-numeric(tt)

> model12<-lm(Species~Endemics+Area+Elevation,data=trainset)

> difftest<-numeric(tt)

> newdata2<-testset[,-5:-7]

#Using the predict() function for the computation of the predicted values.

> lmpred<-predict(model2,newdata2[,-1])

> for(i in 1:tt) {

+ difftest[i]=testset[i,1]-lmpred[i]}

> summ2=0

> for(i in 1:tt) {

+ summ2=summ2+difftest[i]^2}

> summ2<-summ2/tt

> summ2

#This is the mean squared test error.

[1] 1091.469

#Now let’s proceed to the second way of computing the test error.

> cc<-as.matrix(cbind(rep(1,10),testset[,c(2,3,4)]))

> pred.value<-cc%*% model2$coef

> for(i in 1:tt) {

+ difftest[i]<-testset[i,1]-pred.value[i]}

> sum3=0

> for(i in 1:tt) {

+ sum3=sum3+difftest[i]^2}

> sum3<-sum3/tt

> sum3

[1] 1091.469

#As expected the two different methods of computing the test error, yield the same numbers.

Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R

OUTLIER DETECTION

In this discussion, we are using the dataset gala in package faraway.

> gala[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

and the model we will use is:

> model<-lm(Species~Endemics+Area+Elevation,data=gala)

A. Graphical Approaches for Outlier Detection

A. 1. From the graph of standardized residuals.

Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.

>places<-row.names(gala)

> plot(rstandard(model))

> identify(1:30,rstandard(model),places)

NOTE: The identify function enables the interactive labeling of index graphs in R.

A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)

The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).

>library(MASS)

>plot(rstudent(model),main=”Jackknifed residuals”)

> identify(1:30,rstudent(model),places)

B. NON-GRAPHICAL APPROACHES.

B.1 Looking at the magnitudes of studentized residuals.

>jack<-rstudent(model)

> absjack<-abs(jack)

> absjack[absjack >2]

Isabela SanSalvador SantaCruz

3.060831 3.287014 4.092622

Now let’s do a Bonferonni test.

>library(car)

> outlierTest(model)

rstudent unadjusted p-value Bonferonni p

SantaCruz 4.092622 0.00039043 0.011713

So, according to the Bonferroni test, only Santacruz is an outlier.

B,2 Looking at the magnitude of Standard residuals

> standard<-abs(rstandard(model))

> standard[standard > 2]

Isabela SanSalvador SantaCruz

2.662224 2.801042 3.229701

NOTE THE SMALLER MAGNITUDES

INFLUENTIAL CASES

An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.

A. Non-Graphical Approaches for Detection of Influential Cases

A.1. Influential Cases From lm.influence()

> modi<-influence.measures(model)

> summary(modi)

Potentially influential observations of

lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :

dfb.1_ dfb.Endm dfb.Area dfb.Elvt dffit cov.r cook.d hat

Fernandina -0.28 -0.87 -0.70 1.40_* 1.44_* 4.29_* 0.52 0.75_*

Isabela -1.69_* 2.07_* -15.46_* 1.62_* -19.88_* 14.14_* 74.71_* 0.98_*

SanSalvador 0.65 -1.16_* 0.69 -0.08 -1.73_* 0.35_* 0.54 0.22

SantaCruz -0.85 2.41_* -0.32 -0.86 2.80_* 0.22_* 1.22_* 0.32

SantaMaria -0.05 0.16 -0.08 -0.03 0.19 1.46_* 0.01 0.22

A. 2. Influential Cases From Cook’s distance

Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.

> cook<-cooks.distance(model)

> cook[cook > 1]

Isabela SantaCruz
74.709410 1.223785

A. 3. Influential Cases From The Computation of Leverage

Leverage: it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:

av. leverage=(number of predictors + 1 /divided by sample size)

Then we look for cases with leverage values > 2 or 3 X av. Leverage. NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.

In R, the leverage is measured by the hat values:

> x<-model.matrix(model)

> lev<-hat(x)

> places<-row.names(gala)

> names(lev)<-places

> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz

0.7547450 0.9768328 0.3193987

B. Graphical Detection of Influential Cases

B.1 From Cook’s distance

>cook<-cooks.distance(model)

> plot(cook,ylab=”Cooks distance”)

> identify(1:30,cook,places)

B.2. From Leverage

>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)

> sum(lev)

[1] 4

> abline(h=8/30) (NOTE: This line is at 2*sum(lev)/ N.)

> identify(1:30,lev,places)

[1] 12 16 25

B.3.From the function plot(model)

Statistical Learning: Maximum Likelihood and Maximum A Posteriori Examples

1. The Maximum Likelihood (ML) Approach

Here the parameter to be estimated is considered to be a random variable.

Example: A hardware defect started appearing in the assembly line of a computer manufacturer. In the past week, the following observations of the defect were made: Monday (2), Tuesday (2), Wednesday(3), Thursday(1), Friday(0). The assembly line manager knows that the distribution of the defects follows the Poisson distribution, which is a discrete distribution used often to model the number of random events that occur in an interval of time. The probability that K events occur in a given interval is given by:

$P(K)=e^{-lambda} frac{lambda^K}{K!}$

$lambda$ is the parameter of the distribution that is also equal to the mean and variance of the distribution. The manager will use maximum likelihood to estimate $lambda$ .

The likelihood function to maximize is proportional to:

$(P(k_{1}=2,k_{2}=2,k_{3}=3,k_{4}=1,k_{5}=0)= e^{-5lambda} lambda^{sum_{n=1}^{5}k_{n}}$

In order to maximize this function, we take the derivative and set it to 0, which eventually yields:

$lambda=frac{sum_{n=1}^{5}k_{n}}{5}=8/5$

Statistical learning review question: What is the relationship between ML and Least Squares Error Estimation? When observations are independent, and normally distributed with constant variance, then ML and LSE produce the same parameter values [1].

2. The Maximum A Posteriori (MAP) Approach

Here the parameter to be estimated is considered fixed, but unknown.

Example : A technical institute in Greece accepts students from two pools each year: (a) Most of the accepted students come from students that apply for the first time (b) A smaller number of the accepted students come from students that apply for a second time. All students are required to take an entrance examination (the max. grade is 100). A second-try student who scored 70 wants to estimate the probability he will be accepted.

The following information is available from previous years: (a) the grade distribution of accepted second-try-students is a Gaussian with mean 80 and st.dev 10. (b)The grade distribution of not accepted second-try-students is a Gaussian with mean 40 and st. dev 20. (c) The probability of a second-try-student being accepted is 1/5.

$P(accepted 2nd try IF 70)= frac{P(70 IF accepted 2nd try) P(accepted 2nd try)}{P(70 IFNOTaccepted 2nd try)P(NOTaccepted 2nd try)+P(70 IF accepted2ndtry)P(accepted 2nd try)}$

Doing computations in R, we get:

P(70IFaccepted2ndtry)<-dnorm(70,mean=80,s=10) which gives 0.024.

P(70ifNOTaccepted)<-dnorm(70,mean=40,s=10), which gives 0.0064

So, P(accepted 2nd try IF70)=0.024 X 1/5 /(0.024 X 1/5 + 4/5 0.0064)=0.0048/(0.0048+0.00512)=0.0048/0.00992=0.483

Similarly,

$P(NOTaccepted 2nd try IF 70)= frac{P(70 IF NOTaccepted 2nd try) P(NOTaccepted 2nd try)}{P(70 IFNOTaccepted 2nd try)P(NOTaccepted 2nd try)+P(70 IF accepted2ndtry)P(accepted 2nd try)}$

So, P(NOTaccepted 2nd try IF70)=0.0064 X 4/5 /(0.024 X 1/5 + 4/5 0.0064)=0.0064/(0.0048+0.00512)=0.556.

Unfortunately, the probability of not being accepted is greater than that of being accepted, so the student is not too happy about this.

A Bibliographical note: In reference [2], ML and MAP were compared for the estimation of physiological parameters. It was concluded that MAP gave a more precise estimate than ML. In general, the authors conclude in case where one has few or noise-impaired measurements, the a priori information can be very useful when incorporated in a MAP-type estimation.

[1] Myung, J., Tutorial on Maximum Likelihood Estimation, Journal of Mathematical Psychology,

[2] Sparacino, G. et al., “Maximum-Likelihood Versus Maximum a Posteriori Parameter Estimation of Physiological System Models: The C-peptide Impulse Response Case Study“, IEEE Transactions on Biomedical Engineering, vol. 47, no. 6, pp. 801-811, June 2000.

Support vector machine regression in R: Tuning the parameters yields the best predictive power

In this example, we will compare 4 types of regression on the benchmark dataset BostonHousing in package “mlbench”. Specifically, we will compare linear regression, regression tree regression, support vector machine regression using default parameters, and support vector machine regression using tuned parameters. The predictive power will be computed using the PRESS metric, described in previous posts. Here is the code:

>library(mlbench)

>library(e1071)

>data(BostonHousing)

>c1<-c(1:506) . Note that 506 is the number of rows in BostonHousing (nrow(BostonHousing)).

BostonHousingmod<-cbind(BostonHousing,c1).

NOTE: The introduction of the c1 variable is purely for the computation of the PRESS statistic as shown below.

LINEAR REGRESSION CODE AND COMPUTATION OF PRESS METRIC

> difflm<-numeric(506)
> for(i in 1: 506) {model1m<-lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(model1m,newData)
+ difflm[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ1=0
> for(i in 1:506) {summ1=summ1+difflm[i]^2}
> summ1
[1] 12005.23

So the PRESS metric is 12005.23.

SUPPORT VECTOR MACHINE REGRESSION USING DEFAULT PARAMETERS
> diffsvm<-numeric(506)
> for(i in 1: 506) {modelsvm<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelsvm,newData)
+ diffsvm[i]<-BostonHousingmod[i,14]-specpr
+ }
>
> summ2=0
> for(i in 1:506) {summ2=summ2+diffsvm[i]^2}
> summ2
[1] 6436.89

So the PRESS metric here is 6436.89.

REGRESSION TREE CODE AND COMPUTATION OF PRESS
> diffrpart<-numeric(506)
> for(i in 1: 506) {modelrpart<-rpart(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelrpart,newData)
+ diffrpart[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ3=0
> for(i in 1:506) {summ3=summ3+diffrpart[i]^2}
> summ3
[1] 11452.59

So, the PRESS metric here is 11452.59.

CODE FOR THE TUNING OF THE SUPPORT VECTOR MACHINE REGRESSION PARAMETERS

difftune<-numeric(506)
> cc<-seq(0.1,0.9,by=0.1)
> cc2<-seq(100,1000,by=100)
> modeltune<-tune.svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=cc,cost=cc2, data=BostonHousingmod)
> modeltune

Parameter tuning of ‘svm’:

– sampling method: 10-fold cross validation

– best parameters:
gamma cost
0.1 100

– best performance: 10.65047

CODE FOR THE SUPPORT VECTOR MACHINE REGRESSION USING TUNED PARAMETERS AND COMPUTATION OF THE PRESS METRIC

> for(i in 1: 506) {finalmodel<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=0.1,cost=100,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(finalmodel,newData)
+ difftune[i]<-BostonHousingmod[i,14]-specpr
+ }
> sumnew=0
> for(i in 1:506) {sumnew=sumnew+difftune[i]^2}
> sumnew
[1] 5627.534

SO, the tuned support machine regression has the BEST (SMALLEST) PRESS metric with value 5627.534. Therefore, for this example, the tuned support vector machine regression has shown the best predictive power.

By the way, this is what the finalmodel contains:
> finalmodel

Call:
svm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio +
b + lstat, data = BostonHousingmod, gamma = 0.1, cost = 100, subset = (c1 != i))
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 100
gamma: 0.1
epsilon: 0.1
Number of Support Vectors: 329

My posts on Analytic bridge regarding regression,cross validation, and predictive power

http://www.analyticbridge.com/profiles/blogs/cross-validation-in-r-a-do-it-yourself-and-a-black-box-approach

http://www.analyticbridge.com/profiles/blogs/use-press-not-r-squared-to-judge-predictive-power-of-regression

Leave-one-out cross validation in R and computation of the predicted residual sum of squares(PRESS) statistic

We will use the gala dataset in the faraway package to demonstrate leave-one-out cross-validation. In this type of validation, one case of the data set is left out and used as the testing set and the remaining data are used as the training set for the regression. This process is repeated until each case in the data set has served as the testing set.

The key concept in creating the iterative leave-one-out process in R is creating a column vector c1 and attaching it to gala as shown below, This allows us to uniquely identify the row that is to be left out in each iteration.

> library(faraway)

> gala[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

>c1<-c(1:30)

> gala2<-cbind(gala,c1)

> gala2[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent c1

Baltra 58 23 25.09 346 0.6 0.6 1.84 1

Bartolome 31 21 1.24 109 0.6 26.3 572.33 2

> diff1<-numeric(30)

> for(i in 1:30){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=gala2)

+ specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2)

+ diff1[i]<-gala2[i,1]-specpr }

>summ1<-numeric(1)

>summ1=0

> for(i in 1:30){summ1<-summ1+diff1[i]^2}

> summ1

[1] 259520.5

The variable summ1 holds the value of the PRESS statistic.

The Akaike and Bayesian information criteria for regression quality

R squared, the coefficient of determination, tells us about the goodness of fit of our regression. Why? Because it is defined as the percentage of variation of the dependent variable explained by the regression model. The higher the explaining power of our model, the larger the R squared. R squared increases with the number of variables in our model, and this is not very good because it does not protect us from overfitting. Two criteria that can protect us from overfitting are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). They both penalize a model for additional, but not very useful, terms. To put it simply, AIC and BIC encourage model conciseness, while R squared does not. In contrast to R squared, the smaller AIC and BIC, the better.

Let’s compute AIC and BIC in R, using two different models for the gala dataset in R. It shows the number of species in an area, based on several parameters such as area, elevation, etc.

> library(faraway)
> gala[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78

>model<-lm(Species~.,data=gala)

> summary(model)

……………………..

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.337942 9.423550 -1.628 0.117
Endemics 4.393654 0.481203 9.131 4.13e-09 ***
Area 0.013258 0.011403 1.163 0.257
Elevation -0.047537 0.047596 -0.999 0.328
Nearest -0.101460 0.500871 -0.203 0.841
Scruz 0.008256 0.105884 0.078 0.939
Adjacent 0.001811 0.011879 0.152 0.880
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.96 on 23 degrees of freedom
Multiple R-squared: 0.9494, Adjusted R-squared: 0.9362
F-statistic: 71.88 on 6 and 23 DF, p-value: 9.674e-14

Now we use the step function R, that gives us the best model according to R. This model turns out to be a more concise model than the full model above. We call it model2, and it is shown below.

> model2<-step(model)

> model2<-lm(Species~Endemics+Area+Elevation,data=gala)
> summary(model2)

………………………..

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.891237 7.569210 -2.099 0.0456 *
Endemics 4.331791 0.304686 14.217 8.97e-14 ***
Area 0.012669 0.008936 1.418 0.1681
Elevation -0.041439 0.023653 -1.752 0.0916 .
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.29 on 26 degrees of freedom
Multiple R-squared: 0.9492, Adjusted R-squared: 0.9433
F-statistic: 161.8 on 3 and 26 DF, p-value: < 2.2e-16

> AIC(model)
[1] 295.128

> AIC(model2)
[1] 289.243
> BIC(model)
[1] 306.3376
> BIC(model2)
[1] 296.249

We see that the more concise model2 is rewarded by both AIC and BIC (AIC and BIC are both smaller for model2).

Correlation of the columns of a dataframe in R

1. Columns contain numerical values.

Here we can use the function cor().

Example:

> library(MASS)
> cor(trees)
Girth Height Volume
Girth 1.0000000 0.5192801 0.9671194
Height 0.5192801 1.0000000 0.5982497
Volume 0.9671194 0.5982497 1.0000000

2. Some columns contain numerical and some contain ordinal values.

Here we can use the function hetcor() in R package polycor. This function computes Pearson correlations between numeric columns, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables. We will try it on dataset quine in the faraway package.

Example:

>library(faraway)
> library(polycor)
> quine[1:4,]
Eth Sex Age Lrn Days
1 A M F0 SL 2
2 A M F0 SL 11
3 A M F0 SL 14
4 A M F0 AL 5
> hetcor(quine)

Two-Step Estimates

Correlations/Type of Correlation:
Eth Sex Age Lrn Days
Eth 1 Polychoric Polychoric Polychoric Polyserial
Sex 0.008333 1 Polychoric Polychoric Polyserial
Age -0.02581 -0.08348 1 Polychoric Polyserial
Lrn 0.03389 -0.2393 -0.3187 1 Polyserial
Days -0.3504 0.1048 0.1773 0.05657 1

Standard Errors:
Eth Sex Age Lrn
Eth
Sex 0.1305
Age 0.1109 0.1103
Lrn 0.1307 0.1259 0.1026
Days 0.09246 0.1025 0.08507 0.1039

n = 146

P-values for Tests of Bivariate Normality:
Eth Sex Age Lrn
Eth
Sex <NA>
Age 0.8355 0.01814
Lrn <NA> <NA> 7.895e-11
Days 0.008092 0.009787 0.007213 0.005257