Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R

OUTLIER DETECTION

In this discussion, we are using the dataset gala in package faraway.

> gala[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

and the model we will use is:

> model<-lm(Species~Endemics+Area+Elevation,data=gala)

A. Graphical Approaches for Outlier Detection

A. 1. From the graph of standardized residuals.

Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.

>places<-row.names(gala)

> plot(rstandard(model))

> identify(1:30,rstandard(model),places)

NOTE: The identify function enables the interactive labeling of index graphs in R.

A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)

The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).

>library(MASS)

>plot(rstudent(model),main=”Jackknifed residuals”)

> identify(1:30,rstudent(model),places)

B. NON-GRAPHICAL APPROACHES.

B.1 Looking at the magnitudes of studentized residuals.

>jack<-rstudent(model)

> absjack<-abs(jack)

> absjack[absjack >2]

Isabela SanSalvador SantaCruz

3.060831 3.287014 4.092622

Now let’s do a Bonferonni test.

>library(car)

> outlierTest(model)

rstudent unadjusted p-value Bonferonni p

SantaCruz 4.092622 0.00039043 0.011713

So, according to the Bonferroni test, only Santacruz is an outlier.

B,2 Looking at the magnitude of Standard residuals

> standard<-abs(rstandard(model))

> standard[standard > 2]

Isabela SanSalvador SantaCruz

2.662224 2.801042 3.229701

NOTE THE SMALLER MAGNITUDES

INFLUENTIAL CASES

An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.

A. Non-Graphical Approaches for Detection of Influential Cases

A.1. Influential Cases From lm.influence()

> modi<-influence.measures(model)

> summary(modi)

Potentially influential observations of

lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :

dfb.1_ dfb.Endm dfb.Area dfb.Elvt dffit cov.r cook.d hat

Fernandina -0.28 -0.87 -0.70 1.40_* 1.44_* 4.29_* 0.52 0.75_*

Isabela -1.69_* 2.07_* -15.46_* 1.62_* -19.88_* 14.14_* 74.71_* 0.98_*

SanSalvador 0.65 -1.16_* 0.69 -0.08 -1.73_* 0.35_* 0.54 0.22

SantaCruz -0.85 2.41_* -0.32 -0.86 2.80_* 0.22_* 1.22_* 0.32

SantaMaria -0.05 0.16 -0.08 -0.03 0.19 1.46_* 0.01 0.22

A. 2. Influential Cases From Cook’s distance

Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.

> cook<-cooks.distance(model)

> cook[cook > 1]

Isabela SantaCruz
74.709410 1.223785

A. 3. Influential Cases From The Computation of Leverage

Leverage: it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:

av. leverage=(number of predictors + 1 /divided by sample size)

Then we look for cases with leverage values > 2 or 3 X av. Leverage. NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.

In R, the leverage is measured by the hat values:

> x<-model.matrix(model)

> lev<-hat(x)

> places<-row.names(gala)

> names(lev)<-places

> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz

0.7547450 0.9768328 0.3193987

B. Graphical Detection of Influential Cases

B.1 From Cook’s distance

>cook<-cooks.distance(model)

> plot(cook,ylab=”Cooks distance”)

> identify(1:30,cook,places)

B.2. From Leverage

>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)

> sum(lev)

[1] 4

> abline(h=8/30) (NOTE: This line is at 2*sum(lev)/ N.)

> identify(1:30,lev,places)

[1] 12 16 25

B.3.From the function plot(model)

Support vector machine regression in R: Tuning the parameters yields the best predictive power

In this example, we will compare 4 types of regression on the benchmark dataset BostonHousing in package “mlbench”. Specifically, we will compare linear regression, regression tree regression, support vector machine regression using default parameters, and support vector machine regression using tuned parameters. The predictive power will be computed using the PRESS metric, described in previous posts. Here is the code:

>library(mlbench)

>library(e1071)

>data(BostonHousing)

>c1<-c(1:506) . Note that 506 is the number of rows in BostonHousing (nrow(BostonHousing)).

BostonHousingmod<-cbind(BostonHousing,c1).

NOTE: The introduction of the c1 variable is purely for the computation of the PRESS statistic as shown below.

LINEAR REGRESSION CODE AND COMPUTATION OF PRESS METRIC

> difflm<-numeric(506)
> for(i in 1: 506) {model1m<-lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(model1m,newData)
+ difflm[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ1=0
> for(i in 1:506) {summ1=summ1+difflm[i]^2}
> summ1
[1] 12005.23

So the PRESS metric is 12005.23.

SUPPORT VECTOR MACHINE REGRESSION USING DEFAULT PARAMETERS
> diffsvm<-numeric(506)
> for(i in 1: 506) {modelsvm<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelsvm,newData)
+ diffsvm[i]<-BostonHousingmod[i,14]-specpr
+ }
>
> summ2=0
> for(i in 1:506) {summ2=summ2+diffsvm[i]^2}
> summ2
[1] 6436.89

So the PRESS metric here is 6436.89.

REGRESSION TREE CODE AND COMPUTATION OF PRESS
> diffrpart<-numeric(506)
> for(i in 1: 506) {modelrpart<-rpart(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(modelrpart,newData)
+ diffrpart[i]<-BostonHousingmod[i,14]-specpr
+ }
> summ3=0
> for(i in 1:506) {summ3=summ3+diffrpart[i]^2}
> summ3
[1] 11452.59

So, the PRESS metric here is 11452.59.

CODE FOR THE TUNING OF THE SUPPORT VECTOR MACHINE REGRESSION PARAMETERS

difftune<-numeric(506)
> cc<-seq(0.1,0.9,by=0.1)
> cc2<-seq(100,1000,by=100)
> modeltune<-tune.svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=cc,cost=cc2, data=BostonHousingmod)
> modeltune

Parameter tuning of ‘svm’:

– sampling method: 10-fold cross validation

– best parameters:
gamma cost
0.1 100

– best performance: 10.65047

CODE FOR THE SUPPORT VECTOR MACHINE REGRESSION USING TUNED PARAMETERS AND COMPUTATION OF THE PRESS METRIC

> for(i in 1: 506) {finalmodel<-svm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+b+lstat,gamma=0.1,cost=100,subset=(c1!=i), data=BostonHousingmod)
+ newData<-data.frame(BostonHousingmod[i,-14,-15])
+ specpr<-predict(finalmodel,newData)
+ difftune[i]<-BostonHousingmod[i,14]-specpr
+ }
> sumnew=0
> for(i in 1:506) {sumnew=sumnew+difftune[i]^2}
> sumnew
[1] 5627.534

SO, the tuned support machine regression has the BEST (SMALLEST) PRESS metric with value 5627.534. Therefore, for this example, the tuned support vector machine regression has shown the best predictive power.

By the way, this is what the finalmodel contains:
> finalmodel

Call:
svm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio +
b + lstat, data = BostonHousingmod, gamma = 0.1, cost = 100, subset = (c1 != i))
Parameters:
SVM-Type: eps-regression
SVM-Kernel: radial
cost: 100
gamma: 0.1
epsilon: 0.1
Number of Support Vectors: 329

My posts on Analytic bridge regarding regression,cross validation, and predictive power

http://www.analyticbridge.com/profiles/blogs/cross-validation-in-r-a-do-it-yourself-and-a-black-box-approach

http://www.analyticbridge.com/profiles/blogs/use-press-not-r-squared-to-judge-predictive-power-of-regression

Leave-one-out cross validation in R and computation of the predicted residual sum of squares(PRESS) statistic

We will use the gala dataset in the faraway package to demonstrate leave-one-out cross-validation. In this type of validation, one case of the data set is left out and used as the testing set and the remaining data are used as the training set for the regression. This process is repeated until each case in the data set has served as the testing set.

The key concept in creating the iterative leave-one-out process in R is creating a column vector c1 and attaching it to gala as shown below, This allows us to uniquely identify the row that is to be left out in each iteration.

> library(faraway)

> gala[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

>c1<-c(1:30)

> gala2<-cbind(gala,c1)

> gala2[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent c1

Baltra 58 23 25.09 346 0.6 0.6 1.84 1

Bartolome 31 21 1.24 109 0.6 26.3 572.33 2

> diff1<-numeric(30)

> for(i in 1:30){model1<-lm(Species~Endemics+Area+Elevation,subset=(c1!=i),data=gala2)

+ specpr<-predict(model1,list(Endemics=gala2[i,2],Area=gala2[i,3],Elevation=gala2[i,4]),data=gala2)

+ diff1[i]<-gala2[i,1]-specpr }

>summ1<-numeric(1)

>summ1=0

> for(i in 1:30){summ1<-summ1+diff1[i]^2}

> summ1

[1] 259520.5

The variable summ1 holds the value of the PRESS statistic.

The Akaike and Bayesian information criteria for regression quality

R squared, the coefficient of determination, tells us about the goodness of fit of our regression. Why? Because it is defined as the percentage of variation of the dependent variable explained by the regression model. The higher the explaining power of our model, the larger the R squared. R squared increases with the number of variables in our model, and this is not very good because it does not protect us from overfitting. Two criteria that can protect us from overfitting are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). They both penalize a model for additional, but not very useful, terms. To put it simply, AIC and BIC encourage model conciseness, while R squared does not. In contrast to R squared, the smaller AIC and BIC, the better.

Let’s compute AIC and BIC in R, using two different models for the gala dataset in R. It shows the number of species in an area, based on several parameters such as area, elevation, etc.

> library(faraway)
> gala[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78

>model<-lm(Species~.,data=gala)

> summary(model)

……………………..

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.337942 9.423550 -1.628 0.117
Endemics 4.393654 0.481203 9.131 4.13e-09 ***
Area 0.013258 0.011403 1.163 0.257
Elevation -0.047537 0.047596 -0.999 0.328
Nearest -0.101460 0.500871 -0.203 0.841
Scruz 0.008256 0.105884 0.078 0.939
Adjacent 0.001811 0.011879 0.152 0.880
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.96 on 23 degrees of freedom
Multiple R-squared: 0.9494, Adjusted R-squared: 0.9362
F-statistic: 71.88 on 6 and 23 DF, p-value: 9.674e-14

Now we use the step function R, that gives us the best model according to R. This model turns out to be a more concise model than the full model above. We call it model2, and it is shown below.

> model2<-step(model)

> model2<-lm(Species~Endemics+Area+Elevation,data=gala)
> summary(model2)

………………………..

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.891237 7.569210 -2.099 0.0456 *
Endemics 4.331791 0.304686 14.217 8.97e-14 ***
Area 0.012669 0.008936 1.418 0.1681
Elevation -0.041439 0.023653 -1.752 0.0916 .
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.29 on 26 degrees of freedom
Multiple R-squared: 0.9492, Adjusted R-squared: 0.9433
F-statistic: 161.8 on 3 and 26 DF, p-value: < 2.2e-16

> AIC(model)
[1] 295.128

> AIC(model2)
[1] 289.243
> BIC(model)
[1] 306.3376
> BIC(model2)
[1] 296.249

We see that the more concise model2 is rewarded by both AIC and BIC (AIC and BIC are both smaller for model2).

Correlation of the columns of a dataframe in R

1. Columns contain numerical values.

Here we can use the function cor().

Example:

> library(MASS)
> cor(trees)
Girth Height Volume
Girth 1.0000000 0.5192801 0.9671194
Height 0.5192801 1.0000000 0.5982497
Volume 0.9671194 0.5982497 1.0000000

2. Some columns contain numerical and some contain ordinal values.

Here we can use the function hetcor() in R package polycor. This function computes Pearson correlations between numeric columns, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables. We will try it on dataset quine in the faraway package.

Example:

>library(faraway)
> library(polycor)
> quine[1:4,]
Eth Sex Age Lrn Days
1 A M F0 SL 2
2 A M F0 SL 11
3 A M F0 SL 14
4 A M F0 AL 5
> hetcor(quine)

Two-Step Estimates

Correlations/Type of Correlation:
Eth Sex Age Lrn Days
Eth 1 Polychoric Polychoric Polychoric Polyserial
Sex 0.008333 1 Polychoric Polychoric Polyserial
Age -0.02581 -0.08348 1 Polychoric Polyserial
Lrn 0.03389 -0.2393 -0.3187 1 Polyserial
Days -0.3504 0.1048 0.1773 0.05657 1

Standard Errors:
Eth Sex Age Lrn
Eth
Sex 0.1305
Age 0.1109 0.1103
Lrn 0.1307 0.1259 0.1026
Days 0.09246 0.1025 0.08507 0.1039

n = 146

P-values for Tests of Bivariate Normality:
Eth Sex Age Lrn
Eth
Sex <NA>
Age 0.8355 0.01814
Lrn <NA> <NA> 7.895e-11
Days 0.008092 0.009787 0.007213 0.005257

Kolmogorov-Smirnov test

TIP OF THE DAY:
For Comparing two sample populations
While t-tests can be used to detect differences in the mean and Levene’s test can be used to detect differences in the variance, the Kolmogorov-Smirnov test can be used to detect a change either in the mean or the difference of even in the shape of the corresponding population distributions.
An example in R:

> s3<-c(1,5,3,9,2,6,7,9)

> s4<-c(5,5,1,10,4,12,3,3)

> ks.test(s3,s4)

Two-sample Kolmogorov-Smirnov test

data: s3 and s4

D = 0.25, p-value = 0.964

alternative hypothesis: two-sided

Thus, the test shows that the null hypothesis (that the two samples come from the same distribution) can not be rejected.