Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R

 OUTLIER DETECTION

In this discussion, we are using the dataset gala in package faraway.

> gala[1:3,]

     Species Endemics  Area Elevation Nearest Scruz Adjacent

Baltra         58       23 25.09       346     0.6   0.6     1.84

Bartolome      31       21  1.24       109     0.6  26.3   572.33

Caldwell        3        3  0.21       114     2.8  58.7     0.78

and the model we will use is:

> model<-lm(Species~Endemics+Area+Elevation,data=gala)

A. Graphical Approaches for Outlier Detection         

 A. 1. From the graph of standardized residuals.

Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.

 >places<-row.names(gala)

> plot(rstandard(model))

> identify(1:30,rstandard(model),places)

NOTE: The identify function enables the interactive labeling of index graphs in R.

fg1

 A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)

The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).

>library(MASS)

>plot(rstudent(model),main=”Jackknifed residuals”)

> identify(1:30,rstudent(model),places)

 

fig2

 

B. NON-GRAPHICAL APPROACHES.

B.1 Looking at the magnitudes of  studentized residuals.

>jack<-rstudent(model)

> absjack<-abs(jack)

> absjack[absjack >2]

Isabela SanSalvador SantaCruz

3.060831 3.287014 4.092622

 

 Now let’s do a Bonferonni test.

>library(car)

> outlierTest(model)

rstudent unadjusted p-value Bonferonni p

SantaCruz 4.092622         0.00039043     0.011713

So, according to the Bonferroni test, only Santacruz is an outlier.

B,2 Looking at the magnitude of Standard residuals

> standard<-abs(rstandard(model))

> standard[standard > 2]

Isabela SanSalvador   SantaCruz

2.662224    2.801042    3.229701

NOTE THE SMALLER MAGNITUDES

INFLUENTIAL CASES

An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.

A. Non-Graphical Approaches for Detection of Influential Cases

A.1. Influential Cases From lm.influence()

> modi<-influence.measures(model)

> summary(modi)

Potentially influential observations of

lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :

dfb.1_   dfb.Endm dfb.Area dfb.Elvt dffit    cov.r    cook.d   hat

Fernandina   -0.28    -0.87    -0.70     1.40_*   1.44_*   4.29_*   0.52     0.75_*

Isabela      -1.69_*   2.07_* -15.46_*   1.62_* -19.88_*  14.14_*  74.71_*   0.98_*

SanSalvador   0.65    -1.16_*   0.69    -0.08    -1.73_*   0.35_*   0.54     0.22

SantaCruz    -0.85     2.41_*  -0.32    -0.86     2.80_*   0.22_*   1.22_*   0.32

SantaMaria   -0.05     0.16    -0.08    -0.03     0.19     1.46_*   0.01     0.22

 A. 2. Influential Cases From Cook’s distance

Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.

> cook<-cooks.distance(model)

> cook[cook > 1]

Isabela SantaCruz
74.709410 1.223785

A. 3. Influential Cases From The Computation of Leverage

Leverage:  it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:

av. leverage=(number of predictors + 1 /divided by sample size)

Then we look for cases with leverage values > 2 or 3 X av. Leverage.  NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.

In R, the leverage is measured by the hat values:

> x<-model.matrix(model)

> lev<-hat(x)

> places<-row.names(gala)

> names(lev)<-places

> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz

0.7547450 0.9768328 0.3193987

 

 B. Graphical Detection of Influential Cases

B.1  From Cook’s distance

>cook<-cooks.distance(model)

> plot(cook,ylab=”Cooks distance”)

> identify(1:30,cook,places)

 fig3

 

B.2. From Leverage

>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)

> sum(lev)

[1] 4

> abline(h=8/30)  (NOTE: This line is at 2*sum(lev)/ N.)

> identify(1:30,lev,places)

[1] 12 16 25

 

fig4

B.3.From the function  plot(model)

 fig5