OUTLIER DETECTION

In this discussion, we are using the dataset gala in package faraway.

> gala[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

and the model we will use is:

> model<-lm(Species~Endemics+Area+Elevation,data=gala)

A. Graphical Approaches for Outlier Detection

A. 1. From the graph of standardized residuals.

Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.

>places<-row.names(gala)

> plot(rstandard(model))

> identify(1:30,rstandard(model),places)

NOTE: The identify function enables the interactive labeling of index graphs in R.

A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)

The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).

>library(MASS)

>plot(rstudent(model),main=”Jackknifed residuals”)

> identify(1:30,rstudent(model),places)

B. NON-GRAPHICAL APPROACHES.

B.1 Looking at the magnitudes of studentized residuals.

>jack<-rstudent(model)

> absjack<-abs(jack)

> absjack[absjack >2]

Isabela SanSalvador SantaCruz

3.060831 3.287014 4.092622

Now let’s do a Bonferonni test.

>library(car)

> outlierTest(model)

rstudent unadjusted p-value Bonferonni p

SantaCruz 4.092622 0.00039043 0.011713

So, according to the Bonferroni test, only Santacruz is an outlier.

B,2 Looking at the magnitude of Standard residuals

> standard<-abs(rstandard(model))

> standard[standard > 2]

Isabela SanSalvador SantaCruz

2.662224 2.801042 3.229701

NOTE THE SMALLER MAGNITUDES

INFLUENTIAL CASES

An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.

A. Non-Graphical Approaches for Detection of Influential Cases

A.1. Influential Cases From lm.influence()

> modi<-influence.measures(model)

> summary(modi)

Potentially influential observations of

lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :

dfb.1_ dfb.Endm dfb.Area dfb.Elvt dffit cov.r cook.d hat

Fernandina -0.28 -0.87 -0.70 1.40_* 1.44_* 4.29_* 0.52 0.75_*

Isabela -1.69_* 2.07_* -15.46_* 1.62_* -19.88_* 14.14_* 74.71_* 0.98_*

SanSalvador 0.65 -1.16_* 0.69 -0.08 -1.73_* 0.35_* 0.54 0.22

SantaCruz -0.85 2.41_* -0.32 -0.86 2.80_* 0.22_* 1.22_* 0.32

SantaMaria -0.05 0.16 -0.08 -0.03 0.19 1.46_* 0.01 0.22

A. 2. Influential Cases From Cook’s distance

Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.

> cook<-cooks.distance(model)

> cook[cook > 1]

Isabela SantaCruz
74.709410 1.223785

A. 3. Influential Cases From The Computation of Leverage

Leverage: it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:

av. leverage=(number of predictors + 1 /divided by sample size)

Then we look for cases with leverage values > 2 or 3 X av. Leverage. NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.

In R, the leverage is measured by the hat values:

> x<-model.matrix(model)

> lev<-hat(x)

> places<-row.names(gala)

> names(lev)<-places

> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz

0.7547450 0.9768328 0.3193987

B. Graphical Detection of Influential Cases

B.1 From Cook’s distance

>cook<-cooks.distance(model)

> plot(cook,ylab=”Cooks distance”)

> identify(1:30,cook,places)

B.2. From Leverage

>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)

> sum(lev)

[1] 4

> abline(h=8/30) (NOTE: This line is at 2*sum(lev)/ N.)

> identify(1:30,lev,places)

[1] 12 16 25

B.3.From the function plot(model)

2 thoughts on “Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R”

Martin Maechler says:

March 27, 2014 at 8:53 am

[As I also noted in the LinkedIn thread:]

Note that really reliable outlier detection is known to be impossible with the diagnostics mentioned. Rather you need to use so called robust methods such as cov.robust() and rlm() in package MASS (recommended standard R package),
more extensively and modern in the CRAN R packages robustbase robustbase, rrcov, FRB, mvoutlier, etc.

In particular, the “really robust” methods are so called “high
breakdown”, i.e., do not break down in face of a considerable percentage of outliers, often up to 50%.
On the other hand, the lm() based diagnostics that you mention can fail in cases with only two outliers (so called “masking effect”).
See the CRAN Task
View (“ctv”) on robustness and the references mentioned in those R
packages.

Note also that there is dedicated (very low volume)
mailing list
about Robust Statistics and R where issues like this can be asked
about or discussed.

LikeLike

1. admin says:
  
  March 27, 2014 at 10:27 am
  
  Thanks for the really useful comment.
  
  LikeLike

Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R

OUTLIER DETECTION

A. Graphical Approaches for Outlier Detection