Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R

 OUTLIER DETECTION

In this discussion, we are using the dataset gala in package faraway.

> gala[1:3,]

     Species Endemics  Area Elevation Nearest Scruz Adjacent

Baltra         58       23 25.09       346     0.6   0.6     1.84

Bartolome      31       21  1.24       109     0.6  26.3   572.33

Caldwell        3        3  0.21       114     2.8  58.7     0.78

and the model we will use is:

> model<-lm(Species~Endemics+Area+Elevation,data=gala)

A. Graphical Approaches for Outlier Detection         

 A. 1. From the graph of standardized residuals.

Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.

 >places<-row.names(gala)

> plot(rstandard(model))

> identify(1:30,rstandard(model),places)

NOTE: The identify function enables the interactive labeling of index graphs in R.

fg1

 A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)

The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).

>library(MASS)

>plot(rstudent(model),main=”Jackknifed residuals”)

> identify(1:30,rstudent(model),places)

 

fig2

 

B. NON-GRAPHICAL APPROACHES.

B.1 Looking at the magnitudes of  studentized residuals.

>jack<-rstudent(model)

> absjack<-abs(jack)

> absjack[absjack >2]

Isabela SanSalvador SantaCruz

3.060831 3.287014 4.092622

 

 Now let’s do a Bonferonni test.

>library(car)

> outlierTest(model)

rstudent unadjusted p-value Bonferonni p

SantaCruz 4.092622         0.00039043     0.011713

So, according to the Bonferroni test, only Santacruz is an outlier.

B,2 Looking at the magnitude of Standard residuals

> standard<-abs(rstandard(model))

> standard[standard > 2]

Isabela SanSalvador   SantaCruz

2.662224    2.801042    3.229701

NOTE THE SMALLER MAGNITUDES

INFLUENTIAL CASES

An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.

A. Non-Graphical Approaches for Detection of Influential Cases

A.1. Influential Cases From lm.influence()

> modi<-influence.measures(model)

> summary(modi)

Potentially influential observations of

lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :

dfb.1_   dfb.Endm dfb.Area dfb.Elvt dffit    cov.r    cook.d   hat

Fernandina   -0.28    -0.87    -0.70     1.40_*   1.44_*   4.29_*   0.52     0.75_*

Isabela      -1.69_*   2.07_* -15.46_*   1.62_* -19.88_*  14.14_*  74.71_*   0.98_*

SanSalvador   0.65    -1.16_*   0.69    -0.08    -1.73_*   0.35_*   0.54     0.22

SantaCruz    -0.85     2.41_*  -0.32    -0.86     2.80_*   0.22_*   1.22_*   0.32

SantaMaria   -0.05     0.16    -0.08    -0.03     0.19     1.46_*   0.01     0.22

 A. 2. Influential Cases From Cook’s distance

Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.

> cook<-cooks.distance(model)

> cook[cook > 1]

Isabela SantaCruz
74.709410 1.223785

A. 3. Influential Cases From The Computation of Leverage

Leverage:  it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:

av. leverage=(number of predictors + 1 /divided by sample size)

Then we look for cases with leverage values > 2 or 3 X av. Leverage.  NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.

In R, the leverage is measured by the hat values:

> x<-model.matrix(model)

> lev<-hat(x)

> places<-row.names(gala)

> names(lev)<-places

> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz

0.7547450 0.9768328 0.3193987

 

 B. Graphical Detection of Influential Cases

B.1  From Cook’s distance

>cook<-cooks.distance(model)

> plot(cook,ylab=”Cooks distance”)

> identify(1:30,cook,places)

 fig3

 

B.2. From Leverage

>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)

> sum(lev)

[1] 4

> abline(h=8/30)  (NOTE: This line is at 2*sum(lev)/ N.)

> identify(1:30,lev,places)

[1] 12 16 25

 

fig4

B.3.From the function  plot(model)

 fig5

Published by

Unknown's avatar

alitheia15

Data Mining-Analytics Software Consultant

2 thoughts on “Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R”

  1. [As I also noted in the LinkedIn thread:]

    Note that really reliable outlier detection is known to be impossible with the diagnostics mentioned. Rather you need to use so called robust methods such as cov.robust() and rlm() in package MASS (recommended standard R package),
    more extensively and modern in the CRAN R packages robustbase robustbase, rrcov, FRB, mvoutlier, etc.

    In particular, the “really robust” methods are so called “high
    breakdown”
    , i.e., do not break down in face of a considerable percentage of outliers, often up to 50%.
    On the other hand, the lm() based diagnostics that you mention can fail in cases with only two outliers (so called “masking effect”).
    See the CRAN Task
    View (“ctv”) on robustness
    and the references mentioned in those R
    packages.

    Note also that there is dedicated (very low volume)
    mailing list
    about Robust Statistics and R
    where issues like this can be asked
    about or discussed.

    Like

Leave a comment