OUTLIER DETECTION
In this discussion, we are using the dataset gala in package faraway.
> gala[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
and the model we will use is:
> model<-lm(Species~Endemics+Area+Elevation,data=gala)
A. Graphical Approaches for Outlier Detection
A. 1. From the graph of standardized residuals.
Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.
>places<-row.names(gala)
> plot(rstandard(model))
> identify(1:30,rstandard(model),places)
NOTE: The identify function enables the interactive labeling of index graphs in R.
A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)
The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).
>library(MASS)
>plot(rstudent(model),main=”Jackknifed residuals”)
> identify(1:30,rstudent(model),places)
B. NON-GRAPHICAL APPROACHES.
B.1 Looking at the magnitudes of studentized residuals.
>jack<-rstudent(model)
> absjack<-abs(jack)
> absjack[absjack >2]
Isabela SanSalvador SantaCruz
3.060831 3.287014 4.092622
Now let’s do a Bonferonni test.
>library(car)
> outlierTest(model)
rstudent unadjusted p-value Bonferonni p
SantaCruz 4.092622 0.00039043 0.011713
So, according to the Bonferroni test, only Santacruz is an outlier.
B,2 Looking at the magnitude of Standard residuals
> standard<-abs(rstandard(model))
> standard[standard > 2]
Isabela SanSalvador SantaCruz
2.662224 2.801042 3.229701
NOTE THE SMALLER MAGNITUDES
INFLUENTIAL CASES
An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.
A. Non-Graphical Approaches for Detection of Influential Cases
A.1. Influential Cases From lm.influence()
> modi<-influence.measures(model)
> summary(modi)
Potentially influential observations of
lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :
dfb.1_ dfb.Endm dfb.Area dfb.Elvt dffit cov.r cook.d hat
Fernandina -0.28 -0.87 -0.70 1.40_* 1.44_* 4.29_* 0.52 0.75_*
Isabela -1.69_* 2.07_* -15.46_* 1.62_* -19.88_* 14.14_* 74.71_* 0.98_*
SanSalvador 0.65 -1.16_* 0.69 -0.08 -1.73_* 0.35_* 0.54 0.22
SantaCruz -0.85 2.41_* -0.32 -0.86 2.80_* 0.22_* 1.22_* 0.32
SantaMaria -0.05 0.16 -0.08 -0.03 0.19 1.46_* 0.01 0.22
A. 2. Influential Cases From Cook’s distance
Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.
> cook<-cooks.distance(model)
> cook[cook > 1]
Isabela SantaCruz
74.709410 1.223785
A. 3. Influential Cases From The Computation of Leverage
Leverage: it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:
av. leverage=(number of predictors + 1 /divided by sample size)
Then we look for cases with leverage values > 2 or 3 X av. Leverage. NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.
In R, the leverage is measured by the hat values:
> x<-model.matrix(model)
> lev<-hat(x)
> places<-row.names(gala)
> names(lev)<-places
> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz
0.7547450 0.9768328 0.3193987
B. Graphical Detection of Influential Cases
B.1 From Cook’s distance
>cook<-cooks.distance(model)
> plot(cook,ylab=”Cooks distance”)
> identify(1:30,cook,places)
B.2. From Leverage
>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)
> sum(lev)
[1] 4
> abline(h=8/30) (NOTE: This line is at 2*sum(lev)/ N.)
> identify(1:30,lev,places)
[1] 12 16 25





[As I also noted in the LinkedIn thread:]
Note that really reliable outlier detection is known to be impossible with the diagnostics mentioned. Rather you need to use so called robust methods such as cov.robust() and rlm() in package MASS (recommended standard R package),
more extensively and modern in the CRAN R packages robustbase robustbase, rrcov, FRB, mvoutlier, etc.
In particular, the “really robust” methods are so called “high
breakdown”, i.e., do not break down in face of a considerable percentage of outliers, often up to 50%.
On the other hand, the lm() based diagnostics that you mention can fail in cases with only two outliers (so called “masking effect”).
See the CRAN Task
View (“ctv”) on robustness and the references mentioned in those R
packages.
Note also that there is dedicated (very low volume)
mailing list
about Robust Statistics and R where issues like this can be asked
about or discussed.
LikeLike
Thanks for the really useful comment.
LikeLike