OUTLIER DETECTION
In this discussion, we are using the dataset gala in package faraway.
> gala[1:3,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
and the model we will use is:
> model<-lm(Species~Endemics+Area+Elevation,data=gala)
A. Graphical Approaches for Outlier Detection
A. 1. From the graph of standardized residuals.
Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.
>places<-row.names(gala)
> plot(rstandard(model))
> identify(1:30,rstandard(model),places)
NOTE: The identify function enables the interactive labeling of index graphs in R.

A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)
The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).
>library(MASS)
>plot(rstudent(model),main=”Jackknifed residuals”)
> identify(1:30,rstudent(model),places)

B. NON-GRAPHICAL APPROACHES.
B.1 Looking at the magnitudes of studentized residuals.
>jack<-rstudent(model)
> absjack<-abs(jack)
> absjack[absjack >2]
Isabela SanSalvador SantaCruz
3.060831 3.287014 4.092622
Now let’s do a Bonferonni test.
>library(car)
> outlierTest(model)
rstudent unadjusted p-value Bonferonni p
SantaCruz 4.092622 0.00039043 0.011713
So, according to the Bonferroni test, only Santacruz is an outlier.
B,2 Looking at the magnitude of Standard residuals
> standard<-abs(rstandard(model))
> standard[standard > 2]
Isabela SanSalvador SantaCruz
2.662224 2.801042 3.229701
NOTE THE SMALLER MAGNITUDES
INFLUENTIAL CASES
An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.
A. Non-Graphical Approaches for Detection of Influential Cases
A.1. Influential Cases From lm.influence()
> modi<-influence.measures(model)
> summary(modi)
Potentially influential observations of
lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :
dfb.1_ dfb.Endm dfb.Area dfb.Elvt dffit cov.r cook.d hat
Fernandina -0.28 -0.87 -0.70 1.40_* 1.44_* 4.29_* 0.52 0.75_*
Isabela -1.69_* 2.07_* -15.46_* 1.62_* -19.88_* 14.14_* 74.71_* 0.98_*
SanSalvador 0.65 -1.16_* 0.69 -0.08 -1.73_* 0.35_* 0.54 0.22
SantaCruz -0.85 2.41_* -0.32 -0.86 2.80_* 0.22_* 1.22_* 0.32
SantaMaria -0.05 0.16 -0.08 -0.03 0.19 1.46_* 0.01 0.22
A. 2. Influential Cases From Cook’s distance
Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.
> cook<-cooks.distance(model)
> cook[cook > 1]
Isabela SantaCruz
74.709410 1.223785
A. 3. Influential Cases From The Computation of Leverage
Leverage: it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:
av. leverage=(number of predictors + 1 /divided by sample size)
Then we look for cases with leverage values > 2 or 3 X av. Leverage. NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.
In R, the leverage is measured by the hat values:
> x<-model.matrix(model)
> lev<-hat(x)
> places<-row.names(gala)
> names(lev)<-places
> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz
0.7547450 0.9768328 0.3193987
B. Graphical Detection of Influential Cases
B.1 From Cook’s distance
>cook<-cooks.distance(model)
> plot(cook,ylab=”Cooks distance”)
> identify(1:30,cook,places)

B.2. From Leverage
>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)
> sum(lev)
[1] 4
> abline(h=8/30) (NOTE: This line is at 2*sum(lev)/ N.)
> identify(1:30,lev,places)
[1] 12 16 25

B.3.From the function plot(model)
