OUTLIER DETECTION

In this discussion, we are using the dataset gala in package faraway.

> gala[1:3,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

and the model we will use is:

> model<-lm(Species~Endemics+Area+Elevation,data=gala)

A. Graphical Approaches for Outlier Detection

A. 1. From the graph of standardized residuals.

Cases of interest are those where standardized residual > 2. Cases with values > 3 are very likely outliers.

>places<-row.names(gala)

> plot(rstandard(model))

> identify(1:30,rstandard(model),places)

NOTE: The identify function enables the interactive labeling of index graphs in R.

A.2.From the graph of studentized (OR JACKNIFE RESIDUALS)

The Studentized residuals: These are fitted ignoring the current data point. (They are sometimes called jackknifed residuals).

>library(MASS)

>plot(rstudent(model),main=”Jackknifed residuals”)

> identify(1:30,rstudent(model),places)

B. NON-GRAPHICAL APPROACHES.

B.1 Looking at the magnitudes of studentized residuals.

>jack<-rstudent(model)

> absjack<-abs(jack)

> absjack[absjack >2]

Isabela SanSalvador SantaCruz

3.060831 3.287014 4.092622

Now let’s do a Bonferonni test.

>library(car)

> outlierTest(model)

rstudent unadjusted p-value Bonferonni p

SantaCruz 4.092622 0.00039043 0.011713

So, according to the Bonferroni test, only Santacruz is an outlier.

B,2 Looking at the magnitude of Standard residuals

> standard<-abs(rstandard(model))

> standard[standard > 2]

Isabela SanSalvador SantaCruz

2.662224 2.801042 3.229701

NOTE THE SMALLER MAGNITUDES

INFLUENTIAL CASES

An influential case is one that, if removed, causes a large change in the fitted equation. It is not necessary for an influential case to be an outlier, but it could be.

A. Non-Graphical Approaches for Detection of Influential Cases

A.1. Influential Cases From lm.influence()

> modi<-influence.measures(model)

> summary(modi)

Potentially influential observations of

lm(formula = Species ~ Endemics + Area + Elevation, data = gala) :

dfb.1_ dfb.Endm dfb.Area dfb.Elvt dffit cov.r cook.d hat

Fernandina -0.28 -0.87 -0.70 1.40_* 1.44_* 4.29_* 0.52 0.75_*

Isabela -1.69_* 2.07_* -15.46_* 1.62_* -19.88_* 14.14_* 74.71_* 0.98_*

SanSalvador 0.65 -1.16_* 0.69 -0.08 -1.73_* 0.35_* 0.54 0.22

SantaCruz -0.85 2.41_* -0.32 -0.86 2.80_* 0.22_* 1.22_* 0.32

SantaMaria -0.05 0.16 -0.08 -0.03 0.19 1.46_* 0.01 0.22

A. 2. Influential Cases From Cook’s distance

Cook’s distance is the best way to identify influential cases. Cook’s distance: This distance measures the influence of a single case on the entirety of the fitted model. If we have cases with distances > 1, this could be an indicator of an unstable model.

> cook<-cooks.distance(model)

> cook[cook > 1]

Isabela SantaCruz
74.709410 1.223785

A. 3. Influential Cases From The Computation of Leverage

Leverage: it varies between 0 (no influence) and 1 (complete influence). We first compute the average leverage as follows:

av. leverage=(number of predictors + 1 /divided by sample size)

Then we look for cases with leverage values > 2 or 3 X av. Leverage. NOTE: A case gets leverage if for example it is an outlier but has some other outliers close to it.

In R, the leverage is measured by the hat values:

> x<-model.matrix(model)

> lev<-hat(x)

> places<-row.names(gala)

> names(lev)<-places

> lev[lev > (2*sum(y))/30]
Fernandina Isabela SantaCruz

0.7547450 0.9768328 0.3193987

B. Graphical Detection of Influential Cases

B.1 From Cook’s distance

>cook<-cooks.distance(model)

> plot(cook,ylab=”Cooks distance”)

> identify(1:30,cook,places)

B.2. From Leverage

>plot(lev,ylab=”Leverages”,main=”Index plot of leverages”)

> sum(lev)

[1] 4

> abline(h=8/30) (NOTE: This line is at 2*sum(lev)/ N.)

> identify(1:30,lev,places)

[1] 12 16 25

Category: Uncategorized

A Guide to 21 Feature Importance Methods and Packages in Machine Learning (with Code)

Python Factories for Scalable, Reusable, and Elegant Code

Discovering the Treasures of 22 R Exploratory Analysis Packages

Python And R for Data Wrangling: Compare Pandas and Tidyverse Code Side-By-Side, and Learn Speed-Up Tips.

A concise summary of basic plotting in R (with code)

Graphical and Non-graphical Detection of Outliers and Influential Cases in Linear Regression Using R