Purpose is to explore the outlier detection aspect of a linear model. To be honest with myself, in my 33 years of my life, I have never ever actually looked at simulating a set of data and use the influence measures needed to understand the outlier data.

Here is a plot of Y and X Variable that represent raw data

> z1 <- read.csv("test.csv", header = T)
> plot(z1$X, z1$Y, pch = 19, col = "blue", xlab = "X", ylab = "Y",
+     xlim = c(0, 20), ylim = c(0, 15))
> points(7, 14.3, cex = 3, pch = 19, type = "p", col = "red")
> text(7, 14.3, "A", col = "white", cex = 1.1)
> points(17, 14.3, cex = 3, pch = 19, type = "p", col = "red")
> text(17, 14.3, "B", col = "white", cex = 1.1)
> points(17, 10, cex = 3, pch = 19, type = "p", col = "red")
> text(17, 10, "C", col = "white", cex = 1.1)
> abline(h = 1:15, col = "lightgrey")
> abline(v = 1:20, col = "lightgrey")

OutlierDetection-001.jpg

One can clearly see that the outliers have been marked as A, B, C.

Questions of Interest

  • What is the impact of A on the constant and slope parameters of the linear model ?
  • What is the impact of B on the constant and slope parameters of the linear model ?
  • What is the impact of C on the constant and slope parameters of the linear model ?
  • Are there ways to quantify the influence of these outliers ?
  • What are the various influence measures of the outliers ?

Original Data

Call:
lm(formula = z1$Y ~ z1$X)
Residuals: Min 1Q Median 3Q Max -1.8770 -0.9261 0.1428 1.0190 1.7508
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.70157 0.53867 1.302 0.205 z1$X 0.80794 0.06278 12.870 1.58e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.081 on 25 degrees of freedom Multiple R-squared: 0.8689, Adjusted R-squared: 0.8636 F-statistic: 165.6 on 1 and 25 DF, p-value: 1.578e-12

Original Data with the outliers A

Call:
lm(formula = z2$Y ~ z2$X)
Residuals: Min 1Q Median 3Q Max -2.1226 -1.2429 -0.1929 0.7821 7.6384
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.1709 0.9196 1.273 0.214 z2$X 0.7844 0.1078 7.275 1.00e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.859 on 26 degrees of freedom Multiple R-squared: 0.6706, Adjusted R-squared: 0.6579 F-statistic: 52.92 on 1 and 26 DF, p-value: 1.001e-07

Original Data with the outlier B

Call:
lm(formula = z3$Y ~ z3$X)
Residuals: Min 1Q Median 3Q Max -1.86815 -0.88338 0.01995 1.02709 1.74852
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.72290 0.49302 1.466 0.155 z3$X 0.80476 0.05467 14.720 4e-14 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.06 on 26 degrees of freedom Multiple R-squared: 0.8929, Adjusted R-squared: 0.8887 F-statistic: 216.7 on 1 and 26 DF, p-value: 4.003e-14

Original Data with the outlier C

Call:
lm(formula = z4$Y ~ z4$X)
Residuals: Min 1Q Median 3Q Max -3.37300 -0.88246 -0.02908 0.94129 1.94549
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.39441 0.60611 2.301 0.0297 * z4$X 0.70462 0.06721 10.483 7.86e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.304 on 26 degrees of freedom Multiple R-squared: 0.8087, Adjusted R-squared: 0.8013 F-statistic: 109.9 on 1 and 26 DF, p-value: 7.86e-11

Takeaway

  • Point A affects the slope
  • Point B does not affect the slope and the constant
  • Point C affects slope and the constant .
  • When Point B is added, Standard error of estimates goes down.

It is interesting to compare the difference between B and C.

Now Let’s look at the Cook’s Distance and Hatvalues for each of the points B and C

> z5 <- read.csv("test.csv", header = T)
> z5 <- rbind(z5, c(7, 14.3))
> z5 <- rbind(z5, c(17, 14.3))
> z5 <- rbind(z5, c(17, 10))
> fit5 <- lm(z5$Y ~ z5$X)
> n <- dim(z5)[1]
> p <- 2
> influence <- t(rbind(tail(cookd(fit5), 3), tail(hatvalues(fit5),
+     3)))
> colnames(influence) <- c("Cook's D", "HatValue")
> rownames(influence) <- c("A", "B", "C")
> par(mfrow = c(1, 2))
> plot(cookd(fit5), pch = 19, col = "blue", cex = 1.3, ylab = "Cook's Distance",
+     xlab = "")
> abline(h = c(2, 3) * p/n, col = "grey")
> plot(hatvalues(fit5), pch = 19, col = "blue", cex = 1.3, ylab = "Hat Values",
+     xlab = "")
> abline(h = 4/(n - p), col = "grey")
> print(influence)
    Cook's D   HatValue
A 0.32082994 0.03823869
B 0.01953717 0.19334699
C 0.50738037 0.19334699

OutlierDetection-007.jpg

The above example clearly shows that B and C have greater hatvalues and amongst them, C has a greater Cook’s Value and hence the outlier that affects the most is C

Reference : Residuals and Influence in Regression