Outlier Detection
Purpose is to explore the outlier detection aspect of a linear model. To be honest with myself, in my 33 years of my life, I have never ever actually looked at simulating a set of data and use the influence measures needed to understand the outlier data.
Here is a plot of Y and X Variable that represent raw data
> z1 <- read.csv("test.csv", header = T) > plot(z1$X, z1$Y, pch = 19, col = "blue", xlab = "X", ylab = "Y", + xlim = c(0, 20), ylim = c(0, 15)) > points(7, 14.3, cex = 3, pch = 19, type = "p", col = "red") > text(7, 14.3, "A", col = "white", cex = 1.1) > points(17, 14.3, cex = 3, pch = 19, type = "p", col = "red") > text(17, 14.3, "B", col = "white", cex = 1.1) > points(17, 10, cex = 3, pch = 19, type = "p", col = "red") > text(17, 10, "C", col = "white", cex = 1.1) > abline(h = 1:15, col = "lightgrey") > abline(v = 1:20, col = "lightgrey") |
One can clearly see that the outliers have been marked as A, B, C.
Questions of Interest
- What is the impact of A on the constant and slope parameters of the linear model ?
- What is the impact of B on the constant and slope parameters of the linear model ?
- What is the impact of C on the constant and slope parameters of the linear model ?
- Are there ways to quantify the influence of these outliers ?
- What are the various influence measures of the outliers ?
Original Data
Call: lm(formula = z1$Y ~ z1$X) Residuals: Min 1Q Median 3Q Max -1.8770 -0.9261 0.1428 1.0190 1.7508 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.70157 0.53867 1.302 0.205 z1$X 0.80794 0.06278 12.870 1.58e-12 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.081 on 25 degrees of freedom Multiple R-squared: 0.8689, Adjusted R-squared: 0.8636 F-statistic: 165.6 on 1 and 25 DF, p-value: 1.578e-12 |
Original Data with the outliers A
Call: lm(formula = z2$Y ~ z2$X) Residuals: Min 1Q Median 3Q Max -2.1226 -1.2429 -0.1929 0.7821 7.6384 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.1709 0.9196 1.273 0.214 z2$X 0.7844 0.1078 7.275 1.00e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.859 on 26 degrees of freedom Multiple R-squared: 0.6706, Adjusted R-squared: 0.6579 F-statistic: 52.92 on 1 and 26 DF, p-value: 1.001e-07 |
Original Data with the outlier B
Call: lm(formula = z3$Y ~ z3$X) Residuals: Min 1Q Median 3Q Max -1.86815 -0.88338 0.01995 1.02709 1.74852 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.72290 0.49302 1.466 0.155 z3$X 0.80476 0.05467 14.720 4e-14 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.06 on 26 degrees of freedom Multiple R-squared: 0.8929, Adjusted R-squared: 0.8887 F-statistic: 216.7 on 1 and 26 DF, p-value: 4.003e-14 |
Original Data with the outlier C
Call: lm(formula = z4$Y ~ z4$X) Residuals: Min 1Q Median 3Q Max -3.37300 -0.88246 -0.02908 0.94129 1.94549 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.39441 0.60611 2.301 0.0297 * z4$X 0.70462 0.06721 10.483 7.86e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.304 on 26 degrees of freedom Multiple R-squared: 0.8087, Adjusted R-squared: 0.8013 F-statistic: 109.9 on 1 and 26 DF, p-value: 7.86e-11 |
Takeaway
- Point A affects the slope
- Point B does not affect the slope and the constant
- Point C affects slope and the constant .
- When Point B is added, Standard error of estimates goes down.
It is interesting to compare the difference between B and C.
Now Let’s look at the Cook’s Distance and Hatvalues for each of the points B and C
> z5 <- read.csv("test.csv", header = T) > z5 <- rbind(z5, c(7, 14.3)) > z5 <- rbind(z5, c(17, 14.3)) > z5 <- rbind(z5, c(17, 10)) > fit5 <- lm(z5$Y ~ z5$X) > n <- dim(z5)[1] > p <- 2 > influence <- t(rbind(tail(cookd(fit5), 3), tail(hatvalues(fit5), + 3))) > colnames(influence) <- c("Cook's D", "HatValue") > rownames(influence) <- c("A", "B", "C") |
> par(mfrow = c(1, 2)) > plot(cookd(fit5), pch = 19, col = "blue", cex = 1.3, ylab = "Cook's Distance", + xlab = "") > abline(h = c(2, 3) * p/n, col = "grey") > plot(hatvalues(fit5), pch = 19, col = "blue", cex = 1.3, ylab = "Hat Values", + xlab = "") > abline(h = 4/(n - p), col = "grey") > print(influence) Cook's D HatValue A 0.32082994 0.03823869 B 0.01953717 0.19334699 C 0.50738037 0.19334699 |
The above example clearly shows that B and C have greater hatvalues and amongst them, C has a greater Cook’s Value and hence the outlier that affects the most is C
Reference : Residuals and Influence in Regression