Chap 6 - Playing with Outliers
.Purpose Look for PFC - RECLTD pair and look out for the possible outliers in the data based on the regression relationship between two stocks.
> library(RSQLite) > temp <- hdata[, c("RECLTD", "PFC")] > dates <- hdata[, 1] > rownames(temp) <- dates > fit <- lm(RECLTD ~ PFC + 0, data = temp) > summary(fit) Call: lm(formula = RECLTD ~ PFC + 0, data = temp) Residuals: Min 1Q Median 3Q Max -33.8711 -9.3639 0.5983 8.4083 25.7757 Coefficients: Estimate Std. Error t value Pr(>|t|) PFC 0.922983 0.003482 265.1 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 12.36 on 211 degrees of freedom Multiple R-squared: 0.997, Adjusted R-squared: 0.997 F-statistic: 7.026e+04 on 1 and 211 DF, p-value: < 2.2e-16 |
> plot(resid(fit), col = "blue", type = "l") |
Now look for some diagnostics
> library(car) |
Hat Values
> plot(hatvalues(fit), pch = 19, col = "blue", ylim = c(0, 0.1)) > abline(h = c(2, 3) * 2/212, lty = 2) |
As one can clearly see that none of the hat values exceed 2h or 3h and hence one can assume that there are no dangerous hat values which unduly influence the hedge ratio of the pair
dfbetas - Function to compute the change in coefficient of the hedge ratio if one of the observations is removed.
> fit <- lm(RECLTD ~ PFC, data = temp) > dfbs.fit <- dfbetas(fit) > plot(dfbs.fit, pch = 19, col = "blue") |
There is a problem of too much data. As one can see that if you use dfbetas, there will be one dfbeta for each coefficient for each observation.
> fit <- lm(RECLTD ~ PFC + 0, data = temp) > plot(cookd(fit), pch = 19, col = "blue") > abline(h = 4/(212 - 1 - 1), lty = 2) |
- There are about 7 residuals which have more than the required heuristic cooks distance.