Faraway-2-Intro
Purpose
To understand various aspects of extending a linear model .
> library(faraway) > library(car) > data(gavote) > head(gavote) equip econ perAA rural atlanta gore bush other votes ballots APPLING LEVER poor 0.182 rural notAtlanta 2093 3940 66 6099 6617 ATKINSON LEVER poor 0.230 rural notAtlanta 821 1228 22 2071 2149 BACON LEVER poor 0.131 rural notAtlanta 956 2010 29 2995 3347 BAKER OS-CC poor 0.476 rural notAtlanta 893 615 11 1519 1607 BALDWIN LEVER middle 0.359 rural notAtlanta 5893 6041 192 12126 12785 BANKS LEVER middle 0.024 rural notAtlanta 1220 3202 111 4533 4773 |
The variables in the dataset are - equip - type of equipment - econ - economic level of the country - perAA - percentage of afro americans - rural - whether the country is rural or urban - atlanta - whether country is a part of atlanta metropolitan - gore - votes for gore - bush - votes for bush - other - votes for others - votes - total votes - ballots - ballots used
> summary(gavote) equip econ perAA rural atlanta LEVER:74 middle:69 Min. :0.0000 rural:117 Atlanta : 15 OS-CC:44 poor :72 1st Qu.:0.1115 urban: 42 notAtlanta:144 OS-PC:22 rich :18 Median :0.2330 PAPER: 2 Mean :0.2430 PUNCH:17 3rd Qu.:0.3480 Max. :0.7650 gore bush other votes Min. : 249 Min. : 271 Min. : 5.0 Min. : 832 1st Qu.: 1386 1st Qu.: 1804 1st Qu.: 30.0 1st Qu.: 3506 Median : 2326 Median : 3597 Median : 86.0 Median : 6299 Mean : 7020 Mean : 8929 Mean : 381.7 Mean : 16331 3rd Qu.: 4430 3rd Qu.: 7468 3rd Qu.: 210.0 3rd Qu.: 11846 Max. :154509 Max. :140494 Max. :7920.0 Max. :263211 ballots Min. : 881 1st Qu.: 3694 Median : 6712 Mean : 16927 3rd Qu.: 12251 Max. :280975 > gavote$undercount = (gavote$ballots - gavote$votes)/gavote$ballots > boxplot(gavote$undercount) |
One important learning is that you should always look at magnitude of the possible y values and transform it accordingly. In this case , the relative undercount proportion makes far more sense that the usual raw numbers.
> hist(gavote$undercount, n.bins(gavote$undercount)) > plot(density(gavote$undercount)) > rug(gavote$undercount) > pie(table(gavote$equip)) > barplot(sort(table(gavote$equip), decreasing = T), las = 2) > gavote$pergore <- gavote$gore/gavote$votes > plot(pergore ~ perAA, gavote) > plot(undercount ~ equip, gavote) > xtabs(~atlanta + rural, gavote) rural atlanta rural urban Atlanta 1 14 notAtlanta 116 28 |
Basic Modeling
> gavote$cpergore <- gavote$pergore - mean(gavote$pergore) > gavote$cperAA <- gavote$perAA - mean(gavote$perAA) > lmodi <- lm(undercount ~ cperAA + cpergore * rural + equip, gavote) > summary(lmodi) Call: lm(formula = undercount ~ cperAA + cpergore * rural + equip, data = gavote) Residuals: Min 1Q Median 3Q Max -0.059530 -0.012904 -0.002180 0.009013 0.127496 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.043297 0.002839 15.253 < 2e-16 *** cperAA 0.028264 0.031092 0.909 0.3648 cpergore 0.008237 0.051156 0.161 0.8723 ruralurban -0.018637 0.004648 -4.009 9.56e-05 *** equipOS-CC 0.006482 0.004680 1.385 0.1681 equipOS-PC 0.015640 0.005827 2.684 0.0081 ** equipPAPER -0.009092 0.016926 -0.537 0.5920 equipPUNCH 0.014150 0.006783 2.086 0.0387 * cpergore:ruralurban -0.008799 0.038716 -0.227 0.8205 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.02335 on 150 degrees of freedom Multiple R-squared: 0.1696, Adjusted R-squared: 0.1253 F-statistic: 3.829 on 8 and 150 DF, p-value: 0.0004001 > par(mfrow = c(2, 2)) > plot(lmodi) |
Robust regression
This is present in the MASS package.
> library(MASS) > x <- security.db1[, c("UNIONBANK", "PNB")] > head(x) UNIONBANK PNB 1 148.95 404.75 2 152.35 408.90 3 149.80 406.50 4 149.00 402.45 5 149.05 408.55 6 140.85 391.25 > rlm(UNIONBANK ~ PNB, x) Call: rlm(formula = UNIONBANK ~ PNB, data = x) Converged in 6 iterations Coefficients: (Intercept) PNB 50.5755479 0.2426357 Degrees of freedom: 236 total; 234 residual Scale estimate: 9.94 > lm(UNIONBANK ~ PNB, x) Call: lm(formula = UNIONBANK ~ PNB, data = x) Coefficients: (Intercept) PNB 52.5559 0.2419 |
Other learnings from chapter 1 of the book are
- You can use step to prune down a big model in to a small model
- You can manually prune down by doing ANOVA and by checking F values
- regsubsets from MASS package is again very useful
- Hierarchy rule says that lower order interactions can be removed if higher order interactions are found in the model/
- Weighted least square is useful if you know that residuals can be weighted based on the number of observations in a specific variable
- Robust regression from the MASS Package
- You can have a point which has high leverage but low influence