Purpose
To understand various aspects of extending a linear model .

> library(faraway)
> library(car)
> data(gavote)
> head(gavote)
         equip   econ perAA rural    atlanta gore bush other votes ballots
APPLING  LEVER   poor 0.182 rural notAtlanta 2093 3940    66  6099    6617
ATKINSON LEVER   poor 0.230 rural notAtlanta  821 1228    22  2071    2149
BACON    LEVER   poor 0.131 rural notAtlanta  956 2010    29  2995    3347
BAKER    OS-CC   poor 0.476 rural notAtlanta  893  615    11  1519    1607
BALDWIN  LEVER middle 0.359 rural notAtlanta 5893 6041   192 12126   12785
BANKS    LEVER middle 0.024 rural notAtlanta 1220 3202   111  4533    4773

The variables in the dataset are - equip - type of equipment - econ - economic level of the country - perAA - percentage of afro americans - rural - whether the country is rural or urban - atlanta - whether country is a part of atlanta metropolitan - gore - votes for gore - bush - votes for bush - other - votes for others - votes - total votes - ballots - ballots used

> summary(gavote)
   equip        econ        perAA          rural           atlanta
 LEVER:74   middle:69   Min.   :0.0000   rural:117   Atlanta   : 15
 OS-CC:44   poor  :72   1st Qu.:0.1115   urban: 42   notAtlanta:144
 OS-PC:22   rich  :18   Median :0.2330
 PAPER: 2               Mean   :0.2430
 PUNCH:17               3rd Qu.:0.3480
                        Max.   :0.7650
      gore             bush            other            votes
 Min.   :   249   Min.   :   271   Min.   :   5.0   Min.   :   832
 1st Qu.:  1386   1st Qu.:  1804   1st Qu.:  30.0   1st Qu.:  3506
 Median :  2326   Median :  3597   Median :  86.0   Median :  6299
 Mean   :  7020   Mean   :  8929   Mean   : 381.7   Mean   : 16331
 3rd Qu.:  4430   3rd Qu.:  7468   3rd Qu.: 210.0   3rd Qu.: 11846
 Max.   :154509   Max.   :140494   Max.   :7920.0   Max.   :263211
    ballots
 Min.   :   881
 1st Qu.:  3694
 Median :  6712
 Mean   : 16927
 3rd Qu.: 12251
 Max.   :280975
> gavote$undercount = (gavote$ballots - gavote$votes)/gavote$ballots
> boxplot(gavote$undercount)

One important learning is that you should always look at magnitude of the possible y values and transform it accordingly. In this case , the relative undercount proportion makes far more sense that the usual raw numbers.

> hist(gavote$undercount, n.bins(gavote$undercount))
> plot(density(gavote$undercount))
> rug(gavote$undercount)
> pie(table(gavote$equip))
> barplot(sort(table(gavote$equip), decreasing = T), las = 2)
> gavote$pergore <- gavote$gore/gavote$votes
> plot(pergore ~ perAA, gavote)
> plot(undercount ~ equip, gavote)
> xtabs(~atlanta + rural, gavote)
            rural
atlanta      rural urban
  Atlanta        1    14
  notAtlanta   116    28

C1-003.jpg

Basic Modeling

> gavote$cpergore <- gavote$pergore - mean(gavote$pergore)
> gavote$cperAA <- gavote$perAA - mean(gavote$perAA)
> lmodi <- lm(undercount ~ cperAA + cpergore * rural + equip, gavote)
> summary(lmodi)
Call:
lm(formula = undercount ~ cperAA + cpergore * rural + equip,
    data = gavote)
Residuals: Min 1Q Median 3Q Max -0.059530 -0.012904 -0.002180 0.009013 0.127496
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.043297 0.002839 15.253 < 2e-16 *** cperAA 0.028264 0.031092 0.909 0.3648 cpergore 0.008237 0.051156 0.161 0.8723 ruralurban -0.018637 0.004648 -4.009 9.56e-05 *** equipOS-CC 0.006482 0.004680 1.385 0.1681 equipOS-PC 0.015640 0.005827 2.684 0.0081 ** equipPAPER -0.009092 0.016926 -0.537 0.5920 equipPUNCH 0.014150 0.006783 2.086 0.0387 * cpergore:ruralurban -0.008799 0.038716 -0.227 0.8205 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.02335 on 150 degrees of freedom Multiple R-squared: 0.1696, Adjusted R-squared: 0.1253 F-statistic: 3.829 on 8 and 150 DF, p-value: 0.0004001 > par(mfrow = c(2, 2)) > plot(lmodi)

Robust regression
This is present in the MASS package.

> library(MASS)
> x <- security.db1[, c("UNIONBANK", "PNB")]
> head(x)
  UNIONBANK    PNB
1    148.95 404.75
2    152.35 408.90
3    149.80 406.50
4    149.00 402.45
5    149.05 408.55
6    140.85 391.25
> rlm(UNIONBANK ~ PNB, x)
Call:
rlm(formula = UNIONBANK ~ PNB, data = x)
Converged in 6 iterations
Coefficients: (Intercept) PNB 50.5755479 0.2426357
Degrees of freedom: 236 total; 234 residual Scale estimate: 9.94 > lm(UNIONBANK ~ PNB, x) Call: lm(formula = UNIONBANK ~ PNB, data = x)
Coefficients: (Intercept) PNB 52.5559 0.2419

Other learnings from chapter 1 of the book are

  • You can use step to prune down a big model in to a small model
  • You can manually prune down by doing ANOVA and by checking F values
  • regsubsets from MASS package is again very useful
  • Hierarchy rule says that lower order interactions can be removed if higher order interactions are found in the model/
  • Weighted least square is useful if you know that residuals can be weighted based on the number of observations in a specific variable
  • Robust regression from the MASS Package
  • You can have a point which has high leverage but low influence