Faraway-2-Binomial Data

Purpose
To understand the way to model a linear relation if the Y variable is Binomial. There are lot of times when you encounter dependent variable being a Binomial.

The data being discusses is this : First column is the temperature at the time of launch Second column is the # of rings that got damaged.

> library(faraway)
> data(orings)
> orings$dam.prop <- orings$damage/6
> orings
   temp damage  dam.prop
1    53      5 0.8333333
2    57      1 0.1666667
3    58      1 0.1666667
4    63      1 0.1666667
5    66      0 0.0000000
6    67      0 0.0000000
7    67      0 0.0000000
8    67      0 0.0000000
9    68      0 0.0000000
10   69      0 0.0000000
11   70      1 0.1666667
12   70      0 0.0000000
13   70      1 0.1666667
14   70      0 0.0000000
15   72      0 0.0000000
16   73      0 0.0000000
17   75      0 0.0000000
18   75      1 0.1666667
19   76      0 0.0000000
20   76      0 0.0000000
21   78      0 0.0000000
22   79      0 0.0000000
23   81      0 0.0000000
> par(mfrow = c(1, 1))
> plot(damage/6 ~ temp, orings, xlim = c(25, 85), ylim = c(0, 1),
+     xlab = "Temperature", ylab = "Prob of damage", pch = 19,
+     col = "blue")

What would you do ?

> fit <- lm(dam.prop ~ temp, orings)
> abline(fit)

The above thing is obviously crap . So try fitting a logit model

> logitmod <- glm(cbind(damage, 6 - damage) ~ temp, family = binomial,
+     orings)
> summary(logitmod)
Call:
glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial,
    data = orings)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-0.9529  -0.7345  -0.4393  -0.2079   1.9565

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.66299    3.29626   3.538 0.000403 ***
temp        -0.21623    0.05318  -4.066 4.78e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 38.898  on 22  degrees of freedom
Residual deviance: 16.912  on 21  degrees of freedom
AIC: 33.675

Number of Fisher Scoring iterations: 6
> plot(damage/6 ~ temp, orings, xlim = c(25, 85), ylim = c(0, 1))
> x <- seq(25, 85, 1)
> lines(x, ilogit(11.663 - 0.21628 * x))

Now lets fit a probit

> probitmod <- glm(cbind(damage, 6 - damage) ~ temp, family = binomial(link = probit),
+     orings)
> summary(probitmod)
Call:
glm(formula = cbind(damage, 6 - damage) ~ temp, family = binomial(link = probit),
    data = orings)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1.0134  -0.7760  -0.4467  -0.1581   1.9982

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  5.59145    1.71055   3.269  0.00108 **
temp        -0.10580    0.02656  -3.984 6.79e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 38.898  on 22  degrees of freedom
Residual deviance: 18.131  on 21  degrees of freedom
AIC: 34.893

Number of Fisher Scoring iterations: 6
> plot(damage/6 ~ temp, orings, xlim = c(25, 85), ylim = c(0, 1))
> x <- seq(25, 85, 1)
> lines(x, pnorm(5.59145 - 0.1058 * x))

Residual deviance is the deviance of the current model when the null deviance is the deviance for a model with no predictors and just an intercept term Best way to test the model is as follows

> dev <- summary(probitmod)$null.deviance - summary(probitmod)$deviance
> pchisq(dev, 1, lower = F)
[1] 5.186666e-06

Thus you can see that the pvalue is so small the effect of temp is stat significant

Lets work on a probit model

> data(babyfood)
> head(babyfood)
  disease nondisease  sex   food
1      77        381  Boy Bottle
2      19        128  Boy  Suppl
3      47        447  Boy Breast
4      48        336 Girl Bottle
5      16        111 Girl  Suppl
6      31        433 Girl Breast
> babyfood$odds = babyfood$disease/babyfood$nondisease
> xtabs(disease/(disease + nondisease) ~ sex + food, babyfood)
      food
sex        Bottle     Breast      Suppl
  Boy  0.16812227 0.09514170 0.12925170
  Girl 0.12500000 0.06681034 0.12598425
> mdl <- glm(cbind(disease, nondisease) ~ sex + food, family = binomial,
+     babyfood)
> summary(mdl)
Call:
glm(formula = cbind(disease, nondisease) ~ sex + food, family = binomial,
    data = babyfood)

Deviance Residuals:
      1        2        3        4        5        6
 0.1096  -0.5052   0.1922  -0.1342   0.5896  -0.2284

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.6127     0.1124 -14.347  < 2e-16 ***
sexGirl      -0.3126     0.1410  -2.216   0.0267 *
foodBreast   -0.6693     0.1530  -4.374 1.22e-05 ***
foodSuppl    -0.1725     0.2056  -0.839   0.4013
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26.37529  on 5  degrees of freedom
Residual deviance:  0.72192  on 2  degrees of freedom
AIC: 40.24

Number of Fisher Scoring iterations: 4
> drop1(mdl, test = "Chi")
Single term deletions

Model:
cbind(disease, nondisease) ~ sex + food
       Df Deviance    AIC    LRT   Pr(Chi)
<none>       0.722 40.240
sex     1    5.699 43.217  4.977   0.02569 *
food    2   20.899 56.417 20.177 4.155e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The choice of link function is crucial

> data(bliss)
> bliss
  dead alive conc
1    2    28    0
2    8    22    1
3   15    15    2
4   23     7    3
5   27     3    4
> modl <- glm(cbind(dead, alive) ~ conc, family = binomial, data = bliss)
> modp <- glm(cbind(dead, alive) ~ conc, family = binomial(link = probit),
+     data = bliss)
> modc <- glm(cbind(dead, alive) ~ conc, family = binomial(link = cloglog),
+     data = bliss)
> x <- seq(-2, 8, 0.2)
> pl <- ilogit(modl$coef[1] + modl$coef[2] * x)
> pp <- pnorm(modp$coef[1] + modp$coef[2] * x)
> pc <- 1 - exp(-exp((modc$coef[1] + modc$coef[2] * x)))
> par(mfrow = c(1, 3))
> plot(x, pl, type = "l", ylab = "Probability", xlab = "Dose")
> lines(x, pp, lty = 2)
> lines(x, pc, lty = 5)
> matplot(x, cbind(pp/pl, (1 - pp)/(1 - pl)), type = "l", xlab = "Dose",
+     ylab = "Ratio")
> matplot(x, cbind(pc/pl, (1 - pc)/(1 - pl)), type = "l", xlab = "Dose",
+     ylab = "Ratio")
> par(mfrow = c(1, 1))

When does estimation fail ? I am tired of this chapter now as the data is increasingly non financial I have to cull out a dataset to make it interesting!!!