Hobson GLM - Exercise 6.2

Purpose
To work out exercise 6.2 from Hobson’s book '' 6.2 shows response of a grass and legume pasture system to various quantities of phosphorus fertilizer (data from D. F. Sinclair; the results were reported in Sinclair and Probert, 1986). The total yield, ofgrass and legume together, and amount ofphosphorus (K) are both given in kilograms per hectare. Find a suitable model for describing the relationship between yield and quantity off ertilizer.

(a) Plot yield against phosphorus to obtain an approximately linear relationship you may need to try several transformations of either or both variables in order to achieve approximate linearity.

> folder <- "C:/Cauldron/garage/R/soulcraft/Volatility/Learn/Dobson-GLM/"
> file.input <- paste(folder, "Table 6.16 Pasture yield.csv", sep = "")
> data <- read.csv(file.input, header = T, stringsAsFactors = F)

> par(mfrow = c(1, 1))
> plot(data$K, data$yield, pch = 19, col = "blue")

> lambda <- 2
> temp <- (data$yield^lambda - 1)/lambda
> par(mfrow = c(1, 1))
> plot(data$K, temp, pch = 19, col = "blue")

Looks like quadratic is a nice fit. Lets fit the model and check it out
(b) Use the results of(a) to specify a possible model. Fit the model.

> fit <- (lm(yield ~ K + I(K^2), data))
> summary(fit)
Call:
lm(formula = yield ~ K + I(K^2), data = data)

Residuals:
    Min      1Q  Median      3Q     Max
-876.18 -257.45   82.54  287.88  722.17

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 2471.2637   196.1115  12.601 4.51e-12 ***
K             73.8059    19.8999   3.709  0.00110 **
I(K^2)        -0.5152     0.3894  -1.323  0.19827
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 438.6 on 24 degrees of freedom
Multiple R-squared: 0.7915,     Adjusted R-squared: 0.7742
F-statistic: 45.56 on 2 and 24 DF,  p-value: 6.736e-09

(c) Calculate the standardized residuals for the model and use appropriate plots to check for any systematic effects that might suggest alternative models and to investigate the validity ofan y assumptions made.

> library(car)
> qq.plot(fit, simulate = T)
[1] 19

All points seem to lie in the confidence bands

Influence stats

> plot(hatvalues(fit), ylim = c(0, 0.5))
> abline(h = c(2, 3) * 3/27)
> identify(1:27, hatvalues(fit), row.names(data))
[1]  3 12 26

Cooksdistance

> plot(cookd(fit), ylim = c(0, 0.5))
> abline(h = 4/24)
> identify(1:27, cookd(fit), row.names(data))
[1] 1

DFBETAS

> dfbs.fit <- dfbetas(fit)
> par(mfrow = c(3, 1))
> plot(1:27, dfbs.fit[, 1], pch = 19, col = "blue")
> plot(1:27, dfbs.fit[, 2], pch = 19, col = "blue")
> plot(1:27, dfbs.fit[, 3], pch = 19, col = "blue")

As you can clearly see that the first point has
a big influence on the second parameter of the model and hence has a bigger cooks distance. This point concurs with the previous cooksd graph