Purpose
Anova is a concept which is probably very well known to everyone who has taken any elementary course in stats.When I tried racking my brains to remember what I could do with R , relating to ANOVA.

> x <- rnorm(1000, 2, 3)
> y <- rnorm(1000, 2, 9)

I was trying to do this . Anova of 2 random variables with same mean and different variance. Anova is something which is not used on 2 random variables. It is for heaven’s sake , analysis of variance and there is nothing to analyze the variance of x and y I can instead do a t test

> t.test(y - x)
        One Sample t-test
data: y - x t = 0.1947, df = 999, p-value = 0.8457 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.5097545 0.6220424 sample estimates: mean of x 0.05614391

This talks about the hypothesis of difference between means = 0 against the alternate that it is not 0.

While reading the stuff about anova, I came across stripplot and hence explored a bit on the stripplot function in R.

> par(mfrow = c(1, 1))
> x <- stats::rnorm(50)
> xr <- round(x, 1)
> stripchart(x)
> m <- mean(par("usr")[1:2])
> text(m, 1.04, "stripchart(x, \"overplot\")")
> stripchart(xr, method = "stack", add = TRUE, at = 1.2)
> text(m, 1.35, "stripchart(round(x,1), \"stack\")")
> stripchart(xr, method = "jitter", add = TRUE, at = 0.7)
> text(m, 0.85, "stripchart(round(x,1), \"jitter\")")

Anova-003.jpg

Basic funda is to use to check 2 different models.

> library(faraway)
> data(coagulation)
> boxplot(coag ~ diet, coagulation)
> par(mfrow = c(1, 2))
> plot(coag ~ diet, coagulation, ylab = "coagulation time")
> with(coagulation, stripchart(coag ~ diet, vertical = TRUE, method = "stack",
+     xlab = "diet", ylab = "coagulation time"))

Lets fit a model

> g <- lm(coag ~ diet, coagulation)
> summary(g)
Call:
lm(formula = coag ~ diet, data = coagulation)
Residuals: Min 1Q Median 3Q Max -5.000e+00 -1.250e+00 -2.141e-17 1.250e+00 5.000e+00
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 64.0000 0.4979 128.537 < 2e-16 *** diet1 -3.0000 0.9736 -3.081 0.005889 ** diet2 2.0000 0.8453 2.366 0.028195 * diet3 4.0000 0.8453 4.732 0.000128 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.366 on 20 degrees of freedom Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212 F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05 > g <- lm(coag ~ diet - 1, coagulation) > summary(g) Call: lm(formula = coag ~ diet - 1, data = coagulation)
Residuals: Min 1Q Median 3Q Max -5.000e+00 -1.250e+00 1.743e-16 1.250e+00 5.000e+00
Coefficients: Estimate Std. Error t value Pr(>|t|) dietA 61.0000 1.1832 51.55 <2e-16 *** dietB 66.0000 0.9661 68.32 <2e-16 *** dietC 68.0000 0.9661 70.39 <2e-16 *** dietD 61.0000 0.8367 72.91 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.366 on 20 degrees of freedom Multiple R-squared: 0.9989, Adjusted R-squared: 0.9986 F-statistic: 4399 on 4 and 20 DF, p-value: < 2.2e-16

Basically to test the model against the null that none of the coefficients are good, there is a simple anova test

> gi <- lm(coag ~ diet - 1, coagulation)
> gnull <- lm(coag ~ 1, coagulation)
> anova(gnull, gi)
Analysis of Variance Table
Model 1: coag ~ 1 Model 2: coag ~ diet - 1 Res.Df RSS Df Sum of Sq F Pr(>F) 1 23 340 2 20 112 3 228 13.571 4.658e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> options(contrasts = c("contr.sum", "contr.poly"))
> gs <- lm(coag ~ diet, coagulation)
> summary(gs)
Call:
lm(formula = coag ~ diet, data = coagulation)
Residuals: Min 1Q Median 3Q Max -5.000e+00 -1.250e+00 -2.141e-17 1.250e+00 5.000e+00
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 64.0000 0.4979 128.537 < 2e-16 *** diet1 -3.0000 0.9736 -3.081 0.005889 ** diet2 2.0000 0.8453 2.366 0.028195 * diet3 4.0000 0.8453 4.732 0.000128 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.366 on 20 degrees of freedom Multiple R-squared: 0.6706, Adjusted R-squared: 0.6212 F-statistic: 13.57 on 3 and 20 DF, p-value: 4.658e-05 > par(mfrow = c(1, 1)) > qqnorm(residuals(g)) > plot(jitter(fitted(g)), residuals(g), xlab = "Fitted", ylab = "Residuals")

There are bunch of options to produce the contrast matrix

contr.helmert(n, contrasts = TRUE) contr.poly(n, scores = 1:n, contrasts = TRUE) contr.sum(n, contrasts = TRUE) contr.treatment(n, base = 1, contrasts = TRUE) contr.SAS(n, contrasts = TRUE)

> med <- with(coagulation, tapply(coag, diet, median))
> ar <- with(coagulation, abs(coag - med[diet]))
> anova(lm(ar ~ diet, coagulation))
Analysis of Variance Table
Response: ar Df Sum Sq Mean Sq F value Pr(>F) diet 3 4.333 1.444 0.6492 0.5926 Residuals 20 44.500 2.225 > qt(0.975, 20) [1] 2.085963 > c(5 - 2.086 * 1.53, 5 + 2.086 * 1.53) [1] 1.80842 8.19158 > qtukey(0.95, 4, 20)/sqrt(2) [1] 2.798936 > c(5 - 2.8 * 1.53, 5 + 2.8 * 1.53) [1] 0.716 9.284 > TukeyHSD(aov(coag ~ diet, coagulation)) Tukey multiple comparisons of means 95% family-wise confidence level
Fit: aov(formula = coag ~ diet, data = coagulation)
$diet diff lwr upr p adj B-A 5.000000e+00 0.7245544 9.275446 0.0183283 C-A 7.000000e+00 2.7245544 11.275446 0.0009577 D-A -1.421085e-14 -4.0560438 4.056044 1.0000000 C-B 2.000000e+00 -1.8240748 5.824075 0.4766005 D-B -5.000000e+00 -8.5770944 -1.422906 0.0044114 D-C -7.000000e+00 -10.5770944 -3.422906 0.0001268 > qt(1 - 0.05/12, 20) [1] 2.927119

I started going over lady tasting tea to understood why in the hell does one want anova in the first place ? By analyzing anova, what are we getting at ? why is it called analysis of variance ?

> y <- rnorm(1000)
> x <- rep(letters[1:2], 500)
> summary(lm(y ~ x))
Call:
lm(formula = y ~ x)
Residuals: Min 1Q Median 3Q Max -3.157256 -0.724636 -0.008918 0.680274 3.430869
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.05189 0.03298 1.573 0.116 x1 -0.04674 0.03298 -1.417 0.157
Residual standard error: 1.043 on 998 degrees of freedom Multiple R-squared: 0.002008, Adjusted R-squared: 0.001008 F-statistic: 2.008 on 1 and 998 DF, p-value: 0.1568 > anova(lm(y ~ x)) Analysis of Variance Table
Response: y Df Sum Sq Mean Sq F value Pr(>F) x 1 2.18 2.18 2.0083 0.1568 Residuals 998 1085.63 1.09

What is the historical importance of anova ? is something I intend to explore soon.