Purpose
The purpose of this post is to look at regression, ground up. Most of what I have learnt in the past, in my MBA, in my masters has been as follows ? There is some data and you have to run regression on it to get some output. One is taught to interpret the output and get an intuitive sense of the model. May be one explores a step further and uses a software to get the model output. The problem with this approach is that you do not appreciate the finer aspects.

+ Let me post something which ideally is a nice way to understand closely the various aspects of regression

I will start with a simple regression workout and then move on to multiple regression exercise. The structure of each of these exercises will be as follows :
- Simulate the data so that you are already aware of the population parameters. - Run the model using R functions. - Generate Output using R functions. - Manually generate the same output by not using very many functions from R. - Infer from the simulated data.

Using R

> set.seed(12345)
> x <- rep(seq(1, 10, 1), 100)
> beta <- rnorm(1000, 7, 0.5)
> e <- rnorm(1000)
> y <- x * beta + e
> z <- cbind(y, x)
> z <- data.frame(z[order(z[, 2]), ])
> plot(z$x, z$y, pch = 19, col = "blue")
> fit <- lm(y ~ x + 0, z)
> fit.sum <- summary(fit)
> fit.sum
Call:
lm(formula = y ~ x + 0, data = z)
Residuals: Min 1Q Median 3Q Max -14.280894 -1.535239 0.003702 1.792795 12.184642
Coefficients: Estimate Std. Error t value Pr(>|t|) x 7.01435 0.01689 415.4 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.313 on 999 degrees of freedom Multiple R-squared: 0.9942, Adjusted R-squared: 0.9942 F-statistic: 1.725e+05 on 1 and 999 DF, p-value: < 2.2e-16

RegressionFromScratch-001.jpg

Lets generate the output from fit.sum using the first priciples

Using Very Minimal Functions from R

> beta.mu <- solve(crossprod(z$x, z$x), crossprod(z$x, z$y))
> round(beta.mu - fit.sum$coef[1], 6) == 0
     [,1]
[1,] TRUE

Finding Sigma of the error term

> resid.val <- z$y - beta.mu * z$x
> n <- 1000
> p <- 1
> df <- n - p
> resid.sigma <- sqrt(sum(resid.val^2) * (1/df))
> round(resid.sigma - fit.sum$sigma, 6) == 0
[1] TRUE
> resid.sigma.R <- sqrt(deviance(fit)/df.residual(fit))
> round(resid.sigma - resid.sigma.R, 6) == 0
[1] TRUE

Finding sigma of estimate

> beta.sigma <- sqrt(solve(crossprod(z$x, z$x))) * resid.sigma
> xtxi <- fit.sum$cov.unscaled
> beta.stderror.R1 <- sqrt(diag(xtxi)) * resid.sigma
> beta.stderror.R2 <- fit.sum$coef[, 2]
> round(beta.sigma - beta.stderror.R1, 6) == 0
     [,1]
[1,] TRUE
> round(beta.sigma - beta.stderror.R2, 6) == 0
     [,1]
[1,] TRUE

calculating t value

> tvalue <- (beta.mu/beta.sigma)
> tvalue.R <- fit.sum$coef[, 3]
> round(tvalue - tvalue.R, 0) == 0
     [,1]
[1,] TRUE

calculating p value

> pvalue <- 1 - pt(tvalue, 999)
> pvalue.R <- fit.sum$coef[, 4]
> round(pvalue - pvalue.R, 0) == 0
     [,1]
[1,] TRUE

Rsquare of the model

> rsq <- 1 - sum(resid.val^2)/sum((z$y)^2)
> rsq.R1 <- 1 - deviance(fit)/sum((z$y)^2)
> rsq.R2 <- fit.sum$r.squared
> round(rsq - rsq.R1, 6) == 0
[1] TRUE
> round(rsq - rsq.R2, 6) == 0
[1] TRUE

For a 1 variable regression equation F stat = sqrt(t) as both test the same thing

> adj.rsq <- 1 - (sum(resid.val^2)/sum((z$y)^2)) * (n - 1)/(n -
+     p - 1)
> adj.rsq.R <- fit.sum$adj.r.squared
> round(adj.rsq - adj.rsq.R, 6) == 0
[1] TRUE
> beta.mu + c(-1, 1) * qt(0.975, 499) * beta.sigma
[1] 6.981168 7.047525
> confint(fit)
     2.5 %   97.5 %
x 6.981208 7.047485

There are two types of forecasts one can do with the model. One is conditional mean forecast and one is predict for a given xi

> xi <- 4.5
> ymean.forecast <- beta.mu * xi
> yforecast <- beta.mu * xi
> ymean.forecast.error <- resid.sigma * sqrt(xi * solve(crossprod(z$x,
+     z$x)) * xi)
> yforecast.error <- resid.sigma * sqrt(1 + xi * solve(crossprod(z$x,
+     z$x)) * xi)
> t.sig <- qt(0.975, 999)
> ymean.forecast + c(-1, 1) * t.sig * ymean.forecast.error
[1] 31.41544 31.71368
> predict(fit, data.frame(x = xi), interval = "confidence")
       fit      lwr      upr
1 31.56456 31.41544 31.71368
> yforecast + c(-1, 1) * t.sig * yforecast.error
[1] 25.06070 38.06842
> predict(fit, data.frame(x = xi), interval = "prediction")
       fit      lwr      upr
1 31.56456 25.06070 38.06842

This post summarizes the method of finding all the parameters of the model using limited R builtin functions.