Regression First Principles
Purpose
The purpose of this post is to look at regression, ground up. Most of what I have learnt in the past, in my MBA, in my masters has been as follows ? There is some data and you have to run regression on it to get some output. One is taught to interpret the output and get an intuitive sense of the model. May be one explores a step further and uses a software to get the model output. The problem with this approach is that you do not appreciate the finer aspects.
+ Let me post something which ideally is a nice way to understand closely the various aspects of regression
I will start with a simple regression workout and then move on to multiple regression exercise. The structure of each of these exercises will be as follows :
- Simulate the data so that you are already aware of the population parameters. - Run the model using R functions. - Generate Output using R functions. - Manually generate the same output by not using very many functions from R. - Infer from the simulated data.
Using R
> set.seed(12345) > x <- rep(seq(1, 10, 1), 100) > beta <- rnorm(1000, 7, 0.5) > e <- rnorm(1000) > y <- x * beta + e > z <- cbind(y, x) > z <- data.frame(z[order(z[, 2]), ]) > plot(z$x, z$y, pch = 19, col = "blue") > fit <- lm(y ~ x + 0, z) > fit.sum <- summary(fit) > fit.sum Call: lm(formula = y ~ x + 0, data = z) Residuals: Min 1Q Median 3Q Max -14.280894 -1.535239 0.003702 1.792795 12.184642 Coefficients: Estimate Std. Error t value Pr(>|t|) x 7.01435 0.01689 415.4 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3.313 on 999 degrees of freedom Multiple R-squared: 0.9942, Adjusted R-squared: 0.9942 F-statistic: 1.725e+05 on 1 and 999 DF, p-value: < 2.2e-16 |
Lets generate the output from fit.sum using the first priciples
Using Very Minimal Functions from R
> beta.mu <- solve(crossprod(z$x, z$x), crossprod(z$x, z$y)) > round(beta.mu - fit.sum$coef[1], 6) == 0 [,1] [1,] TRUE |
Finding Sigma of the error term
> resid.val <- z$y - beta.mu * z$x > n <- 1000 > p <- 1 > df <- n - p > resid.sigma <- sqrt(sum(resid.val^2) * (1/df)) > round(resid.sigma - fit.sum$sigma, 6) == 0 [1] TRUE > resid.sigma.R <- sqrt(deviance(fit)/df.residual(fit)) > round(resid.sigma - resid.sigma.R, 6) == 0 [1] TRUE |
Finding sigma of estimate
> beta.sigma <- sqrt(solve(crossprod(z$x, z$x))) * resid.sigma > xtxi <- fit.sum$cov.unscaled > beta.stderror.R1 <- sqrt(diag(xtxi)) * resid.sigma > beta.stderror.R2 <- fit.sum$coef[, 2] > round(beta.sigma - beta.stderror.R1, 6) == 0 [,1] [1,] TRUE > round(beta.sigma - beta.stderror.R2, 6) == 0 [,1] [1,] TRUE |
calculating t value
> tvalue <- (beta.mu/beta.sigma) > tvalue.R <- fit.sum$coef[, 3] > round(tvalue - tvalue.R, 0) == 0 [,1] [1,] TRUE |
calculating p value
> pvalue <- 1 - pt(tvalue, 999) > pvalue.R <- fit.sum$coef[, 4] > round(pvalue - pvalue.R, 0) == 0 [,1] [1,] TRUE |
Rsquare of the model
> rsq <- 1 - sum(resid.val^2)/sum((z$y)^2) > rsq.R1 <- 1 - deviance(fit)/sum((z$y)^2) > rsq.R2 <- fit.sum$r.squared > round(rsq - rsq.R1, 6) == 0 [1] TRUE > round(rsq - rsq.R2, 6) == 0 [1] TRUE |
For a 1 variable regression equation F stat = sqrt(t) as both test the same thing
> adj.rsq <- 1 - (sum(resid.val^2)/sum((z$y)^2)) * (n - 1)/(n - + p - 1) > adj.rsq.R <- fit.sum$adj.r.squared > round(adj.rsq - adj.rsq.R, 6) == 0 [1] TRUE > beta.mu + c(-1, 1) * qt(0.975, 499) * beta.sigma [1] 6.981168 7.047525 > confint(fit) 2.5 % 97.5 % x 6.981208 7.047485 |
There are two types of forecasts one can do with the model. One is conditional mean forecast and one is predict for a given xi
> xi <- 4.5 > ymean.forecast <- beta.mu * xi > yforecast <- beta.mu * xi > ymean.forecast.error <- resid.sigma * sqrt(xi * solve(crossprod(z$x, + z$x)) * xi) > yforecast.error <- resid.sigma * sqrt(1 + xi * solve(crossprod(z$x, + z$x)) * xi) > t.sig <- qt(0.975, 999) > ymean.forecast + c(-1, 1) * t.sig * ymean.forecast.error [1] 31.41544 31.71368 > predict(fit, data.frame(x = xi), interval = "confidence") fit lwr upr 1 31.56456 31.41544 31.71368 > yforecast + c(-1, 1) * t.sig * yforecast.error [1] 25.06070 38.06842 > predict(fit, data.frame(x = xi), interval = "prediction") fit lwr upr 1 31.56456 25.06070 38.06842 |
This post summarizes the method of finding all the parameters of the model using limited R builtin functions.