Omitted Variable Problem
Purpose
Omitted Variable Bias
Simulating a dataset with three 2 independent variables with no correlation between them. Y = alpha + Beta_1 X_1 + Beta_2 X_2
> N <- 10000 > x <- cbind(1, runif(N), runif(N)) > beta.true <- c(2, 3, 5) > error.var <- 3 > indep <- as.matrix(x) > dep <- indep %*% beta.true + sqrt(error.var) * rnorm(N) > fit1 <- lm(dep ~ indep[, c(2, 3)]) > summary(fit1) Call: lm(formula = dep ~ indep[, c(2, 3)]) Residuals: Min 1Q Median 3Q Max -6.681732 -1.174800 0.005124 1.160683 7.613576 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.04134 0.04606 44.32 <2e-16 *** indep[, c(2, 3)]1 2.98309 0.06077 49.09 <2e-16 *** indep[, c(2, 3)]2 4.94039 0.06078 81.28 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.749 on 9997 degrees of freedom Multiple R-squared: 0.4762, Adjusted R-squared: 0.4761 F-statistic: 4544 on 2 and 9997 DF, p-value: < 2.2e-16 > fit2 <- lm(dep ~ indep[, c(2)]) > summary(fit2) Call: lm(formula = dep ~ indep[, c(2)]) Residuals: Min 1Q Median 3Q Max -7.5494252 -1.5544859 0.0004559 1.5569615 8.7648413 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.49838 0.04479 100.44 <2e-16 *** indep[, c(2)] 3.02685 0.07831 38.65 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.254 on 9998 degrees of freedom Multiple R-squared: 0.13, Adjusted R-squared: 0.1299 F-statistic: 1494 on 1 and 9998 DF, p-value: < 2.2e-16 > fit3 <- lm(dep ~ indep[, c(3)]) > summary(fit3) Call: lm(formula = dep ~ indep[, c(3)]) Residuals: Min 1Q Median 3Q Max -7.08114 -1.31229 -0.02188 1.29275 8.40231 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.50248 0.03916 89.44 <2e-16 *** indep[, c(3)] 4.96683 0.06770 73.36 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.948 on 9998 degrees of freedom Multiple R-squared: 0.3499, Adjusted R-squared: 0.3499 F-statistic: 5382 on 1 and 9998 DF, p-value: < 2.2e-16 |
As you can see that Beta_1 and Beta_2 can independently estimated with out any bias even in shorter form of regressions.
Now lets say that the variables are correlated. X_2 and X_3 are correlated. Let the correlation between be 0.9
> library(mnormt) > sample.cov <- matrix(data = NA, nrow = 2, ncol = 2) > sample.cov[1, 1] <- 1 > sample.cov[1, 2] <- 0.8 > sample.cov[2, 1] <- 0.8 > sample.cov[2, 2] <- 1 > x <- rmnorm(n, mean = 0, varcov = sample.cov) > x <- cbind(1, x) > beta.true <- c(2, 3, 3) > error.var <- 3 > indep <- as.matrix(x) > dep <- indep %*% beta.true + sqrt(error.var) * rnorm(N) > fit1 <- lm(dep ~ indep[, c(2, 3)]) > summary(fit1) Call: lm(formula = dep ~ indep[, c(2, 3)]) Residuals: Min 1Q Median 3Q Max -6.57254 -1.16425 0.01150 1.15195 6.17648 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.98840 0.01725 115.3 <2e-16 *** indep[, c(2, 3)]1 3.00990 0.02880 104.5 <2e-16 *** indep[, c(2, 3)]2 2.99156 0.02860 104.6 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.725 on 9997 degrees of freedom Multiple R-squared: 0.918, Adjusted R-squared: 0.918 F-statistic: 5.596e+04 on 2 and 9997 DF, p-value: < 2.2e-16 > fit2 <- lm(dep ~ indep[, c(2)]) > summary(fit2) Call: lm(formula = dep ~ indep[, c(2)]) Residuals: Min 1Q Median 3Q Max -9.20541 -1.66141 0.03458 1.70168 10.26445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.96109 0.02496 78.58 <2e-16 *** indep[, c(2)] 5.43369 0.02474 219.60 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.495 on 9998 degrees of freedom Multiple R-squared: 0.8283, Adjusted R-squared: 0.8283 F-statistic: 4.823e+04 on 1 and 9998 DF, p-value: < 2.2e-16 > fit3 <- lm(dep ~ indep[, c(3)]) > summary(fit3) Call: lm(formula = dep ~ indep[, c(3)]) Residuals: Min 1Q Median 3Q Max -11.749998 -1.669863 0.008223 1.679512 9.647873 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.99631 0.02495 80.01 <2e-16 *** indep[, c(3)] 5.39703 0.02456 219.71 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.495 on 9998 degrees of freedom Multiple R-squared: 0.8284, Adjusted R-squared: 0.8284 F-statistic: 4.827e+04 on 1 and 9998 DF, p-value: < 2.2e-16 |
One can see that the coefficients are screwed up if you omit any variable in the regression. Thus omitted variable kills you if the omitted variable has any correlation with the explanatory variable in the model which is always the case Take any regression involving 2 variable, there is always a possibility that the left out variable or the omitted variable has a correlation with the explanatory variable.
For some vague reason, I had never simulated and tested out and checked out what was being said about the omitted variable