PredictorProblems

Purpose
If you have two variables X and Y, then do you regress X Vs Y or Y Vs X.

> data(cars)
> head(cars)
  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
> g <- lm(dist ~ speed, cars)
> summary(g)
Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max
-29.069  -9.525  -2.272   9.215  43.201

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791     6.7584  -2.601   0.0123 *
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511,     Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.490e-12
> g <- lm(speed ~ dist, cars)
> summary(g)
Call:
lm(formula = speed ~ dist, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max
-7.5293 -2.1550  0.3615  2.4377  6.4179

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
dist         0.16557    0.01749   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.156 on 48 degrees of freedom
Multiple R-squared: 0.6511,     Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.490e-12

What if there are error in the measurement of X and Y ?

Let there be a systematic error in dependent variable

> g <- lm(dist ~ speed, cars)
> summary(g)
Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max
-29.069  -9.525  -2.272   9.215  43.201

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791     6.7584  -2.601   0.0123 *
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511,     Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.490e-12
> g <- lm(I(dist + rnorm(50)) ~ speed, cars)
> summary(g)
Call:
lm(formula = I(dist + rnorm(50)) ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max
-28.599  -9.819  -2.601   9.365  43.668

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.9210     6.7803  -2.496   0.0161 *
speed         3.8877     0.4169   9.326 2.36e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.43 on 48 degrees of freedom
Multiple R-squared: 0.6444,     Adjusted R-squared: 0.637
F-statistic: 86.98 on 1 and 48 DF,  p-value: 2.363e-12

There seems to be no change in parameter estimates

Let there be a systematic error in independent variable

> g <- lm(dist ~ speed, cars)
> coef(ge1)
         (Intercept) I(speed + rnorm(50))
           -13.61135              3.70888
> ge1 <- lm(dist ~ I(speed + rnorm(50)), cars)
> coef(ge1)
         (Intercept) I(speed + rnorm(50))
          -14.768600             3.755804
> ge2 <- lm(dist ~ I(speed + 2 * rnorm(50)), cars)
> coef(ge2)
             (Intercept) I(speed + 2 * rnorm(50))
              -11.690288                 3.617516
> ge2 <- lm(dist ~ I(speed + 4 * rnorm(50)), cars)
> coef(ge2)
             (Intercept) I(speed + 4 * rnorm(50))
               0.5258011                2.6991028
> ge2 <- lm(dist ~ I(speed + 6 * rnorm(50)), cars)
> coef(ge2)
             (Intercept) I(speed + 6 * rnorm(50))
               11.541484                 1.966775

As you see , the slope becomes flatter and flatter as the measurement error increases in the independent variable.

One important thing to note is that there is a relationship between new beta and old beta which clearly shows that if the stdev of error measurement is less than the stdev in the fixed variable case, then one can conveniently forget about the bias induced by error measurement of the independent variable.

Controlled variables are hypothetical in Finance. In Hard sciences, if you are experimenting in a lab, then you have the choice on controlling the variable. In finance, one cant even think of such stuff…

Another takeaway is that in a controlled variable environment, whichever variable has a lesser estimation / measurement error, you can take that variable as independent variable and conduct the experiment