PCA Regression

Purpose
If one thinks about the model matrix, it is not necessary that all the independent variables are truly independent. There could be a lot of variables which might not be explaining the variance of the dependent variable.

Lets take a simple dataset with a dependent variable as fat and 100 independent variables

> library(faraway)
> data(meatspec)
> model <- meatspec[1:172, ]
> test <- meatspec[173:215, ]
> fit <- lm(fat ~ ., model)
> fit.sum <- summary(fit)
> fit.sum$r.squared
[1] 0.9970196

Model looks too good to be true

> rmse <- function(x, y) sqrt(mean((x - y)^2))
> rmse(fit$fit, model$fat)
[1] 0.6903167
> rmse(predict(fit, test), test$fat)
[1] 3.814000

As you can see that the rmse looks good for the model used for development but sucks big time for out of sample test.

A look at the standard error gives you an indication of the sensitivity and usefulness of various predictors

> plot(fit.sum$coef[, 2], pch = 19, col = "blue")

Lets do a step wise regression and see what we get

> fit2 <- step(fit, trace = 0)
> rmse(fit2$fit, model$fat)
[1] 0.7095069
> rmse(predict(fit2, test), test$fat)
[1] 3.590245
> rmse(predict(fit2, test), test$fat)
[1] 3.590245

There is hardly any reduction in rmse for the model where step wise regression has been used.

So, whats the alternative ? Do a PCA of the model matrix

> modelpca <- prcomp(model[, -101])
> round(modelpca$sdev, 3)
  [1] 5.055 0.511 0.282 0.168 0.038 0.025 0.014 0.011 0.005 0.003 0.002 0.002
 [13] 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 [25] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 [37] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 [49] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 [61] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 [73] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 [85] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 [97] 0.000 0.000 0.000 0.000

Lets take the first 4 eigen vectors and use it for regression

> fit3 <- lm(fat ~ modelpca$x[, 1:4], model)
> rmse(fit3$fit, model$fat)
[1] 4.064745

Actually not bad for using merely 4 PCA components and RMSE has not shot up Lets compare the regression output from plain simple OLS and PCA based method

> par(mfrow = c(1, 2))
> plot(fit.sum$coef[, 2], pch = 19, col = "blue", ylab = "Coefficient",
+     main = "OLS")
> svb <- modelpca$rot[, 1:4] %*% fit3$coef[-1]
> plot(svb, ylab = "Coefficient", pch = 19, col = "blue", main = "PCA based OLS")

It is clear from the above figure that coefficients range is drastically reduced and hence the pca method is called shrinkage method

> par(mfrow = c(1, 1))
> plot(modelpca$sdev[1:10], type = "l", ylab = "SD of PC", xlab = "PC number")

Scree plot shows the effect of various principal components.

> mm <- apply(model[, -101], 2, mean)
> tx <- as.matrix(sweep(test[, -101], 2, mm))
> nx <- tx %*% modelpca$rot[, 1:4]
> pv <- cbind(1, nx) %*% fit3$coef
> rmse(pv, test$fat)
[1] 4.533982

Thus the 4 PCA factors will give a rmse of 12.5 Obviously the next thing to explore is to increase the number of eigen vectors

> rmsmeat <- numeric(50)
> for (i in 1:50) {
+     nx <- tx %*% modelpca$rot[, 1:i]
+     model3 <- lm(fat ~ modelpca$x[, 1:i], model)
+     pv <- cbind(1, nx) %*% model3$coef
+     rmsmeat[i] <- rmse(pv, test$fat)
+ }
> which.min(rmsmeat)
[1] 27
> min(rmsmeat)
[1] 1.854858

The above number tells us that one needs to choose 27 eigen vectors

One aspect that is bloody bloody important is that , when you do PRINCOMP in R, the eigen vectors you get are typically for centered data. So, using the eigen vectors on the fresh data should be treaded with caution. You must center the fresh data and apply these eigen vectors. The use of sweep in the above example is mainly to center the data.

In the above example though, a brute force search is done from first principal component till all the principal components are exhausted. Thus at the end of it, one chooses which ever minimizes the root mean square.

> library(pls)
> pcrmod <- pcr(fat ~ ., data = model, validation = "CV", ncomp = 50)
> plsg <- plsr(fat ~ ., data = model, ncomp = 50, validation = "CV")
> par(mfrow = c(1, 2))
> coefplot(plsg, ncomp = 4, xlab = "Frequency", main = "Coef Plot")
> validationplot(plsg, main = " Validation Plot")

Let’s use this to check the root mean square using 21 Principal components.

> ypred <- predict(plsg, ncomp = 21)
> rmse(ypred, model$fat)
[1] 1.46433
> ytpred <- predict(plsg, test, ncomp = 21)
> rmse(ytpred, test$fat)
[1] 3.267872

Thus the root mean square has been reduced by choosing 21 components.

Basic idea is this: Instead of using raw model matrix, rotate it and then regress it so that you can concentrate on a few variables and get the maximum bang for the buck.