Meeting PCA again

I was familiar with this technique, had used it in a couple of places. This week I had a chance to apply this technique to a problem at hand. However, this time I did not want to rush in . Had in my archives an old book called “User’s guide to Principal Components”. Took some time out to know a bit more about the technique. As always, there is always some new learning by reading old stuff.

A bit of history, to begin with. This technique was first introduced by Karl Pearson in 1901 (Refer to his other contributions here. The general procedure to carry out the technique had to wait until 1933 when Harold Hotelling introduced it in his paper. Thus 1930s and 1940s had a great deal of activities relating to PCA.Then things subsided and the development took place as a poisson process :). With the advent of computers, things changed. People were now able to find factors, invert matrices with ease and thus the application of PCA became widespread in various fields. An unfortunate development was that many social scientists extended PCA to factor analysis and interpreted factors by their own whims and fancies.

The author of this book , Mr. Edward Jackson has put in his thoughts from his working experience at Eastman Kodak for over 20 years. Books from practitioners are gems for they show how to marry theory and application , and more so, make you feel that there is no single answer to any problem. Smartness lies in using a specific technique to a situation with adequate mix of intuition and quantification. Another book, Super Crunchers, makes the same point

Principal Component Analysis, abbreviated as PCA, is a data exploration technique as well as an inferential technique. The methodology is same for population as well as samples( a greatly comforting aspect). Also most of the aspects relating to PCA are distribution free.

Let’s say your are a police officer in charge of keeping a tab on the militant threats in Mumbai. Lets put some numbers to the situation. Imagine you track 10 variables at different points in the city.. You figure out that these 10 variables should to be present in a specific bands so that you are fairly confident of no militant threat in Mumbai. You have put in place police infra, informer infra etc in the city to get 10 dimensional vector from all the potential threat points in mumbai, let’s say 5000 points, ending up with a 5000*10 matrix of observations.All the variables can be quantified on a scale of 1-10.Lets say the first variable x1 can be quantified on a scale of 1-10 and the mean is 5 and the bands are at 4.5 and 5.5. Any observation beyond this band might make you suspicious. Lets say you keep a track of all these 10 variables on a big chart. So, basically you add daily observations to your dashboard .

If all the 10 variables are with in the bands(lets say these are 95% confidence interval bands), would you be 95% confident that there is no militant threat ?.

At the outset, it might seem correct. You are tracking 95% intervals for each of the variables. However , on thinking a little bit, you realize that the type I error shoots up. Reason, the joint probability of threat,thanks to Type I errors is (1-0.95)*(1-0.95)*….10 times = 0.4. This means that if you control all 10 variables , there is 40% chance that one of the variables is outside the bands!!! Now this is assuming that the variables are uncorrelated. But think abt it, variables are often correlated , especially in a militant threat kind of situation. So , your Type I error goes up. What do you do ? Well , there are a couple of things you can do. One is see to it that the bands are designed in such a way that Type I and Type II error is built in to bands mechanism OR, do something which is relatively easier,Do not track these variables x1,x2,x3,…..x10 for 5000 sites, but a different linear combinations of these variables call then X1,X2,X3,X4 which have one great property , which is, the transformed vectors are uncorrelated.

Basically this is my pathetic attempt :) at explaining PCA with out any matrix notion. PCA helps you view the 10 variables sets in a different perspective which gives a lot more clarity on the relational structure in the data. By looking at the bands and Hotelling T-square you can act on the variables realization.Ok, now to say the same thing in matrix notation. Whatever be your input matrix, Let S be covariance matrix then one can find an orthonormal matrix U such that U’SU =L , a lower triangular matrix. The diagonal elements are eigen values which can also be obtained by solving |S-m*I|=0 .Characteristic vectors are obtained by (S-m*I)Xp = 0 where Xp is the characteristic vector.Thus one can transform the original data to z where z=U'[x-xbar). ( With out latex support, writing math is particularly difficult on typepad ). One use of this transformation is that L matrix is very useful becoz trace of L is the sum of original variances and det|S| is just the product of eigen values.

Another important aspect that I have learnt from this book is about the stopping criterion. I was using proportion of variance explained as a stopping factor for the number of components selected. Why ? I don’t know….I just accepted somebody’s word. I was completely dumb in not even thinking about it.However this book revealed to me that there is nothing great about that rule. Bartlett’s test provides a better way to scientifically create a stopping condition. This was new to me. Somehow , it is never mentioned in the usual 10,000 ft view read of PCA. There are at least a dozen stopping criterion and I guess at least crunching half a dozen would bring upon a healthy dose of skepticism in your views.

What I could summarize above is just 5% of the book. The title is definitely an understatement. It should be have been “all encompassing guide” becoz in whatever field you are, for whatever problem you want to use PCA, there is a guideline in the book. This is one of those good books which focus less on abstract theorems and more on practical applications.