Correlation Phantom

Day in , Day out one comes across reports, newspaper articles, business channel anchors using the word correlation. Correlation between NIFTY movement and some sector index, pollution rates Vs number of vehicles in the city, y vs x etc . If one stops to think about the word correlation and ask a few questions, opinions/arguments/citings all tend to fall apart.

In the statistical sense of the word, let me take r as a pearson correlation and bring out the first often neglected aspect of correlation. Like everything in the world of numbers, it is an estimate. This means that its got to have a sampling distribution .

If you have lets say 1000 values of X and 1000 values of Y and you want to find the correlation. well, one of the first things one usually does is to report Summation(x*y) / (n *Sx*Sy) where x and y are deviations of X and Y from their respective sample means , Sx and Sy are sample stdev of x and y respectively. Now this number is not sacrosanct. Its afterall an estimate. If you sample lets say 1000 values from X and 1000 values from Y with replacement, and calculate the correlation, you end up with another number of correlation. If you repeat this exercise umpteen number of times you get a sampling distribution for r, meaning r can take any number of values and the least you can do is to look at the frequency distribution of such numbers …Can you form confidence bands for this ?

Well, one thing to note is that it is not a normal distribution. Fischer realized this and applied a transformation to r values and converted in to a normal dist, calculated bands in the transformed world and brought it back to the r world. Thus there is a big ram kahani to this simple task of forming confidence bands for r. Well, if r is »0.4 , then you can assume to be normal and report the std error as (1-r^2)/sqrt(n) .

Ok. assume you have done all this and then you take 2 variables totally different in nature where you don’t see any reason for them to be correlated. Let’s say the # of runs scored by tendulkar since his debut in each match, # of car accidents in mumbai on each day of those matches. If the two variable are strongly(positively/negatively) correlated , common sense would say that it is totally meaningless. Spurious correlation is everywhere.

So, is there something wrong with the metric ? Or there is something missing about the way we think correlation works ? Or is there an assumption behind correlation that gets violated big time when you just use it on any two variables ? This was a question answered by a British statistician named Yule almost 80 years ago…Sadly, the results have not percolated in to everyday life…

Thus correlation has become one of the most notoriously used word to lend phantom quantitative props to whatever opinion one has to give!!

Next time you hear the word correlation being casually tossed around by someone, do try to ask a simple question - “Can you give some confidence bands for the correlation that you are talking ?” ….. In all probability they will go back and try to sincerely work out a possible answer and think through it.. OR.. they will never ever use the word atleast with you again :)