window
Most of us would have come across Occam’s razor principle in the context of variable selection, the essence of which is, “parsimony wins”. However not many would have heard about “Occam’s window” that is relevant in the context of Model selection, i.e. choosing a set of models out of an ocean of potential models. In the stats literature, Occam’s window appears under Bayesian Model Selection. In this post, I will try to summarize some of the main points from this fantastic paper by Adrian Raftery. In many disciplines, more so in social sciences, an associative analysis between a dependent variable and a set of predictors can be done in multiple ways. Think back to simple regression between a dependent variable and a large set of independent variables.If there are n predictors, ideally there can be 2^n linear models. The way one might go about taming the model explosion is via forward stepwise/ backward stepwise/ mixture of the two. Inevitably this exercise of choosing one final model gives rise to many problems.

_Introduction:
_The paper starts with a brief introduction to the problems that crop up while analyzing data in a social science setting :

  • Large samples results in greater likelihood of rejection of null ( Why ? Alternate hypothesis variance shrinks and power of the test increases and hence there is a larger likelihood to detect a pattern. )

  • Most of the elimination or addition of variables are based on p-values. p values tend to indicate rejection of the null hypothesis even when the null model seems reasonable theoretically and inspection of the data fails to reveal any striking discrepancies with it.

  • Several different models may all seem reasonable given the data but nevertheless lead to different conclusions about questions of interest.

The author pitches for BIC, a criterion for model selection, as the way to get out this model selection conundrum.

What are the practical difficulties with p-values ? :

This section elaborates on the problems mentioned in the previous section.

  • Why should one stick to the conventional alpha as 0.05 or 0.01 based on Fischer’s experience with relatively small agricultural experiments ?

  • Generally vague advice is given to balance power and significance of hypothesis testing, such as , Type I error should be set lower for large sample size.

  • The entire standard hypothesis testing framework rests on the basic assumption that only two models are every entertained. This is far from being the case in many sociological studies

  • Whenever there is a conflict in the intuition, frequentists say that statistically significant need not mean substantive significant. The author feels that this is a bogus reason and says that in most cases the conflict arises due to miscalibration of statistical significance using p-values, rather than any real conflict between statistical and substantive significance

  • To show the problems with procedures such as forward regression /backward regression, the author simulates a data set containing 51 variables ( 50 predictors, 1 response variable) that are iid. Running regressions on the dataset and pruning out the variables that don’t have significant p values results in a strong model containing a few explanatory variables.A total nonsense result is produced. Standard procedures run the risk of fitting a model to the noise

  • When there are multiple models that seem to be showing up statistically significant, a frequentist analyst takes on of the two routes – pick one model and adopt the conclusions that flow form it rather than from the other defensible models. The second route is to present the analyses based on all the plausible models without choosing between them. Both the routes are not satisfactory in most of the situations.

  • A case study is presented to show that conflicting results can arise by choosing various explanatory models from the same dataset.

  • Often in sociology, the comparison is between two non-nested models and the standard frequentist methods break down.

  • The standard significance tests allow one either to reject the null hypothesis or to to fail to reject it, but they do not provide any measure of evidence for the null hypothesis.A standard test allows us to say only that data have failed to reject our null but gives no indication of whether the data support it or not.

Bayesian Hypothesis Testing:

This section introduces Bayes estimation procedure and subsequently introduces Bayes factor that seems intuitively so easy to understand and apply. If there are two models you compare the conditional probability of the two models given the data and choose whichever probability is more. The critical thing is that you are not limited to just two models. Given a baseline model, you can compare a ton of models that need not be nested.

The BIC approximation :

This section derived the BIC criterion and gives an asymptotic distribution for the statistic. The procedure is fairly straightforward. You write down the Taylor series expansion of the marginal probability of data given a model, invoke Laplace method for integrals, invoke Fisher’s likelihood theory and end up an expression for BIC that is a function of Likelihood ration statistic , the degrees of freedom and the sample size.  The author goes on to give BIC forms for simple regression, logistic regression , log-linear modeling, event-history analysis, structured equation models. Often one has to choose the baseline model as a model with no parameters or a model that is fully saturated. The section shows that for either of the case, there is no change in the final BIC calculations.A connection between BIC statistic and t statistic is used and various tables are provided to grade the evidence corresponding to the value of t stat.

Model Uncertainty and Occam’s Window :

This section is the highlight of the entire paper. Let me attempt to write down the essence of this section without resorting to math : Let’s say you have a response variable and a dozen predictor variables. You are interested in testing the associative relationship of the first predictor, say X1 . Since there are 12 predictors, one can ideally build 2^(12) models. Here’s where Occam’s Window comes up. You take the best model based on BIC and exclude all the models that are say 20 times less likely tha n the best. As a second step of pruning, remove all the models that contain effects for which there is no evidence, i.e. those models that might have nested models. You end up with a set of models in which you can now analyze the effect of say X1. A very simple way to get started is to basically find the proportion of models where X1 is present. If this probability is very less, then there is no point in even thinking about the effect size. If the probability is high, then one can find the posterior distribution of coefficient of X1, compute the mean and variance. This procedure is far more elegant as it takes in to consideration the model uncertainty issues. The paper ends with a section that analyzes the same case studies mentioned earlier in the paper via Bayes factor. All the inconsistencies mentioned earlier with the p-value approach are resolved. 

My summary is about 1200 words. The author’s summarizes the findings of the paper towards the end, that is very precise and to the point. I am merely replicating it here, for my future reference.

  • Bayes factors provide a better assessment of the evidence for a hypothesis than p-values, particularly with large samples

  • Bayes factor allows the direct comparison of nonnested models, in a simple way

  • Bayes factors can quantify the evidence for a null hypothesis of interest. They can distinguish between the situation where a null hypothesis is not rejected because there is not enough data, and that where the data provide evidence for the null hypothesis

  • BIC provides a simple and accurate approximation of Bayes factors

  • When there are many candidate independent variables, standard model selection procedures are misleading and tend to find strong evidence for effects that do not exist. By conditioning on a single model, they also ignore model uncertainty and so understate uncertainty about quantities of interest

  • Bayesian model averaging enables one to take in to account model uncertainty and to avoid the difficulties with standard model selection

  • The Occam’s window algorithm is a manageable way to implement Bayesian model averaging, even with many models, and allows effective communication of model uncertainty

  • BIC can be used to guide an iterative model selection process