Choosing the Best Volatility Models

The paper, “Determining the best forecasting models”, is about testing 55 models that belong to the GARCH family. If we have just one model and a straw model, it is easy to show that some statistic on the test sample that the hypothesized model is superior. How do we go about testing a set of competing models? There are many wonderful techniques in the Bayesian world. However this paper is more frequentist in nature. It uses a method called “Model Confidence Set” for deciding the best forecasting models. This is akin to forming a confidence interval for a parameter rather than a point estimate. What’s the advantage of Model Confidence Set ?

The method neither assumes knowledge of the correct specification, nor does it require that “true” model is available as one of the competing models
The methods does not discard a model unless it is found to be significantly inferior relative to the other models
The methods is more appealing to work with a set of forecasting models because in practice if often cannot be ruled out of that two or more competing models are equation good
The method accounts for uncertainty across a set of forecasting models, rather than uncertainty about the best model
The method selects a set of models, unlikely the commonly used model selection criteria which select a single model

What does the MCA procedure involve?

Let’s say you have four models with different specifications. The objective is to obtain best three model set, best two model set and the best model. Why does one need to have varying cardinality sets? Most often than not we have a set of competing models where there is no particular model that outperforms every other model. We might be looking for a set of models that more or less perform equally. MCA puts a framework around it in the following way:

Assume a loss function, mean squared loss or MAD or whatever you think is an appropriate function.
Estimate the parameters of all the models on a training set
On the test set, compute the loss function statistic
Generate bootstrap samples from the test dataset – Block bootstrap. One can look at the models being tested and take the max lag length as the block length
Start with all the models in the model set M0
For every pair of models considered, compute the standardized distance between two models. This standardized distance is nothing but the ratio of the difference in loss functions to the standard error for the difference. The difference in loss functions is straightforward to compute but how does one get the standard error for the difference? Use all the bootstrap samples to generate the standard error.
For each model in a model set, compute the average distance between a model and rest of the models in the model set. Standardize this distance. Again bootstrapped samples can be used to get the standard error of the distance
Compute the test statistic of the model set M0. This test statistic can be maximum standardized distance between any two models in the model set OR sum of squared distances between all the models in the model set
Test this statistic whether it is significant. In order to test this statistic, use the bootstrap samples to generate the density estimate
If the null hypothesis that all the models in the current model set are equally good at a certain confidence level, then we are done. Else drop the worst performing model from the current set and repeat the procedure with the new current set of models. How does one choose the worst performer ? Worst performing model is one that is relative farther away from all the models considered in the model set

Coming back to this paper, the author uses the above methodology to test 55 volatility models. The advantage with using this method is that it gives a probability for each model in each model set. Thus one gets a quantitative measure of how each model is stacked against each other in various model sets.

What are the results of applying MCA to 55 models?

Best performing models have a leverage effect, i.e all the asymmetric GARCH models perform equally well
VGARCH(2,2) models turns out to be a superior model if trimming the model set cardinality is taken all the way to 1
There are also some surprising models such NAGARCH(2,2) that show up as superior models. This leads the author to do a simulation study on the MCA technique and the conclusion from the study is this : MCS yields a set of models that contains all, or almost all, truly superior models, but rarely exactly captures the “true” set of superior models, unless the difference in expected performance of superior and inferior models is large

I think MCA is an extremely useful method to compare a set of models. Several questions come to my mind, right away :

Does any model beat GARCH(1,1) volatility estimate for a specific security or index ?
What is the model set in which GARCH(1,1) is an element?
How does HAR-RV model compare with GARCH family of models ?
Can we combine the forecasts a set of models from a model set ?
How does this approach differ from let’s say applying a simple Bayes factor for model selection?
Is the best two or best three models forecast from MCA different from the Model averaging forecast from the Bayes world ?