Introduction to Probability Simulation and Gibbs Sampling with R : Summary

Gibbs Sampling is an important method in the context of Bayesian estimation. There are other books on R that focus on Bayesian Simulation (Jim Albert), Monte Carlo simulation (Robert and Casella) that are very good and little advanced. So, where does this book fit in? If one has never done any kind of simulation in R, this book is the ideal reference. For someone who is already aware of the basic simulation stuff, there are 3 chapters in the book that talk about Markov Chain, Bayesian Estimation and Gibbs Sampler that are worth going over. I will try to summarize the main points of the book.

The authors suggest going over Appendix to R newbie. In the appendix, the book talks about basic installation procedure, vectorization present in R, basic syntax for loops, basic graphs using base package and the sample function. In all likelihood a person reading this book might already be very familiar with the basic syntax and operations in R. So, appendix is more a formality in this book. The book contains 10 chapters. The first 7 chapters of the book can be used as a supplementary material for any frequentist based course and the last three chapters for Bayesian Course.

Chapter 1: Introductory Examples: Simulation, Estimation, and Graphics:

The entire chapter is about sample function that is used to generate bootstrap samples to answer some probability questions like birthday matching problem, Envelope mismatch problem etc. By going over this section, a reader will understand all the arguments of the function “sample” in R. The chapter then talks about coverage probabilities for confidence interval, which is nothing but the proportion of time that the interval contains the true value of interest. Using a simple binomial case, the coverage probabilities are shown for various population parameter values. A first timer would find this topic illuminating and will start developing a skeptical eye towards confidence intervals based on frequentist methods. For a simple binomial proportion case, once the true parameter moves away from 0.5, the coverage probabilities drop a lot. There is a mention about Agresti-Coull confidence intervals that addresses coverage probability problem. The best thing about this chapter is the last exercise where Beta distribution is used to model binomial proportion and Bayes is used to form credible intervals for the parameter. Frankly a person exposed to Bayesian Stats will just ignore all these fancy methods like Agresti-Coull method etc and go ahead with prior+ likelihood + posterior modeling of the situation.

Chapter 2: Generating Probabilities:

This chapter starts off by talking about the ways to generate random numbers. Firstly, the random numbers that are generated by computer are pseudo random numbers as there is nothing that is perfectly random. The word “random” is more appropriately applied to the process by which a number is produced than the number itself. The section briefly traces the history of pseudo random numbers starting from Linear Congruential generator, RANDU, Mersenne twister. A few conditions are mentioned for any generator to be useful in real life, i.e. 1) large modulus and full period, 2) histogram of scaled values should look uniform, 3) pair wise independence of the values. R has Mersenne twister algo behind runif() command and its power is really amazing. Its behavior has been tested up to 623 consecutive dimensions and has a period of 4.32 followed by 60001 zeroes. These distinct values from runif() are mapped in to roughly 4.3 billion numbers that can be expressed within the precision of R. That’s the power you have when you use generators from R. The section then moves to generating random variables from uniform random variates. The quantile transformation method is used to simulate values from distributions such as beta, binomial, chi-square, exponential, standard normal, normal, Poisson, Rayleigh etc. Knowing functional forms of the density for these distributions are useful in general, as they help one immediately recognize them when they appear in various contexts. For example if you see f(x) = constant * x ^(constant- 1) , you can immediately decipher that the distribution can be a beta in some form. I mean, these are little things but are very useful. Let’s say you see a function whose cumulative density function is sqrt(x) , you should immediately of relate this to beta(1/2,1) distribution. These little linkages will help one decide the components of model better. This chapter will enable a reader to simulate values from the popular distributions, either using uniforms / standard normal variates.

Chapter 3: Monte Carlo Integration and Limit Theorems

The chapter starts off with solving an integral problem with Riemann integration method, Monte Carlo Integration, Accept-Reject Method and Random sampling method. It then explores law of large numbers and explains convergence in probability using visuals. You can’t use Law of Large numbers for Cauchy; a fact that is showed using integrals and drilled in to every student in an elementary course on statistics. But a simple visual shows it instantly. The section then talks about limiting behavior of Markov process and shows that the convergence of Markov process to be very slow. Convergence in distribution is explained using Central Limit theorem, the crucial difference between LLN and CLT being that former doesn’t given any indication about the uncertainty of the estimate while the latter gives an idea about the error possibility. The section ends with showing connections between Riemann integral and Monte Carlo simulation. It then shows the importance of Monte Carlo Simulation in higher dimensional parameter problems where Riemann integral using grid based approach turns out to be computationally inefficient.

Chapter 4: Sampling from Applied Probability Models

This chapter goes in to sampling from known distributions to estimate probabilities of situations where closed form solutions are painful/ sometimes intractable. It starts off with looking at Poisson process, exponential holding times of the process and tries to apply sampling procedures to solve simple problems from Queuing theory. Based on the assumptions of inter arrival times of customer and distribution of servicing time of various servers, many kinds of estimates can be obtained by approximating values from sampling distribution. Another application explored is order statistics. Most of the distribution related questions about Order statistics can be solved using painful integration procedures and it is a classic area where simulation gives you quick answers. Another advantage of simulation is that you can quickly check your intuition. Most of the intro stat books tell you that sample mean and sample variance are independent variables for a normal distribution. For other distributions like exponential, it doesn’t hold well. To test this, one can easily simulate some random samples from whatever non-normal distribution you want to check, look at the joint distribution of sample variance and sample mean visually. Visuals do a good job of showing the independence of these variables, though merely relying on some correlation metric could be dicey(as correlation is merely a measure of linear dependence). The chapter ends with an introduction to Bootstrapping method. I vividly remember the days when I first got introduced to bootstrapping. I was so kicked about this concept as it relieved me of memorizing painful formulae (2 sample mean comparison tests equal / unequal sample sizes). It’s a shame that I actually memorized these things way back during my MBA days and never could understand the real practical applications. Thankfully Bootstrapping got my interest back in to stats. The section provides a few exercises that illustrate the difference between non-parametric bootstrapping method vis-à-vis parametric bootstrapping methods.

A cool thing one can learn from this chapter is the projection of n dim space in to sufficient statistic space. Let’s say you simulate 5 Random variables from Beta (0.3, 0.3) distribution and then you plot the sample mean vs. sample sd. The data fall inside the 5-dimensional unit hypercube, this has 32 vertices. A large proportion of data points will fall near the vertices, edges and faces of the hypercube. The horns in the below plots are images of these vertices under the transformation from the 5-dimensional data space to the 2-dimensional space of sufficient statistics. The horns correspond to the various vertices and actually each of horns correspond to choose(5,0) , choose(5,1), choose(5,2), choose(5,3), choose(5,4), choose(5,5) vertices of the hypercube. I guess such learning is valuable as one starts visualizing stuff in higher dimensional / multivariate cases.

Chapter 5: Screening Tests

This chapter gives an intro in to an estimation problem where traditional methods fail/ give absurd values. It subsequently introduces Bayes theorem, thus showing the importance of conditional probabilities in statistical inference problems. The relevance of these conditional distributions to sampling from target distributions is evident from the Gibbs Sampling method covered in the subsequent chapters.

Chapter 6: Markov Chains with Two States

Finally after 5 chapters, a reader gets to see Gibbs sampling procedure. The chapter starts off by introducing a Markov chain, a particular kind of stochastic process that allows for the possibility of a limited dependence amongst a set of random variables. Basic terminology is introduced like state space, transition matrix, homogeneity etc. The book is more a “DO” book where you are expected to understand stuff by coding up things. An example is shown where a sample R code is used to show the long run frequency of states in a simple 2 state Markov chain. acf plots are used as an exploratory tool to test out the rapidity of convergence of the Markov chain. Transition matrices are then introduced to show the ease of computation by looking at problems from a linear algebra perspective. The r step transition matrix boils down to multiplication of the transition matrix r times. Subsequently, the limiting behavior of the chain is shown. One crucial point that these examples bring out is that the rate of convergence of powers of transition matrix is not the same thing as the rate of convergence of the chain. The examples make it very clear that the transition matrix might converge very quickly but the chain might take a very long time to converge, and vice-versa. The last section of this chapter introduces Gibbs sampling where a set of conditional distributions are used to simulate the Markov chain. It is amply clear from the simple example mentioned in the chapter that it is difficult to simulate values from just one conditional distribution. The Gibbs sampling algo is meant to address this situation where the chain moves from one conditional distribution to another in a systematic manner so that the resulting Markov chain, after the initial burn out period, converges to the limiting distribution.

Chapter 7: Example of Markov Chains with Larger State Spaces

Personally, the main reason for going over this book was to understand the content in this chapter. I was looking for some simple examples to understand and internalize the core logic of Metropolis algo, Metropolis-Hastings algo and Gibbs Sampling algo and this chapter was perfect fit for my requirement. The chapter starts off with expanding on the state space of the markov chain, where K state, countably many and a continuum of states are considered. These types of state spaces are what we come across in real life situations. In the context of K state space markov chain, the chapter mentions 5 methods of computing the long run distribution. They are

Powers of the transition matrix
Simulation
Steady-state distribution
Explicit algebraic solution
Means of geometric distributions

Out of the above methods, the text uses the first two methods only. Powers of the transition matrix is obviously one of the straightforward but rather a lame way of doing stuff. Till what power do you multiply? What if the matrix converges very slowly? etc are some of the questions that one must have a clue before using this method. I guess this method is at best a diagnostic tool. The second method is based on simulation. You simulate a Kstate space markov chain and then examine the limiting behavior of the markov chain. The downside of this method is that we must be sure that it is an ergodic chain that we are simulating. The section also has an example where the long-run behavior for a non-ergodic chain is the absorption of the chain in to the absorbing states of the markov chain. The section briefly talks about countably infinite state spaces with a few examples of simple random walk with varying drifts. Finally, Continuous state spaces are covered where the transition density plays the role of transition matrix that was relevant for the discrete case. An important point illustrated from an example is that “A state space of finite length does not ensure useful long-run behavior”. The highlight of the chapter is using a simple bivariate normal variates to show the application of Metropolis Algorithm , its tweak for asymmetric jumps,i.e Metropolis-Hastings algorithm and the most important Gibbs Sampling algorithm. I think the example mentioned in the context of Gibbs sampling is probably easier to remember and internalize. Since the conditional distribution of a normal random variable given another normal random variable is again normal, one can easily apply Gibbs sampling for a bivariate normal case to see its effectiveness. Once the basic algo is internalized, one can then use it for complicated cases. The beauty of this book is that there are enough R based exercises that you get a pretty good idea of the concepts involved. Exercises for this chapter mainly helped me in understanding the following aspects

What’s a doubly stochastic matrix? Can it be non-ergodic ?
How to code a reflecting barrier in an efficient way?
Assuming that there is a continuous state space Markov chain that is over a finite interval. One cannot conclude that it has a long run behavior. Using a markov chain based on beta distributions, there are 2 problems mentioned where one has a long run behavior and the other doesn’t.
How to code Metropolis-Hastings algo efficiently ?
How to simulate a finite markov chain efficiently?
How to code a Gibbs sampler efficiently?

There are R codes given for each of these algos that make this chapter a very useful one.

Chapter 8: Introduction to Bayesian Estimation

The chapter talks about basic applications of Bayesian statistics. The crux of the Bayes stats is : Posterior probability is proportional to likelihood times prior probability. There are 4 examples mentioned that cover the most popular distributions:

Beta Prior + Binomial likelihood = Beta Posterior
Gamma Prior + Poisson likelihood = Gamma posterior
Unknown mu - Normal prior + Normal likelihood = Normal Posterior
Unknown sigma - Gamma for tolerance + Normal likelihood = Inverse Gamma for sigma.

Chapter 9: Using Gibbs Samplers to Compute Bayesian Posterior Distributions

In previous chapter, analytical form for posterior distribution was derived and used in the computation. However closed form solutions for posterior distributions are very difficult and more so in higher dimensional parameter space. This chapter delves in to the utility of Gibbs Sampler in computing posterior distributions. If you assume a normal prior and want to estimate mean , assuming you know the population standard deviation, then posterior distribution of mean turns out to be another normal distribution. If you assume a gamma prior for the precision and want to estimate standard deviation, assuming you know the population mean, then posterior distribution of standard deviations turns out to be inverse gamma distribution. What if you are unaware of mean and sd of the population ? This is where things become slightly complicated. Through a simple example of height differences, the chapter introduces this problem and solves it using Gibbs sampler algorithm.

Here is a hypothetical example, let’s say you are interested in knowing a variable X . You can take a random sample of 40 data points , call it xi. You are interested in knowing the mean and standard deviation of X and Y. Let’s say you observe the data as follows.( Actually I generated this data with the population mean of 40 and sd as 4, so that it will help me in comparing the effectiveness of the Gibb sampler)

Data (40 numbers) :

39 41 39 38 36 32 40 47 37 41 40 46 42 43 41 38 39 32 35 42 41 34 32 38 44 41 42 37 40 35 34 36 42 30 48 32 39 43 32 37

Lets say I assume a prior for mu as normal (30, 5) .. This is a vague prior looking at data I have assumed that the mean might be 30 and the sd for the same be 5.Lets say I assume a prior for sd^2 Inverse Gamma(5, 38) . How do I get these numbers ? Well, my vague prior for sigma is that it lies between 2 and 6 - 95% of the times, i.e sigma square lies between 4 and 36 – 95% of the times, i.e 1/ sigma square lies between 1/4 and 1/36 – 95% of the times.. I assume Gamma ( alpha, kappa) in such a way that this area under the prob distribution between 1/4 and 1/36 is 95%. A simple trial and error will give the values as 5 and 38 for the gamma distribution. This means that 1/sigma^square is gamma(5,38) and hence sigma.squared is Inverse Gamma(5, 38). Now I have my priors ready, I can use Gibbs sampling to get the posterior distribution of the random variable. Since marginals (mu | x,sigma.squared ) and (sigma.squared| x,mu) have closed forms , it is easy to actually code up the Gibbs sampler manually and see the results. I generated large # of sample paths, discarded the first 50% of the steps and obtained the posterior mean as 38.9 and posterior. sd as 4.19. As one can see the posterior mean and posterior sd come very close to the population mean and sd of 40 and 4 respectively with merely 40 data points. Once you increase the sample, the credible interval for the parameter is narrowed and more robust estimates are obtained.

Not always you manage to get closed form for partial conditionals, i.e. parameter1 given (data, parameter2,….parameter n) need not always lead to closed form. In fact most often than not, it won’t be a closed form. Here is where BUGS comes in to picture and makes our life easy. Anyways that is covered in the last chapter, so will summarize about BUGS at the relevant section of this post.

The chapter introduces another example where frequentist stats fall short in giving answers. In a two step manufacturing process, there is a need to understand whether there is variance at the batch level or whether there is variable at the individual unit level with in the batch. Using gibbs sampling, the book makes a strong case for using Bayes to get a clear understanding of the variability. This example is pretty involved and it took me 3-4 hours to understand it in its entirety. However the effort was well worth it as it gave me a better understanding of stuff. With Bayes, all you need is a good specification of priors + specification of likelihood— you don’t even have to code the partial conditionals…BUGS will do the job for you . The last chapter in this book is all about BUGS.

**
Chapter 10: Using Win BUGS for Bayesian Estimation**

The last chapter in the book is like a wonderful dessert after a nice meal. If you want to do anything in Bayes world, you got to learn BUGS. Like any other language/ tool, there is the initial pregnancy period one needs to go through where things grow in you , you have nurture it and only then you can actually produce something real Smile Well, my analogy might be stretched but it is something which I think is very apt. Some of the people I have met think that learning skill should be done very quickly. If they fail to get it in the first few days, they either assume that the whole thing is boring or think it is not worth it. If they do get a part of stuff correctly, they assume that they know everything about it. I had a lot of trouble teaching undergrads calculus as adjunct faculty. There were some kids who were willing to slog , but the majority were looking for instant solutions/ instant techniques to solve problems. Alas! They failed to recognize that “what comes easy never sticks”. I think I am going tangential here. Let me stick to the intent of this post, i.e summarizing the book.

This last chapter gives a very good introduction to BUGS. After going through this BUGS with the examples mentioned, one can fairly confident of testing out various models and meta-models. As such BUGS UI is not all that difficult. The syntax is very much R like and one can easily learn it. The content this chapter will make a reader curious to wrap Bugs functions with R/ Matlab/ whatever be the programming environment that he/she is comfortable. That’s when things start getting interesting. Let’s say you want to use a truncated normal prior, you cannot do it with plain vanilla BUGS. You have to look around and install a few plugins to specify truncated normal prior. Things like these can be learnt/ internalized as a consequence of this wonderful chapter. This chapter alone can be read from the book, by someone who is looking for BUGS 101 principles.

Takeaway:
This book is much more than an exposition of “Gibbs Sampling Method”. By systematically explaining the various algorithms that go with Bayesian estimation, it changes the way you look and formulate a statistical model in R. The tight linkage of all relevant concepts with excellent R code is the highlight of this book. Not many books manage this kind of balance between theory and practice.