Probability Theory and its Applications ( Volume I ) - Feller

A classic is a book that has never finished saying what it has to say – Italo Calvino.

These words definitely apply to Feller’s books. Both the volumes by Feller on probability can be considered as classics. The first book deal with the discrete variables while second is far advanced as it deals with measure theory and continuous variables. Markov processes are one of the main sources for Martingales and I had to go back to Feller to work on Markov Processes and managed to go over the entire book instead of only looking up stuff on Markov Processes. Let me attempt to various probability topics that are covered in Volume I. This post is going to serve mainly as my reference to various ideas and thoughts that the book brings out. This is definitely one of the longest posts that I have written till date. It is a ~5500 word summary and hence this post will definitely have a list of some of the powerful ideas in probability.

The author starts off with mentioning three aspects of any theory, a) formal logical content, b) intuitive background, c) the applications. From what I could make out, I think he urges the reader to work on all the three areas for a good understanding of the theory. Formal logical content comprises the axioms, lemmas and theorems based on which the entire probability theory is built. The intuitive background, I guess refers to a constant re-evaluation of our beliefs. Applications, the third and I think the most important part for a guy like me whose sole purpose in slogging through this stuff is to build useful models.

So, in one sense this book needs to be read in such a way that one must check the progress in all the three directions at the end of every chapter by asking these questions to oneself

What axiomatic theory has been developed in the chapter?
What intuitive notions that the theory supports or refutes?
Where can I apply this theory?

Unless you keep asking these questions at regular intervals, it is likely that you will have a limited understanding of the theory. A trivia from the introduction - “Sample space“ – term was coined by Von Mises.

Chapter 1: Sample Spaces

The chapter starts off with a basic description of sample spaces. As the book is all about discrete sample spaces, the number of events in the probability space is considered to be denumerable. The chapter provides umpteen number of examples of sample spaces and provides the necessary understanding to distinguish outcomes to events in a probability space. A fantastic point mentioned in this chapter is about “how probability and experience go hand in hand in selecting the models?” If there are r balls to be places in n slots in such a way that there are r1, r2, r3,….., rn balls in each of the slots then the probability of occurrence of this event is n! /( Π ri! * rn) . This is what is taught in elementary courses in probability. However who says this model is correct? Well there are models for which one cannot assign the above probability. In the case of Bose-Einstein Statistics experience leads to an assignment of 1/ n+r-1cr probability whereas Fermi-Dirac computes the probability as 1/n+r-1cr. It is impossible to select or justify the probability model by a priori reasoning. Thus there is an intricate interplay of theory and experience while assigning probabilities. There is always this argument you get to hear the finance models are so different from physics models and one must be cautious about it. Well, here is a argument which says physics and finance models do have something in common. Experience and Theory should go hand in hand while assigning probabilities in finance, much like assignment of probabilities in Bose-Einstein/Fermi-Dirac/Maxwell-Boltzman worlds.

Chapter 2: Elements of Combinatorial Analysis

This chapter deals with all the necessary tools to think about combinatorial problems. Tons of examples are given to make the reader understand sampling without replacement, sampling with replacement. The highlight of this chapter is it shows the way to crack occupancy problems. Based on your notions about the probability spaces, a partition of r balls in n slots can take various forms. Hyper geometric distribution is introduced and a nifty linkage is provided between the usage of the distribution to Maximum Likelihood estimate. A glimpse in to “waiting times” is provided using simple balls/urns example. This intuition / simple examples are needed to understand sophisticated concepts such as hitting times in martingales and ultimately develop a model for American option. Yes everything is connected in one way. If you want to evaluate an American option, hitting times become important. To understand hitting times via Martingales, you might have to understand hitting times in simple coin toss examples, which in turn means that you need to understand stuff using simple ball/urn models. It is often seen that students are interested in fancy models and no one is interested in spending time in understanding a few basics. Fancy models look good only in “talks” and not in “action”. Anyways coming back to the chapter, it concludes with a bunch of equalities involving binomial coefficients , stirlings approximation etc. This chapter is a reference chapter of sorts and one can come back to the content at frequent times during the book.

Chapter 3: Fluctuation in Coin Tossing and Random Walks

This chapter is a brilliant and an eye opening one; It can be read independently of the entire book. Many myths are shattered by looking at the world of coin tosses. It explores the concepts of random walks using simple coin tossing experiments. Sophisticated concepts can be extrapolated once random walks are seen as realization of infinite coin tosses. The chapter starts off with “Refection Principle” which finds its applications almost everywhere throughout the chapter. A few examples are given at the beginning of the chapter to provide motivation to any reader. Ballot Problem, Galton’s rank order test, Kolmogorov-Smirnov tests become analytically tractable by viewing it from coin toss perspective. The following questions are explored in this chapter:

What is the probability that no return to the origin occurs up to and including epoch 2n ?
What is the probability that the first return to the origin happens at epoch 2n ?
What is the probability that a random walk always stays to one side until epoch 2n ?
Probability that the last visit to the origin occurs at epoch 2k considering that the random walk is at epoch 2n now?
How does the above probability relate to arc sine distribution?
What is the probability that the random walk spends k % of the time of the total time ?
Surprising connection between Sojourn times and Last visit times
What is the probability that the first passage time through r occurs at epoch n?
What is the probability that the rth return to the origin occurs at epoch n ?
Connection between first passage time and return time
What is the probability that the maximum value of the random walk = r in its journey till epoch = 2n

Each of the above questions is solved in a detailed way so that user understands the logic + intuition + practical application of the theory behind these questions.

This chapter is a little overwhelming and needs a re-read atleast a few times to appreciate the results. The applications of concepts from this chapter are wide and far reaching. For example, the sojourn times and last visit to origin’s distribution – ArcSine law can be used in deciding the holding period in a pair trading model. Assuming the spread reverts to origin x number of times in a year, one can fit an arc sine distribution to decide the holding period of the pair.

Chapter 4: Combination of Events

If you consider a set of events A1, A2, …AN, How does one calculate the probability of the event that at least one of the events happen? How does one calculate the probability of the event that exactly k out of N events happen? Once the number of events grows beyond 2, one needs a systematic way to solve these problems. This chapter introduces an approach to calculate the probability that at least one of event happens in a set of events. This approach is then applied to “Matching Problem”, “Occupancy Problem”. Occupancy problem is described in detail and a connection is made to the fact that probabilities reach Poisson distribution asymptotically. This chapter is essential to compute the probability of events such as

Probability that atleast one of events happens from the event set (A1, A2, …AN)?
Probability that exactly k events happen from the event set (A1, A2, …AN)?
Probability that atleast k events happen from the event set (A1, A2, …AN)?

Even though closed form solutions are given for these kind of problems, I wonder whether one really applies these closed forms. For example, if you had 1000 slots and 5000 balls and you distribute these balls randomly in to 1000 slots, Let’s say you wanted to find the probability that exactly 3 boxes are empty ; Just put this command in a loop that goes over N times

( 1000 - length( unique( c ( sample( 1:1000, 5000, T ) ) ) ) == 2 )
Count the number of TRUE values for the condition in the loop, (count / N) gives the probability.

You don’t need to know any of the stuff mentioned in this chapter. However the reason you probably NEED to know the closed form probability is that it builds the intuition behind such stuff. Simulation gives you an estimate but it will not help you build a probabilistic mindset I guess.

Chapter 5: Conditional Probability and Stochastic Independence

Conditional probability is introduced for discrete variables. The real power of Conditional probability and Conditional expectation is usually seen in the continuous case. Dealing with discrete is easy and appeals to our intuitive notions. The highlight of the chapter is the description of urn models and its usage in the formulating conditional probability models. Polya’s urn scheme is discussed so as to model contagions. Independence is also explored where conditions are given in order for a group of variables to be independent. Product spaces are merely introduced as a thorough description would need measure theory. There are several starred sections in this chapter which can be skipped during the first read.

Chapter 6: The Binomial and the Poisson Distributions

The most popular distributions in the probability theory are introduced in this chapter, namely binomial and Poisson. As is well known that binomial distribution B(r;n,p) increases and then decreases as r increases. Nifty tail approximations are provided in the book that are useful for quick back of envelope calculations. Weak law of large numbers is introduced where convergence in probability is discussed. An easy derivation of weak law of large numbers is shown applying tail approximations of binomial distribution. Subsequently Poisson as a limiting distribution for binomial is introduced. A dozen examples or so are provided to see the relevance of Poisson process. The chapter ends with a brief description on multinomial distribution

Chapter 7: The Normal approximation to the Binomial distribution

For very many cases, we usually deal with large values of n and hence it makes sense to approximate the distribution with a normal or a Poisson distribution. Both versions of binomial distributions, symmetric & asymmetric are shown asymptotically to converge to normal distribution. However the approximation sucks as we move away from the mean of the binomial distribution, a subtle point often not paid attention to. Demoivre-Laplace theorem is then stated and proved which gives a convenient way to formulate probabilities for events that restrict the random variable realizations. One thing to be noticed is that binomial approximation to Poisson is better when lambda is small whereas it does not matter whether you approximate with Poisson or Normal if lambda is large and n is large.

Chapter 8: Unlimited Sequences of Bernoulli Trials

This chapter deals with Borel Cantelli Lemmas that are tremendously useful in formulating probabilities on an infinite probability space. One of the lemmas is subsequently used to prove strong law of large numbers, i.e the cases where the law does not hold good, form a negligible set. Law of iterated logarithm is also introduced in this chapter which I have skipped and would revisit at a later date. Author clearly suggests to skip certain sections of the book so that the reader does lose sight of the forest for too many trees.

Chapter 9: Random Variables; Expectation

Basic definition of random variable, density function, distribution function, expectation of a random variable is given. The chapter can be quickly covered as most of the concepts any reader would have come across in some elementary course. The author makes a few interesting points like

The reduction of probability theory to random variables is a short-cut to the use of analysis and simplifies the theory in many ways. However, it also has the drawback of obscuring the probability background. The notion of random variable easily remains vague as “something that takes on different values with different probabilities.” But random variables are ordinary functions, and this notion is by no means peculiar to probability theory.

BASICALLY measure theory is a MUST to get clarity in your understanding.

Negative binomial distribution and its connection to exponential distribution is shown. Examples on waiting times that figure out in Negative binomial distributions can be very crucial in understanding stuff about Martingales. I had never come across Kolmogorov inequality till date and was realized that it is a scaled version of Chebyshev’s inequality.

One thing I consciously keep in mind while reading is that – “Is this content/chapter/theorem/lemma changing the way I think about stuff? Sometimes I forget this principle and I end up in a state of “Awareness”, instead of “Understanding”. Let me give an elementary example. If you were asked to compute the mean and variance of a hypergeometric distribution, then you can as well write its probability mass function and then religiously compute the mean and variance. However that is not the elegant way to do things once you see the covariance formula. Covariance formula gives you the power to break up a big problem in to series of small problems. Applying to this hyper geometric example, you can actually look at the distribution as sum of iids taking 1 or 0 based on the selection of defective ball. Once you zero on to the individual block, you can calculate mean and variance at an individual level, covariance between these individual elements and then just sum up the expectation and variance to get the metrics at a aggregate level. So, this is an example where the existence covariance formula changes the way you think /approach / solve problems. In this book, the author delights at so many places that I bet even a well trained probabilist will have aha moments from this book

Chapter 10: Law of Large numbers

Law of Large numbers and Central Limit theorem are stated at the very beginning of the chapter. Subsequently, the book makes an important point that there is a need to formulate an analogous form for variables which do not have finite expectation. It is a mistake to dismiss variables which do not have finite expectation from our study as they frequently occur in many stochastic processes. The Proof of Law of Large numbers is given by using “method of truncation”, a technique which is commonly used to prove limit theorems.

The chapter talks about “Petersburg game”, the puzzle which stumps any beginner who needs to deal with a situation where there is a need to formulate long term probabilities but the random variables have no finite expectation. Petersburg game is one such example where the random variable has infinite expectation but one can still one form some sort of bounds in order to compute the cost of entering in to the game.

Some nice examples are given to show the wide applicability of Central limit theorem. So, one should not have the notion that central limit theorem is a stricter condition and applies to a subset of the sequences for which weak law holds good. Sufficient conditions for Weak law and Central limit theorem are mentioned clearly and derived. In this context, Lindeberg theorem is mentioned as the sufficient condition for CLT to hold good. For Law of Large numbers to hold good the variables need to be uniformly bounded. The chapter shows that Strong law of large numbers deals with almost sure convergence whereas Weak law of large numbers deals with convergence in measure. It is put in words brilliantly by the author as follows:

“Weak law does not imply the average remains small as you increase the sample size.The value continues to fluctuate between finite and infinite limits. However one can conclude that large values occur at infrequent intervals.”

Strong law on the other hand talks about the entire tail of the sequence of random variables and hence is a stricter condition, meaning Convergence almost sure implies Convergence in measure

Chapter 11: Integral-Valued Variables. Generating Functions

Generating function is a very useful tool for solving a multitude of problems in the discrete variable space. Events for which probability computations appear very complex become tractable with the usage of generating function. What is a generating function in the first place? If you are given a set of real numbers, in this context a set of discrete probabilities, you can create a summary function for all the numbers in the form of a generating function. The name is apt as the function appears like generating all the probability values of a discrete variable. One of the powerful applications of generating function is to compute convolutions. If you have a convolution, i.e sum of certain number of random variables, the probability that the sum of random variables behave in a specific way is analytically intensive from first principles but becomes solvable using generating function. For example, if you toss n coins, what is the probability that no 3 heads appear together in all the trials? Attempt this problem from first principles and then attempt the same using generating function, you will clearly see the power of generating functions. In chapter III of the book, various interesting events such as first passage times, equilibrium times, return times to origin are derived using nifty tricks and observations that involves reformulating the problem so that the problem changes to a known form. However the same problems are solved using generating functions in this chapter and the advantage that one gets from using this tool is that you get to see not only the probabilities but also the various moments.

For example, the probability that in a simple random walk, the random walk always stays equal to or below 0 and at epoch n , reaches 1 can be solved using generating function. The resulting generating function gives much more information of the entire process and helps one get to a ton of results at once. Bivariate generating functions are also touched upon where a brief introduction is given to compound Poisson distribution. The applications of compound Poisson distribution is immense in field of quant finance. Think about it, the bid and ask processes are generally modelled using Poisson distribution and there is a need to understand the generating function of the combined bid and ask process. I have never built a model till date on bid and ask processes but certainly hope to do so in my working life. The chapter ends with a discussion on “Continuity theorem”, which says generating functions convergence is necessary and sufficient condition for a distribution to converge to the limiting distributions. For example when binomial generating function converges to Poisson generating function, then it means that in the limiting case, binomial converges to Poisson. One way to visualize generating function is that it is the DNA of discrete variable. You can study all the properties of the creature using this basic DNA. You can extend the analogy to characteristic functions for more generic random variables.

Chapter 12: Compound Distributions, Branching Processes

One often deals with a sum of random variables in real life applied math problems. The previous chapter dealt with the power of using generating functions to simplify things; a generating function of a convolution can easily be written as product of generating functions of the involved variables. Most often we come across convolutions where the number of independent variables involved in the convolution is a Poisson variable and hence the resulting distribution is called a compound Poisson distribution. A few examples are given relating to compound Poisson distribution so that a reader has a clear idea of the place where such distributions arise. The chapter then moves on to discussing branching processes.

I was little skeptical to go over this stuff mainly because my naive brain said that these processes would not be used as such in math finance. However a quick perusal of math fin literature showed that there was a serious paper on pricing barrier options using branching process. Instead of using standard GBM for the price path evolution, the authors have used a branching process to evaluate a barrier option. Motivated by seeing such an application, I began understanding various aspects of branching process. Frankly till this date, I had always though people use GBM for prices primarily and never thought there could be something else to model the price evolution. I hope someday in the future I will check the validity of using a branching process for price evolution. In theory, the problem that branching process deals is “Given an ancestor which spawns children with a certain probability distribution, what is the probability of certain event after n generations”? The math is little tricky here as , one does not compute a generating function per se but one needs to come up with a recurrence relation which is then used to find out the probability and expectation of certain events. The chapter ends with a discussion of a detailed example of total progeny in a branching process. This chapter makes one realize the immense power of generating functions in probability models.

Chapter 13: Recurrent Events, Renewal Theory

This chapter serves as a superb introduction to Renewal theory which is crucial to OR Models. From a Quant finance perspective, I came across an order book model where Renewal processes where used. There is also a usage of Renewal processes in high frequency trading. So, this chapter has important implications for those looking to develop quant fin models on high frequency data.

The chapter introduces about 9 examples of recurrent events.

Return to equilibrium in a sequence of Bernoulli trials and event is “Accumulated number of successes and failures are equal”
Return to equilibrium through negative values
Accumulated number of successes equals K times the accumulated number of failures
Partial sums exceed all the preceding sums
Success runs in Bernoulli trials
Continuation – Repeated patterns
Geiger counters
Simple Queuing process where the event that “server is free “constitutes a recurrent pattern.
Servicing of machines

In all the above examples, the inter arrival times are iid. They need not be exponential as in the case of Poisson distribution. This is what differentiates Renewal theory from the usual Poisson processes. If the inter-arrival is geometric distribution then the recurrent events/ renewal process is Bernoulli.

Some basic terminology of event types is introduced such as persistent event, transient event, and periodicity of the event. The crucial link between inter-arrival distribution and the waiting time is summarized by generating function, which is then used to arrive at waiting distribution under some of the above recurrent events mentioned. Elementary Renewal theorem is derived; Central limit version for Renewal process is derived. This chapter took me the longest to understand and I have left about 3 sections to revisit at a later date. It is too overwhelming for me on the first read.

Chapter 14: Random Walks and Ruin Problems

This chapter gives a firsthand introduction to stochastic processes. Coin tossing examples are extremely useful to get an intuitive sense of stochastic processes. The chapter begins with the classic ruin problem where difference equations are used to answer the following questions

What is the probability that a gambler gets ruined?
What is the probability of a ruin if the casino has infinite amount of money?
What is the expected duration of gambler’s ruin?
What is the expected duration of gambler’s ruin if casino has infinite capital?

The chapter is fairly advanced for the first time reader. Generating function approach is used to compute duration and first passage times. The highlight of this chapter is where the author makes the connection between coin tosses and diffusion processes. By imposing some conditions on the size of the simple random walk and time duration of the random walk, one gets a good intuition about general stochastic processes. By providing two approaches, one by explicit probability density computations and second by differential equation approach, the author shows the 2 basic approaches that are used to formulate various stochastic processes.

Chapter 15: Markov Chains

Markov chains are extremely important in developing analytical techniques for solving classical probability problems. Before encountering Markov chains, I had always thought that the only way to approach problems like the following, was to set up recurrence relations and solving them.

In a simple random walk, what is the probability that the random walk is between a and b?
What if the absorbing barrier at a is made in to a reflecting barrier?
What if the absorbing barrier at b is made in to a reflecting barrier?
What is the mean first passage time of a particular coordinate in the simple random walk?
What is the mean recurrence time of a particular coordinate in a simple random walk?

This was slightly posing a resistance in my mind as I wanted to computationally solve such problems. I was oblivious to Markov chains which were developed almost 100 years ago and that were used to solve precisely the problems such as above, using matrices. As I spend more time doing math, I am realizing that any problem especially which involves lot of repeated calculations, one should use matrices in some way or the other.

Coming back to the chapter, it starts off by giving idea of a Markov chain using examples like random walk, urn models. It makes it clear that Markov chain by definition generalizes the outcomes to be dependent on the previous outcome, thus it is defined in terms of conditional probabilities. Since one is dealing with a number of possible states, one needs a matrix to represent these conditional probability movements. This kind of matrix is called transition probability matrix. A dozen examples are given with their transition matrices so that a reader gets an idea of the whole stuff. At this stage I was left wondering the reason for not thinking/using Markov chains in my work. I had vaguely heard of Monte Carlo using Markov chain but had never ventured in to understanding it. Last night I was hearing a introductory lecture on Markov Chain and one of the Stanford professors says that “Where do I use MCMC methods is like asking an undergrad, where do you use quadratic equation?” I felt ashamed of the fact that I had never worked on MCMC till date. Have made a resolution that I will learn this technique and see where all its usage makes a tremendous sense in math finance.

The chapter introduces Chapman-Kolmogorov equations which talk about the transition probabilities from state i to state j after n trials. Subsequently concepts such as Closures and Closed Sets are introduced. The terms Closed Sets and Closures might seem abstract but they are extremely useful in breaking up the transition matrix in to stochastic independent matrices which can then be analyzed. This logically leads to the definition of irreducible chain where every state can be reached from every other state. One must be careful to note that an irreducible chain need not be a regular chain , but every regular chain is an irreducible chain.

After the first 4 sections, the chapter then moves in to classifying states. The terms that one must be clear while classification are

Persistent - means that the unconditional probability that the path moves from state j to state j is 1.
Transient - means that the unconditional probability that the path moves from state j to state j is <1.
Periodic - States are visited periodically
Aperiodic - There is no sense of periodicity in the way states are visited.
Null - a persistent state where mean recurrence time is infinite.
Not Null - a persistent state where mean recurrence time is finite.

The above definitions are useful in classifying states. So, one can classify the Markov chain in to the nature of states. Typically one is interested in “aperiodic-persistent -Not Null “states. These are called Ergodic states.

The chapter subsequently talks about invariant distributions, a key concept that will be used in Markov Chain Monte Carlo. In an ergodic markov chain , if one looks at the probability of reaching a specific state from any of the initial starting states, it converges to a fixed number. This means that after n iterations, you will end up with a transition matrix that is a fixed row matrix. This means that it does not matter what initial distribution you start from. You end up with a constant probability in a specific state. Now one can flip the question – What is the condition that the fixed weight vector that you end up is the same as the initial distribution? The answer is not easy, given any transition matrix. However there are certain types of transition matrices for which one can work out. A number of examples that you come across have transition matrices that are band matrices with values only on the upper diagonal and lower diagonal. For such type of matrices, the book gives a condition for the presence of an invariant distribution. This condition is then applied to a variety of situations, post which, one can formulate an opinion about the stability of the transition matrix. The examples where the invariant distribution is checked are

Reflecting barriers
Ehrenfest model for diffusion
Bernoulli- Laplace model for diffusion
Recurrent events

Once you read this section, you wonder how easy computations become once you formulate problems using Markov chain. Choosing the states of the Markov chain is not always obvious. Look at this card shuffle example where choosing states in a smart way reduces the computational complexity of the problem.

After reading this chapter, you will understand the difference between Markov chain and stochastic process. In a Markov Chain, all that is needed to understand the conditional probability for a state n is the previous state. In a stochastic process, it is not just the previous state but the entire history of the process will play a role in the prediction of future state. Markov property is related to conditional probability whereas Martingale is related to Conditional expectation. So, there is a lot of difference between the two mathematical objects. A Markov process need not be a martingale, though Markov processes are a good source for martingales. Similarly not every martingale is a Markov process.

Chapter 16: Algebraic Treatment of Markov Chains

I skipped this chapter and instead preferred Matrix approach presented in “Finite Markov Chains”.

Chapter 17: Simplest Time Dependent Stochastic Process

This chapter uses the concepts of conditional probability to set recurrence relations which in turn could be looked as a PDE. You get to see the construction of simple time dependent stochastic processes like Poisson, Birth process, Birth-Death Process etc. The last chapter gives a flavour of the stochastic process from a discrete world view.

Takeaway:

The author’s enthusiasm for the subject is infectious, as evident from the way the book is written. The book has outstanding breadth and depth. Any seasoned probabalist would still find something new in this book, even on an nth reading. Every year there are professors/ practitioners/ instructors/ modelers referring to / debating on /agreeing with /pondering over various aspects mentioned in this book. In that sense, the content of the book is timeless.