Ergodic Markov Chains

A Markov chain is ergodic if and only if it has at most one recurrent class and is aperiodic. A sketch of a proof of this theorem hinges on an intuitive probabilistic idea called "coupling" that is worth understanding.

An important theorem for Markov chains is the following:

Theorem 1: A Markov chain is ergodic if and only if it has at most one recurrent class and is aperiodic.

This is a very useful theorem. It means that in many cases, we can tell if a Markov chain is ergodic just by analyzing its state diagram. The goal of this post is not to provide a full proof of this theorem. Rather, I want to provide some intuition for why the theorem is true. In particular, the “only if” direction contains an intuitive probabilistic idea called coupling that is worth understanding.

The coupling method was invented by Wolfgang Doeblin (Doeblin, 1939). See A1 for a brief biography of his short but impactful life. Before explaining Doeblin’s idea of coupling, I review Markov chains to introduce notation. This section can be skimmed if one is already familiar with these ideas. Finally, I provide a sketch of a proof of Theorem 11.

Markov chains

A Markov chain is a powerful mathematical object. It is a stochastic model that represents a sequence of events in which each event depends only on the previous event. Formally,

Definition 1: Let DD be a finite set. A random process X1,X2,X_1, X_2, \dots with values in DD is called a Markov chain if

P{Xn+1=xn+1Xn=xn,,X0=x0}=P{Xn+1=xn+1Xn=xn}.(1) \> \mathbb{P}\{X_{n+1} = x_{n+1} \mid X_n = x_n, \dots, X_0 = x_0\} = \mathbb{P}\{X_{n+1} = x_{n+1} \mid X_n = x_n\}. \tag{1} \>

We can think of XnX_n as a random state at time nn, and the Markovian assumption is that the probability of transitioning from xnx_n to xn+1x_{n+1} only depends on xnx_n. In words, the future depends only on the present. Let pijp_{ij} be the probability of transitioning from state ii to state jj. A Markov chain can be defined by a transition probability matrix:

Definition 2: The matrix P=(pij)i,jD\mathbf{P} = \big( p_{ij} \big)_{i,j \in D} is called the transition probability matrix.

Thus, P\mathbf{P} is a D×D|D| \times |D| matrix, where D|D| denotes the cardinality of DD, and the cell value pijp_{ij} is the probability of transitioning from state ii to state jj, and the rows of P\mathbf{P} must sum to one. In this post, we will restrict ourselves to time homogeneous Markov chains:

Definition 3: A Markov chain is called time homogeneous if

P{Xn+1=jXn=i}=pij,n.(2) \> \mathbb{P}\{X_{n+1} = j \mid X_n = i\} = p_{ij}, \quad \forall n. \tag{2} \>

In words, the transition probabilities are not changing as a function of time. Finally, let’s introduce some useful notation for the initial state of the Markov chain. Let

Px0{    }P{  X0=x0}.(3) \mathbb{P}_{x_0}\{\;\cdot\;\} \triangleq \mathbb{P}\{\;\cdot \mid X_0 = x_0\}. \tag{3}

For example, I will write Pa{X1=b}\mathbb{P}_a\{X_1 = b\} rather than P{X1=bX0=a}\mathbb{P}\{X_1 = b \mid X_0 = a\}. This is because all the conditional probabilities depend on the initial state, and it the usual notation is cumbersome.

Consider a simple Markov chain modeling the weather. We assume the weather has two states: rainy and sunny. Therefore D={r,s}D = \{r, s\}, and XnX_n is the “weather on day nn”. The Markov chain model is as follows. If today is rainy (rr), tomorrow it is sunny (ss) with probability pp. If today it is sunny, tomorrow it is rainy with probability qq. The transition matrix is then

P=rsr1ppsq1q \mathbf{P} = \begin{array}{|c|cc|} \hline & r & s \\ \hline r & 1 - p & p \\ s & q & 1 - q \\ \hline \end{array}

and the state diagram of the Markov chain is Figure 11.

Figure 1: State diagram for a Markov chain with states D={r,s}D = \{r, s\}. The probability of moving from rr to ss is pp and from ss to rr is qq. The remaining probabilities can be computed because each distribution (row) must sum to unity.

As a consequence of the Markovian assumptions, the probability of any path on a Markov chain is just the multiplication of the numbers along the path. For example, for some Markov chain with states D={a,b,c,d}D = \{a, b, c, d\}, the probability of a particular path, say abbdca \rightarrow b \rightarrow b \rightarrow d \rightarrow c factorizes as

Pa{X1=b,X2=b,X3=d,X4=c}=pab  pbb  pbd  pdc.(4) \mathbb{P}_a\{X_1 = b, X_2 = b, X_3 = d, X_4 = c\} = p_{ab} \; p_{bb} \; p_{bd} \; p_{dc}. \tag{4}

You can convince yourself of this easily by repeatedly applying the chain rule and the Markovian assumption (or see A2).

Furthermore, we have an easy formula to compute quantities such as

Pi{Xn=j}(5) \mathbb{P}_i\{X_n = j\} \tag{5}

for n1n \geq 1. For example, let n=2n = 2. Using marginalization and the Markovian assumption, we have

Pi{X2=j}=dDPi{X2=j,X1=d}=dDPi{X2=jX1=d}Pi{X1=d}=dDpid  pdj(6) \begin{aligned} \mathbb{P}_i\{X_2 = j\} &= \sum_{d \in D} \mathbb{P}_i\{X_2 = j, X_1 = d\} \\ &= \sum_{d \in D} \mathbb{P}_i\{X_2 = j \mid X_1 = d\} \mathbb{P}_i\{X_1 = d\} \\ &= \sum_{d \in D} p_{id} \; p_{dj} \end{aligned} \tag{6}

What is this? We are just performing a dot product between the iith row and jj column of the transition matrix P\mathbf{P} (Figure 22) or

P{X2=jX0=i}=dDpid  pdj=(P2)ij.(7) \mathbb{P}\{X_2 = j \mid X_0 = i\} = \sum_{d \in D} p_{id} \; p_{dj} = \big( \mathbf{P}^2 \big)_{ij}. \tag{7}

You can check that this idea generalizes so that,

Pi{Xn=j}=(Pn)ij.(8) \mathbb{P}_i\{X_n = j\} = \big(\mathbf{P}^{n}\big)_{ij}. \tag{8}

Since we are repeatedly applying the transition matrix, Pn\mathbf{P}^n is sometimes called the nnth iterate of P\mathbf{P}.

In essence, the dot product performs marginalization, and everything works out nicely because of the Markovian assumption. If the DD-dimensional vector v\mathbf{v} represents a discrete distribution over initial states X0X_0, then vP\mathbf{v}^{\top} \mathbf{P} is a DD-dimensional vector representing the probability distribution over X1X_1. This is because a dot product of v\mathbf{v} with a single column of P\mathbf{P} performs marginalization over X0X_0. If this is unclear, I’d recommend writing out the probabilities explicitly using a small 2×22 \times 2 matrix.

Figure 2: The second iterate of P\mathbf{P}. Since each element (P2)ij(\mathbf{P}^2)_{ij} is the dot product between the iith row and jjth column, the cell values of P2\mathbf{P}^2 represent the probability of starting the Markov chain on state ii and arriving at jj by n=2n = 2.

We are interested in ergodicity, which is a property of a random process in which its time average is the same as its probability space average. Formally,

Definition 4: A Markov chain {Xn}\{X_n\} is called ergodic if the limit

π(j)=limnPi{Xn=j}(9) \> \boldsymbol{\pi}(j) = \lim_{n \rightarrow \infty }\mathbb{P}_i\{X_n = j\} \tag{9} \>

exists for every state jj and does not depend on the initial state ii. The DD-vector π\boldsymbol{\pi} is called the stationary probability.

In other words, the probability π(j)\boldsymbol{\pi}(j) of being on state jj after a long time is independent of the initial state ii. We can also write this as

π(j)=limn(Pn)ij.(10) \boldsymbol{\pi}(j) = \lim_{n \rightarrow \infty} (\mathbf{P}^n)_{ij}. \tag{10}

Consider the following important implication. If a Markov chain is ergodic, then

π(j)limn(Pn)ij=limn(Pn+1)ij=limn(PnP)ij=limndD(Pn)idPdj=dDπ(d)Pdj.(11) \begin{aligned} \boldsymbol{\pi}(j) &\triangleq \lim_{n \rightarrow \infty} (\mathbf{P}^n)_{ij} \\ &\stackrel{\star}{=} \lim_{n \rightarrow \infty} (\mathbf{P}^{n+1})_{ij} \\ &= \lim_{n \rightarrow \infty} (\mathbf{P}^n \mathbf{P})_{ij} \\ &= \lim_{n \rightarrow \infty} \sum_{d \in D} (\mathbf{P}^n)_{id} \mathbf{P}_{dj} \\ &= \sum_{d \in D} \boldsymbol{\pi}(d) \mathbf{P}_{dj}. \end{aligned} \tag{11}

Step \star holds because if the limit exists, the distinction between nn and n+1n+1 does not matter. So we can write this as

π=πP,(12) \boldsymbol{\pi}^{\top} = \boldsymbol{\pi}^{\top} \mathbf{P}, \tag{12}

where π\boldsymbol{\pi} is a column vector. Hence the name “stationary probability” for π\boldsymbol{\pi}. It is a distribution that does not change over time.

Since we are interested in ergodicity, let’s now introduce some properties related to reachability and long-term behavior.

Definition 5: We say that iji \leadsto j (path from ii to jj) if there is a nonzero probability that starting at ii, we can reach jj at some point in the future.

Definition 6: A state ii is called transient if there exists a state jj such that iji \leadsto j but jij \cancel{\leadsto} i.

Definition 7: A state ii is called recurrent if for all states jj that iji \leadsto j, also jij \leadsto i.

Intuitively, a transient state is a state that once the chain arrives leaves, it cannot return to; while a recurrent state can be returned to. In other words, these two conditions are opposite. A transient state is not recurrent, and a recurrent state is not transient. Note that if a state is recurrent, every reachable state is also recurrent. We think of a set of recurrent states as a “class” or a “recurrent class”.

For example, consider Figure 33. In this Markov chain, states cc, gg, and hh are transient, while aa and bb form one recurrent class and dd, ee, and ff form another.

Figure 3: A Markov chain with D={a,b,c,d,e,f,g,h}D = \{a, b, c, d, e, f, g, h\}. The states cc, gg, and hh are transient, while there are two recurrent classes: {a,b}\{a, b\} and {d,e,f}\{d, e, f\}.

As I mentioned in the preamble, defining ergodicity in terms of recurrent states is useful because we can sometimes just look at a Markov chain and know whether it is ergodic, e.g. the Markov chain in Figure 11 is ergodic, while the Markov chain in Figure 33 is not. We now have a common framework and notation to tackle Theorem 11.

Proof sketch

We first show the “if” direction: if a Markov chain is either periodic or containing more than one recurrent class, it cannot be ergodic. We conclude with the “only if” direction, which contains the notion of coupling mentioned in the introduction.

“If” direction

Theorem 11 implies that there are two blockers to ergodicity. First, if a Markov chain has more than one recurrent class, then it is not ergodic. Second, if a Markov chain is periodic, then it is not ergodic. Consider the Markov chain with two recurrent classes in Figure 44.

Figure 4: A Markov chain with two recurrent classes, {a}\{a\} and {c,d}\{c,d\}, and a single transient state bb. The Markov chain cannot be ergodic because the long-term probability of being on a given state depends on the initial state.

The important thing to note is that it has two recurrent classes. If we start on state aa, we’ll stay on aa forever. If we start in {c,d}\{c, d\}, we’ll stay in this recurrent class forever. If we start on bb, we’ll end up in one of the two recurrent classes. Therefore, the long-term probability of being on a particular state depends on the initial state. So multiple recurrent classes is an obvious obstruction to ergodicity.

Next, consider the periodic Markov chain in Figure 55.

Figure 5: A periodic Markov chain with D={a,b,c}D = \{a,b,c\}.

The probability of being on a state by time nn is completely dependent upon the initial state. It is easy to see this if we write out two sequences of probability paths side-by-side:

Pa{X0=a}=1Pb{X0=a}=0Pa{X1=a}=0Pb{X1=a}=0Pa{X2=a}=0Pb{X2=a}=1Pa{X3=a}=1Pb{X3=a}=0Pa{X4=a}=0Pb{X4=a}=0Pa{X5=a}=0Pb{X5=a}=1Pa{X6=a}=1Pb{X6=a}=0Pa{X7=a}=0Pb{X7=a}=0Pa{X8=a}=0Pb{X8=a}=1(13) \begin{aligned} \mathbb{P}_a\{X_0 = a\} = 1 && \mathbb{P}_b\{X_0 = a\} = 0 \\ \mathbb{P}_a\{X_1 = a\} = 0 && \mathbb{P}_b\{X_1 = a\} = 0 \\ \mathbb{P}_a\{X_2 = a\} = 0 && \mathbb{P}_b\{X_2 = a\} = 1 \\ \mathbb{P}_a\{X_3 = a\} = 1 && \mathbb{P}_b\{X_3 = a\} = 0 \\ \mathbb{P}_a\{X_4 = a\} = 0 && \mathbb{P}_b\{X_4 = a\} = 0 \\ \mathbb{P}_a\{X_5 = a\} = 0 && \mathbb{P}_b\{X_5 = a\} = 1 \\ \mathbb{P}_a\{X_6 = a\} = 1 && \mathbb{P}_b\{X_6 = a\} = 0 \\ \mathbb{P}_a\{X_7 = a\} = 0 && \mathbb{P}_b\{X_7 = a\} = 0 \\ \mathbb{P}_a\{X_8 = a\} = 0 && \mathbb{P}_b\{X_8 = a\} = 1 \\ &\qquad\vdots \end{aligned} \tag{13}

Thus, periodicity is another obstruction to ergodicity, and so both aperiodicity and at most one recurrent class are necessary conditions for ergodicity. If a Markov chain is ergodic, it cannot be periodic or have more than one recurrent class. Perhaps surprisingly, these conditions are also sufficient.

Note that a Markov chain can be periodic even if it is random, meaning even if we cannot predict which state it will go to next given its current state. This is because periodicity refers to the chain being “forced” to switch between sets of states in a predictable manner:

Definition 8: A Markov chain is periodic if its states can be partitioned into K2K \geq 2 periodic classes S1,,SKS_1, \dots, S_K such that P{Xn+1Sk+1XnSk}=1\mathbb{P}\{X_{n+1} \in S_{k+1} \mid X_n \in S_k \} = 1.

For example, Figure 66 is a 22-periodic Markov chain. While a random walk of the chain would be indeed random, we can predict if the chain will be in states S1{a,b}S_1 \triangleq \{a, b\} or S2{c,d}S_2 \triangleq \{c, d\}, conditional on the current set of states.

Figure 6: A 22-periodic Markov chain with D={a,b,c,d}D = \{a,b,c,d\}. The chain is periodic since from states {a,b}\{a,b\} one must always go to states {c,d}\{c,d\}.

“Only if” direction

Let’s assume that we have an aperiodic Markov chain with one recurrent class. We need to show that the Markov chain “forgets” its initial state or

limnPi{Xn=k}Pj{Xn=k}=0(14) \lim_{n \rightarrow \infty} \big| \mathbb{P}_i\{X_n = k\} - \mathbb{P}_j\{X_n = k\} \big| = 0 \tag{14}

for all initial states ii and jj. To see this, we need Wolfgang Doeblin’s idea of coupling. Imagine an aperiodic Markov chain with a single recurrent class in which each state is a lily pad, and imagine that there are two frogs born on two separate lily pads (Figure 77).

Figure 7: An aperiodic Markov chain with states D={a,b,c}D = \{a, b, c\} and a single recurrent class re-imagined as a lily pad field. Two frogs begin hopping around on the lily pad field from two different initial states, aa and bb.

Initially, these two frogs hop around independently, moving according to the same transition matrix but with different initial conditions. However, the moment they first land on the same lily pad, they are coupled. That is, at each time, the next state is selected according to the transition probabilities of the Markov chain, and both frogs jump together to the next state. Thus, once they meet, they fall madly in love and go everywhere together. The time at which the frogs meet is the coupling time.

Let’s embrace heteronormativity for ease of notation, and say that one frog is a boy and the other is a girl. The trajectory of the boy frog is {Bn}\{B_n\} and the girl frog {Gn}\{G_n\}. For example, the movement of the frogs might look something like this,

n12345678{Bn}aaabbcca{Gn}bbcabcca \begin{array}{|c|cccccccc|} \hline \text{$n$} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & \dots \\ \hline \{B_n\} & a & a & a & b & b & c & c & a & \dots \\ \{G_n\} & b & b & c & a & b & c & c & a & \dots \\ \hline \end{array}

where the coupling time is n=5n = 5. On and after n=5n = 5, the frogs are linked. Why are we doing this? This is a way of defining two different Markov chains, {Bn}\{B_n\} and {Gn}\{G_n\}, at two different starting conditions on the same meta-Markov chain {Xn}\{X_n\}. Imagine the boy frog starts on state aa and the girl frog state bb. Then note

Pa{Xn=k}=P{Bn=k}Pb{Xn=k}=P{Gn=k}.(15) \begin{aligned} \mathbb{P}_a \{X_n = k\} &= \mathbb{P}\{B_n = k\} \\ \mathbb{P}_b \{X_n = k\} &= \mathbb{P}\{G_n = k\}. \end{aligned} \tag{15}

Recall that P(A)=E[1A]\mathbb{P}(A) = \mathbb{E}[\textbf{1}_{A}]. It follows that

Pa{Xn=k}Pb{Xn=k}=E[1Bn=k]E[1Gn=k]=E[1Bn=k1Gn=k]E[1Bn=k1Gn=k]=E[1BnGn]=P(BnGn).(16) \begin{aligned} \big| \mathbb{P}_a \{X_n = k\} - \mathbb{P}_b \{X_n = k\} \big| &= \big| \mathbb{E}[\textbf{1}_{B_n = k}] - \mathbb{E}[\textbf{1}_{G_n = k}] \big| \\ &= \big| \mathbb{E}[\textbf{1}_{B_n = k} - \textbf{1}_{G_n = k}] \big| \\ &\stackrel{\star}{\leq} \mathbb{E}[|\textbf{1}_{B_n = k} - \textbf{1}_{G_n = k}|] \\ &\stackrel{\dagger}{=} \mathbb{E}[\textbf{1}_{B_n \neq G_n}] \\ &= \mathbb{P}(B_n \neq G_n). \end{aligned} \tag{16}

Step \star holds because of Jensen’s inequality. Step \dagger holds because 1Bn=k1Gn=k=1BnGn\|\mathbf{1}_{B_n = k} - \mathbf{1}_{G_n = k}\| = \mathbf{1}_{B_n \neq G_n}. If the two indicators on the left-hand side disagree, then the value is one. Otherwise, the value is zero.

This mathematical statement is intuitive: the difference in probability between the two Markov chains {Bn}\{B_n\} and {Gn}\{G_n\} is upper bounded by the probability that the two frogs have not yet met. We can drive this difference to zero by showing that P(BnGn)0\mathbb{P}(B_n \neq G_n) \rightarrow 0 as nn \rightarrow \infty. In words, they will eventually meet. Intuitively, the reason periodicity hurts ergodicity is that it is like the two frogs simply never meet.

So we need to show that the two frogs eventually meet. Note that the pair {(Bn,Gn)}\{(B_n, G_n)\} is itself a Markov chain but in the state space {(i,j):i,j{a,b,c}}\{(i, j): i,j \in \{a,b,c\}\}. In order to show that the frogs eventually meet, it suffices to show that every state (i,j)(i, j) where iji \neq j is a transient state of the Markov chain {(Bn,Gn)}\{(B_n, G_n)\}. We know that once the two frogs are coupled, they are never uncoupled. That is,

(k,k)(i,j)(17) (k, k) \cancel{\leadsto} (i, j) \tag{17}

for all ii and jj. Therefore, to show that (i,j)(i, j) is transient, we just need to show that (i,j)(k,k)(i, j) \leadsto (k, k). This is a mathematical formulation of the idea that the frogs eventually meet.

Let kk be any recurrent state in the original Markov chain. Since there is only one recurrent class, then we know that iki \leadsto k and jkj \leadsto k. If (i,j)(k,k)(i, j) \leadsto (k, k) were to fail, we would have a weird situation: even though both frogs can reach kk infinitely often, they will never be there at the same time. The only way this can happen is if the frogs are always out of sync, that is: if the chain is periodic.

Therefore, if the original Markov chain has only one recurrent class and is aperiodic, the two frogs must eventually meet in finite time. Thus, as nn \rightarrow \infty, the Markov chain forgets its initial state and is therefore ergodic.

   

Acknowledgements

This blog post relies heavily on definitions, notation, and intuition (the story of two frogs) provided by Ramon von Handel in Princeton’s course “Probability and random processes”.

   

Appendix

A1. Wolfgang Doeblin

Wolfgang Doeblin was a born in Berlin in 1915 to Alfred and Erna Döblin. His father was a famous writer. They were Jewish-German; and Alfred and his family sans Wolfgang fled Berlin a few days after the Reichstag fire in February 1933. Wolfgang stayed behind until April to complete his gymnasium studies. The family was reunited and settled in Paris later that year; and Wolfgang continued his mathematical studies at the Sorbonne.

In the summer of 1936, Doeblin submitted his work on coupling (Doeblin, 1939) for publication. In March 1938, Doeblin defended his thesis—more or less (Doeblin, 1938)—and was drafted into the French army in November. After the start of World War II in September 1939, Doeblin was moved into active duty. In April 1940, Doeblin’s last mathematical note was filed with l’Academie des Sciences by Émile Borel. The note was submitted as a pli cacheté (sealed envelope), which was designed to remain sealed, granting its author intellectual priority if necessary without disclosing the work.

In 1940, Doeblin lost contact with his company while on a mission to the village Housseras. With Nazi troops a few minutes away, Doeblin burned his mathematical notes and took his own life. Doeblin’s pli cacheté was opened in 2000, revealing he had already proven a result (Itô’s lemma) for random processes.

A2. Path probabilities

Pa{X1=b,X2=b,X3=d,X4=c}=Pa{X4=cX1=b,X2=b,X3=d}Pa{X1=b,X2=b,X3=dX0=a}=Pa{X4=cX3=d}Pa{X1=b,X2=b,X3=d}=Pa{X4=cX3=d}Pa{X3=dX1=b,X2=b}Pa{X1=b,X2=b}=Pa{X4=cX3=d}Pa{X3=dX2=b}Pa{X1=b,X2=b}=Pa{X4=cX3=d}Pa{X3=dX2=b}Pa{X2=bX1=b}Pa{X1=b}=pab  pbb  pbd  pdc.(A2.1) \begin{aligned} &\mathbb{P}_a\{X_1 = b, X_2 = b, X_3 = d, X_4 = c\} \\ &= \mathbb{P}_a\{X_4 = c \mid X_1 = b, X_2 = b, X_3 = d\} \mathbb{P}_a \{X_1 = b, X_2 = b, X_3 = d \mid X_0 = a\} \\ &= \mathbb{P}_a\{X_4 = c \mid X_3 = d\} \mathbb{P}_a\{X_1 = b, X_2 = b, X_3 = d\} \\ &= \mathbb{P}_a\{X_4 = c \mid X_3 = d\} \mathbb{P}_a\{X_3 = d \mid X_1 = b, X_2 = b\} \mathbb{P}_a\{X_1 = b, X_2 = b\} \\ &= \mathbb{P}_a\{X_4 = c \mid X_3 = d\} \mathbb{P}_a\{X_3 = d \mid X_2 = b\} \mathbb{P}_a\{X_1 = b, X_2 = b\} \\ &= \mathbb{P}_a\{X_4 = c \mid X_3 = d\} \mathbb{P}_a\{X_3 = d \mid X_2 = b\} \mathbb{P}_a\{X_2 = b \mid X_1 = b\} \mathbb{P}_a\{X_1 = b\} \\ &= p_{ab} \; p_{bb} \; p_{bd} \; p_{dc}. \end{aligned} \tag{A2.1}

  1. Doeblin, W. (1939). Sur les sommes d’un grand nombre de variables aléatoires indépendantes. Bull. Sci. Math, 63(2), 23–32.
  2. Doeblin, W. (1938). Sur les propriétés asymptotiques de mouvements régis par certains types de chaı̂nes simples... [PhD thesis]. Univ. de Paris.