This is the first post in a series of notes taken from the Stochastic Systems class (MS&E 321) taught by Professor Peter Glynn. This series will be updated as the course progresses.
A fundamental property of Markov chains is that their long-run behavior “forgets” the initial state. More precisely, for a finite, irreducible, aperiodic Markov chain with transition matrix $P$, the $n$-step transition probabilities $P^n(x,y)$ converge to a stationary distribution $\pi(y)$ at a geometric rate. The key idea behind the proof is a clean algebraic decomposition $P^m = \delta \Lambda + (1-\delta)Q$, which separates a “memoryless” rank-1 component $\Lambda$ from a remainder $Q$. This decomposition admits a natural probabilistic interpretation via coupling, and extends to non-stationary transitions, infinite state spaces, and continuous state spaces through Doeblin’s condition.
Setup
Consider a Markov chain (MC) with:
- Finite state space: $\lvert S\rvert = d < \infty$
- Stationary transition matrix $P$
- $P$ is irreducible and aperiodic
We begin with a technical lemma that connects irreducibility and aperiodicity to a uniform positivity condition on $P^m$.
Lemma 1. Let $P$ be the transition matrix of an irreducible Markov chain on a finite state space $S$. Then $P$ is aperiodic if and only if there exists $m \geq 1$ such that $P^m(x,y) > 0$ for all $x, y \in S$.
Remark. This lemma guarantees the existence of $m$ such that $P^m \gg 0$ (all entries strictly positive), which is the starting point of the convergence analysis below. We will revisit the necessity of finiteness in Lemma 1 at the end of this note.
Convergence Analysis
We first consider the case $m = 1$, i.e., $P \gg 0$.
1. Decomposition of $P$
Let $\Lambda$ be a stochastic matrix ($\Lambda \mathbf{e} = \mathbf{e}$) with identical rows, i.e., $\Lambda = \begin{bmatrix} \lambda \\ \vdots \\ \lambda \end{bmatrix}$ where $\lambda\ \mathbf{e}= 1$.
Since $P \gg 0$, there exists $\delta \in (0,1)$ such that $P \geq \delta \Lambda$. Define
\[P = \delta \Lambda + (1 - \delta) Q, \qquad Q = \frac{P - \delta \Lambda}{1 - \delta}.\]Then $Q \geq 0$, and $Q$ is stochastic: $Q\mathbf{e} = \frac{P\mathbf{e} - \delta \Lambda\mathbf{e}}{1 - \delta}= \mathbf{e}$.
Remark. The parameter $\delta$ measures how much of $P$ can be “explained” by the memoryless component $\Lambda$. A larger $\delta$ means faster mixing.
2. Computing powers of $P$
A key property: $R\Lambda = \Lambda$ for any stochastic matrix $R$, since $R\Lambda = R \mathbf{e} \lambda = \mathbf{e} \lambda = \Lambda$.
Expanding $P^2$:
\[\begin{aligned} P^2 &= P(\delta \Lambda + (1 - \delta)Q) \\ &= \delta \Lambda + (1 - \delta)PQ \\ &= \delta \Lambda + (1 - \delta)(\delta \Lambda + (1 - \delta)Q)Q \\ &= \delta \Lambda + (1 - \delta)\delta \Lambda Q + (1 - \delta)^2 Q^2 \end{aligned}\]By induction, the general formula is:
\[P^n = \sum_{j=0}^{n-1}\delta(1 - \delta)^{j}\Lambda Q^{j} + (1-\delta)^n Q^n.\]As $n \to \infty$, the remainder $(1-\delta)^n Q^n \to 0$, and:
\[P^n \to \sum_{j=0}^{\infty} \delta(1 - \delta)^j \Lambda Q^j \quad \triangleq \quad \Pi.\]3. Structure of $\Pi$
- $\Pi$ is stochastic: since $Q^j\mathbf{e} = \mathbf{e}$ for all $j$, we have $\Lambda Q^j\mathbf{e} = \Lambda\mathbf{e} = \mathbf{e}$, and the geometric series preserves row sums.
- $\Pi$ has identical rows: since $\Lambda Q^j= \mathbf{e} (\lambda Q^j)$, every term in the series has identical rows.
Therefore $\Pi$ is a rank-1 stochastic matrix whose common row is the stationary distribution $\pi$.
4. Rate of convergence
Define the matrix operator norm induced by the $\ell_\infty$ vector norm. For any stochastic matrix $P$, $\lVert P\rVert = 1$. Then:
\[\lVert P^n - \Pi\rVert = O\!\left((1 - \delta)^n\right) \qquad \text{(rate of mixing)}\]5. Generalization to $m > 1$
By Lemma 1, for a finite irreducible aperiodic chain, there exists $m$ such that $P^m \gg 0$. Applying the $m=1$ argument to $P^m$ yields
\[P^{m\,k} \to \Pi := \sum_{j=0}^\infty \delta(1-\delta)^j \Lambda Q^j, \qquad \lVert\,P^{m\,k}-\Pi\,\rVert = O\!\left((1-\delta)^k\right).\]For general $n$, write $n=km+r$ with $0\le r<m$. Then $P^n = (P^m)^k P^r$. Since $\Pi$ has identical rows, $\Pi P^r = \Pi$, so:
\[\lVert P^n-\Pi\rVert = \lVert(P^{mk}-\Pi)P^r\rVert \leq \lVert P^{mk}-\Pi\rVert= O\!\left((1-\delta)^k\right).\]Generalizations and Doeblin’s Condition
The convergence analysis relies entirely on the decomposition $P^m = \delta \Lambda + (1 - \delta) Q$. This decomposition admits a natural probabilistic interpretation and extends to several broader settings.
Probabilistic Interpretation (Coupling). At each block of $m$ steps, with probability $\delta$, the chain “forgets” its past and transitions according to the memoryless kernel $\Lambda$; with probability $1-\delta$, it follows the residual kernel $Q$. The number of blocks until the first “reset” is geometric with parameter $\delta$, which directly gives the exponential mixing rate.
Non-Stationary Transitions. Transition matrices $P(1), P(2), \ldots$ may vary over time, so we do not expect a stationary equilibrium. However, if a uniform lower bound holds:
\[P(i, x, y) \geq \delta \lambda(y), \quad \forall\, x \in S,\; i \geq 1,\]then each $P(i) = \delta \Lambda + (1 - \delta) Q_i$, and the same telescoping argument applies:
\[\begin{aligned} P(1)P(2) &= P(1)\bigl(\delta \Lambda + (1 - \delta)Q_2\bigr) = \delta \Lambda + (1 - \delta) P(1) Q_2 \\ &= \delta \Lambda + (1 - \delta)\delta \Lambda Q_2 + (1 - \delta)^2 Q_1 Q_2. \end{aligned}\]In general, $\lVert P(1) \cdots P(n) - \text{(rank-1)}\rVert \to 0$ at a geometric rate, even without stationarity.
Infinite Discrete State Space. For $\lvert S\rvert = \infty$, Lemma 1 may fail (see discussion below), so the decomposition $P^m \gg 0$ cannot be guaranteed. One must instead directly assume Doeblin’s condition: there exist $m$, $\delta > 0$, and a probability measure $\lambda$ such that $P^m(x, \cdot) \geq \delta \lambda(\cdot)$ for all $x \in S$.
Continuous State Space. Doeblin’s condition generalizes naturally: assume there exist $m$, $\delta > 0$, and a probability measure $\lambda$ on $(S, \mathcal{B})$ such that
\[\mathbb{P}_x(X_m \in B) \geq \delta \lambda(B), \quad \forall\, x \in S,\; B \in \mathcal{B}.\]Then $\lvert\mathbb{P}_x(X_n \in B) - \pi(B)\rvert = O(r^n)$ for some $r < 1$, where $\pi$ is the unique stationary measure.
Discussion on Lemma 1
Recall Lemma 1: $P$ is irreducible and aperiodic on a finite state space if and only if $\exists\, m$ such that $P^m(x,y) > 0$ for all $x,y$.
This equivalence relies critically on finiteness of the state space. For infinite state spaces, an irreducible aperiodic chain may have no $m$ for which $P^m \gg 0$.
Counterexample: Birth-death chain. Consider the chain on $\lbrace 0, 1, 2, \ldots\rbrace$ with nearest-neighbor transitions. Starting from state $s$, in any finite number of steps $m$, the chain can reach at most state $s + m$, so $P^m(s, s+m+1) = 0$ for all $m$. Hence no single $m$ makes $P^m$ uniformly positive, even though the chain is irreducible and aperiodic.
This is precisely why Doeblin’s condition must be imposed as an additional assumption in the infinite state space setting.