Generating Lyapunov Functions for Gradient Descent by SDP

This blog is the reading note of the following paper:

[1] Taylor, Adrien, Bryan Van Scoy, and Laurent Lessard. ‘‘Lyapunov functions for first-order methods: Tight automated convergence guarantees.” International Conference on Machine Learning. PMLR, 2018.

Background

A key technique for proving the convergence of optimization algorithms is the use of a Lyapunov function (also called a potential function). Such a function is designed to decrease (or contract) at every iteration, thereby establishing the algorithm’s convergence rate. However, constructing a suitable Lyapunov function is often nontrivial and typically requires expertise.

Main idea of the paper. Paper [1] is among the earliest works on automated Lyapunov function construction. It studies first-order methods (FOMs), including gradient descent, the heavy-ball method, and Nesterov’s accelerated gradient method, for minimizing $L$-smooth and $\mu$-strongly convex functions. The authors propose a semidefinite programming (SDP) formulation that automatically generates Lyapunov functions and provides linear convergence guarantees.

Scope of this blog. The paper presents the SDP formulation directly followed by proof of its equivalence to finding a valid Lyapunov function. However, I found the derivation of the SDP formulation somewhat non-intuitive. To better understand it, I attempted to reconstruct the derivation in the simplest setting: gradient descent.

\[x_1 = x_0 - \alpha \nabla f (x_0)\]

About potential functions. The potential function is defined on the state of the algorithm,

\[\xi_k = (\mathbf{x}_k, \mathbf{g}_k, \mathbf{f}_k) := (x_k - x_{\star}, \nabla f (x_k), f (x_k) - f (x_{\star})),\]

and $\xi_{\star} = (0, 0, 0)$. The paper considers quadratic potential functions of the form

\[\mathcal{V} (\xi_k) = \left[ \begin{array}{l} \mathbf{x}_k\\ \mathbf{g}_k \end{array} \right]^{\top} (P \otimes I_d) \left[ \begin{array}{l} \mathbf{x}_k\\ \mathbf{g}_k \end{array} \right] + q \mathbf{f}_k,\]

where $P \in \mathbb{S}^2, p \in \mathbb{R}$ are the coefficients to be determined, and $I_d$ denotes the $d \times d$ identity matrix. To prove linear convergence for gradient descent with rate $0 \leq \rho < 1$, the potential function must satisfy:

(1) $\mathcal{V} (\xi) \geq 0$ for all $\xi$,

(2) $\mathcal{V} (\xi) = 0$ if and only if $\xi = \xi_{\star}$,

(3) $\mathcal{V} (\xi) \rightarrow \infty$ as $| \xi | \rightarrow \infty$,

(4) $\mathcal{V} (\xi_{k + 1}) \leq \rho^2 \mathcal{V} (\xi_k)$ for all $k$.

Under these conditions, linear convergence follows:

\[\| x_k - x_{\star} \| = \mathcal{O} (\rho^k), \quad \| \nabla f (x_k) \| = \mathcal{O} (\rho^k), \quad f (x_k) - f_{\star} = \mathcal{O} (\rho^{2 k}) .\]

Given rate $\rho$, find the potential functions

The problem is to find a feasible $\mathcal{V} (\xi)$. Formally,

\[\text{find }P, p, \quad \text{subject to conditions (1)-(4) being satisfied}\]

Translating condition (4)

Condition (4) requires,

\[\left[ \begin{array}{l} \mathbf{x}_1\\ \mathbf{g}_1 \end{array} \right]^{\top} (P \otimes I_d) \left[ \begin{array}{l} \mathbf{x}_1\\ \mathbf{g}_1 \end{array} \right] + q \mathbf{f}_1 \leq \rho^2 \left[ \begin{array}{l} \mathbf{x}_0\\ \mathbf{g}_0 \end{array} \right]^{\top} (P \otimes I_d) \left[ \begin{array}{l} \mathbf{x}_0\\ \mathbf{g}_0 \end{array} \right] + \rho^2 q \mathbf{f}_0.\]

Since the left- and right-hand sides are defined over different variables, we rewrite the inequality by introducing the stacked vector

\[\mathbf{b} = \begin{bmatrix} \mathbf{x}_0 \\ \mathbf{g}_0 \\ \mathbf{g}_1 \\ f_0 \\ f_1 \end{bmatrix}.\]

With appropriate selector matrices, we can write

\[\begin{aligned} \begin{bmatrix}\mathbf{x}_1 \\ \mathbf{g}_1 \end{bmatrix} &= \Bigl( \begin{bmatrix} I^x_1 \\ I^g_1 \end{bmatrix} \otimes I_d \Bigr)\mathbf{b}, &\quad f_1 &= I^f_1 \mathbf{b}, \\ \begin{bmatrix}\mathbf{x}_0 \\ \mathbf{g}_0 \end{bmatrix} &= \Bigl( \begin{bmatrix} I^x_0 \\ I^g_0 \end{bmatrix} \otimes I_d \Bigr)\mathbf{b}, &\quad f_0 &= I^f_0 \mathbf{b}. \end{aligned}\]

Therefore, condition (4) is equivalent to

\[\mathbf{b}^{\top} \Biggl( \Bigl( \rho^2 \begin{bmatrix} I^x_0 \\ I^g_0 \end{bmatrix}^{\top} P \begin{bmatrix} I^x_0 \\ I^g_0 \end{bmatrix} - \begin{bmatrix} I^x_1 \\ I^g_1 \end{bmatrix}^{\top} P \begin{bmatrix} I^x_1 \\ I^g_1 \end{bmatrix} \Bigr)\otimes I_d \Biggr)\mathbf{b} + (\rho^2 q I^f_0 - q I^f_1)\mathbf{b} \geq 0. \tag{C4}\]

Using interpolation conditions. Inequality (C4) is difficult to verify directly. However, it becomes tractable once we use the interpolation properties of the function class. So far, we have not exploited the fact that the points $(x_0,g_0,f_0)$ and $(x_1,g_1,f_1)$ must lie on the same $L$-smooth, $\mu$-strongly convex function.

From interpolation theory, the following are equivalent:

There exists $f \in \mathcal{F}_{\mu,L}$ such that
\[g_k = \nabla f(x_k), \quad f_k = f(x_k), \quad \forall k \in \{0,1\};\]
For all $i,j \in {0,1,\star}$, the quadratic inequalities hold:
\[\phi_{ij} := (L-\mu)(f_i - f_j) + \begin{bmatrix} x_i \\ x_j \\ g_i \\ g_j \end{bmatrix}^{\top} (M^1 \otimes I_d) \begin{bmatrix} x_i \\ x_j \\ g_i \\ g_j \end{bmatrix} \geq 0, \tag{I}\]
where $M^1 \in \mathbb{S}^4$ is a fixed symmetric matrix depending only on $L,\mu$.

(I) can be expressed in terms of $\mathbf{b}$ as

\[\mathbf{b}^{\top} \Bigl( \underbrace{ \begin{bmatrix} I^x_i \\ I^x_j \\ I^g_i \\ I^g_j \end{bmatrix}^{\top} M^1 \begin{bmatrix} I^x_i \\ I^x_j \\ I^g_i \\ I^g_j \end{bmatrix} }_{M^1_{ij}} \otimes I_d \Bigr)\mathbf{b} + \underbrace{(L-\mu)(I^f_i - I^f_j)}_{m_{ij}}\mathbf{b} \geq 0.\]

The SDP formulation. Paper [1] shows that inequality (C) holds if and only if there exist nonnegative multipliers $\eta_{ij} \geq 0$ such that

\[\begin{aligned} \sum_{i,j \in \mathcal{I}} \eta_{ij} M^1_{ij} &\;\preceq\; \rho^2 \begin{bmatrix} I^x_0 \\ I^g_0 \end{bmatrix}^{\top} P \begin{bmatrix} I^x_0 \\ I^g_0 \end{bmatrix} - \begin{bmatrix} I^x_1 \\ I^g_1 \end{bmatrix}^{\top} P \begin{bmatrix} I^x_1 \\ I^g_1 \end{bmatrix}, \\ \sum_{i,j \in \mathcal{I}} \eta_{ij} m_{ij} &\;\leq\; \rho^2 q I^f_0 - q I^f_1. \end{aligned} \tag{SDP-1}\]

Translating condition (1)

Condition (1) requires

\[\begin{bmatrix} \mathbf{x}_0 \\ \mathbf{g}_0 \end{bmatrix}^{\top} (P \otimes I_d) \begin{bmatrix} \mathbf{x}_0 \\ \mathbf{g}_0 \end{bmatrix} + q f_0 \;\;\geq\; 0, \quad \forall\, \mathbf{x}_0, \mathbf{g}_0, f_0. \tag{C1}\]

Using interpolation between $(x_0, g_0, f_0)$ and the reference point $(x_{\star}, g_{\star}, f_{\star}) = (0,0,0)$, we have

\[\phi := (L-\mu)\, f_0 + \begin{bmatrix} \mathbf{x}_0 \\ \mathbf{g}_0 \end{bmatrix}^{\top} (M^0 \otimes I_d) \begin{bmatrix} \mathbf{x}_0 \\ \mathbf{g}_0 \end{bmatrix} \;\;\geq 0,\]

where $M^0 \in \mathbb{S}^2$ is a fixed symmetric matrix depending only on $L,\mu$.

Therefore, inequality (C1) holds if there exists $\lambda \geq 0$ such that

\[\begin{aligned} \lambda M^0 &\;\preceq\; P, \\ \lambda (L-\mu) &\;\leq\; q. \end{aligned}\tag{SDP-2}\]

Conditions (2) and (3) are automatically ensured when (SDP-1) and (SDP-2) hold.

The complete SDP formulation

Putting everything together: given contraction ratio $\rho$, the SDP problem to find a quadratic potential function is

($\rho$-SDP): Find $P \in \mathbb{S}^2$, $q \in \mathbb{R}$, multipliers $\eta_{ij} \geq 0$ for $i,j \in {0,1,\star}$, and $\lambda \geq 0$ such that

\[\begin{aligned} 0 &\;\preceq\; \rho^2 \begin{bmatrix} I^x_0 \\ I^g_0 \end{bmatrix}^{\top} P \begin{bmatrix} I^x_0 \\ I^g_0 \end{bmatrix} - \begin{bmatrix} I^x_1 \\ I^g_1 \end{bmatrix}^{\top} P \begin{bmatrix} I^x_1 \\ I^g_1 \end{bmatrix} - \sum_{i,j \in \mathcal{I}} \eta_{ij} M^1_{ij}, \\ 0 &\;\leq\; (\rho^2 q I^f_0 - q I^f_1) - \sum_{i,j \in \mathcal{I}} \eta_{ij} m_{ij}, \\ 0 &\;\preceq\; P - \lambda M^0, \\ 0 &\;\leq\; q - \lambda (L-\mu). \end{aligned}\]

Paper [1] proves that a quadratic potential function with contraction rate $\rho$ exists if and only if this SDP is feasible.

Find $\rho$

Given $\rho$, we solve ($\rho$-SDP):

If the problem is feasible, the algorithm admits a Lyapunov function with contraction rate at most $\rho$.
If the problem is infeasible, the algorithm cannot contract at rate $\rho$.

To obtain the tightest convergence guarantee, we perform a bisection search on the interval $[0,1]$. Starting from the midpoint, we test feasibility:

if feasible, we continue the search on the left subinterval;
if infeasible, we continue on the right subinterval.

The procedure terminates when we identify the smallest $\rho$ for which ($\rho$-SDP) is feasible.