High Probability Analysis for SGD

Long time no see! This one is the longest post I have written so far—so grab a drink, it will take a little time to read! For better readability, you can refer to the pdf version.

I am learning how to derive high probability bounds for stochastic optimization algorithms. This is my learning note when reading the following wonderful papers:

Attia, Amit, and Tomer Koren. ‘‘A note on high-probability analysis of algorithms with exponential, sub-gaussian, and general light tails.”arXiv preprint arXiv:2403.02873(2024).
Attia, Amit, and Tomer Koren. ‘‘SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance.”International Conference on Machine Learning. PMLR, 2023.

As these notes are not targeted at SGD, I reorganized the problem settings and completed the proofs for SGD under different assumptions by adapting proof techniques in these notes.

Problem Setup

Consider the following optimization problem:

\[\min_{x \in \mathcal{X}} f (x)\]

We may have different assumptions for the problem. In this note, we assume $f$ is convex, and admits a minimizer $x_{\star}$, and we consider the following two settings.

F1. (Lipschitz) Let $\mathcal{X}$ be a convex set with diameter $D$. $f$ is $G$-Lipschitz.
F2. (smooth) $\mathcal{X}$ is $\mathbb{R}^n$. $f$ is $L$-smooth.

We use stochastic gradient descent (SGD) to solve this problem. In each iteration, we query a stochastic gradient oracle and update:

\[x_{t + 1} = x_t - \eta \nabla f (x_t, \xi_t) .\]

In this note, we assume that the stochastic oracle is unbiased, i.e., $\mathbb{E}_{\xi} [\nabla f (x, \xi)] = \nabla f (x)$. We further consider two setups:

G1. (bounded noise) $| \nabla f (x, \xi) - \nabla f (x) | \leq B$.
G2. (affine noise) $| \nabla f (x, \xi) - \nabla f (x) | \leq \sqrt{\sigma_0^2 + \sigma_1^2 |\nabla f (x) |^2}$.

We focus on deriving high probability bounds for SGD with different combinations of the above assumptions about the objective function (F) and the stochastic gradient oracles (G).

Concentration inequalities

In the high probability analysis of SGD, one important component is to give a high probability upper bound for the sum of martingale difference sequence (MDS). We present some frequently used concentration inequalities here and show how to use them in the following sections.

Let $Z_0, \ldots, Z_{T - 1}$ be an MDS:

L1. (Azuma’s Inequality) If $| Z_t | \leq B$ for all $t$ almost surely, then for every $\delta \in (0, 1)$,

\[\Pr \left( \sum_{t = 0}^{T - 1} Z_t \geq B \sqrt{2 T \log \frac{1}{\delta}} \right) \leq \delta .\]

L2. (Empirical Bernstein Concentration Inequality) If $| Z_t | \leq 1$ for all $t$ w.p. 1. Then for every $\delta \in (0, 1)$, and any $\hat Z_t \in \mathcal{F}_{t - 1}$ such that $| \hat{Z}_t | \leq 1$ w.p. 1,

\[\Pr \left( \exists t, \left| \sum_{s = 0}^{t - 1} Z_s -\mathbb{E} [Z_s \mid \mathcal{F}_{s - 1}] \right| \geq \sqrt{A_t (\delta) \sum_{s = 0}^{t - 1} (Z_s - \hat{Z}_s)^2 + B_t (\delta)} \right) \leq \delta,\]

where $A_t (\delta) = 16 \log \left( \frac{60 \log (6 t)}{\delta} \right)$ and $B_t (\delta) = 16 \log^2 \left( \frac{60 \log (6 t)}{\delta} \right)$.

L3. If $\mathbb{E} [\exp (Z_t^2 / \sigma_t^2) \mid \mathcal{F}_{t - 1}] \leq \exp(1)$ for all $t$. Then, for any fixed $\lambda > 0$ and $\delta \in (0, 1)$,

\[\Pr \left( \sum_{t = 0}^{T - 1} Z_t \geq \frac{3}{4} \lambda \sum_{t = 0}^{T - 1} \sigma_t^2 + \frac{1}{\lambda} \log \frac{1}{\delta} \right) \leq \delta .\]

Lipschitz, bounded noise

First, we use Assumption F1, G1 to illustrate the basic ideas of high probability analysis.

Since $f$ is convex, we use function value gap as the performance measure.

\[\begin{array}{rcl} f (x_t) - f (x_{\star}) & \leq & \langle \nabla f (x_t), x_t - x_{\star} \rangle\\ & = & \underbrace{\langle \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle}_{(\ast)} + \underbrace{\langle \nabla f (x_t) - \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle}_{(\ast \ast)} \end{array}\]

Note: in subsequent sections, we continue using such decomposition.

For the first part $(\ast)$,

\[\begin{array}{rcl} \langle \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle & = & \frac{1}{\eta} \langle x_t - x_{t + 1}, x_t - x_{\star} \rangle\\ & = & \frac{1}{2 \eta} [\| x_t - x_{\star} \|^2 - \| x_{t + 1} - x_{\star} \|^2 + \| x_t - x_{t + 1} \|^2]\\ & = & \frac{1}{2 \eta} [\| x_t - x_{\star} \|^2 - \| x_{t + 1} - x_{\star} \|^2] + \frac{\eta}{2} \| \nabla f (x_t, \xi_t) \|^2\\ & \leq & \frac{1}{2 \eta} [\| x_t - x_{\star} \|^2 - \| x_{t + 1} - x_{\star} \|^2] + \eta [\| \nabla f (x_t) \|^2 + \| \nabla f (x_t, \xi_t) - \nabla f (x_t) \|^2]\\ & \leq & \frac{1}{2 \eta} [\| x_t - x_{\star} \|^2 - \| x_{t + 1} - x_{\star} \|^2] + \eta (G^2 + B^2) \end{array}\]

Telescope,

\[\begin{array}{rcl} \sum_{t = 0}^{T - 1} \langle \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle & \leq & \frac{\| x_0 - x_{\star} \|^2}{2 \eta} + \eta (G^2 + B^2) T \end{array}\]

For deterministic gradient, the analysis is nearly complete - choose $\eta = \frac{D}{\sqrt{2 (G^2 + B^2) T}}$ and we have, $\begin{array}{rcl} \frac{1}{T} \sum_{t = 0}^{T - 1} (\ast) = \frac{1}{T} \sum_{t = 0}^{T - 1} \langle \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle & \leq & 2 D \sqrt{\frac{G^2 + B^2}{T}} \end{array}$

For stochastic gradient, we need to tackle the second part $(\ast \ast)$. Let $Z_t = \langle \nabla f (x_t) - \nabla f (x_t, \xi_t), x_t - x_{\star}\rangle$, it is an MDS. The goal of the remaining proof is to give a high probability bound for $\sum_{t = 0}^{T - 1} Z_t$.

Under Assumption F1, G1, $| Z_t | \leq D \cdot B$, so we apply Azuma’s inequality, w.p. $1 - \delta$, $\begin{array}{rcl} \frac{1}{T} \sum_{t = 0}^{T - 1} (\ast \ast) = \frac{1}{T} \sum_{t = 0}^{T - 1} \langle\nabla f (x_t) - \nabla f (x_t, \xi_t), x_t - x_{\star}\rangle & \leq & D B \sqrt{\frac{2 \log \frac{1}{\delta}}{T}} \end{array}$ Combine $(\ast)$ and $(\ast \ast)$, and by Jensen’s inequality, w.p. $1 - \delta$, $\begin{array}{rcl} f \left( \frac{1}{T} \sum_{t = 0}^{T - 1} x_t \right) - f (x_{\star}) & \leq & 2 D \sqrt{\frac{G^2 + B^2}{T}} + D B \sqrt{\frac{2 \log \frac{1}{\delta}}{T}}. \end{array}$

Remarks. We can compare different analysis frameworks, and they mainly differ in how to tackle the second part $(\ast \ast)$.

Expectation: $(\ast \ast)$ has conditional expectation 0. the final bound is $2 D \sqrt{\frac{G^2 + B^2}{T}}$.
High probability: $(\ast \ast)$ has a high probability sublinear bound after telescoping. the final bound is $2 D \sqrt{\frac{G^2 + B^2}{T}} + D B \sqrt{\frac{2 \log˙\frac{1}{\delta}}{T}}$.
Worst case: $(\ast \ast)$ is a constant. the final bound is $2 D \sqrt{\frac{G^2 + B^2}{T}} + D B$.

Smooth, bounded noise

In this section, we use Assumption F2, G1.

The analysis of the first part $(\ast)$ is very similar the previous section, except that we can not upper bound $| \nabla f (x_t) |^2$ by $G^2$,

Choose $\eta = \min ( \frac{1}{4 L}, \frac{\alpha}{B \sqrt{T}})$,

\[\begin{array}{rcl} \sum_{t = 0}^{T - 1} \langle \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle & \leq & \frac{\| x_0 - x_{\star} \|^2}{2 \eta} + \eta B^2 T + 2 \eta L \sum_{t = 0}^{T - 1} (f (x_t) - f (x_{\star}))\\ & \leq & \| x_0 - x_{\star} \|^2 \left( 2 L + \frac{2 B \sqrt{T}}{\alpha} \right) + \alpha B \sqrt{T} + \frac{1}{2} \sum_{t = 0}^{T - 1} (f (x_t) - f (x_{\star}))\\ & \leq & 2 L \| x_0 - x_{\star} \|^2 + \left( \alpha + \frac{2 \| x_0 - x_{\star} \|^2}{\alpha} \right) B \sqrt{T} + \frac{1}{2} \sum_{t = 0}^{T - 1} (f (x_t) - f (x_{\star})) \end{array}\]

Consider the second part $(\ast \ast)$. Let $Z_t = \langle \nabla f (x_t) - \nabla f (x_t, \xi_t), x_t - x_{\star}˙\rangle$, it is an MDS because $\mathbb{E} [Z_t \mid \mathcal{F}_{t - 1}] = 0$. However, under Assumption F2, we cannot simply derive an upper bound for $Z_t$ because the domain is unbounded. Here, we need to use a useful lemma:

Lemma 1. Under Assumption F2, G1, $| x_t - x_{\star} | \leq D, \forall t, \text{w.p.} 1 - \delta .$

The definition of $D$ and the proof will be deferred to the end of this section.

As a corollary of Lemma 1, $| Z_t | \leq D \cdot B$, so we apply Azuma’s inequality and union bound, w.p. $1 - 2 \delta$,

\[\frac{1}{T} \sum_{t = 0}^{T - 1} (\nabla f (x_t) - \nabla f (x_t, \xi_t))^{\top} (x_t - x_{\star}) \leq D B \sqrt{\frac{2 \log \frac{1}{\delta}}{T}}\]

Combine the two parts, and by Jensen’s inequality, w.p. $1 - 2 \delta$,

\[f \left( \frac{1}{T} \sum_{t = 0}^{T - 1} x_t \right) - f (x_{\star}) \leq \frac{4 L \| x_0 - x_{\star} \|^2}{T} + \left( 2 \alpha + \frac{4 \| x_0 - x_{\star} \|^2}{\alpha} \right) \frac{B}{\sqrt{T}} + 2 D B \sqrt{\frac{2 \log \frac{1}{\delta}}{T}} .\]

This result does not rely on the bound of gradient norms and the diameter of the domain. We proceed to characterize $D$ in Lemma 1. Define $d_t = | x_t-x_{\star} |$, $\bar d_t = \max_{s \leq t} d_t$.

\[\begin{array}{rcl} & & \| x_{t + 1} - x_{\star} \|^2\\ & = & \| x_t - x_{\star} \|^2 - 2 \eta \langle \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle + \eta^2 \| \nabla f (x_t, \xi_t) \|^2\\ & \leq & \| x_t - x_{\star} \|^2 - 2 \eta \langle \nabla f (x_t), x_t - x_{\star} \rangle + 2 \eta \langle \nabla f (x_t) - \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle + \eta^2 \| \nabla f (x_t, \xi_t) \|^2\\ & \leq & \| x_t - x_{\star} \|^2 - 2 \eta (1 - \eta L) (f (x_t) - f (x_{\star})) + 2 \eta \langle \nabla f (x_t) - \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle\\ & \leq & \| x_t - x_{\star} \|^2 + 2 \eta \langle \nabla f (x_t) - \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle \end{array}\]

the last inequality holds because $\eta \leq \frac{1}{4 L}$. Then we have, $\begin{array}{rcl} \| x_{t + 1} - x_{\star} \|^2 \leq & \| x_0 - x_{\star} \|^2 + 2 \eta \sum_{s = 0}^t \langle \nabla f (x_s) - \nabla f (x_s, \xi_s), x_s - x_{\star} \rangle \end{array}$ To apply the Bernstein empirical concentration bound, we define

\[X_s = \frac{\langle \nabla f (x_s) - \nabla f (x_s, \xi_s), x_s - x_{\star} \rangle}{B \bar{d}_{t - 1}}, \quad \hat{X}_s = 0\]

Note that $X_s \leq 1$ w.p. 1, and $\mathbb{E} [X_s \mid \mathcal{F}_{s - 1}] = 0$. Then, for all $0 \leq t \leq T - 1$, and $\delta’ \in (0, 1)$, w.p. at least $1 - \delta’$,

\[\left| \sum_{s = 0}^t X_s \right| \leq \sqrt{A_t (\delta') \sum_{s = 0}^t X_s^2 + B_t (\delta')}\]

We can upper bound $X_s^2$ by 1. Then we return to the inequality, w.p. at least $1 - \delta’$, $\begin{array}{rcl} \| x_{t + 1} - x_{\star} \|^2 & \leq & \| x_0 - x_{\star} \|^2 + 2 \eta B \bar{d}_t \sum_{s = 0}^t X_s\\ & \leq & \| x_0 - x_{\star} \|^2 + 2 \eta B \bar{d}_t \sqrt{A_t (\delta') (t + 1) + B_t (\delta')}\\ & \leq & \| x_0 - x_{\star} \|^2 + \frac{1}{2} \bar{d}_t^2 + 2 \eta^2 B^2 (A_t (\delta') (t + 1) + B_t (\delta'))\\ & \leq & \frac{1}{2} \bar{d}_t^2 + d_0^2 + 2 \frac{\alpha^2}{T} (A_t (\delta') (t + 1) + B_t (\delta')) \end{array}$

Define $D^2 = 2 d_0^2 + 4 \alpha^2 A_T (\delta’) + 4 \frac{\alpha^2}{T} B_T (\delta’)˙= \mathcal{O} (1)$, then $\bar{d}_{t + 1}^2 \leq \frac{\bar{d}_t^2}{2} + \frac{D^2}{2}$. Rolling to $t = 0$, we have $\bar{d}_T \leq D$. Therefore, $d_t \leq D$ for all $0 \leq t \leq T - 1$.

Smooth, affine noise

In this section, we use Assumption F2, G2. The setting with Assumption F1, G2 is easier to tackle than this case, so we skip this setting.

For the first part $(\ast)$,

\[\begin{array}{rcl} \langle \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle & \leq & \frac{1}{2 \eta} [\| x_t - x_{\star} \|^2 - \| x_{t + 1} - x_{\star} \|^2] + \eta [\| \nabla f (x_t) \|^2 + \| \nabla f (x_t, \xi_t) - \nabla f (x_t) \|^2]\\ & \leq & \frac{1}{2 \eta} [\| x_t - x_{\star} \|^2 - \| x_{t + 1} - x_{\star} \|^2] + \eta [\sigma_0^2 + (1 + \sigma_1^2) \| \nabla f (x_t) \|^2]\\ & \leq & \frac{1}{2 \eta} [\| x_t - x_{\star} \|^2 - \| x_{t + 1} - x_{\star} \|^2] + \eta [\sigma_0^2 + 2 (1 + \sigma_1^2) L (f (x_t) - f (x_{\star}))] \end{array}\]

Choose $\eta = \min (\frac{1}{8 (1 + \sigma_1^2) L},\frac{\alpha}{\sigma_0\sqrt{T}})$, and telescope,

\[\begin{array}{rcl} & & \sum_{t = 0}^{T - 1} \langle \nabla f (x_t, \xi_t), x_t - x_{\star} \rangle\\ & \leq & \frac{\| x_0 - x_{\star} \|^2}{2 \eta} + \eta \sigma_0^2 T + 2 (1 + \sigma_1^2) \eta L \sum_{t = 0}^{T - 1} (f (x_t) - f (x_{\star}))\\ & \leq & 4 (1 + \sigma_1^2) L \| x_0 - x_{\star} \|^2 + \left( \alpha + \frac{2 \| x_0 - x_{\star} \|^2}{\alpha} \right) \sigma_0 \sqrt{T} + \frac{1}{4} \sum_{t = 0}^{T - 1} (f (x_t) - f (x_{\star})) \end{array}\]

Consider the second part $(\ast \ast)$. Let $Z_t = \langle \nabla f (x_t) - \nabla f (x_t, \xi_t), x_t - x_{\star}\rangle$, it is an MDS. Here, we still need the existence of the upper bound on $| x_t - x_{\star} |$, but the uniformed bound on gradient noise does not exists. We use the following Lemma and delay the proof to the end.

Lemma 2. Under Assumption F2, G2, $| x_t - x_{\star} | \leq D, \forall t, \text{w.p.} 1 - 2 \delta .$

With Lemma 2, we state that $| Z_t | \leq d_t \sqrt{\sigma_0^2 + \sigma_1^2 | \nabla f (x_t) |^2}$, where $d_t = | x_t - x_{\star} |$. Apply inequality L3, w.p. $1 - \delta$,

\[\begin{array}{rcl} \frac{1}{T} \sum_{t = 0}^{T - 1} Z_t & \leq & \frac{3}{4} \frac{\lambda}{T} \sum_{t = 0}^{T - 1} d_t^2 (\sigma_0^2 + \sigma_1^2 \| \nabla f (x_t) \|^2) + \frac{1}{\lambda T} \log \frac{1}{\delta}\\ & \leq & \frac{3}{4} \frac{\lambda}{T} \sum_{t = 0}^{T - 1} d_t^2 (\sigma_0^2 + 2 \sigma_1^2 L (f (x_t) - f (x_{\star}))) + \frac{1}{\lambda T} \log \frac{1}{\delta} \end{array}\]

Let $\lambda = \left( \sigma_0 D \sqrt{T / \log \frac{1}{\delta}} + 6 D^2˙\sigma_1^2 L \right)^{- 1}$, and by a union bound of Lemma 2 and L3, w.p. $1 - 3 \delta$,

\[\begin{array}{rcl} \frac{1}{T} \sum_{t = 0}^{T - 1} Z_t & \leq & \frac{3}{4} \frac{\bar{d}_T^2 \sigma_0}{D \sqrt{T / \log \frac{1}{\delta}}} + \frac{1}{4 T} \sum_{t = 0}^{T - 1} \frac{\bar{d}_T^2}{D^2} (f (x_t) - f (x_{\star})) + \frac{1}{T} \log \frac{1}{\delta} \left( \sigma_0 D \sqrt{\frac{T}{\log \frac{1}{\delta}}} + 6 D^2 \sigma_1^2 L \right)\\ & \leq & 2 D \sigma_0 \sqrt{\frac{\log \frac{1}{\delta}}{T}} + \frac{1}{4 T} \sum_{t = 0}^{T - 1} (f (x_t) - f (x_{\star})) + \frac{6 D^2 \sigma_1^2 L \log \frac{1}{\delta}}{T} \end{array}\]

Combine the two parts, and by Jensen’s inequality, w.p. $1 - 3 \delta$,

\[f \left( \frac{1}{T} \sum_{t = 0}^{T - 1} x_t \right) - f (x_{\star}) \leq \frac{\sigma_0 D}{\sqrt{T}} \left( \frac{1}{\alpha} + 2 \alpha + 4 \sqrt{\log \frac{1}{\delta}} \right) + \frac{D^2 L}{T} \left( 8 (1 + \sigma_1^2) + 12 \sigma_1^2 \log \frac{1}{\delta} \right) .\]

Finally, we return to prove Lemma 2. Define $d_t = | x_t - x_{\star} |$, $\bar d_t = \max_{s \leq t} d_s$, $\Delta_t = f (x_t) - f (x_{\star})$, $\bar \Delta_t = \max_{s \leq t} \Delta_s$.

The same as the previous section, we have,

\[\| x_{t + 1} - x_{\star} \|^2 \leq \| x_0 - x_{\star} \|^2 + 2 \eta \sum_{s = 0}^t \langle \nabla f (x_s) - \nabla f (x_s, \xi_s), x_s - x_{\star} \rangle.\]

To apply the Bernstein empirical concentration bound L2, we define

\[X_s = \frac{\langle \nabla f (x_s) - \nabla f (x_s, \xi_s), x_s - x_{\star} \rangle}{\bar{d}_t \sqrt{\sigma_0^2 + 2 \sigma_1^2 L \bar{\Delta}_t^2}}, \quad \hat{X}_s = 0\]

Note that $X_s \leq 1$ w.p. 1, and $\mathbb{E} [X_s \mid \mathcal{F}_{s - 1}] = 0$. Then, for all $0 \leq t \leq T - 1$, and $\delta’ \in (0, 1)$, w.p. at least $1 - \delta’$,

\[\| \sum_{s = 0}^{t} X_s \| \leq \sqrt{A_t (\delta') (t + 1) + B_t (\delta')}\]

Then we return to the inequality, w.p. at least $1 - \delta’$, $\begin{array}{rcl} \| x_{t + 1} - x_{\star} \|^2 & \leq & \| x_0 - x_{\star} \|^2 + 2 \eta \bar{d}_t \sqrt{\sigma_0^2 + 2 \sigma_1^2 L \bar{\Delta}_t^2} \sum_{s = 0}^t X_s\\ & \leq & d_0^2 + 2 \eta \bar{d}_t \sqrt{\sigma_0^2 + 2 \sigma_1^2 L \bar{\Delta}_t^2} \sqrt{A_t (\delta') (t + 1) + B_t (\delta')}\\ & \leq & d_0^2 + \frac{1}{2} \bar{d}_t^2 + 2 \eta^2 (\sigma_0^2 + 2 \sigma_1^2 L \bar{\Delta}_t^2) (A_t (\delta') (t + 1) + B_t (\delta'))\\ & \leq & \frac{1}{2} \bar{d}_t^2 + d_0^2 + 2 \frac{\alpha^2}{T} \frac{\sigma_0^2 + 2 \sigma_1^2 L \bar{\Delta}_t^2}{\sigma_0^2} (A_t (\delta') (t + 1) + B_t (\delta')) \end{array}$

Finally, we need to show that:

Lemma 3. With probability at least $1 - \delta$, $\bar{\Delta}_T \leq F$.

Let $D^2 = 2 d_0^2 + 4 \frac{\alpha^2}{T} \frac{\sigma_0^2 + 2 \sigma_1^2 L˙F^2}{\sigma_0^2} (A_t (\delta’) (t + 1) + B_t (\delta’)) = \mathcal{O} (1)$ and apply a union bound with Lemma 3 proves Lemma 2.

We prove Lemma 3 by induction. Suppose $\bar \Delta_t \leq F$, we show that $\bar \Delta_{t + 1} \leq F$. We begin with the L-smooth condition,

\[\begin{array}{rcl} f (x_{t + 1}) & = & f (x_t) - \eta \nabla f (x_t, \xi_t)^{\top} \nabla f (x_t) + \frac{L}{2} \eta^2 \| \nabla f (x_t, \xi_t) \|^2\\ & \leq & f (x_0) - \eta \sum_{s = 0}^t \left( \nabla f (x_s, \xi_s)^{\top} \nabla f (x_s) + \frac{L}{2} \eta^2 \| \nabla f (x_s, \xi_s) \|^2 \right) \end{array}\]

We move to bound the first term on the r.h.s.,

\[\begin{array}{rcl} - \sum_{s = 0}^t \nabla f (x_s, \xi_s)^{\top} \nabla f (x_s) & = & - \sum_{s = 0}^t \| \nabla f (x_s) \|^2 + \sum_{s = 0}^t \langle \nabla f (x_s), \nabla f (x_s) - \nabla f (x_s, \xi_s) \rangle \end{array}\]

Then apply the inequality L3, w.p. $1 - \delta$,

\[\begin{array}{rcl} \sum_{s = 0}^t \langle \nabla f (x_s), \nabla f (x_s) - \nabla f (x_s, \xi_s) \rangle & \leq & \frac{3}{4} \lambda \sum_{s = 0}^t \| \nabla f (x_s) \|^2 (\sigma_0^2 + \sigma_1^2 \| \nabla f (x_s) \|^2) + \frac{1}{\lambda} \log \frac{T}{\delta}\\ & \leq & \frac{3}{4} \lambda \sum_{s = 0}^t \| \nabla f (x_s) \|^2 (\sigma_0^2 + 2 \sigma_1^2 L F) + \frac{1}{\lambda} \log \frac{T}{\delta} \end{array}\]

Let $\lambda = \frac{2}{3 \sqrt{\sigma_0^2 + 2 \sigma_1^2 L F}} < 1$,

\[\begin{array}{rcl} \sum_{s = 0}^t \langle \nabla f (x_s), \nabla f (x_s) - \nabla f (x_s, \xi_s) \rangle & \leq & \frac{\sum_{s = 0}^t \| \nabla f (x_s) \|^2}{2 \sqrt{\sigma_0^2 + 2 \sigma_1^2 L F}} + \frac{3 \sqrt{\sigma_0^2 + 2 \sigma_1^2 L F}}{2} \log \frac{T}{\delta}\\ & \leq & \frac{\sum_{s = 0}^t \| \nabla f (x_s) \|^2}{2 \sqrt{\sigma_0^2 + 2 \sigma_1^2 L F}} + \frac{9 (\sigma_0^2 + 2 \sigma_1^2 L F)}{16} + \log^2 \frac{T}{\delta} \end{array}\] \[\begin{array}{rcl} - \eta \sum_{s = 0}^t \nabla f (x_s, \xi_s)^{\top} \nabla f (x_s) & \leq & \frac{9 \eta (\sigma_0^2 + 2 \sigma_1^2 L F)}{16} + \eta \log^2 \frac{T}{\delta} \leq \frac{F}{2} + \frac{9 \alpha \sigma_0}{16 \sqrt{T}} + \frac{\alpha \log^2 \frac{T}{\delta}}{\sigma_0 \sqrt{T}} \end{array}\]

We return to the inequality uses $L$-smooth condition,

\[\begin{array}{rcl} f (x_{t + 1}) & \leq & f (x_0) - \eta \sum_{s = 0}^t \nabla f (x_s, \xi_s)^{\top} \nabla f (x_s)\\ \bar{\Delta}_{t + 1} & \leq & \frac{F}{2} + \Delta_0 + \frac{9 \alpha \sigma_0}{16 \sqrt{T}} + \frac{\alpha \log^2 \frac{T}{\delta}}{\sigma_0 \sqrt{T}} \end{array}\]

Let $F = 2 \left( \Delta_0 + \frac{9 \alpha \sigma_0}{16 \sqrt{T}} + \frac{\alpha \log^2 \frac{T}{\delta}}{\sigma_0 \sqrt{T}} \right)$ would derive Lemma 3.

High Probability Analysis for SGD

Beyond Bounded Domain and Bounded Gradients

Problem Setup

Concentration inequalities

Lipschitz, bounded noise

Smooth, bounded noise

Smooth, affine noise

CATALOG

FEATURED TAGS