机器学习笔记之优化算法(十三)关于二次上界引理-Toy模板网

这篇具有很好参考价值的文章主要介绍了机器学习笔记之优化算法(十三)关于二次上界引理。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

引言

本节将介绍二次上界的具体作用以及它的证明过程。

回顾：

利普希兹连续

在 $\text{Wolfe}$ 准则收敛性证明一节中简单介绍了利普希兹连续 $(\text{Lipschitz Continuity})$ 。其定义对应数学符号表达如下：
$\forall x,\hat x \in \mathbb R^n , \exist \mathcal L: \quad s.t. ||f(x) - f(\hat x)|| \leq \mathcal L \cdot ||x - \hat x||$
如果函数 $f(\cdot)$ 满足利普希兹连续，对上式进行简单变换可得到：
不等式左侧可使用拉格朗日中值定理进行进一步替换。
$\exist \xi \in (x,\hat x) \Rightarrow \frac{||f(x) - f(\hat x)||}{||x - \hat x||} = f'(\xi)\leq \mathcal L$
这意味着：在函数 $f(\cdot)$ 在定义域内的绝大部分点处的变化率存在上界，受到 $\mathcal L$ 的限制。

梯度下降法介绍

在梯度下降法铺垫：总体介绍一节中对梯度下降法进行了简单认识。首先，梯度下降法是一个典型的线搜索方法 $(\text{Line Search Method})$ 。其迭代过程对应数学符号表示如下：
$x_{k+1} = x_k + \alpha_k \cdot \mathcal P_k$

其中 $\mathcal P_k \in \mathbb R^n$ ，描述数值解的更新方向，在梯度下降法中，它选择目标函数 $f(\cdot)$ 在 $x_k$ 处梯度的反方向 $\nabla f(x_k)$ 作为更新方向，也称最速下降方向：
$\mathcal P_k = -\nabla f(x_k)$
而 $\alpha_k$ 表示步长。基于步长的选择方式分为精确搜索与非精确搜索两类。关于非精确搜索——通过迭代获取数值解序列并以此近似最优步长的方法详见：
- $\text{Armijo}$ 准则
- $\text{Glodstein}$ 准则
- $\text{Wolfe}$ 准则

本节将介绍梯度下降法中使用精确搜索求解最优步长，以及精确搜索的限制条件——二次上界引理。

二次上界引理：介绍与作用

在求解梯度下降法的精确步长过程中，关于目标函数 $f(\cdot)$ ，在其定义域内可微的基础上增加一个条件：目标函数的梯度函数 $\nabla f(\cdot)$ 满足利普希兹连续。
如果是梯度函数 $\nabla f(\cdot)$ 满足利普希兹连续，根据上面的格式，可以得到：
$\nabla^2 f(\cdot) \leq \mathcal L$
而二阶梯度描述的是梯度 $\nabla f(\cdot)$ 的变化量。这意味着：关于 $\nabla f(\cdot)$ 的变化情况不会过于剧烈。相反，如果 $\nabla f(\cdot)$ 的变化情况过于剧烈：即便迭代过程中极小的一次更新，对应函数结果的变化也极大，例如： $\begin{aligned}f(x) = \frac{1}{x}\end{aligned}$ 在 $\in (0,1]$ 区间内 $\nabla f(\cdot)$ 的变化情况。从而在迭代过程中，可能出现梯度爆炸的现象。

基于上述条件，可以得到结论：函数 $f(\cdot)$ 存在二次上界。其数学符号表示为：
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y-x) + \frac{\mathcal L}{2}||y - x||^2$
我们之前仅知道函数梯度 $\nabla f(\cdot)$ 的变化率存在上界对其进行约束，但可通过该结论求出该上界的精确结果。
首先通过图像观察该结论各部分的具体意义：
机器学习笔记之优化算法(十三)关于二次上界引理,数学,机器学习,深度学习,二次上界,利普希兹连续,凸优化方法,梯度下降法
很明显，这仅是一个一维变量对应的函数结果 $(\mathbb R \mapsto\mathbb R)$ ，其中蓝色虚线箭头表示 $f (y)$ ；黑色虚线箭头表示 $[\nabla f(x)]^T \cdot (y - x)$ 。在上述结论中，两者之间的差距(绿色实线)不会无限大下去，而是存在一个上界约束这个差距：
$[\nabla f(x)]^T \cdot (y-x)] \leq \frac{\mathcal L}{2}||y -x||^2$
假如这个差距结果远远大于 $\begin{aligned}\frac{\mathcal L}{2}||y -x||^2\end{aligned}$ 。例如：
机器学习笔记之优化算法(十三)关于二次上界引理,数学,机器学习,深度学习,二次上界,利普希兹连续,凸优化方法,梯度下降法

从图像中可以明显看到，如果 $f (y)$ 与 $[\nabla f(x)]^T (y - x)$ 之间的差距过大的话，那么必然是 $f (y)$ 处的斜率与 $f (x)$ 处的斜率差距过大产生的结果。因此这个差距上界 $\begin{aligned}\frac{\mathcal L}{2}||y - x||^2\end{aligned}$ 本质上依然是约束 $\nabla f(\cdot)$ 变化率的大小。
这种情况出现梯度爆炸的可能性更高。

二次上界与最优步长之间的关系

假定二次上界引理是已知的，我们观察：二次上界引理对精确步长的求解起到什么作用。
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y-x) + \frac{\mathcal L}{2}||y - x||^2$
既然二次上界引理对于 $\forall x,y \in \mathbb R^n$ 均成立，我们可以将 $x, y$ 视作：某次迭代步骤 $k$ 的 $x_k,x_{k+1}$ ：
后续依然使用 $x, y$ 进行表示。
$\begin{cases} x \Rightarrow x_k \\ y \Rightarrow x_{k+1} \\ y = x + \alpha_k \cdot \mathcal P_k \end{cases}$
由于 $\Rightarrow x_k$ 是上一次迭代步骤产生的位置，是已知项。这意味着：上述不等式右侧相当于关于变量 $\Rightarrow x_{k+1}$ 的一个二次函数。记作 $\phi(y)$ ：
$\begin{cases} \phi(y) \triangleq f(x) + [\nabla f(x)]^T \cdot (y - x) + \frac{\mathcal L}{2}||y - x||^2 \\ \quad \\ f(y) \leq \phi(y) \end{cases}$
由于关于 $y$ 的二次项 $\begin{aligned}\frac{\mathcal L}{2} > 0\end{aligned}$ ，说明函数 $\phi(y)$ 存在最小值。对该值进行求解：
函数图像开口向上~
$y_{min} = \mathop{\arg\min}\limits_{y \in \mathbb R^n} \phi(y)$

首先对 $\phi(y)$ 关于 $y$ 求解梯度：
与 $x$ 相关的项均视作常数。
$\begin{aligned} \nabla \phi(y) & = 0 + \nabla f(x) \cdot 1 + \frac{\mathcal L}{2} \cdot 2 \cdot (y-x) \\ & = \nabla f(x) + \mathcal L \cdot (y-x) \end{aligned}$
令 $\nabla \phi(y) \triangleq 0$ ，有：
$y_{min} = -\frac{\nabla f(x)}{\mathcal L} + x$
对应 $\phi(y)$ 的最小值 $\min \phi(y)$ 有：
$\begin{aligned} \min \phi(y) & = \phi(y_{min}) \\ & = f(x) + [\nabla f(x)]^T \cdot \left(-\frac{\nabla f(x)}{\mathcal L}\right) + \frac{\mathcal L}{2} \cdot \frac{[- \nabla f(x)]^T [- \nabla f(x)]}{\mathcal L^2}\\ & = f(x) - \frac{||\nabla f(x)||^2}{2\mathcal L} \end{aligned}$

将 $\alpha_k \cdot \mathcal P_k$ 代入，观察：

$\mathcal P_k$ 是描述更新方向的向量，对应的是负梯度方向 $-\nabla f(x)$ ；
同理, $\alpha_k$ 对应 $\begin{aligned}\frac{1}{\mathcal L}\end{aligned}$ 。
$\begin{cases} \begin{aligned} y & = x + \alpha_k \cdot \mathcal P_k \\ y_{min} & = x + \frac{1}{\mathcal L} \cdot [-\nabla f(x)] \end{aligned} \end{cases} \Rightarrow \begin{cases} \begin{aligned}\alpha_k & = \frac{1}{\mathcal L} \\ \mathcal P_k & = - \nabla f(x) \end{aligned} \end{cases}$

但需要注意的是： $\leq \phi(y)$ ，而 $y_{min}$ 仅仅是 $\phi(y)$ 中的最小值。也就是说： $y_{min}$ 是 $f (y)$ 取值上界中的最小值。在这种条件下，我们认为 $\begin{aligned}\alpha_k = \frac{1}{\mathcal L}\end{aligned}$ 就是可控制的最优步长。

二次上界引理证明过程

条件：函数 $f(\cdot)$ 可微，并且 $\nabla f(\cdot)$ 满足利普希兹连续；
结论： $f(\cdot)$ 存在二次上界：
$\forall x,y \in \mathbb R^n \Rightarrow f(y) \leq f(x) + [\nabla f(x)]^T \cdot (y - x) + \frac{\mathcal L}{2}||y - x||^2$

证明：
由于上述的 $\in \mathbb R^n$ 是定义域内任意取值，因而无法直接从条件中获取到 $f (x), f (y)$ 之间的大小关系。这里不妨设： $y > x$ ，并引入辅助函数 $\mathcal G(\theta)$ ：
在 $\in \mathbb R^n \text{ } (y > x)$ 确定的情况下,构建一个关于 $\theta$ 的函数，从而通过调节 $\theta$ 来获取 $[f (x), f (y)]$ 之间的函数结果。
$\begin{aligned} \mathcal G(\theta) & = f [\theta \cdot y + (1 - \theta) \cdot x] \\ & = f [x + \theta(y - x)] \quad \theta \in [0,1] \end{aligned}$
从而有： $\mathcal G(0) = f(x);\mathcal G(1) = f(y)$ 。将其与结论中的对应项进行替换：
仅需证明‘替换’后的式子成立即可。
$\begin{aligned} & \quad \quad \mathcal G(1) \leq \mathcal G(0) + [\nabla f(x)]^T \cdot (y - x) + \frac{\mathcal L}{2} ||y - x||^2 \\ & \Rightarrow \mathcal G(1) - \mathcal G(0) - [\nabla f(x)]^T \cdot (y - x) \leq \frac{\mathcal L}{2} ||y - x||^2 \end{aligned}$
观察不等式左侧：
使用牛顿-莱布尼兹公式，可以将 $\mathcal G(1) - \mathcal G(0)$ 表示成如下形式:
$\mathcal G(1) - \mathcal G(0) = \mathcal G(\theta) |_{0}^1 = \int_{0}^1 \mathcal G'(\theta) d\theta$
关于项 $[\nabla f(x)]^T \cdot (y - x)$ ,同样可以使用定积分的形式进行表示。其中 $[\nabla f(x)]^T \cdot (y - x)$ 中不含 $\theta$ ，被视作常数。
$\begin{aligned} [\nabla f(x)]^T \cdot(y - x) & = [\nabla f(x)]^T \cdot (y - x) \cdot 1 \\ & = [\nabla f(x)]^T \cdot (y - x) \cdot \theta |_0^1 \\ & = [\nabla f(x)]^T \cdot (y - x) \cdot \int_0^1 1 d\theta \\ & = \int_{0}^1 [\nabla f(x)]^T \cdot (y - x) d\theta \end{aligned}$
至此，不等式左侧可表示为：
$\begin{aligned} \mathcal I_{left} & = \int_{0}^1 \mathcal G'(\theta) d\theta - \int_{0}^1 [\nabla f(x)]^T \cdot (y - x) d\theta \\ & = \int_0^1 \left \{[\nabla f(x + \theta \cdot (y - x))]^T\cdot (y - x) - [\nabla f(x)]^T \cdot (y - x) \right\} d\theta \end{aligned}$
提出公共部分： $y - x$ ，将剩余部分进行合并：
$\mathcal I_{left} = \int_{0}^1 \left\{\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)\right\}^T \cdot (y - x) d\theta$
观察积分号内的项，其本质上是向量 $\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)$ 与向量 $y - x$ 的内积结果。因而有：
不等式满足的原因: $\cos \theta \in [-1,1]$
$\begin{aligned} \left\{\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)\right\}^T \cdot (y - x) & = ||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \cdot ||y - x|| \cdot \cos \theta \\ & \leq ||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \cdot ||y - x|| \end{aligned}$
将该不等式带回 $\mathcal I_{left}$ ，有：
$\mathcal I_{left} \leq \int_0^1 ||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \cdot ||y - x|| d\theta$
由于 $f(\cdot)$ 满足利普希兹连续，因而有：
其中 $\theta \in [0,1]$ ,因而可以将其从范数符号中提出来。
$||\nabla f[x + \theta \cdot (y - x)] - \nabla f(x)|| \leq \mathcal L \cdot ||x + \theta \cdot (y -x) - x|| = \mathcal L \cdot \theta \cdot ||y - x||$
整理有：
$\mathcal I_{left} \leq \int_0^1 \mathcal L \cdot \theta \cdot ||y - x||^2 d\theta$
又因为 $\mathcal L,||y - x||^2$ 与 $\theta$ 无关，因而从积分号中提出：
$\begin{aligned} \mathcal I_{left} & \leq \mathcal L \cdot ||y - x||^2 \cdot \int_0^1 \theta d\theta \\ & = \mathcal L \cdot ||y - x||^2 \cdot \frac{1}{2} \theta^2|_0^1 \\ & = \frac{\mathcal L}{2} \cdot ||y - x||^2 \\ & = \mathcal I_{right} \end{aligned}$
证毕。