3.1 Moreau-Yosida regularization
From the name we can know that, this interpretation is closely related to the Moreau decomposition.
And the proximal operator has the same formula as the moreau-vosida regularization.
Definition 3.1: The infimal convolution of closed proper convex function f and g on \(\mathbb{R}^{n}\), denoted \(f \square g\)
is defined as :
\[(f \square g)(v) = \inf_{x}(f(x) + g(v-x))\]
with \(\mathbf{dom}(f\square g) = \mathbf{dom}f + \mathbf{g}\)
The main example related is the Moreau envelope or Moreau-Yosida regularization \(M_{\lambda f}\) of the function \(\lambda f\),
(which has the same definition as \(\bar f_{\mu}\) in the page “Proerties”, the proof of Moreau decomposition) is defined as:
\[M_{\lambda f}(v) = \lambda f \square (1/2)\| \cdot \|^{2}_{2}\]
\[M_{\lambda f}(v) = \inf_{x}(f(x) + (1/2\lambda) \| x- v\|^{2}_{2})\]
3.1.1 Convexity
Note \(L(x,y) = f(y) + \frac{1}{2*\mu} \| x- y\|^{2}_{2}\), then is jointly convex in x and y, Then \(M_{\lambda f}(v) = \inf_{y}L(x,y)\)
must be convex (since its epigraph is the projection of a convex set).
3.1.2 Smoothed or Regularized form of f
Theorem 3.2 : \((f \square g)^{*} = f^{*} + g^{*}\).
Proof of Theorem 3.2 :
\[\begin{split}\begin{align*}
(f\square g)^{*}(x) &= \sup_{u}(y^{T}u - (f\square g)(u)) \\
&= \sup_{u}(y^{T}u - \inf_{v}(f(v) + g(u-v))) \\
&= \sup_{u} \sup_{v} (y^{T}u - f(v) - g(u-v)) \\
& \quad ( t \triangleq u - v \Rightarrow u = t+v) \\
&= \sup_{t}sup_{v}(y^{T}(t+v) - f(v) - g(t)) \\
&= \sup_{t}(y^{T}t -g(t)) + \sup_{v}(y^{T}v - f(v))\\
&= f^{*} + g^{*}
\end{align*}\end{split}\]
Proof of Theorem 3.2 with duality :
First, we can reform the Moreau envelope as a equation constrained convex optimization problem:
\[\begin{split}\begin{align*}
M_{\lambda f}(x) &= \inf_{y}(f(y) + (1/2\lambda)\|x-y\|^{2}) \\
&= \inf_{y, s.t. x=y=z}(f(y) + (1/2\lambda)\|z\|^{2})
\end{align*}\end{split}\]
The largangian of this problem is :
\[\begin{split}\begin{align*}
\mathbb{L}(y,z,\mu) &= f(y) + (1/2\lambda)\|z\|^{2} + \mu^{T}(x-y-z) \\
&=(f(y)- \mu^{T}y) + ((1/2\lambda)\|z\|^{2} - \mu^{T}z) + \mu^{T}x
\end{align*}\end{split}\]
The dual problem is :
\[\begin{split}\begin{align*}
g(\mu) &= \inf_{y,z} \mathbb{L}(y,z,\mu) \\
&= - \sup_{y}(\mu^{T}y - f(y)) - (\lambda/2)\|\mu\|^{2} + \mu^{T}x \\
&= - f^{*}(\mu) - (\lambda/2)\|\mu\|^{2} + \mu^{T}x
\end{align*}\end{split}\]
As strong duality holds for strickly feasible convex optimization problem, we have the dualty gap equals zero:
\[\begin{split}\begin{align*}
M_{\lambda f}(x) &= \sup_{\mu} g(\mu)\\
&= \sup_{\mu} (-f^{*}(\mu) - (\lambda/2)\|\mu\|^{2} + \mu^{T}x) \\
&= (f^{*} + (\lambda/2)\|\cdot\|^{2})^{*}(x)
\end{align*}\end{split}\]
As a result of Theorem 3.2, we have :
\[\begin{split}\begin{align*}
M_{\lambda f} = M_{\lambda f}^{**} &= (f \square (1/2\lambda)\|\cdot\|^{2}_{2})^{**} \\
&= (f^{*} + ( (1/2\lambda)\|\cdot\|^{2}_{2})^{*})^{*} \\
& \quad (\|\cdot\|^{2}_{2} \quad is \quad self-dual) \\
&=(f^{*} + (1/2\lambda)\|\cdot\|^{2}_{2})^{*}
\end{align*}\end{split}\]
From the upper equation, we can interperte the Moreau envelope \(M_{\lambda f}\) as a smooth approximation to a function
by taking its conjugate, adding regulization. Mreau envelope obtains a smooth approximation via:
- Take the conjugate of f : \(f^{*}\).
- Regularize : \(f^{*} + (1/2\lambda)\|\cdot\|^{2}_{2}\)
- Take the conjugate again : \((\cdot)^{*} = M_{\lambda f}\)
In my point of view, this three steps is the most important part of the interpretation of Moreau envelope : a smoothed or regularized form of f.
3.1.3 Moreau envelope of L1 norm
Moreau envelope of \(\mid \cdot \mid\) is the Huber function:
\[\begin{split}\begin{align*}
M_{L1} &= (\mid \cdot \mid^{*} + (1/2)\|x\|^{2}_{2})^{*}\\
&=\sup_x(x^{T}y - \sup_{v}(-\mid v\mid + v^{T}x) - (1/2)\|x\|^{2}_{2})
\end{align*}\end{split}\]
Consider firstly the variable v:
\[\begin{split}\sup_{v \ge 0} ( - \mid v \mid + v^{T}x) =
\begin{cases}
0 \quad x < 1\\
+ \infty \quad x >1
\end{cases}\end{split}\]
\[\begin{split}\sup_{v \le 0} ( - \mid v \mid + v^{T}x) =
\begin{cases}
0 \quad x > -1\\
+ \infty \quad x < -1
\end{cases}\end{split}\]
In summary :
\[\begin{split}k(x) = \sup_{v} ( - \mid v \mid + v^{T}x) =
\begin{cases}
0 \quad \mid x\mid < 1\\
+ \infty \quad \mid x \mid >1
\end{cases}\end{split}\]
And:
\[M_{L1} =\sup_x(-\frac{1}{2}(x-y)^{2} - k(x) + (1/2)y^{2})\]
- If \(\mid y \mid \le 1\), take \(x=y\), \(\mid x \mid \le 1\), \(k(x) = 0\), We will have \(M_{L1} = \frac{1}{2}y^{2}\).
- If \(\mid y \mid \ge 1\), \(\mid x \mid \ge 1\), \(k(x) = \infty\), we should also take \(\mid x \mid \le 1\), as a result \(\mid x \mid = 1\). We will have \(M_{L1} = \mid y \mid - \frac{1}{2}\)
We end up with Huber function.
3.1.4 Gardient of Moreau envelope
Consider the definition of Proximal operator, we have :
\[M_{\lambda f}(x) = f(\mathbf{prox}_{\lambda f}(x)) + \frac{1}{2\lambda}\| x - \mathbf{prox}_{\lambda f}(x)\|^{2}_{2}\]
To find the gradient of Moreau envelope, we reform the expression first:
\[\begin{split}\begin{align*}
M_{\lambda f}(x) &= \inf_{y}(f(y) + (1/2\lambda)\|y-x\|^{2}_{2}) \\
&=\inf_{y}(f(y) + (1/2\lambda)(\|x\|^{2} + \|y\|^{2} - 2x^{T}y)) \\
&= (1/2\lambda)\|x\|^{2} + (1/\lambda) \inf_{y}(\lambda f(y) - x^{T}y + (1/2)\|y\|^{2}) \\
&= (1/2\lambda)\|x\|^{2} - (1/\lambda) \sup_{y}(-\lambda f(y) + x^{T}y - (1/2)\|y\|^{2}) \\
&= (1/2\lambda)\|x\|^{2} - (1/\lambda) (\lambda f + (1/2)\|\cdot\|^{2})^{*}(x)
\end{align*}\end{split}\]
Then take the gradient of both sides, we will have :
\[\begin{split}\begin{align*}
\Delta M_{\lambda f}(x) &= x/\lambda - (1/\lambda)\arg\max_{y}(x^{T}y - \lambda f(y) - (1/2)\|y\|^{2}) \\
& (as \quad x_{best} \in \partial f^{*}(y) \quad from \quad Properties \quad page) \\
&= (1/\lambda)(x- \mathbf{prox}_{\lambda f}(x))
\end{align*}\end{split}\]
\[\mathbf{prox}_{\lambda f}(x) = x- \lambda \Delta M_{\lambda f}(x)\]
The proximal operator is a gradient update step of a smoothed version of f, with step size \(\lambda\).
3.2 Resolvent of subdifferential operator
The subdifferential operator \(\partial f\) is a point-to-set mapping (project to the subgradient set).
From the optimal condition of the evaluation of proximal operator, we have :
\[0 \in \partial f(x) + (1/\lambda)(x-v)\]
\[v \in \lambda \partial f(x) + x = (I + \partial f)(x)\]
\[x \in (I + \partial f)^{-1}(v)\]
\[prox_{\lambda f} = (I + \partial f)^{-1}\]
Where \((I + \partial f)\) is a projection operator. And its inverse \((I + \partial f)^{-1}\) is called
the resolvent of the operator \(\partial f\).
3.3 Modified gradient step
3.3.1 First order approximation
Take the first order taylor expansion of function f:
\[\hat{f}^{(1)}_{v}(x) = f(v) + \Delta f(v)^{T}(x-v)\]
Substitute into the proximal operator:
\[\mathbf{prox}_{\lambda \hat{f}^{(1)}_{v}} = \arg\min_{x} (f(v) + \Delta f(v)^{T}(x-v) + (1/2\lambda)\|x-v\|^{2}_{2})\]
\[\frac{\partial}{\partial x}(\cdot) = \Delta f(v) + (1/\lambda)(x-v) = 0\]
\[\mathbf{prox}_{\lambda \hat{f}^{(1)}_{v}} = v - \lambda \Delta f(v)\]
It is a standard gradient update with step size \(\lambda\).
3.3.2 Second order approximation
Take the seconf order taylor expansion of function f:
\[\hat{f}^{(2)}_{v}(x) = f(v) + \Delta f(v)^{T}(x-v) + \frac{1}{2}(x-v)^{T}\Delta^{2} f(v)(x-v)\]
Substitute into the proximal operator:
\[\mathbf{prox}_{\lambda \hat{f}^{(2)}_{v}} = \arg\min_{x} (f(v) + \Delta f(v)^{T}(x-v) + \frac{1}{2}(x-v)^{T}\Delta^{2} f(v)(x-v) + (1/2\lambda)\|x-v\|^{2}_{2})\]
\[\frac{\partial}{\partial x}(\cdot) = \Delta f(v) + \Delta^{2} f(v)(x-v) + (1/\lambda)(x-v) = 0\]
\[\Delta f(v) + ( \Delta^{2} f(v) + (1/\lambda) I) (1/\lambda)(x-v) = 0\]
\[\mathbf{prox}_{\lambda \hat{f}^{(2)}_{v}} = v - ( \Delta^{2} f(v) + (1/\lambda) I)^{-1}\Delta f(v)\]
It is a Tikhonovregularized Newton update, also known as a Levenberg-Marquardt up-date
or a modified Hessian Newton update. Thus, gradient and Levenberg-Marquardt steps can be viewed as proximal operators
of first and second-order approximations of f.