3. Interpretations ============================== 3.1 Moreau-Yosida regularization ------------------------------ From the name we can know that, this interpretation is closely related to the Moreau decomposition. And the proximal operator has the same formula as the moreau-vosida regularization. **Definition 3.1**: The **infimal convolution** of closed proper convex function f and g on :math:`\mathbb{R}^{n}`, denoted :math:`f \square g` is defined as : .. math:: (f \square g)(v) = \inf_{x}(f(x) + g(v-x)) with :math:`\mathbf{dom}(f\square g) = \mathbf{dom}f + \mathbf{g}` The main example related is the **Moreau envelope** or **Moreau-Yosida regularization** :math:`M_{\lambda f}` of the function :math:`\lambda f`, (which has the same definition as :math:`\bar f_{\mu}` in the page "Proerties", the proof of Moreau decomposition) is defined as: .. math:: M_{\lambda f}(v) = \lambda f \square (1/2)\| \cdot \|^{2}_{2} .. math:: M_{\lambda f}(v) = \inf_{x}(f(x) + (1/2\lambda) \| x- v\|^{2}_{2}) 3.1.1 Convexity ~~~~~~~~~~~~~~~~~~~~~~ Note :math:`L(x,y) = f(y) + \frac{1}{2*\mu} \| x- y\|^{2}_{2}`, then is jointly convex in x and y, Then :math:`M_{\lambda f}(v) = \inf_{y}L(x,y)` must be convex (since its epigraph is the projection of a convex set). 3.1.2 Smoothed or Regularized form of f ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Theorem 3.2** : :math:`(f \square g)^{*} = f^{*} + g^{*}`. **Proof of Theorem 3.2** : .. math:: \begin{align*} (f\square g)^{*}(x) &= \sup_{u}(y^{T}u - (f\square g)(u)) \\ &= \sup_{u}(y^{T}u - \inf_{v}(f(v) + g(u-v))) \\ &= \sup_{u} \sup_{v} (y^{T}u - f(v) - g(u-v)) \\ & \quad ( t \triangleq u - v \Rightarrow u = t+v) \\ &= \sup_{t}sup_{v}(y^{T}(t+v) - f(v) - g(t)) \\ &= \sup_{t}(y^{T}t -g(t)) + \sup_{v}(y^{T}v - f(v))\\ &= f^{*} + g^{*} \end{align*} **Proof of Theorem 3.2 with duality** : First, we can reform the Moreau envelope as a equation constrained convex optimization problem: .. math:: \begin{align*} M_{\lambda f}(x) &= \inf_{y}(f(y) + (1/2\lambda)\|x-y\|^{2}) \\ &= \inf_{y, s.t. x=y=z}(f(y) + (1/2\lambda)\|z\|^{2}) \end{align*} The largangian of this problem is : .. math:: \begin{align*} \mathbb{L}(y,z,\mu) &= f(y) + (1/2\lambda)\|z\|^{2} + \mu^{T}(x-y-z) \\ &=(f(y)- \mu^{T}y) + ((1/2\lambda)\|z\|^{2} - \mu^{T}z) + \mu^{T}x \end{align*} The dual problem is : .. math:: \begin{align*} g(\mu) &= \inf_{y,z} \mathbb{L}(y,z,\mu) \\ &= - \sup_{y}(\mu^{T}y - f(y)) - (\lambda/2)\|\mu\|^{2} + \mu^{T}x \\ &= - f^{*}(\mu) - (\lambda/2)\|\mu\|^{2} + \mu^{T}x \end{align*} As strong duality holds for strickly feasible convex optimization problem, we have the dualty gap equals zero: .. math:: \begin{align*} M_{\lambda f}(x) &= \sup_{\mu} g(\mu)\\ &= \sup_{\mu} (-f^{*}(\mu) - (\lambda/2)\|\mu\|^{2} + \mu^{T}x) \\ &= (f^{*} + (\lambda/2)\|\cdot\|^{2})^{*}(x) \end{align*} As a result of Theorem 3.2, we have : .. math:: \begin{align*} M_{\lambda f} = M_{\lambda f}^{**} &= (f \square (1/2\lambda)\|\cdot\|^{2}_{2})^{**} \\ &= (f^{*} + ( (1/2\lambda)\|\cdot\|^{2}_{2})^{*})^{*} \\ & \quad (\|\cdot\|^{2}_{2} \quad is \quad self-dual) \\ &=(f^{*} + (1/2\lambda)\|\cdot\|^{2}_{2})^{*} \end{align*} From the upper equation, we can interperte the Moreau envelope :math:`M_{\lambda f}` as a smooth approximation to a function by taking its conjugate, adding regulization. **Mreau envelope obtains a smooth approximation via**: 1. Take the conjugate of f : :math:`f^{*}`. 2. Regularize : :math:`f^{*} + (1/2\lambda)\|\cdot\|^{2}_{2}` 3. Take the conjugate again : :math:`(\cdot)^{*} = M_{\lambda f}` In my point of view, this three steps is the most important part of the interpretation of Moreau envelope : a **smoothed or regularized form of f**. 3.1.3 Moreau envelope of L1 norm ~~~~~~~~~~~~~~~~~~~~~~~~~ Moreau envelope of :math:`\mid \cdot \mid` is the Huber function: .. math:: \begin{align*} M_{L1} &= (\mid \cdot \mid^{*} + (1/2)\|x\|^{2}_{2})^{*}\\ &=\sup_x(x^{T}y - \sup_{v}(-\mid v\mid + v^{T}x) - (1/2)\|x\|^{2}_{2}) \end{align*} Consider firstly the variable v: .. math:: \sup_{v \ge 0} ( - \mid v \mid + v^{T}x) = \begin{cases} 0 \quad x < 1\\ + \infty \quad x >1 \end{cases} .. math:: \sup_{v \le 0} ( - \mid v \mid + v^{T}x) = \begin{cases} 0 \quad x > -1\\ + \infty \quad x < -1 \end{cases} In summary : .. math:: k(x) = \sup_{v} ( - \mid v \mid + v^{T}x) = \begin{cases} 0 \quad \mid x\mid < 1\\ + \infty \quad \mid x \mid >1 \end{cases} And: .. math:: M_{L1} =\sup_x(-\frac{1}{2}(x-y)^{2} - k(x) + (1/2)y^{2}) * If :math:`\mid y \mid \le 1`, take :math:`x=y`, :math:`\mid x \mid \le 1`, :math:`k(x) = 0`, We will have :math:`M_{L1} = \frac{1}{2}y^{2}`. * If :math:`\mid y \mid \ge 1`, :math:`\mid x \mid \ge 1`, :math:`k(x) = \infty`, we should also take :math:`\mid x \mid \le 1`, as a result :math:`\mid x \mid = 1`. We will have :math:`M_{L1} = \mid y \mid - \frac{1}{2}` We end up with **Huber function**. 3.1.4 Gardient of Moreau envelope ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider the definition of Proximal operator, we have : .. math:: M_{\lambda f}(x) = f(\mathbf{prox}_{\lambda f}(x)) + \frac{1}{2\lambda}\| x - \mathbf{prox}_{\lambda f}(x)\|^{2}_{2} To find the gradient of Moreau envelope, we reform the expression first: .. math:: \begin{align*} M_{\lambda f}(x) &= \inf_{y}(f(y) + (1/2\lambda)\|y-x\|^{2}_{2}) \\ &=\inf_{y}(f(y) + (1/2\lambda)(\|x\|^{2} + \|y\|^{2} - 2x^{T}y)) \\ &= (1/2\lambda)\|x\|^{2} + (1/\lambda) \inf_{y}(\lambda f(y) - x^{T}y + (1/2)\|y\|^{2}) \\ &= (1/2\lambda)\|x\|^{2} - (1/\lambda) \sup_{y}(-\lambda f(y) + x^{T}y - (1/2)\|y\|^{2}) \\ &= (1/2\lambda)\|x\|^{2} - (1/\lambda) (\lambda f + (1/2)\|\cdot\|^{2})^{*}(x) \end{align*} Then take the gradient of both sides, we will have : .. math:: \begin{align*} \Delta M_{\lambda f}(x) &= x/\lambda - (1/\lambda)\arg\max_{y}(x^{T}y - \lambda f(y) - (1/2)\|y\|^{2}) \\ & (as \quad x_{best} \in \partial f^{*}(y) \quad from \quad Properties \quad page) \\ &= (1/\lambda)(x- \mathbf{prox}_{\lambda f}(x)) \end{align*} .. math:: \mathbf{prox}_{\lambda f}(x) = x- \lambda \Delta M_{\lambda f}(x) The proximal operator is a **gradient update step** of a smoothed version of f, with step size :math:`\lambda`. 3.2 Resolvent of subdifferential operator ----------------------------------------- The subdifferential operator :math:`\partial f` is a point-to-set mapping (project to the subgradient set). From the optimal condition of the evaluation of proximal operator, we have : .. math:: 0 \in \partial f(x) + (1/\lambda)(x-v) .. math:: v \in \lambda \partial f(x) + x = (I + \partial f)(x) .. math:: x \in (I + \partial f)^{-1}(v) .. math:: prox_{\lambda f} = (I + \partial f)^{-1} Where :math:`(I + \partial f)` is a projection operator. And its inverse :math:`(I + \partial f)^{-1}` is called the **resolvent of the operator** :math:`\partial f`. 3.3 Modified gradient step --------------------------------------- 3.3.1 First order approximation ~~~~~~~~~~~~~~~~~~~~ Take the first order taylor expansion of function f: .. math:: \hat{f}^{(1)}_{v}(x) = f(v) + \Delta f(v)^{T}(x-v) Substitute into the proximal operator: .. math:: \mathbf{prox}_{\lambda \hat{f}^{(1)}_{v}} = \arg\min_{x} (f(v) + \Delta f(v)^{T}(x-v) + (1/2\lambda)\|x-v\|^{2}_{2}) .. math:: \frac{\partial}{\partial x}(\cdot) = \Delta f(v) + (1/\lambda)(x-v) = 0 .. math:: \mathbf{prox}_{\lambda \hat{f}^{(1)}_{v}} = v - \lambda \Delta f(v) It is a standard gradient update with step size :math:`\lambda`. 3.3.2 Second order approximation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Take the seconf order taylor expansion of function f: .. math:: \hat{f}^{(2)}_{v}(x) = f(v) + \Delta f(v)^{T}(x-v) + \frac{1}{2}(x-v)^{T}\Delta^{2} f(v)(x-v) Substitute into the proximal operator: .. math:: \mathbf{prox}_{\lambda \hat{f}^{(2)}_{v}} = \arg\min_{x} (f(v) + \Delta f(v)^{T}(x-v) + \frac{1}{2}(x-v)^{T}\Delta^{2} f(v)(x-v) + (1/2\lambda)\|x-v\|^{2}_{2}) .. math:: \frac{\partial}{\partial x}(\cdot) = \Delta f(v) + \Delta^{2} f(v)(x-v) + (1/\lambda)(x-v) = 0 .. math:: \Delta f(v) + ( \Delta^{2} f(v) + (1/\lambda) I) (1/\lambda)(x-v) = 0 .. math:: \mathbf{prox}_{\lambda \hat{f}^{(2)}_{v}} = v - ( \Delta^{2} f(v) + (1/\lambda) I)^{-1}\Delta f(v) It is a Tikhonovregularized Newton update, also known as a Levenberg-Marquardt up-date or a modified Hessian Newton update. Thus, gradient and Levenberg-Marquardt steps can be viewed as proximal operators of first and second-order approximations of f. 3.4 Trust region problem ------------------------- The second term of the proximal operator :math:`\|x-v\|^{2}_{2}` could be seen as a squared penalty, to restraint the variable in a nearby region.