6. Evaluating Proximal Operators ============================= To use the proximal algorithms, we need to evaluate (solve) the proximal operator at each step. The proximal operator is a convex optimization problem. * We will investigate solving using generic optimization algorithms. * Proximal methods can be very useful even in cases when a closed form solution for the proximal operator is not available. * Proximal operators and certain projection operators have a close connection. 6.1 Generic methods ------------------------- .. math:: \begin{cases} Unconstrained \begin{cases} Non-smooth : \quad Subgradient \quad methods, etc. \\ Smooth : \quad Gradient \quad methods (Newton, quasi-Newton, etc), etc \end{cases} \\ Constrained \begin{cases} Non-smooth: \quad Projected \quad subgradient, etc. \\ Smooth: \quad Projected \quad gradient, Interior-point \quad methods, etc \end{cases} \end{cases} We can also review the solve of SDP, we can transform SDP to a cone problem, and solve it with a generic interior-point method based cone solver. 6.1.1 Quariatic problems ~~~~~~~~~~~~~~~~~~~~~~~~ .. math:: f(x) = (1/2)x^{T}Ax + b^{T}x + c With :math:`A \in \mathbf{S}^{n}_{+}`, then : .. math:: \mathbf{prox}_{\lambda f}(v) = \arg\min_{x} ((1/2)x^{T}Ax + b^{T}x + c + (1/2\lambda) (x^{T}x + v^{T}v - 2v^{T}x)) .. math:: \frac{\partial}{\partial x} = Ax + (b - (1/\lambda)v) + (1/\lambda)x = 0 .. math:: \mathbf{prox}_{\lambda f}(v) = (I + \lambda A)^{-1}(v- \lambda b) In summary, evaluate the proximal operator is to solve the linear equation: .. math:: (I + \lambda A)x = v- \lambda b * We can explore the sparse structure of A, to solve faster. * If we use direct method (factorize), we could save the factorization for re-use. * If we use an iterative method, e.g. CG, we can warm start with previous solution. 6.1.2 Smooth functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Same idea here as the descussion above, in most our method, we will try solve : .. math:: (I + \lambda H)x = v- \lambda b * We can use the previous solution as a warm start. * We can exploit structue of Hessian(H) of f. 6.1.3 General scalar functions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Localization method**: this method, is taught in Subgradient Chapter of EE364b, as a method suitable for solving non-smooth problem. It is basicly iteratively reduce the candidate region of the solution. We can use the subgradient to find a update direction. It can be guaranteed to converge. In practice, this method works, however in a very slow way. **Guarded Newton method**: this method is basicly the same as localization method, except it requires the objective function to betwice differentiable, to use the hessian and gradient to find a update direction. It converges faster but more expensive for each iteration (a nature result as it exploit more the objective function). 6.2 Polyhedra ----------------------- Consider the problem : projection to a polyhedra: .. math:: \begin{align*} &minimize \quad (1/2)\|x - v\|^{2}_{2} \\ &subject\quad to\quad Ax = b, \quad Cx \le d \end{align*} Where :math:`A \in \mathbf{R}^{m \times n}`, and :math:`C \in \mathbf{R}^{p \times n}`. The dual problem of it is a convex quadratic problem. So it is the same as evulative the proximal operator of a convex quadratic function restricted to a polyhedron. 6.2.1 Solution via duality ~~~~~~~~~~~~~~~~~~~~~~ It is much more efficient to solve via duality, when m and p are both much smaller than n. m and n are much smaller means the dual variables have much smaller dimension, which is to say the cost to solve the dual problem is much lower. This corresponding to a case where we wanto to project a high-dimensional point onto a polyhedron described by just a few equalities and inequalities. .. math:: \mathbb{L}(x, \mu, \eta) = (1/2)(x^{T}x + v^{T}v - 2v^{T}x) + \mu^{T}(Ax-b) + \eta^{T}(Cx-d) The optimal conditions are : .. math:: \begin{align*} &\frac{\partial \mathbb{L}}{\partial x} = x - v + A^{T}\mu + C^{T}\eta = 0 \\ &\frac{\partial \mathbb{L}}{\partial \mu} = Ax - b = 0 \\ &\frac{\partial \mathbb{L}}{\partial \eta} = Cx - d = 0 \end{align*} From the first condition we have : .. math:: x = v - A^{T}\mu - C^{T}\eta Them the dual function is the concave quadratic: .. math:: \begin{align*} g(\mu, \eta) &= - (1/2)\|v - A^{T}\mu - C^{T}\eta\|^{2}_{2} + (1/2)v^{T}v - \mu^{T}b - \eta^{T}d \\ &= - \frac{1}{2}\|A^{T}\mu - C^{T}\eta\|^{2}_{2} + ((Av)^{T} - b^{T})\mu + ((Cv)^{T} - d^{T})\eta \\ &= - \frac{1}{2}\| \begin{bmatrix}A\\C \end{bmatrix}^{T}\begin{bmatrix} \mu \\ \eta\end{bmatrix} \|^{2}_{2} + (\begin{bmatrix}A\\ C \end{bmatrix} v - \begin{bmatrix} b\\d\end{bmatrix})^{T}\begin{bmatrix}\mu \\ \eta \end{bmatrix} \end{align*} The dual problem is : .. math:: \begin{align*} &maximize g(\mu, \eta) \\ &subject\quad to\quad \eta \ge 0 \end{align*} The solution of the primal problem is : .. math:: x^{*} = v - A^{T}\mu^{*} - C^{T}\eta^{*} **Gram matrix caching**, is n is large but m+p is modest. we define : .. math:: G = \begin{bmatrix}A\\ C \end{bmatrix} \in \mathbf{R}^{(m+p)\times n} As a result, we have : .. math:: g(\mu, \eta) = - \frac{1}{2}\| G^{T}\begin{bmatrix} \mu \\ \eta\end{bmatrix} \|^{2}_{2} + (G v - \begin{bmatrix} b\\d\end{bmatrix})^{T}\begin{bmatrix}\mu \\ \eta \end{bmatrix} We can parallely compute :math:`GG^{T}` : .. math:: GG^{T} = \sum_{i= 1}^{n} \begin{bmatrix}a_{i}\\c_{i}\end{bmatrix} \begin{bmatrix}a_{i}\\c_{i}\end{bmatrix}^{T} The computation of this part is the most expensive, as a result we can use the similar gestion as we deal with the Newton's step. For an example, we can limit the frequence of the update of :math:`GG^{T}`. 6.2.2 Affine set ~~~~~~~~~~~~~~~~~~~~~ see the paper 6.2.3 Halfspace ~~~~~~~~~~~~~~~~~~~~~ see the paper 6.2.4 Box ~~~~~~~~~~~~~~~~~~~~~ see the paper 6.2.5 Simplex ~~~~~~~~~~~~~~~~~~~~~ see the paper 6.3 Cones --------------------- .. math:: \begin{align*} &minimize \quad (1/2)\|x - v\|^{2}_{2} \\ &subject\quad to\quad x \in \mathcal{K} \end{align*} Where :math:`\mathcal{K}` is a proper cone with dual cone :math:`\mathcal{K}^{*}`. It is a projection onto a proper cone, using the Moreau theorem, we have : .. math:: v = \Pi_{\mathcal{K}}(v) + \Pi_{\mathcal{K}^{*}}(v) So we have : .. math:: x \in \mathcal{K}, \quad v = x - \lambda, \quad \lambda \in \mathcal{K}^{*}, \quad \lambda^{T}x = 0 It also corresponding to the duality properties of proper cone. 6.5 Norms and norm balls -------------------------------- .. math:: \begin{align*} \mathbf{prox}_{\lambda f}(v)&= v - \lambda \mathbf{prox}_{f^{*}/\lambda}(v/\lambda) \\ &= v - \lambda \Pi_{\mathcal{B}}(v/\lambda) \end{align*} 6.5.1 Euclidean norm ~~~~~~~~~~~~~~~~~~~~~ .. math:: \mathbf{prox}_{\lambda f}(v) = (1-\lambda/\|v\|_{2})_{+}v = \begin{cases} (1-\lambda/\|v\|v_{2})v \quad \|v\|_{2} \ge \lambda \\ 0 \quad \quad \|v\|_{2} < \lambda \end{cases} In :mathbb{R}: space, it is a ReLU function. In :math:`\mathbb{R}^{2}`, its magnitude looks like: .. image:: images/magitude_l2_r2.png :align: center