7. Applications =========================== We can always use the following Matlab code to calculate the proximal operator:: function x = prox_f_cvx(v, lambda, f, A) % For testing purposes. [n T] = size(v); cvx_begin quiet variable x(n,T) minimize(f(x,A) + (1/(2*lambda))*square_pos(norm(x - v,'fro'))) subject to x <= 1; x >= -1; cvx_end end 7.1 Lasso ------------------- Lasso is short of **Least Absoluate Shrinkage and Selection Operator**. The problem is : .. math:: minimize \quad (1/2)\|Ax-b\|^{2}_{2} + \gamma \|x\|_{1} 7.1.1 Proximal gradient method ~~~~~~~~~~~~~~~~~~~~~ It carries out variable selection (by the l1 heuristic), and model fitting (by the least square). Consider the splitting : .. math:: f(x) = (1/2)\|Ax-b\|^{2}_{2} ,\quad g(x) = \gamma \|x\|_{1} With the gradient and proximal operator: .. math:: \Delta f(x) = A^{T}(Ax - b), \quad \mathbf{prox}_{g}(x) = S_{\gamma}(x) = (x - \gamma)_{+} - (-x -\gamma)_{+} The proximal gradient method will be : .. math:: \begin{align*} x^{k+1} &:= \mathbf{prox}_{\lambda^{k} g} (x^{k} - \lambda^{k} \Delta f(x^{k})) \\ & = S_{\lambda^{k}\gamma}(x^{k} - \lambda^{k}A^{T}(Ax - b)) \end{align*} The proximal gradient method is also called **ISTA** (iterative shrinkage-thresholding algorithm), while the accelerated version is known as **FISTA** (fast ISTA), the fast version is basicly adding a momentum. As the proximal gradient method can be interpreted as seperately optimize f and g. * We can further accelerate the algorithm by parallex matrix-vector multiplication. * Or even use the **Gram matrix** as mentioned in Chapter "Evaluating Proximal Operator". * If we want solution for multiply :math:`\gamma`, we can use the solution of the largest :math:`\gamma` as warm starting. 7.1.2 ADMM ~~~~~~~~~~~~~~~~~~~~~~~~~~ f is quadratic function, we can use the tricks as before: * Reuse of factorization. * Warm start with previous gradient, if use an iterative method. * if n much smaller, we can precompute the Gram matrix. .. math:: \begin{align*} &x^{k+1} := (A^{T}A + (1/\lambda))^{-1}(A^{T}b + (1/\lambda)(z^{k} - u^{k}) ) \\ &z^{k+1} := S_{\lambda^{k}\gamma}(x^{k+1} + u^{k})\\ &u^{k+1} := u^{k} + x^{k+1} - z^{k+1} \end{align*} 7.1.3 Test ~~~~~~~~~~~~~~~~~~~~~~~~ Here we should pay attention to the update of x. With :math:`A\in \mathbb{R}^{m \times n}`, if m larger than n, we should use the expression upper. While in the case when m is smaller than n, we can reform the process to accelerate: .. math:: (A^{T}A + (1/\lambda))x = A^{T}b + (1/\lambda)(z^{k} - u^{k}) \triangleq q .. math:: A(A^{T}A + (1/\lambda))x = Aq .. math:: (AA^{T}A + A(1/\lambda))x = Aq .. math:: (AA^{T} + (1/\lambda))Ax = Aq .. math:: A^{T}Ax = A^{T}(AA^{T} + (1/\lambda))^{-1}Aq \triangleq p Using this in the original equation we have: .. math:: (A^{T}A + (1/\lambda))x = q .. math:: p + (1/\lambda)x = q .. math:: x = \lambda (q - p) we will have the corresponding code in matlab as :: L = chol(speye(m) + lambda*(A*A'), 'lower'); L = sparse(L); U = sparse(L'); q = Atb + rho*(z - u); x = lambda*(q - lambda*(A'*(U \ ( L \ (A*q) )))); The original codes could be found `here `_, or in `my github page `_ The result run times are : * CVX time elapsed: 25.06 seconds. * Proximal gradient time elapsed: 0.35 seconds. * Fast prox gradient time elapsed: 0.17 seconds. * ADMM time elapsed: 0.04 seconds. .. image:: images/lasso.jpg :align: center 7.2 Matrix decomposition ------------------------ The problem is to decompose matrix A into a sum of components :math:`X_{i}` .. math:: \begin{align*} &minimize \quad \phi_{1}(X_{1}) + \gamma_{2}\phi_{2}(X_{2}) + \cdot\cdot\cdot + \gamma_{N}\phi_{N}(X_{N}) \\ &subject\quad to\quad X_{1} + X_{2} + \cdot\cdot\cdot + X_{N} = A \end{align*} The function :math:`\phi(X)` can usually be seen as 'penalties', to drive :math:`X_{i}` to have our objective properties. * **Squared Frobenius norm**: :math:`\phi(X) = \|X\|_{F}^{2} = \sum_{i,j}X_{i,j}^{2}`, to encourage X to be small. * **Entrywise l1 norm**: :math:`\phi(X) = \|X\|_{1} = \sum_{i,j}\mid X_{i,j}\mid`, to encourage X to be sparse. * **Sum-column-norm**: :math:`\phi(X) = \sum_{j}\|x_{j}\|_{2}`, to encourage column sparsity. (can be interpreted as group lasso regulization) * **Elementwise constraints**: :math:`X_{i,j}\in C_{i,j}`, for instant, we want to fixed some entries (fixed sparse pattern). * **Separable convex function**: :math:`\phi(X) = \sum_{i=1}^{m}\sum_{j=1}^{n}\phi_{i,j}(X_{i,j})`. For instant, constrain the subblock of the matrix. * **Semidefinite cone constraint**: :math:`X \succeq 0`. * **Nuclear norm**: :math:`\phi(X) = \|X\|_{*} = tr(X^{T}X)`, encourage X to be low rank. For an example, take :math:`\phi_{1}` be the Squred Frobenius norm,:math:`\phi_{2}` be the entrywise l1 norm, :math:`\phi_{3}` be the Nuclear norm, the problem can be reformed into: .. math:: minimize \quad \|A-(X_{2} + X_{3})\|_{F}^{2} + \gamma_{2}\|X_{2}\|_{1} + \gamma_{3}\|X_{3}\|_{*} So we will decompose A into a sum of a small matrix :math:`X_{1}`, a sparse matrix :math:`X_{2}`, and a low rank matrix :math:`X_{3}`. 7.2.1 ADMM ~~~~~~~~~~~~~~~~~~~~ Consider the splitting: .. math:: f(X) = \sum_{i = 1}^{N}\phi_{i}(X_{i}), \quad g(X)= I_{\mathcal{C}}(X) where :math:`X = (X_{1}, ..., X_{N})`, and : .. math:: \mathcal{C} = \left\{ (X_{1},...,X_{N}) \mid \sum_{i=1}^{N}X_{i} = A \right\} f is to evulate the objective function, and g is to project onto :math:`\mathcal{C}`: the feasible set. The projection is fairly simple, which is similar to a translation of centroid: .. math:: \Pi_{\mathcal{C}}(X) = X - \bar X + (1/N)A So the final algorithms looks as follows: .. math:: \begin{align*} &X_{i}^{k+1} := \mathbf{prox}_{\lambda \phi_{i}}(X_{i}^{k} - \bar X^{k} + (1/N)A - U^{k}) \\ &U^{k+1} := U^{k} + \bar X^{k+1} - (1/N)A \end{align*} 7.2.2 Test ~~~~~~~~~~~~~~~~~~~~ Take the former example : :math:`\phi_{1}` be the Squred Frobenius norm,:math:`\phi_{2}` be the entrywise l1 norm, :math:`\phi_{3}` be the Nuclear norm. Note :math:`B=\bar X^{k} - (1/N)A + U^{k}` So our updates of X is: .. math:: \begin{align*} &X_{1} = \mathbf{prox}_{\lambda L2}(X_{1}-B) = \frac{1}{1+\lambda}(X_{1} -B) \\ &X_{2} = \mathbf{prox}_{\lambda L1}(X_{2}-B) = S_{\gamma_{2} \lambda}(X_{2} -B) \\ &X_{3} = \mathbf{prox}_{\lambda Nuclear}(X_{3}-B) = U \mathbf{diag}(\mathbf{prox}_{\lambda f}(\sigma_{s}(X_{3}-B)))V^{T} \end{align*} Corresponding codes are:: X_1 = (1/(1+lambda))*(X_1 - B); X_2 = prox_l1(X_2 - B, lambda*g2); X_3 = prox_matrix(X_3 - B, lambda*g3, @prox_l1); where prox_matrix is defined as :: function [ Vout ] = prox_matrix(X, eta, prox_l1) [U,S,V] = svd(X); % X= U*S*V' Spos = prox_l1(S, eta); Vout = U * Spos * V'; end We get the output :: CVX (vs true): |V| = 0.31; |X_1| = 26.23 nnz(S) = 49; nnz(X_2) = 53 rank(L) = 4; rank(X_3) = 4 ADMM (vs true): |V| = 0.31; |X_1| = 26.18 nnz(S) = 49; nnz(X_2) = 52 rank(L) = 4; rank(X_3) = 4 ADMM vs CVX solutions (in Frobenius norm): X_1: 3.59e-01; X_2: 6.15e-01; X_3: 5.30e-01 7.3 Multi-period portfolio optimization ---------------------------- Optimize the sum of a risk-adjusted negative return f and a transaction cost g, for a period of portfolio investment. .. math:: minimize \quad \sum_{t=1}^{T}f_{t}(x_{t}) + \sum_{t=1}^{T}g_{t}(x_{t} - x_{t-1}) With the constraints that indicate any short position (x>0), and limit the sum o f liquid. .. math:: x_{t} \ge 0, \quad \sum_{i=1}^{N} x_{t,i} \le 1 Assume that :math:`f_{t}` and :math:`g_{t}` are closed proper convex and that :math:`f_{t}` are fully separable, i.e. the transaction cost in any period is the sum of the transaction costs for each asset. Let :math:`X = [x_{1},...,x_{T}] \in \mathbb{R}^{n \times T}` donate the matrix of the portfoilo sequence. Consider the splitting: .. math:: f(X) = \sum_{t=1}^{T}f_{t}(x_{t}) \quad g(X) = \sum_{t=1}^{T}g_{t}(x_{t} - x_{t-1}) **Where f is separable across the columns of X and g is separable across the rows of X.** Recall the update formula of ADMM : .. math:: \begin{align*} &x^{k+1}:=\mathbf{prox}_{\lambda f}(z^{k} - u^{k}) \\ &z^{k+1}:=\mathbf{prox}_{\lambda g}(x^{k+1} + u^{k}) \\ &u^{k+1} := u^{k} + x^{k+1} - z^{k+1} \end{align*} The update of each column of x will be solved using a CVX solver: .. math:: \begin{align*} &minimize \quad f_{t}(x_{t}) + (1/2\lambda)\|x_{t} - z^{k} + u^{k}\|^{2}_{2}\\ & subject\ to \quad x_{t}\ge 0, \quad \sum_{i=1}^{N} x_{t,i} \le 1 \end{align*} The update of each rows of z will be solving the following problem: .. math:: \begin{align*} &minimize \quad \sum_{t=1}^{N} g_{t,i}(z_{t,i}- z_{t-1,i}) + (1/2\lambda)\|z_{t,i} - x^{k+1} - u^{k}\|_{2}^{2} \\ & subject\ to \quad z_{1} = 0 \end{align*} Code could be found in `Stanford page `_, or in `my github `_ . The following image shows the time series of asset holdings. .. image:: images/fin_asset_holdings.png :align: center In this case, ADMM method converges to the same optimal point as CVX solver, however it is much slower. 7.4 Stochastic optimization ----------------------------- Optimize the stochastic optimization, with :math:`\pi` be a probability distribution, and :math:`f_{(k)}` is the closed proper convex objective function for scenario k. .. math:: minimize \quad \mathcal{E}(f(x)) = \sum_{k = 1}^{K} \pi_{k}f^{(k)}(x) Reform the problem into consensus form by introducing a consensus constraint. .. math:: \begin{align*} & minimize \quad \mathcal{E}(f(x)) = \sum_{k = 1}^{K} \pi_{k}f^{(k)}(x^{(k)}) \\ & subject \ to \quad x^{(1)} = \cdot\cdot\cdot = x^{(K)} \end{align*} To use ADMM method, the function f take the form of the upper objective function, while g take the form of a pojection onto the feasible set, by take the average of the local solutions :math:`x^{(k)}`. Example code could be found `here `_. ADMM and CVX method solved in similar amount of time, while we should notice that in ADMM, the updates of variables could be processed in parallex.