8. Distributed Model Fitting

It to find a model best fit the measurements. It normally has a observation error term (‘l’ for loss), and a regularization term (r). l and r are chosen to be convex. As the following shows a problem to fit a linear model:

\[minimize \quad l(Ax - b) + r(x)\]

Assume l is additive:

\[l(Ax-b) = \sum_{i=1}^{m}l_{i}(a_{i}^{T}x - b_{i})\]

\(l_{i}\) is the loss of ith training example. \(a_{i}\) is the feature vector of example i (the ith system input), and \(b_{i}\) is the ouput(response) of the example i (the ith observation).

  • r choose l2 : \(r(x) = \lambda\|x\|_{2}^{2}\), is the tikhonov regularization or a ridge penalty.
  • r choose l1 : \(r(x) = \lambda\|x\|_{1}\) is a lasso penalty.
  • In some case, part of the parameters should not be regularized (e.g. offset parameters).
  • Split across training examples.
  • Split across features.

8.1 Examples

  • Regression
  • Classification
  • Image segmentation, denoise, decomposition.

8.2 Splitting across examples

Large amount of relatively low-dimensional data. Goal: solve in a distributed way. Partition A and b (example inputs and measurements):

\[\begin{split}A = \begin{bmatrix} A_{1} \\ : \\ A_{N} \end{bmatrix}, \quad b = \begin{bmatrix} b_{1} \\ : \\ b_{N} \end{bmatrix},\end{split}\]

The problem will be :

\[minimize \quad \sum_{i=1}^{N}l_{i}(A_{i}x - b_{i}) + r(x)\]

Reform the problem into consensus form to enable distributed calculation (turn into a standard ADMM type of problem):

\[\begin{split}\begin{align*} &minimize \quad \sum_{i=1}^{N}l_{i}(A_{i}x_{i} - b_{i}) + r(z) \\ & subject\ to \quad x_{i}-z = 0, \ i = 1,..., N \end{align*}\end{split}\]

Using the scaled form of ADMM updates (see 7.3 sharing problems for more details):

\[\begin{split}\begin{align*} & x_{i}^{k+1} := \arg\min_{x_{i}} (l_{i}(A_{i}x_{i} - b_{i}) + (\rho/2)\|x_{i} - z^{k} + u_{i}^{k} \|_{2}^{2}) \\ & z^{k+1} := \arg\min_{z} (r(z) + (N\rho/2)\|\bar{x}^{k+1} - z + \bar{u}^{k} \|_{2}^{2}) \\ & u_{i}^{k+1} := u_{i}^{k} + x_{i}^{k+1} - z^{k+1} \end{align*}\end{split}\]
  • Lasso
  • Sparse Logistic Regression
  • Support Vector Machine

8.2.1 Group Lasso

For Lasso have the function f being squared l2 norm, and r being the l1 norm. Then the ADMM udpates are:

\[\begin{split}\begin{align*} &x_{i}^{k+1} := \arg\min_{x_{i}} ((1/2)\|A_{i}x_{i} -b_{i}\|_{2}^{2} + (\rho/2)\|x_{i}-z^{k}+u_{i}^{k}\|_{2}^{2}) \\ &z^{k+1}:= S_{\lambda/\rho N}(\bar{x}^{k+1} + \bar{u}^{k}) \\ &u^{k+1}_{i} := u_{i}^{k} + x_{i}^{k+1} - z^{k+1} \end{align*}\end{split}\]

See here for some details about the update of x. The difference from the serial version is that :

  • The update of different group of variables \(x_{i}\) could be carry out in parallel.
  • The collection and averaging steps.

8.2.2 Distributed l_1-regularized logistic regression

Could be compared with the serial version. Could be seen function and Script using L-BFGS for distributed calculations.

8.2.3 Support Vector Machine

Here we model a linear support vector machine problem, which is a linear model fitting problem. Which is to find a best linear model applied to feature variables x (\(w^{T}x_{j} + b\)) to best fit the observation y (\(y_{j}\)), where y is a binary variable.

Which is to say, if the observation \(y_{j}\) is 1, we want \(w^{T}x_{j} + b \to 1\) and if the observation \(y_{j}\) is -1, we want \(w^{T}x_{j} + b \to -1\). Which is a optimization problem :

\[minimize \quad \sum_{j=1}^{M} (1-y_{j}(w^{T}x_{j}+b))\]

In partice, we can truncate the results of the model to 1 or -1, so the problem will be better if we optimize this:

\[minimize \quad \sum_{j=1}^{M} (1-y_{j}(w^{T}x_{j}+b))_{+}\]

Where we have M obervations in total. The problem is equivalent to :

\[\begin{split}minimize \quad \sum_{j=1}^{M}(1 + \begin{bmatrix} -y_{j}x_{j}^{T} & -y_{j} \end{bmatrix} \begin{bmatrix} w \\ b \end{bmatrix})_{+}\end{split}\]

By forming :

\[\begin{split}A = \begin{bmatrix} -y_{1}x_{1}^{T} & -y_{1} \\ : & :\\ -y_{M}x_{M}^{T} & -y_{M} \end{bmatrix}, \quad x = \begin{bmatrix} w \\ b \end{bmatrix},\end{split}\]

We have the reformed problem:

\[minimize \quad (Ax + \mathbb{1})_{+}\]

Adding the regularization term of the linear model weights w:

\[minimize \quad (Ax + \mathbb{1})_{+} + (1/2\lambda)\|w\|_{2}^{2}\]

If we apply the distributed model where i indicates a sub-set of samples, we have :

\[minimize \quad \sum_{i=1}^{N}(A_{i}x + \mathbb{1})_{+} + (1/2\lambda)\|w\|_{2}^{2}\]

Applying the consensus variable z :

\[\begin{split}\begin{align*} &minimize \quad \sum_{i=1}^{N}(A_{i}x_{i} + \mathbb{1})_{+} + (1/2\lambda)\|w\|_{2}^{2} \\ &subject\ to \quad x_{i} = z \end{align*}\end{split}\]

We further simplify the problem with a small adjustment in the regularization term : instead of regularize w we will regularize z directly. Then we will have the problem:

\[\begin{split}\begin{align*} &minimize \quad \mathbb{1}^{T}(A_{i}x_{i} + \mathbb{1})_{+} + (1/2\lambda)\|z\|_{2}^{2} \\ &subject\ to \quad x_{i} = z \end{align*}\end{split}\]

The corresponding ADMM updates are :

\[\begin{split}\begin{align*} &x^{k+1}_{i} := \arg\min_{x_{i}} (\mathbb{1}^{T}(A_{i}x_{i} + \mathbb{1})_{+} + (\rho/2)\|x_{i} - z^{k} + u^{k}_{i}\|_{2}^{2}) \\ &z^{k+1} := \arg\min_{z} ((1/2\lambda)\|z\|_{2}^{2} + \sum_{i=1}^{N} (\rho/2)\|x_{i}^{k+1} - z + u^{k}_{i}\|_{2}^{2}) \\ &u^{k+1}_{i} := u_{i}^{k} + x^{k+1}_{i} - z^{k+1} \end{align*}\end{split}\]

The update of x will be solved by another optimization problem:

cvx_begin
    variable x_var(n)
    minimize ( sum(pos(A{i}*x_var + 1)) + rho/2*sum_square(x_var - z(:,i) + u(:,i)) )
cvx_end

The update of z is simple, using the first order optimal condition we have :

\[(1/\lambda)z^{k+1} + \sum_{i=1}^{N}(-\rho(x_{i}^{k+1}- z^{k+1} + u^{k}_{i})) = 0\]
\[z^{k+1} = \frac{\rho N}{(1/\lambda) + N \rho}(\bar{x}^{k+1} + \bar{u}^{k})\]

Then, we get the final updates of ADMM of linear SVM :

\[\begin{split}\begin{align*} &x^{k+1}_{i} := \arg\min_{x_{i}} (\mathbb{1}^{T}(A_{i}x_{i} + \mathbb{1})_{+} + (\rho/2)\|x_{i} - z^{k} + u^{k}_{i}\|_{2}^{2}) \\ &z^{k+1} := \frac{\rho N}{(1/\lambda) + N \rho}(\bar{x}^{k+1} + \bar{u}^{k})\\ &u^{k+1}_{i} := u_{i}^{k} + x^{k+1}_{i} - z^{k+1} \end{align*}\end{split}\]

Code and Script could be found in ADMM Stanford page (A distributed version but solved in serial).

8.3 Splitting across Features

Model fitting problems with a modest number of examples and a large number of features.

  • NLP(natural language processing) : pairs of adjucent words (bigrams), etc.
  • Bioinformatics: DNA mutation, etc.

Partition of the parameter vector x as \(x = (x_{1}, ..., x_{N})\), and A as \(A = [A_{1},...,A_{N}]\), the problem will be:

\[minimize\quad l(\sum_{i=1}^{N} A_{i}x_{i} -b) + \sum_{i=1}^{N}r_{i}(x_{i})\]

Reform into consensus problem:

\[\begin{split}\begin{align*} & minimize \quad l(\sum_{i=1}^{N} z_{i} -b) + \sum_{i=1}^{N}r_{i}(x_{i}) \\ & subject\ to \quad A_{i}x_{i} - z_{i} =0,\ i=1,...,N \end{align*}\end{split}\]

The corresponding scaled form of ADMM is :

\[\begin{split}\begin{align*} & x_{i}^{k+1} := \arg\min_{x_{i}} (r_{i}(x_{i}) + (\rho/2)\|A_{i}x_{i} - z_{i}^{k} + u_{i}^{k} \|_{2}^{2}) \\ & z^{k+1} := \arg\min_{z} (l(\sum_{i=1}^{N}z_{i}-b) + \sum_{i=1}^{N}(\rho/2)\|A_{i}x_{i}^{k+1} - z_{i} + u_{i}^{k} \|_{2}^{2}) \\ & u_{i}^{k+1} := u_{i}^{k} + A_{i}x_{i}^{k+1} - z^{k+1}_{i} \end{align*}\end{split}\]

8.3.1 Group Lasso

Code and Script .