Misplaced Pages

Random coordinate descent

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

Randomized (Block) Coordinate Descent Method is an optimization algorithm popularized by Nesterov (2010) and Richtárik and Takáč (2011). The first analysis of this method, when applied to the problem of minimizing a smooth convex function, was performed by Nesterov (2010). In Nesterov's analysis the method needs to be applied to a quadratic perturbation of the original function with an unknown scaling factor. Richtárik and Takáč (2011) give iteration complexity bounds which do not require this, i.e., the method is applied to the objective function directly. Furthermore, they generalize the setting to the problem of minimizing a composite function, i.e., sum of a smooth convex and a (possibly nonsmooth) convex block-separable function:

F ( x ) = f ( x ) + Ψ ( x ) , {\displaystyle F(x)=f(x)+\Psi (x),}

where Ψ ( x ) = i = 1 n Ψ i ( x ( i ) ) , {\displaystyle \Psi (x)=\sum _{i=1}^{n}\Psi _{i}(x^{(i)}),} x R N {\displaystyle x\in R^{N}} is decomposed into n {\displaystyle n} blocks of variables/coordinates: x = ( x ( 1 ) , , x ( n ) ) {\displaystyle x=(x^{(1)},\dots ,x^{(n)})} and Ψ 1 , , Ψ n {\displaystyle \Psi _{1},\dots ,\Psi _{n}} are (simple) convex functions.

Example (block decomposition): If x = ( x 1 , x 2 , , x 5 ) R 5 {\displaystyle x=(x_{1},x_{2},\dots ,x_{5})\in R^{5}} and n = 3 {\displaystyle n=3} , one may choose x ( 1 ) = ( x 1 , x 3 ) , x ( 2 ) = ( x 2 , x 5 ) {\displaystyle x^{(1)}=(x_{1},x_{3}),x^{(2)}=(x_{2},x_{5})} and x ( 3 ) = x 4 {\displaystyle x^{(3)}=x_{4}} .

Example (block-separable regularizers):

  1. n = N ; Ψ ( x ) = x 1 = i = 1 n | x i | {\displaystyle n=N;\Psi (x)=\|x\|_{1}=\sum _{i=1}^{n}|x_{i}|}
  2. N = N 1 + N 2 + + N n ; Ψ ( x ) = i = 1 n x ( i ) 2 {\displaystyle N=N_{1}+N_{2}+\dots +N_{n};\Psi (x)=\sum _{i=1}^{n}\|x^{(i)}\|_{2}} , where x ( i ) R N i {\displaystyle x^{(i)}\in R^{N_{i}}} and 2 {\displaystyle \|\cdot \|_{2}} is the standard Euclidean norm.

Algorithm

Consider the optimization problem

min x R n f ( x ) , {\displaystyle \min _{x\in R^{n}}f(x),}

where f {\displaystyle f} is a convex and smooth function.

Smoothness: By smoothness we mean the following: we assume the gradient of f {\displaystyle f} is coordinate-wise Lipschitz continuous with constants L 1 , L 2 , , L n {\displaystyle L_{1},L_{2},\dots ,L_{n}} . That is, we assume that

| i f ( x + h e i ) i f ( x ) | L i | h | , {\displaystyle |\nabla _{i}f(x+he_{i})-\nabla _{i}f(x)|\leq L_{i}|h|,}

for all x R n {\displaystyle x\in R^{n}} and h R {\displaystyle h\in R} , where i {\displaystyle \nabla _{i}} denotes the partial derivative with respect to variable x ( i ) {\displaystyle x^{(i)}} .

Nesterov, and Richtarik and Takac showed that the following algorithm converges to the optimal point:

Algorithm Random Coordinate Descent Method
    Input: 
  
    
      
        
          x
          
            0
          
        
        
        
          R
          
            n
          
        
      
    
    {\displaystyle x_{0}\in R^{n}}
  
 //starting point
    Output: 
  
    
      
        x
      
    
    {\displaystyle x}
  

    set x := x_0
    for k := 1, ... do
        choose coordinate 
  
    
      
        i
        
        {
        1
        ,
        2
        ,
        
        ,
        n
        }
      
    
    {\displaystyle i\in \{1,2,\dots ,n\}}
  
, uniformly at random
        update 
  
    
      
        
          x
          
            (
            i
            )
          
        
        =
        
          x
          
            (
            i
            )
          
        
        
        
          
            1
            
              L
              
                i
              
            
          
        
        
          
          
            i
          
        
        f
        (
        x
        )
      
    
    {\displaystyle x^{(i)}=x^{(i)}-{\frac {1}{L_{i}}}\nabla _{i}f(x)}
  
 
    end for
  • "←" denotes assignment. For instance, "largestitem" means that the value of largest changes to the value of item.
  • "return" terminates the algorithm and outputs the following value.

Convergence rate

Since the iterates of this algorithm are random vectors, a complexity result would give a bound on the number of iterations needed for the method to output an approximate solution with high probability. It was shown in that if k 2 n R L ( x 0 ) ϵ log ( f ( x 0 ) f ϵ ρ ) {\displaystyle k\geq {\frac {2nR_{L}(x_{0})}{\epsilon }}\log \left({\frac {f(x_{0})-f^{*}}{\epsilon \rho }}\right)} , where R L ( x ) = max y max x X { y x L : f ( y ) f ( x ) } {\displaystyle R_{L}(x)=\max _{y}\max _{x^{*}\in X^{*}}\{\|y-x^{*}\|_{L}:f(y)\leq f(x)\}} , f {\displaystyle f^{*}} is an optimal solution ( f = min x R n { f ( x ) } {\displaystyle f^{*}=\min _{x\in R^{n}}\{f(x)\}} ), ρ ( 0 , 1 ) {\displaystyle \rho \in (0,1)} is a confidence level and ϵ > 0 {\displaystyle \epsilon >0} is target accuracy, then Prob ( f ( x k ) f > ϵ ) ρ {\displaystyle {\text{Prob}}(f(x_{k})-f^{*}>\epsilon )\leq \rho } .

Example on particular function

The following Figure shows how x k {\displaystyle x_{k}} develops during iterations, in principle. The problem is

f ( x ) = 1 2 x T ( 1 0.5 0.5 1 ) x ( 1.5 1.5 ) x , x 0 = ( 0 0 ) T {\displaystyle f(x)={\tfrac {1}{2}}x^{T}\left({\begin{array}{cc}1&0.5\\0.5&1\end{array}}\right)x-\left({\begin{array}{cc}1.5&1.5\end{array}}\right)x,\quad x_{0}=\left({\begin{array}{cc}0&0\end{array}}\right)^{T}}

Extension to block coordinate setting

Blocking coordinate directions into Block coordinate directions

One can naturally extend this algorithm not only just to coordinates, but to blocks of coordinates. Assume that we have space R 5 {\displaystyle R^{5}} . This space has 5 coordinate directions, concretely e 1 = ( 1 , 0 , 0 , 0 , 0 ) T , e 2 = ( 0 , 1 , 0 , 0 , 0 ) T , e 3 = ( 0 , 0 , 1 , 0 , 0 ) T , e 4 = ( 0 , 0 , 0 , 1 , 0 ) T , e 5 = ( 0 , 0 , 0 , 0 , 1 ) T {\displaystyle e_{1}=(1,0,0,0,0)^{T},e_{2}=(0,1,0,0,0)^{T},e_{3}=(0,0,1,0,0)^{T},e_{4}=(0,0,0,1,0)^{T},e_{5}=(0,0,0,0,1)^{T}} in which Random Coordinate Descent Method can move. However, one can group some coordinate directions into blocks and we can have instead of those 5 coordinate directions 3 block coordinate directions (see image).

See also

References

  1. Nesterov, Yurii (2010), "Efficiency of coordinate descent methods on huge-scale optimization problems", SIAM Journal on Optimization, 22 (2): 341–362, CiteSeerX 10.1.1.332.3336, doi:10.1137/100802001
  2. Richtárik, Peter; Takáč, Martin (2011), "Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function", Mathematical Programming, Series A, 144 (1–2): 1–38, arXiv:1107.2848, doi:10.1007/s10107-012-0614-z
Category: