Quasi-Newton method
The Newton method for solving a system of non-linear equations $F ( x ) = 0$ is defined by the iteration
\begin{equation*} x ^ { k + 1 } = x ^ { k } - [ D F ( x ^ { k } ) ] ^ { - 1 } F ( x ^ { k } ), \end{equation*}
\begin{equation*} k = 0,1 , \dots , \end{equation*}
where $x ^ { 0 } \in \mathbf{R} ^ { n}$ is a starting point and $D F$ denotes the Jacobi matrix of $F$. In fact, the point $x ^ { k + 1 }$ solves the linearized equation at $x ^ { k }$: $F ( x ^ { k } ) + D F ( x ^ { k } ) ( x - x ^ { k } ) = 0$. The method can be applied to find a local minimizer of a smooth real-valued function $f$ by searching for the stationary points of $f$, i.e. the zeros of $F = D ^ { T } f$. The advantage of this method is that it works extremely well if the starting point $x ^ { 0 }$ is chosen sufficiently close to a local minimizer $x ^ { * }$ with positive-definite Hessian $D ^ { 2 } f ( x ^ { \color{blue}* } ) = D ( D ^ { T } f ( x ^ {\color{blue } * } ) )$. The method even exhibits quadratic convergence in this case, if $D ^ { 2 } f$ depends Lipschitzian on $x$ (cf. also Lipschitz condition).
Apart from the lack of global convergence of Newton's method, the use of Hessian matrices is a disadvantage, since explicit expressions for second-order derivatives are often hard to obtain. Even if Hessians are available, the solution of the corresponding linear system $D ^ { 2 } f ( x ^ { k } ) \cdot d = - D ^ { T } f ( x ^ { k } )$ for the determination of the displacement $d ^ { k }$ in each step might be numerically expensive.
One is led to methods which use derivatives of at most order one, so-called gradient methods (cf. also Gradient method). A wide class of these methods is included in the following modified Newton algorithm:
1) Choose a starting point $x ^ { 0 } \in \mathbf{R} ^ { n}$ and a symmetric positive-definite matrix $H _ { 0 }$ (e.g. the identity matrix).
2) Put $d ^ { k } = - H _ { k } D ^ { T } f ( x ^ { k } )$.
3) Determine a step length $\alpha _ { k } > 0$ by approximate minimization of the function $\alpha \mapsto f ( x ^ { k } + \alpha d ^ { k } )$ and put $x ^ { k + 1 } = x ^ { k } + \alpha _ { k } d ^ { k }$.
4) Generate a symmetric positive-definite matrix $H _ { k+1 } $, put $ k = k + 1$ and go to step 2).
The remainder of this article focuses on the choice of the matrices $H _ { k }$.
First note that the search direction $d ^ { k }$ is the negative gradient of $f$ with respect to the scalar product induced by $H _ { k } ^{- 1}$ at the point $x ^ { k }$. In fact, each symmetric positive-definite $( n \times n )$-matrix $R$ defines a scalar product $\langle x , y \rangle _ { R } = x ^ { T } R y$ on ${\bf R} ^ { n }$, and the gradient of $f$ with respect to $R$ is the vector which solves $\langle \operatorname { grad } _ { R } f ( x ) , v \rangle _ { R } = D f ( x ) . v$ for all $v \in {\bf R} ^ { n }$. It follows that $\operatorname{grad}_R f ( x ) = R ^ { - 1 } D ^ { T } f ( x )$. If $R$ depends on $x$, then it is called a Riemannian metric or variable metric. Since the search direction in step 2 can obviously be rewritten as
\begin{equation*} d ^ { k } = - \operatorname { grad } _ { H _ { k } ^ { - 1 } } f ( x ^ { k } ), \end{equation*}
methods of the considered type are also called variable metric methods ([a4]). Due to the relation
\begin{equation*} \frac { d } { d \alpha } f ( x ^ { k } + \alpha d ^ { k } ) | _ { \alpha = 0 } = D f ( x ^ { k } ) d ^ { k } = \end{equation*}
\begin{equation*} = - D f ( x ^ { k } ) H _ { k } D ^ { T } f ( x ^ { k } ) < 0, \end{equation*}
the vector $d ^ { k }$ is always a descent direction for $f$, if $D f ( x ^ { k } ) \neq 0$ (cf. also Descent, method of). In particular, the step-length $\alpha _ { k }$ can be chosen to be strictly positive.
A quasi-Newton method is generated if in step 4) of the modified Newton algorithm the matrix $H _ { k+1 } $ satisfies the quasi-Newton condition (or secant equation) $H _ { k + 1 } y ^ { k } = s ^ { k }$, where $y ^ { k } = D ^ { T } f ( x ^ { k + 1 } ) - D ^ { T } f ( x ^ { k } )$ and $s ^ { k } = x ^ { k + 1 } - x ^ { k }$. In order to obtain the matrix $H _ { k+1 } $ in a numerically efficient way, it is assumed to be a symmetric rank-$1$ or rank-$2$ update of $H _ { k }$:
\begin{equation*} H _ { k + 1 } = H _ { k } + \beta _ { k } u ^ { k } ( u ^ { k } ) ^ { T } + \gamma _ { k } v ^ { k } ( v ^ { k } ) ^ { T } \end{equation*}
with scalars $\beta _ { k }$, $\gamma _ { k }$ and vectors $u ^ { k }$, $v ^ { k }$. An important class of update formulas is given by the Broyden family [a2]. Putting $H = H _ { k }$, $H _ { \text{new} } = H _ { k + 1 }$, etc., it can be written in the form
\begin{equation*} H _ { \operatorname {new} } = H - \frac { H y y ^ { T } H } { y ^ { T } H y } + \frac { s s ^ { T } } { s ^ { T } y } + \phi \cdot w v ^ { T }, \end{equation*}
where $\phi = \phi ^ { k }$ is an additional parameter and
\begin{equation*} v = \sqrt { y ^ { T } H y } \left( \frac { s } { s ^ { T } y } - \frac { H y } { y ^ { T } H y } \right). \end{equation*}
For $\phi = s ^ { T } y ( s ^ { T } y - y ^ { T } H y ) ^ { - 1 }$, one obtains the symmetric rank-$1$ update SR1 ([a1], [a7]), whereas the choices $\phi = 0$ and $\phi = 1$ correspond to the DFP formula ([a4], [a9]) and to the BFGS formula ([a3], [a6], [a11], [a14]; cf. also Broyden–Fletcher–Goldfarb–Shanno method), respectively. The DFP and BFGS formulas are dual in the following sense. The Sherman–Morrison–Woodbury formula implies that the update for the matrix inverse $B = H ^ { - 1 }$ takes the form
\begin{equation*} B _ { \operatorname{new} } = B - \frac { B s s ^ { T } B } { s ^ { T } B s } + \frac { y y ^ { T } } { y ^ { T } s } + \theta . w w ^ { T }, \end{equation*}
where $\theta = \theta ^ { k }$ is now the corresponding parameter and
\begin{equation*} w = \sqrt { s ^ { T } B s } \left( \frac { y } { y ^ { T } s } - \frac { B s } { s ^ { T } B s } \right). \end{equation*}
For given $\phi$, the parameter $\theta$ becomes ([a7])
\begin{equation*} \theta = \frac { \phi - 1 } { \phi - 1 - \phi \mu }, \end{equation*}
with
\begin{equation*} \mu = \frac { y ^ { T } H y . s ^ { T } B s } { ( s ^ { T } y ) ^ { 2 } }. \end{equation*}
Consequently, the DFP and BFGS updates correspond now to the parameter values $\theta = 1$ and $\theta = 0$, respectively. Here, duality means that the BFGS update for $H$ is obtained from the DFP update for $B$ by interchanging $H$ with $B$ and $s$ with $y$, respectively. The SR1 formula is self-dual in this sense.
It is clear that all updates from the Broyden family inherit symmetry of the matrix $H$. Each formula with $\phi \in [ 0,1 ]$, i.e. a formula from the so-called convex class, inherits also positive definiteness of $H$. This holds in particular for the DFP and BFGS updates. The SR1 formula is not in the convex class and might give rise to indefinite or singular matrices $H_{\text{new}}$.
If, for any update formula from the Broyden family, the matrices $H _ { k }$ remain non-singular, then the following properties can be derived (for quadratic $f$): The algorithm terminates at the solution $x ^ { * }$ after at most $n$ iterations, and $B _ { n } = H _ { n } ^ { - 1 } = D ^ { 2 } f ( x ^ { * } )$. The search directions are conjugate, i.e. orthogonal with respect to the scalar product $\langle \cdot , \cdot \rangle _ { D ^{ 2} f ( x ^ { * } )}$, and they are equivalent to the directions from the conjugate-gradient method (cf. Conjugate gradients, method of; [a10]) if $H _ { 0 }$ is the identity matrix.
If the line searches are carried out exactly, then all methods from the Broyden family generate the same sequence of iterates $( x ^ { k } ) _ { k \in \mathbf{N} }$ ([a5]). However, implementation of inaccurate line searches leads to extreme differences between the methods. In particular, the BFGS update behaves less sensitively with regard to inexact line searches than the DFP update. Moreover, results in [a13] indicate that DFP reduces large eigenvalues of $B _ { k }$ much slower than BFGS.
Nevertheless, for the case that $D ^ { 2 } f ( x ^ { * } )$ is positive definite and that $D ^ { 2 } f$ is locally Lipschitz continuous, all updates from the Broyden family with exact line searches give rise to superlinear convergence, if $x ^ { 0 }$ and $H _ { 0 }$ are chosen sufficiently close to $x ^ { * }$ and $D ^ { 2 } f ( x ^ { * } )$, respectively. The DFP and BFGS methods even converge superlinearly if the Armijo rule is implemented for inexact line searches. For more details on DFP and BFGS see [a8].
The Broyden family is contained in the larger Oren–Luenberger class ([a12]) of quasi-Newton methods. This class includes, in particular, the self-scaling variable metric algorithms (SSVM algorithms), which share most properties of the Broyden family and automatically compensate for poor scaling of the objective function.
References
[a1] | C.G. Broyden, "A class of methods for solving non-linear simultaneous equations" Math. Comput. , 19 (1965) pp. 577–593 |
[a2] | C.G. Broyden, "Quasi-Newton methods and their application to function minimisation" Math. Comput. , 21 (1967) pp. 368–381 |
[a3] | C.G. Broyden, "The convergence of a class of double-rank minimization algorithms; the new algorithm" J. Inst. Math. Appl. , 6 (1970) pp. 222–231 |
[a4] | W.C. Davidon, "Variable metric method for minimisation" SIAM J. Optim. , 1 (1991) pp. 1–17 (Also: Argonne Nat. Lab. Report ANL-5990 (rev.) (1959)) |
[a5] | L.C.W. Dixon, "Quasi-Newton algorithms generate identical points" Math. Progr. , 2 (1972) pp. 383–387 |
[a6] | R. Fletcher, "A new approach for variable metric algorithms" Comput. J. , 13 (1970) pp. 317–322 |
[a7] | R. Fletcher, "Practical methods of optimization" , Wiley (1986) (Edition: Second) |
[a8] | R. Fletcher, "An overview of unconstrained optimization" Dundee Numerical Analysis Report , NA/149 (1993) |
[a9] | R. Fletcher, M.J.D. Powell, "A rapidly convergent descent method for minimisation" Comput. J. , 6 (1963) pp. 163–168 |
[a10] | R. Fletcher, C.M. Reeves, "Function minimization by conjugate gradients" Comput. J. , 7 (1964) pp. 149–154 |
[a11] | D. Goldfarb, "A family of variable metric methods derived by variational means" Math. Comput. , 24 (1970) pp. 23–26 |
[a12] | S.S. Oren, D.G. Luenberger, "Self-scaling variable metric (SSVM) algorithms I" Management Science , 20 (1974) pp. 845–862 |
[a13] | M.J.D. Powell, "How bad are the BFGS and DFP methods when the objective function is quadratic?" Math. Progr. , 34 (1987) pp. 34–47 |
[a14] | D.F. Shanno, "Conditioning of quasi-Newton methods for function minimization" Math. Comput. , 24 (1970) pp. 647–656 |
Quasi-Newton method. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Quasi-Newton_method&oldid=55412