Namespaces
Variants
Actions

Broyden-Fletcher-Goldfarb-Shanno method

From Encyclopedia of Mathematics
Revision as of 16:55, 1 July 2020 by Maximilian Janisch (talk | contribs) (AUTOMATIC EDIT (latexlist): Replaced 93 formulas out of 93 by TEX code with an average confidence of 2.0 and a minimal confidence of 2.0.)
Jump to: navigation, search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

BFGS method

The unconstrained optimization problem is to minimize a real-valued function $f$ of $N$ variables. That is, to find a local minimizer, i.e. a point $x ^ { * }$ such that

\begin{equation} \tag{a1} f ( x ^ { * } ) \leq f ( x ) \text { for all } x \text{ near } x ^ { * }; \end{equation}

$f$ is typically called the cost function or objective function.

A classic approach for doing this is the method of steepest descent (cf. Steepest descent, method of). The iteration, here described in terms of the transition from a current approximation $x _ { c }$ to a local minimizer $x ^ { * }$, to an update (and hopefully better) approximation is

\begin{equation} \tag{a2} x _ { + } = x _ { c } - \lambda \nabla f ( x _ { c } ). \end{equation}

By elementary calculus, $- \nabla f ( x _ { c } )$ is the direction of most rapid decrease (steepest descent) in $f$ starting from $x _ { c }$. The step length $\lambda$ must be a part of the algorithm in order to ensure that $f ( x _ { + } ) < f ( x _ { c } )$ (which must be so for a sufficiently small $\lambda$).

There are several methods for selecting an appropriate $\lambda$, [a3], [a7], for instance the classical Armijo rule, [a1], in which $\lambda = \beta ^ { m }$ for some $\beta \in ( 0,1 )$ and where $m \geq 0$ is the least integer such that the sufficient decrease condition

\begin{equation} \tag{a3} f ( x _ { c } + \lambda d ) \leq f ( x _ { c } ) + \alpha \lambda d ^ { T } \nabla f ( x _ { c } ) \end{equation}

holds. In (a3), $- \nabla f ( x _ { c } )$ is replaced by a general descent direction $d$, a vector satisfying $d ^ { T } \nabla f ( x _ { c } ) < 0$, and a parameter $\alpha$ has been introduced (typically $10 ^ { - 4 }$). One might think that simple decrease ($f ( x _ { c } + \lambda d ) < f ( x _ { c } )$) would suffice, and it usually does in practice. However, the theory requires (a3) to eliminate the possibility of stagnation.

The steepest descent iteration with the Armijo rule will, if $f$ is smooth and bounded away from $- \infty$, which is assumed throughout below, produce a sequence $\{ x _ { n } \}$ such that

\begin{equation} \tag{a4} \operatorname { lim } _ { n \rightarrow \infty } \nabla f ( x _ { n } ) = 0. \end{equation}

Hence, any limit point $x ^ { * }$ of the steepest descent sequence satisfies the first-order necessary conditions $\nabla f ( x ^ { * } ) = 0$. This does not imply convergence to a local minimizer and even if the steepest descent iterations do converge, and they often do, the convergence is usually very slow in the terminal phase when the iterates are near $x ^ { * }$. Scaling the steepest descent direction by multiplying it with the inverse of a symmetric positive-definite scaling matrix $H _ { c }$ can preserve the convergence of $\nabla f$ to zero and dramatically improve the convergence near a minimizer.

Now, if $x ^ { * }$ is a local minimizer that satisfies the standard assumptions, i.e. the Hessian of $f$, $\nabla ^ { 2 } f$, is symmetric, positive definite, and Lipschitz continuous near $x ^ { * }$, then the Newton method,

\begin{equation} \tag{a5} x _ { + } = x _ { c } - ( \nabla ^ { 2 } f ( x _ { c } ) ) ^ { - 1 } \nabla f ( x _ { c } ), \end{equation}

will converge rapidly to the minimizer if the initial iterate $x _ { 0 }$ is sufficiently near $x ^ { * }$, [a3], [a7]. However, when far from $x ^ { * }$, the Hessian need not be positive definite and cannot be used as a scaling matrix.

Quasi-Newton methods (cf. also Quasi-Newton method) update an approximation of $\nabla ^ { 2 } f ( x ^ { * } )$ as the iteration progresses. In general, the transition from current approximations $x _ { c }$ and $H _ { c }$ of $x ^ { * }$ and $\nabla f ( x ^ { * } )$ to new approximations $x_{+}$ and $H _ { + }$ is given (using a line search paradigm) by:

1) compute a search direction $d = - H _ { c } ^ { - 1 } \nabla f ( x _ { c } )$;

2) find $x _ { + } = x _ { c } + \lambda d$ using a line search to ensure sufficient decrease;

3) use $x _ { c }$, $x_{+}$, and $H _ { c }$ to update $H _ { c }$ and obtain $H _ { + }$. The way in which $H _ { + }$ is computed determines the method.

The BFGS method, [a2], [a4], [a5], [a6], updates $H _ { c }$ with the rank-two formula

\begin{equation} \tag{a6} H _ { + } = H _ { c } + \frac { y y ^ { T } } { y ^ { T } s } - \frac { ( H _ { c } s ) ( H _ { c } s ) ^ { T } } { s ^ { T } H _ { c } s }. \end{equation}

In (a6),

\begin{equation*} s = x _ { + } - x _ { c } , \quad y = \nabla f ( x _ { + } ) - \nabla f ( x _ { c } ). \end{equation*}

Local and global convergence theory.

The local convergence result, [a8], [a3], [a7], assumes that the standard assumptions hold and that the initial data is sufficiently good (i.e., $x _ { 0 }$ is near to $x ^ { * }$ and $H _ { 0 }$ is near $\nabla ^ { 2 } f ( x ^ { * } )$). Under these assumptions the BFGS iterates exist and converge $q$-superlinearly to $x ^ { * }$, i.e.,

\begin{equation} \tag{a7} \operatorname { lim } _ { n \rightarrow \infty } \frac { \left\| x _ { n + 1} - x ^ { * } \right\| } { \left\| x _ { n } - x ^ { * } \right\| } = 0. \end{equation}

By a global convergence theory one means that one does not assume that the initial data is good and one manages the poor data by making certain that $f$ is decreased as the iteration progresses. There are several variants of global convergence theorems for BFGS and related methods, [a9], [a11], [a10], [a12]. All these results make strong assumptions on the function and some require line search methods more complicated that the simple Armijo rule discussed above.

The result from [a9] is the most general and a special case illustrating the idea now follows. One must assume that the set

\begin{equation*} D = \{ x : f ( x ) \leq f ( x _ { 0 } ) \} \end{equation*}

is convex, $f$ is Lipschitz twice continuously differentiable in $D$, and that the Hessian and its inverse are uniformly bounded and positive definite in $D$. This assumption implies that $f$ has a unique minimizer $x ^ { * }$ in $D$ and that the standard assumptions hold. If $H _ { 0 }$ is symmetric and positive definite, the BFGS iterations, using the Armijo rule to enforce (a3), converge $q$-superlinearly to $x ^ { * }$.

Implementation.

The recursive implementation from [a7], described below, stores the history of the iteration and uses that information recursively to compute the action of $H _ { k } ^{- 1}$ on a vector. This idea was suggested in [a13], [a14], [a17]. Other implementations may be found in [a15], [a3], [a7], and [a16].

One can easily show that

\begin{equation} \tag{a8} H _ { + } ^ { - 1 } = \left( I - \frac { s y ^ { T } } { y ^ { T } s } \right) H _ { c } ^ { - 1 } \left( I - \frac { y s ^ { T } } { y ^ { T } s } \right) + \frac { s s ^ { T } } { y ^ { T } s }. \end{equation}

The algorithm $\operatorname{bfgsrec}$ below overwrites a given vector $d$ with $H _ { n}^{ - 1} d$. The storage needed is one vector for $d$ and $2 n$ vectors for the sequences $\{ s _ { k } , y _ { k } \} _ { k = 0 } ^ { n - 1 }$. A method for computing the product of $H _ { 0 } ^ { - 1 }$ and a vector must also be provided.

'

1 IF $n = 0$, $d = H _ { 0 } ^ { - 1 } d$; return
2 $\alpha = s _ { n-1 } ^ { T } d / y _ { n-1 } ^ { T }s$; $d = d - \alpha y _ { n - 1}$
call $ \operatorname{bfgsrec} ( n - 1 , \{ s _ { k } \} , \{ y _ { k } \} , H _ { 0 } ^ { - 1 } , d )$;
3 $d = d + ( \alpha - ( y _ { n-1 } ^ { T } { d } / y _ { n - 1 } ^ { T } s _ { n - 1 } ) s _ { n - 1 }$.

References

[a1] L. Armijo, "Minimization of functions having Lipschitz-continuous first partial derivatives" Pacific J. Math. , 16 (1966) pp. 1–3
[a2] C.G. Broyden, "A new double-rank minimization algorithm" Notices Amer. Math. Soc. , 16 (1969) pp. 670
[a3] J.E. Dennis, R.B. Schnabel, "Numerical Methods for Nonlinear Equations and Unconstrained Optimization" , Classics in Applied Math. : 16 , SIAM (Soc. Industrial Applied Math.) (1996)
[a4] R. Fletcher, "A new approach to variable metric methods" Comput. J. , 13 (1970) pp. 317–322
[a5] D. Goldfarb, "A family of variable metric methods derived by variational means" Math. Comp. , 24 (1970) pp. 23–26
[a6] D.F. Shanno, "Conditioning of quasi-Newton methods for function minimization" Math. Comp. , 24 (1970) pp. 647–657
[a7] C.T. Kelley, "Iterative methods for optimization" , Frontiers in Appl. Math. , 18 , SIAM (Soc. Industrial Applied Math.) (1999)
[a8] C.G. Broyden, J.E. Dennis, J.J. Moré, "On the local and superlinear convergence of quasi-Newton methods" J. Inst. Math. Appl. , 12 (1973) pp. 223–246
[a9] R.H. Byrd, J. Nocedal, "A tool for the analysis of quasi-Newton methods with application to unconstrained minimization" SIAM J. Numer. Anal. , 26 (1989) pp. 727–739
[a10] R.H. Byrd, J. Nocedal, Y. Yuan, "Global convergence of a class of quasi-Newton methods on convex problems" SIAM J. Numer. Anal. , 24 (1987) pp. 1171–1190
[a11] M.J.D. Powell, "Some global convergence properties of a variable metric algorithm without exact line searches" R. Cottle (ed.) C. Lemke (ed.) , Nonlinear Programming , Amer. Math. Soc. (1976) pp. 53–72
[a12] J. Werner, "Über die globale konvergenz von Variable-Metric Verfahren mit nichtexakter Schrittweitenbestimmung" Numer. Math. , 31 (1978) pp. 321–334
[a13] K.J. Bathe, A.P. Cimento, "Some practical procedures for the solution of nonlinear finite element equations" Comput. Meth. Appl. Mech. Eng. , 22 (1980) pp. 59–85
[a14] H. Matthies, G. Strang, "The solution of nonlinear finite element equations" Internat. J. Numerical Methods Eng. , 14 (1979) pp. 1613–1626
[a15] R.H. Byrd, J. Nocedal, R.B. Schnabel, "Representation of quasi-Newton matrices and their use in limited memory methods" Math. Progr. , 63 (1994) pp. 129–156
[a16] J.L. Nazareth, "Conjugate gradient methods less dependent on conjugacy" SIAM Review , 28 (1986) pp. 501–512
[a17] J. Nocedal, "Updating quasi-Newton matrices with limited storage" Math. Comp. , 35 (1980) pp. 773–782
How to Cite This Entry:
Broyden-Fletcher-Goldfarb-Shanno method. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Broyden-Fletcher-Goldfarb-Shanno_method&oldid=50094
This article was adapted from an original article by C.T. Kelley (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article