Difference between revisions of "Broyden-Fletcher-Goldfarb-Shanno method"
m (AUTOMATIC EDIT (latexlist): Replaced 93 formulas out of 93 by TEX code with an average confidence of 2.0 and a minimal confidence of 2.0.) |
(latex details) |
||
Line 19: | Line 19: | ||
\begin{equation} \tag{a2} x _ { + } = x _ { c } - \lambda \nabla f ( x _ { c } ). \end{equation} | |||
− | By elementary calculus, - \nabla f ( x _ { c } ) is the direction of most rapid decrease (steepest descent) in f starting from x _ { c }. The step length \lambda must be a part of the algorithm in order to ensure that $f ( x _ { + } ) | + | By elementary calculus, - \nabla f ( x _ { c } ) is the direction of most rapid decrease (steepest descent) in f starting from x _ { c }. The step length \lambda must be a part of the algorithm in order to ensure that $f ( x _ { + } ) < f ( x _ { c } ) (which must be so for a sufficiently small \lambda$). |
There are several methods for selecting an appropriate \lambda, [[#References|[a3]]], [[#References|[a7]]], for instance the classical Armijo rule, [[#References|[a1]]], in which \lambda = \beta ^ { m } for some \beta \in ( 0,1 ) and where m \geq 0 is the least integer such that the sufficient decrease condition | There are several methods for selecting an appropriate \lambda, [[#References|[a3]]], [[#References|[a7]]], for instance the classical Armijo rule, [[#References|[a1]]], in which \lambda = \beta ^ { m } for some \beta \in ( 0,1 ) and where m \geq 0 is the least integer such that the sufficient decrease condition | ||
Line 25: | Line 25: | ||
\begin{equation} \tag{a3} f ( x _ { c } + \lambda d ) \leq f ( x _ { c } ) + \alpha \lambda d ^ { T } \nabla f ( x _ { c } ) \end{equation} | \begin{equation} \tag{a3} f ( x _ { c } + \lambda d ) \leq f ( x _ { c } ) + \alpha \lambda d ^ { T } \nabla f ( x _ { c } ) \end{equation} | ||
− | holds. In (a3), - \nabla f ( x _ { c } ) is replaced by a general descent direction d, a vector satisfying $d ^ { T } \nabla f ( x _ { c } ) | + | holds. In (a3), - \nabla f ( x _ { c } ) is replaced by a general descent direction d, a vector satisfying $d ^ { T } \nabla f ( x _ { c } ) < 0, and a parameter \alpha has been introduced (typically 10 ^ { - 4 }). One might think that simple decrease (f ( x _ { c } + \lambda d ) < f ( x _ { c } )$) would suffice, and it usually does in practice. However, the theory requires (a3) to eliminate the possibility of stagnation. |
The steepest descent iteration with the Armijo rule will, if f is smooth and bounded away from - \infty, which is assumed throughout below, produce a sequence \{ x _ { n } \} such that | The steepest descent iteration with the Armijo rule will, if f is smooth and bounded away from - \infty, which is assumed throughout below, produce a sequence \{ x _ { n } \} such that | ||
Line 82: | Line 82: | ||
====References==== | ====References==== | ||
− | <table><tr><td valign="top">[a1]</td> <td valign="top"> L. Armijo, "Minimization of functions having Lipschitz-continuous first partial derivatives" ''Pacific J. Math.'' , '''16''' (1966) pp. 1–3</td></tr><tr><td valign="top">[a2]</td> <td valign="top"> C.G. Broyden, "A new double-rank minimization algorithm" ''Notices Amer. Math. Soc.'' , '''16''' (1969) pp. 670</td></tr><tr><td valign="top">[a3]</td> <td valign="top"> J.E. Dennis, R.B. Schnabel, "Numerical Methods for Nonlinear Equations and Unconstrained Optimization" , ''Classics in Applied Math.'' : 16 , SIAM (Soc. Industrial Applied Math.) (1996)</td></tr><tr><td valign="top">[a4]</td> <td valign="top"> R. Fletcher, "A new approach to variable metric methods" ''Comput. J.'' , '''13''' (1970) pp. 317–322</td></tr><tr><td valign="top">[a5]</td> <td valign="top"> D. Goldfarb, "A family of variable metric methods derived by variational means" ''Math. Comp.'' , '''24''' (1970) pp. 23–26</td></tr><tr><td valign="top">[a6]</td> <td valign="top"> D.F. Shanno, "Conditioning of quasi-Newton methods for function minimization" ''Math. Comp.'' , '''24''' (1970) pp. 647–657</td></tr><tr><td valign="top">[a7]</td> <td valign="top"> C.T. Kelley, "Iterative methods for optimization" , ''Frontiers in Appl. Math.'' , '''18''' , SIAM (Soc. Industrial Applied Math.) (1999)</td></tr><tr><td valign="top">[a8]</td> <td valign="top"> C.G. Broyden, J.E. Dennis, J.J. Moré, "On the local and superlinear convergence of quasi-Newton methods" ''J. Inst. Math. Appl.'' , '''12''' (1973) pp. 223–246</td></tr><tr><td valign="top">[a9]</td> <td valign="top"> R.H. Byrd, J. Nocedal, "A tool for the analysis of quasi-Newton methods with application to unconstrained minimization" ''SIAM J. Numer. Anal.'' , '''26''' (1989) pp. 727–739</td></tr><tr><td valign="top">[a10]</td> <td valign="top"> R.H. Byrd, J. Nocedal, Y. Yuan, "Global convergence of a class of quasi-Newton methods on convex problems" ''SIAM J. Numer. Anal.'' , '''24''' (1987) pp. 1171–1190</td></tr><tr><td valign="top">[a11]</td> <td valign="top"> M.J.D. Powell, "Some global convergence properties of a variable metric algorithm without exact line searches" R. Cottle (ed.) C. Lemke (ed.) , ''Nonlinear Programming'' , Amer. Math. Soc. (1976) pp. 53–72</td></tr><tr><td valign="top">[a12]</td> <td valign="top"> J. Werner, "Über die globale konvergenz von Variable-Metric Verfahren mit nichtexakter Schrittweitenbestimmung" ''Numer. Math.'' , '''31''' (1978) pp. 321–334</td></tr><tr><td valign="top">[a13]</td> <td valign="top"> K.J. Bathe, A.P. Cimento, "Some practical procedures for the solution of nonlinear finite element equations" ''Comput. Meth. Appl. Mech. Eng.'' , '''22''' (1980) pp. 59–85</td></tr><tr><td valign="top">[a14]</td> <td valign="top"> H. Matthies, G. Strang, "The solution of nonlinear finite element equations" ''Internat. J. Numerical Methods Eng.'' , '''14''' (1979) pp. 1613–1626</td></tr><tr><td valign="top">[a15]</td> <td valign="top"> R.H. Byrd, J. Nocedal, R.B. Schnabel, "Representation of quasi-Newton matrices and their use in limited memory methods" ''Math. Progr.'' , '''63''' (1994) pp. 129–156</td></tr><tr><td valign="top">[a16]</td> <td valign="top"> J.L. Nazareth, "Conjugate gradient methods less dependent on conjugacy" ''SIAM Review'' , '''28''' (1986) pp. 501–512</td></tr><tr><td valign="top">[a17]</td> <td valign="top"> J. Nocedal, "Updating quasi-Newton matrices with limited storage" ''Math. Comp.'' , '''35''' (1980) pp. 773–782</td></tr></table> | + | <table><tr><td valign="top">[a1]</td> <td valign="top"> L. Armijo, "Minimization of functions having Lipschitz-continuous first partial derivatives" ''Pacific J. Math.'' , '''16''' (1966) pp. 1–3</td></tr><tr><td valign="top">[a2]</td> <td valign="top"> C.G. Broyden, "A new double-rank minimization algorithm" ''Notices Amer. Math. Soc.'' , '''16''' (1969) pp. 670</td></tr><tr><td valign="top">[a3]</td> <td valign="top"> J.E. Dennis, R.B. Schnabel, "Numerical Methods for Nonlinear Equations and Unconstrained Optimization" , ''Classics in Applied Math.'' : 16 , SIAM (Soc. Industrial Applied Math.) (1996)</td></tr><tr><td valign="top">[a4]</td> <td valign="top"> R. Fletcher, "A new approach to variable metric methods" ''Comput. J.'' , '''13''' (1970) pp. 317–322</td></tr><tr><td valign="top">[a5]</td> <td valign="top"> D. Goldfarb, "A family of variable metric methods derived by variational means" ''Math. Comp.'' , '''24''' (1970) pp. 23–26</td></tr><tr><td valign="top">[a6]</td> <td valign="top"> D.F. Shanno, "Conditioning of quasi-Newton methods for function minimization" ''Math. Comp.'' , '''24''' (1970) pp. 647–657</td></tr><tr><td valign="top">[a7]</td> <td valign="top"> C.T. Kelley, "Iterative methods for optimization" , ''Frontiers in Appl. Math.'' , '''18''' , SIAM (Soc. Industrial Applied Math.) (1999)</td></tr><tr><td valign="top">[a8]</td> <td valign="top"> C.G. Broyden, J.E. Dennis, J.J. Moré, "On the local and superlinear convergence of quasi-Newton methods" ''J. Inst. Math. Appl.'' , '''12''' (1973) pp. 223–246</td></tr><tr><td valign="top">[a9]</td> <td valign="top"> R.H. Byrd, J. Nocedal, "A tool for the analysis of quasi-Newton methods with application to unconstrained minimization" ''SIAM J. Numer. Anal.'' , '''26''' (1989) pp. 727–739</td></tr><tr><td valign="top">[a10]</td> <td valign="top"> R.H. Byrd, J. Nocedal, Y. Yuan, "Global convergence of a class of quasi-Newton methods on convex problems" ''SIAM J. Numer. Anal.'' , '''24''' (1987) pp. 1171–1190</td></tr><tr><td valign="top">[a11]</td> <td valign="top"> M.J.D. Powell, "Some global convergence properties of a variable metric algorithm without exact line searches" R. Cottle (ed.) C. Lemke (ed.) , ''Nonlinear Programming'' , Amer. Math. Soc. (1976) pp. 53–72</td></tr><tr><td valign="top">[a12]</td> <td valign="top"> J. Werner, "Über die globale konvergenz von Variable-Metric Verfahren mit nichtexakter Schrittweitenbestimmung" ''Numer. Math.'' , '''31''' (1978) pp. 321–334</td></tr><tr><td valign="top">[a13]</td> <td valign="top"> K.J. Bathe, A.P. Cimento, "Some practical procedures for the solution of nonlinear finite element equations" ''Comput. Meth. Appl. Mech. Eng.'' , '''22''' (1980) pp. 59–85</td></tr><tr><td valign="top">[a14]</td> <td valign="top"> H. Matthies, G. Strang, "The solution of nonlinear finite element equations" ''Internat. J. Numerical Methods Eng.'' , '''14''' (1979) pp. 1613–1626</td></tr><tr><td valign="top">[a15]</td> <td valign="top"> R.H. Byrd, J. Nocedal, R.B. Schnabel, "Representation of quasi-Newton matrices and their use in limited memory methods" ''Math. Progr.'' , '''63''' (1994) pp. 129–156</td></tr><tr><td valign="top">[a16]</td> <td valign="top"> J.L. Nazareth, "Conjugate gradient methods less dependent on conjugacy" ''SIAM Review'' , '''28''' (1986) pp. 501–512</td></tr><tr><td valign="top">[a17]</td> <td valign="top"> J. Nocedal, "Updating quasi-Newton matrices with limited storage" ''Math. Comp.'' , '''35''' (1980) pp. 773–782</td></tr> |
+ | </table> |
Latest revision as of 07:21, 13 February 2024
BFGS method
The unconstrained optimization problem is to minimize a real-valued function f of N variables. That is, to find a local minimizer, i.e. a point x ^ { * } such that
\begin{equation} \tag{a1} f ( x ^ { * } ) \leq f ( x ) \text { for all } x \text{ near } x ^ { * }; \end{equation}
f is typically called the cost function or objective function.
A classic approach for doing this is the method of steepest descent (cf. Steepest descent, method of). The iteration, here described in terms of the transition from a current approximation x _ { c } to a local minimizer x ^ { * }, to an update (and hopefully better) approximation is
\begin{equation} \tag{a2} x _ { + } = x _ { c } - \lambda \nabla f ( x _ { c } ). \end{equation}
By elementary calculus, - \nabla f ( x _ { c } ) is the direction of most rapid decrease (steepest descent) in f starting from x _ { c }. The step length \lambda must be a part of the algorithm in order to ensure that f ( x _ { + } ) < f ( x _ { c } ) (which must be so for a sufficiently small \lambda).
There are several methods for selecting an appropriate \lambda, [a3], [a7], for instance the classical Armijo rule, [a1], in which \lambda = \beta ^ { m } for some \beta \in ( 0,1 ) and where m \geq 0 is the least integer such that the sufficient decrease condition
\begin{equation} \tag{a3} f ( x _ { c } + \lambda d ) \leq f ( x _ { c } ) + \alpha \lambda d ^ { T } \nabla f ( x _ { c } ) \end{equation}
holds. In (a3), - \nabla f ( x _ { c } ) is replaced by a general descent direction d, a vector satisfying d ^ { T } \nabla f ( x _ { c } ) < 0, and a parameter \alpha has been introduced (typically 10 ^ { - 4 }). One might think that simple decrease (f ( x _ { c } + \lambda d ) < f ( x _ { c } )) would suffice, and it usually does in practice. However, the theory requires (a3) to eliminate the possibility of stagnation.
The steepest descent iteration with the Armijo rule will, if f is smooth and bounded away from - \infty, which is assumed throughout below, produce a sequence \{ x _ { n } \} such that
\begin{equation} \tag{a4} \operatorname { lim } _ { n \rightarrow \infty } \nabla f ( x _ { n } ) = 0. \end{equation}
Hence, any limit point x ^ { * } of the steepest descent sequence satisfies the first-order necessary conditions \nabla f ( x ^ { * } ) = 0. This does not imply convergence to a local minimizer and even if the steepest descent iterations do converge, and they often do, the convergence is usually very slow in the terminal phase when the iterates are near x ^ { * }. Scaling the steepest descent direction by multiplying it with the inverse of a symmetric positive-definite scaling matrix H _ { c } can preserve the convergence of \nabla f to zero and dramatically improve the convergence near a minimizer.
Now, if x ^ { * } is a local minimizer that satisfies the standard assumptions, i.e. the Hessian of f, \nabla ^ { 2 } f, is symmetric, positive definite, and Lipschitz continuous near x ^ { * }, then the Newton method,
\begin{equation} \tag{a5} x _ { + } = x _ { c } - ( \nabla ^ { 2 } f ( x _ { c } ) ) ^ { - 1 } \nabla f ( x _ { c } ), \end{equation}
will converge rapidly to the minimizer if the initial iterate x _ { 0 } is sufficiently near x ^ { * }, [a3], [a7]. However, when far from x ^ { * }, the Hessian need not be positive definite and cannot be used as a scaling matrix.
Quasi-Newton methods (cf. also Quasi-Newton method) update an approximation of \nabla ^ { 2 } f ( x ^ { * } ) as the iteration progresses. In general, the transition from current approximations x _ { c } and H _ { c } of x ^ { * } and \nabla f ( x ^ { * } ) to new approximations x_{+} and H _ { + } is given (using a line search paradigm) by:
1) compute a search direction d = - H _ { c } ^ { - 1 } \nabla f ( x _ { c } );
2) find x _ { + } = x _ { c } + \lambda d using a line search to ensure sufficient decrease;
3) use x _ { c }, x_{+}, and H _ { c } to update H _ { c } and obtain H _ { + }. The way in which H _ { + } is computed determines the method.
The BFGS method, [a2], [a4], [a5], [a6], updates H _ { c } with the rank-two formula
\begin{equation} \tag{a6} H _ { + } = H _ { c } + \frac { y y ^ { T } } { y ^ { T } s } - \frac { ( H _ { c } s ) ( H _ { c } s ) ^ { T } } { s ^ { T } H _ { c } s }. \end{equation}
In (a6),
\begin{equation*} s = x _ { + } - x _ { c } , \quad y = \nabla f ( x _ { + } ) - \nabla f ( x _ { c } ). \end{equation*}
Local and global convergence theory.
The local convergence result, [a8], [a3], [a7], assumes that the standard assumptions hold and that the initial data is sufficiently good (i.e., x _ { 0 } is near to x ^ { * } and H _ { 0 } is near \nabla ^ { 2 } f ( x ^ { * } )). Under these assumptions the BFGS iterates exist and converge q-superlinearly to x ^ { * }, i.e.,
\begin{equation} \tag{a7} \operatorname { lim } _ { n \rightarrow \infty } \frac { \left\| x _ { n + 1} - x ^ { * } \right\| } { \left\| x _ { n } - x ^ { * } \right\| } = 0. \end{equation}
By a global convergence theory one means that one does not assume that the initial data is good and one manages the poor data by making certain that f is decreased as the iteration progresses. There are several variants of global convergence theorems for BFGS and related methods, [a9], [a11], [a10], [a12]. All these results make strong assumptions on the function and some require line search methods more complicated that the simple Armijo rule discussed above.
The result from [a9] is the most general and a special case illustrating the idea now follows. One must assume that the set
\begin{equation*} D = \{ x : f ( x ) \leq f ( x _ { 0 } ) \} \end{equation*}
is convex, f is Lipschitz twice continuously differentiable in D, and that the Hessian and its inverse are uniformly bounded and positive definite in D. This assumption implies that f has a unique minimizer x ^ { * } in D and that the standard assumptions hold. If H _ { 0 } is symmetric and positive definite, the BFGS iterations, using the Armijo rule to enforce (a3), converge q-superlinearly to x ^ { * }.
Implementation.
The recursive implementation from [a7], described below, stores the history of the iteration and uses that information recursively to compute the action of H _ { k } ^{- 1} on a vector. This idea was suggested in [a13], [a14], [a17]. Other implementations may be found in [a15], [a3], [a7], and [a16].
One can easily show that
\begin{equation} \tag{a8} H _ { + } ^ { - 1 } = \left( I - \frac { s y ^ { T } } { y ^ { T } s } \right) H _ { c } ^ { - 1 } \left( I - \frac { y s ^ { T } } { y ^ { T } s } \right) + \frac { s s ^ { T } } { y ^ { T } s }. \end{equation}
The algorithm \operatorname{bfgsrec} below overwrites a given vector d with H _ { n}^{ - 1} d. The storage needed is one vector for d and 2 n vectors for the sequences \{ s _ { k } , y _ { k } \} _ { k = 0 } ^ { n - 1 }. A method for computing the product of H _ { 0 } ^ { - 1 } and a vector must also be provided.
'
|
References
[a1] | L. Armijo, "Minimization of functions having Lipschitz-continuous first partial derivatives" Pacific J. Math. , 16 (1966) pp. 1–3 |
[a2] | C.G. Broyden, "A new double-rank minimization algorithm" Notices Amer. Math. Soc. , 16 (1969) pp. 670 |
[a3] | J.E. Dennis, R.B. Schnabel, "Numerical Methods for Nonlinear Equations and Unconstrained Optimization" , Classics in Applied Math. : 16 , SIAM (Soc. Industrial Applied Math.) (1996) |
[a4] | R. Fletcher, "A new approach to variable metric methods" Comput. J. , 13 (1970) pp. 317–322 |
[a5] | D. Goldfarb, "A family of variable metric methods derived by variational means" Math. Comp. , 24 (1970) pp. 23–26 |
[a6] | D.F. Shanno, "Conditioning of quasi-Newton methods for function minimization" Math. Comp. , 24 (1970) pp. 647–657 |
[a7] | C.T. Kelley, "Iterative methods for optimization" , Frontiers in Appl. Math. , 18 , SIAM (Soc. Industrial Applied Math.) (1999) |
[a8] | C.G. Broyden, J.E. Dennis, J.J. Moré, "On the local and superlinear convergence of quasi-Newton methods" J. Inst. Math. Appl. , 12 (1973) pp. 223–246 |
[a9] | R.H. Byrd, J. Nocedal, "A tool for the analysis of quasi-Newton methods with application to unconstrained minimization" SIAM J. Numer. Anal. , 26 (1989) pp. 727–739 |
[a10] | R.H. Byrd, J. Nocedal, Y. Yuan, "Global convergence of a class of quasi-Newton methods on convex problems" SIAM J. Numer. Anal. , 24 (1987) pp. 1171–1190 |
[a11] | M.J.D. Powell, "Some global convergence properties of a variable metric algorithm without exact line searches" R. Cottle (ed.) C. Lemke (ed.) , Nonlinear Programming , Amer. Math. Soc. (1976) pp. 53–72 |
[a12] | J. Werner, "Über die globale konvergenz von Variable-Metric Verfahren mit nichtexakter Schrittweitenbestimmung" Numer. Math. , 31 (1978) pp. 321–334 |
[a13] | K.J. Bathe, A.P. Cimento, "Some practical procedures for the solution of nonlinear finite element equations" Comput. Meth. Appl. Mech. Eng. , 22 (1980) pp. 59–85 |
[a14] | H. Matthies, G. Strang, "The solution of nonlinear finite element equations" Internat. J. Numerical Methods Eng. , 14 (1979) pp. 1613–1626 |
[a15] | R.H. Byrd, J. Nocedal, R.B. Schnabel, "Representation of quasi-Newton matrices and their use in limited memory methods" Math. Progr. , 63 (1994) pp. 129–156 |
[a16] | J.L. Nazareth, "Conjugate gradient methods less dependent on conjugacy" SIAM Review , 28 (1986) pp. 501–512 |
[a17] | J. Nocedal, "Updating quasi-Newton matrices with limited storage" Math. Comp. , 35 (1980) pp. 773–782 |
Broyden-Fletcher-Goldfarb-Shanno method. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Broyden-Fletcher-Goldfarb-Shanno_method&oldid=55449