BroydenFletcherGoldfarbShanno method
BFGS method
The unconstrained optimization problem is to minimize a realvalued function $f$ of $N$ variables. That is, to find a local minimizer, i.e. a point $x ^ { * }$ such that
\begin{equation} \tag{a1} f ( x ^ { * } ) \leq f ( x ) \text { for all } x \text{ near } x ^ { * }; \end{equation}
$f$ is typically called the cost function or objective function.
A classic approach for doing this is the method of steepest descent (cf. Steepest descent, method of). The iteration, here described in terms of the transition from a current approximation $x _ { c }$ to a local minimizer $x ^ { * }$, to an update (and hopefully better) approximation is
\begin{equation} \tag{a2} x _ { + } = x _ { c }  \lambda \nabla f ( x _ { c } ). \end{equation}
By elementary calculus, $ \nabla f ( x _ { c } )$ is the direction of most rapid decrease (steepest descent) in $f$ starting from $x _ { c }$. The step length $\lambda$ must be a part of the algorithm in order to ensure that $f ( x _ { + } ) < f ( x _ { c } )$ (which must be so for a sufficiently small $\lambda$).
There are several methods for selecting an appropriate $\lambda$, [a3], [a7], for instance the classical Armijo rule, [a1], in which $\lambda = \beta ^ { m }$ for some $\beta \in ( 0,1 )$ and where $m \geq 0$ is the least integer such that the sufficient decrease condition
\begin{equation} \tag{a3} f ( x _ { c } + \lambda d ) \leq f ( x _ { c } ) + \alpha \lambda d ^ { T } \nabla f ( x _ { c } ) \end{equation}
holds. In (a3), $ \nabla f ( x _ { c } )$ is replaced by a general descent direction $d$, a vector satisfying $d ^ { T } \nabla f ( x _ { c } ) < 0$, and a parameter $\alpha$ has been introduced (typically $10 ^ {  4 }$). One might think that simple decrease ($f ( x _ { c } + \lambda d ) < f ( x _ { c } )$) would suffice, and it usually does in practice. However, the theory requires (a3) to eliminate the possibility of stagnation.
The steepest descent iteration with the Armijo rule will, if $f$ is smooth and bounded away from $ \infty$, which is assumed throughout below, produce a sequence $\{ x _ { n } \}$ such that
\begin{equation} \tag{a4} \operatorname { lim } _ { n \rightarrow \infty } \nabla f ( x _ { n } ) = 0. \end{equation}
Hence, any limit point $x ^ { * }$ of the steepest descent sequence satisfies the firstorder necessary conditions $\nabla f ( x ^ { * } ) = 0$. This does not imply convergence to a local minimizer and even if the steepest descent iterations do converge, and they often do, the convergence is usually very slow in the terminal phase when the iterates are near $x ^ { * }$. Scaling the steepest descent direction by multiplying it with the inverse of a symmetric positivedefinite scaling matrix $H _ { c }$ can preserve the convergence of $\nabla f$ to zero and dramatically improve the convergence near a minimizer.
Now, if $x ^ { * }$ is a local minimizer that satisfies the standard assumptions, i.e. the Hessian of $f$, $\nabla ^ { 2 } f$, is symmetric, positive definite, and Lipschitz continuous near $x ^ { * }$, then the Newton method,
\begin{equation} \tag{a5} x _ { + } = x _ { c }  ( \nabla ^ { 2 } f ( x _ { c } ) ) ^ {  1 } \nabla f ( x _ { c } ), \end{equation}
will converge rapidly to the minimizer if the initial iterate $x _ { 0 }$ is sufficiently near $x ^ { * }$, [a3], [a7]. However, when far from $x ^ { * }$, the Hessian need not be positive definite and cannot be used as a scaling matrix.
QuasiNewton methods (cf. also QuasiNewton method) update an approximation of $\nabla ^ { 2 } f ( x ^ { * } )$ as the iteration progresses. In general, the transition from current approximations $x _ { c }$ and $H _ { c }$ of $x ^ { * }$ and $\nabla f ( x ^ { * } )$ to new approximations $x_{+}$ and $H _ { + }$ is given (using a line search paradigm) by:
1) compute a search direction $d =  H _ { c } ^ {  1 } \nabla f ( x _ { c } )$;
2) find $x _ { + } = x _ { c } + \lambda d$ using a line search to ensure sufficient decrease;
3) use $x _ { c }$, $x_{+}$, and $H _ { c }$ to update $H _ { c }$ and obtain $H _ { + }$. The way in which $H _ { + }$ is computed determines the method.
The BFGS method, [a2], [a4], [a5], [a6], updates $H _ { c }$ with the ranktwo formula
\begin{equation} \tag{a6} H _ { + } = H _ { c } + \frac { y y ^ { T } } { y ^ { T } s }  \frac { ( H _ { c } s ) ( H _ { c } s ) ^ { T } } { s ^ { T } H _ { c } s }. \end{equation}
In (a6),
\begin{equation*} s = x _ { + }  x _ { c } , \quad y = \nabla f ( x _ { + } )  \nabla f ( x _ { c } ). \end{equation*}
Local and global convergence theory.
The local convergence result, [a8], [a3], [a7], assumes that the standard assumptions hold and that the initial data is sufficiently good (i.e., $x _ { 0 }$ is near to $x ^ { * }$ and $H _ { 0 }$ is near $\nabla ^ { 2 } f ( x ^ { * } )$). Under these assumptions the BFGS iterates exist and converge $q$superlinearly to $x ^ { * }$, i.e.,
\begin{equation} \tag{a7} \operatorname { lim } _ { n \rightarrow \infty } \frac { \left\ x _ { n + 1}  x ^ { * } \right\ } { \left\ x _ { n }  x ^ { * } \right\ } = 0. \end{equation}
By a global convergence theory one means that one does not assume that the initial data is good and one manages the poor data by making certain that $f$ is decreased as the iteration progresses. There are several variants of global convergence theorems for BFGS and related methods, [a9], [a11], [a10], [a12]. All these results make strong assumptions on the function and some require line search methods more complicated that the simple Armijo rule discussed above.
The result from [a9] is the most general and a special case illustrating the idea now follows. One must assume that the set
\begin{equation*} D = \{ x : f ( x ) \leq f ( x _ { 0 } ) \} \end{equation*}
is convex, $f$ is Lipschitz twice continuously differentiable in $D$, and that the Hessian and its inverse are uniformly bounded and positive definite in $D$. This assumption implies that $f$ has a unique minimizer $x ^ { * }$ in $D$ and that the standard assumptions hold. If $H _ { 0 }$ is symmetric and positive definite, the BFGS iterations, using the Armijo rule to enforce (a3), converge $q$superlinearly to $x ^ { * }$.
Implementation.
The recursive implementation from [a7], described below, stores the history of the iteration and uses that information recursively to compute the action of $H _ { k } ^{ 1}$ on a vector. This idea was suggested in [a13], [a14], [a17]. Other implementations may be found in [a15], [a3], [a7], and [a16].
One can easily show that
\begin{equation} \tag{a8} H _ { + } ^ {  1 } = \left( I  \frac { s y ^ { T } } { y ^ { T } s } \right) H _ { c } ^ {  1 } \left( I  \frac { y s ^ { T } } { y ^ { T } s } \right) + \frac { s s ^ { T } } { y ^ { T } s }. \end{equation}
The algorithm $\operatorname{bfgsrec}$ below overwrites a given vector $d$ with $H _ { n}^{  1} d$. The storage needed is one vector for $d$ and $2 n$ vectors for the sequences $\{ s _ { k } , y _ { k } \} _ { k = 0 } ^ { n  1 }$. A method for computing the product of $H _ { 0 } ^ {  1 }$ and a vector must also be provided.
'

References
[a1]  L. Armijo, "Minimization of functions having Lipschitzcontinuous first partial derivatives" Pacific J. Math. , 16 (1966) pp. 1–3 
[a2]  C.G. Broyden, "A new doublerank minimization algorithm" Notices Amer. Math. Soc. , 16 (1969) pp. 670 
[a3]  J.E. Dennis, R.B. Schnabel, "Numerical Methods for Nonlinear Equations and Unconstrained Optimization" , Classics in Applied Math. : 16 , SIAM (Soc. Industrial Applied Math.) (1996) 
[a4]  R. Fletcher, "A new approach to variable metric methods" Comput. J. , 13 (1970) pp. 317–322 
[a5]  D. Goldfarb, "A family of variable metric methods derived by variational means" Math. Comp. , 24 (1970) pp. 23–26 
[a6]  D.F. Shanno, "Conditioning of quasiNewton methods for function minimization" Math. Comp. , 24 (1970) pp. 647–657 
[a7]  C.T. Kelley, "Iterative methods for optimization" , Frontiers in Appl. Math. , 18 , SIAM (Soc. Industrial Applied Math.) (1999) 
[a8]  C.G. Broyden, J.E. Dennis, J.J. Moré, "On the local and superlinear convergence of quasiNewton methods" J. Inst. Math. Appl. , 12 (1973) pp. 223–246 
[a9]  R.H. Byrd, J. Nocedal, "A tool for the analysis of quasiNewton methods with application to unconstrained minimization" SIAM J. Numer. Anal. , 26 (1989) pp. 727–739 
[a10]  R.H. Byrd, J. Nocedal, Y. Yuan, "Global convergence of a class of quasiNewton methods on convex problems" SIAM J. Numer. Anal. , 24 (1987) pp. 1171–1190 
[a11]  M.J.D. Powell, "Some global convergence properties of a variable metric algorithm without exact line searches" R. Cottle (ed.) C. Lemke (ed.) , Nonlinear Programming , Amer. Math. Soc. (1976) pp. 53–72 
[a12]  J. Werner, "Über die globale konvergenz von VariableMetric Verfahren mit nichtexakter Schrittweitenbestimmung" Numer. Math. , 31 (1978) pp. 321–334 
[a13]  K.J. Bathe, A.P. Cimento, "Some practical procedures for the solution of nonlinear finite element equations" Comput. Meth. Appl. Mech. Eng. , 22 (1980) pp. 59–85 
[a14]  H. Matthies, G. Strang, "The solution of nonlinear finite element equations" Internat. J. Numerical Methods Eng. , 14 (1979) pp. 1613–1626 
[a15]  R.H. Byrd, J. Nocedal, R.B. Schnabel, "Representation of quasiNewton matrices and their use in limited memory methods" Math. Progr. , 63 (1994) pp. 129–156 
[a16]  J.L. Nazareth, "Conjugate gradient methods less dependent on conjugacy" SIAM Review , 28 (1986) pp. 501–512 
[a17]  J. Nocedal, "Updating quasiNewton matrices with limited storage" Math. Comp. , 35 (1980) pp. 773–782 
Broyden–Fletcher–Goldfarb–Shanno method. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_method&oldid=22202