Namespaces
Variants
Actions

Difference between revisions of "Minimization methods for functions depending strongly on a few variables"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
Line 1: Line 1:
 +
<!--
 +
m0639801.png
 +
$#A+1 = 109 n = 0
 +
$#C+1 = 109 : ~/encyclopedia/old_files/data/M063/M.0603980 Minimization methods for functions depending strongly on a few variables
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
 +
 +
{{TEX|auto}}
 +
{{TEX|done}}
 +
 
Numerical methods for finding minima of functions of several variables. Suppose one is given a function that is bounded from below and twice continuously differentiable,
 
Numerical methods for finding minima of functions of several variables. Suppose one is given a function that is bounded from below and twice continuously differentiable,
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639801.png" /></td> </tr></table>
+
$$
 +
J ( x)  = J ( x _ {1} \dots x _ {m} ) ,
 +
$$
 +
 
 +
and that is known to take its least value at a vector  $  x  ^ {*} = ( x _ {1}  ^ {*} \dots x _ {m}  ^ {*} )  ^ {T} $(
 +
where  $  {}  ^ {T} $
 +
denotes transposition). One wishes to construct a sequence of vectors  $  \{ x _ {n} \} $,
 +
$  x _ {n} = ( x _ {1n} \dots x _ {mn} )  ^ {T} $,
 +
such that
  
and that is known to take its least value at a vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639802.png" /> (where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639803.png" /> denotes transposition). One wishes to construct a sequence of vectors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639804.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639805.png" />, such that
+
$$
 +
\lim\limits _ {n \rightarrow \infty }  J ( x _ {n} )  = J ( x  ^ {*} ) .
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639806.png" /></td> </tr></table>
+
There are many methods for obtaining such a sequence. However, the majority of algorithms suffer from a sharp deterioration when the level surfaces  $  J ( x) = \textrm{ const } $
 +
of the function to be minimized have a structure which is far from spherical. In this case a domain  $  Q $
 +
in which the norm of the gradient vector
  
There are many methods for obtaining such a sequence. However, the majority of algorithms suffer from a sharp deterioration when the level surfaces <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639807.png" /> of the function to be minimized have a structure which is far from spherical. In this case a domain <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639808.png" /> in which the norm of the gradient vector
+
$$
 +
J  ^  \prime  ( x)  = \
 +
\left (
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m0639809.png" /></td> </tr></table>
+
\frac{\partial  J }{\partial  x _ {1} }
 +
\dots
  
is essentially smaller than in the rest of the space is called a bottom of a valley, and the function itself is called a valley function. If the dimension of the space of arguments of the function to be minimized exceeds two, the structure of the level surfaces of the valley function may be very complicated. <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398010.png" />-dimensional valleys occur, where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398011.png" /> varies from 1 to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398012.png" />. In three-dimensional space, for example, one and two-dimensional valleys are possible.
+
\frac{\partial  J }{\partial  x _ {m} }
 +
 
 +
\right )  ^ {T}
 +
$$
 +
 
 +
is essentially smaller than in the rest of the space is called a bottom of a valley, and the function itself is called a valley function. If the dimension of the space of arguments of the function to be minimized exceeds two, the structure of the level surfaces of the valley function may be very complicated. $  ( m - k ) $-
 +
dimensional valleys occur, where $  k $
 +
varies from 1 to m - 1 $.  
 +
In three-dimensional space, for example, one and two-dimensional valleys are possible.
  
 
Functions of valley type are locally characterized by an ill-conditioned matrix of second derivatives (Hesse matrix or Hessian)
 
Functions of valley type are locally characterized by an ill-conditioned matrix of second derivatives (Hesse matrix or Hessian)
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398013.png" /></td> </tr></table>
+
$$
 +
J  ^ {\prime\prime} ( x)  = \
 +
\left \|
  
which leads to strong variations of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398014.png" /> along directions coinciding with eigen vectors of the Hesse matrix corresponding to large eigen values, and to small variations along directions corresponding to small eigen values.
+
\frac{\partial  ^ {2} J ( x) }{\partial  x _ {i} \partial  x _ {j} }
 +
\
 +
\right \| ,\ \
 +
i , j = 1 \dots m ,
 +
$$
  
Most known methods of optimization allow a fairly rapid descent to the bottom <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398015.png" /> of the valley, leading sometimes to a considerable reduction in the values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398016.png" /> as compared with its value at the starting point (descent into the bottom of the valley). However, the process then slows down sharply, and practically stops at some point of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398017.png" /> which may be far from the required minimum point.
+
which leads to strong variations of $  J ( x) $
 +
along directions coinciding with eigen vectors of the Hesse matrix corresponding to large eigen values, and to small variations along directions corresponding to small eigen values.
  
A twice continuously-differentiable function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398018.png" /> is called a valley function (see [[#References|[1]]]) if there exists a domain <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398019.png" /> in which the eigen values of the Hesse matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398020.png" />, in order of decreasing modulus, satisfy, at any point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398021.png" />, the inequality
+
Most known methods of optimization allow a fairly rapid descent to the bottom  $  Q $
 +
of the valley, leading sometimes to a considerable reduction in the values of  $  J ( x) $
 +
as compared with its value at the starting point (descent into the bottom of the valley). However, the process then slows down sharply, and practically stops at some point of  $  Q $
 +
which may be far from the required minimum point.
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398022.png" /></td> <td valign="top" style="width:5%;text-align:right;">(1)</td></tr></table>
+
A twice continuously-differentiable function  $  J ( x) $
 +
is called a valley function (see [[#References|[1]]]) if there exists a domain  $  G \subset  \mathbf R  ^ {m} $
 +
in which the eigen values of the Hesse matrix  $  J  ^ {\prime\prime} ( x) $,
 +
in order of decreasing modulus, satisfy, at any point  $  x \in G $,
 +
the inequality
 +
 
 +
$$ \tag{1 }
 +
0 < \left | \min _ { i }  \lambda _ {i} ( x) \right | \ll  \lambda _ {1} ( x) .
 +
$$
  
 
The degree of steepness of the valley is characterized by the number
 
The degree of steepness of the valley is characterized by the number
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398023.png" /></td> <td valign="top" style="width:5%;text-align:right;">(2)</td></tr></table>
+
$$ \tag{2 }
 +
=
 +
\frac{\lambda _ {1} }{\left | \min _ {\lambda _ {i} \neq 0 } \
 +
\lambda _ {i} \right | }
 +
.
 +
$$
  
If the eigen values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398024.png" /> in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398025.png" /> satisfy the inequalities
+
If the eigen values of $  J  ^ {\prime\prime} ( x) $
 +
in $  G $
 +
satisfy the inequalities
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398026.png" /></td> </tr></table>
+
$$
 +
| \lambda _ {m} ( x) |  \leq  \dots
 +
\leq  | \lambda _ {m-} r+ 1 ( x) |
 +
\ll  \lambda _ {m-} r ( x) \
 +
\leq  \dots \leq  \lambda _ {1} ( x) ,
 +
$$
  
then the number <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398027.png" /> is called the dimension of the valley of the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398028.png" /> at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398029.png" /> (see [[#References|[1]]]).
+
then the number $  r $
 +
is called the dimension of the valley of the function $  J ( x) $
 +
at $  x \in G $(
 +
see [[#References|[1]]]).
  
The system of differential equations describing the trajectory of descent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398030.png" />,
+
The system of differential equations describing the trajectory of descent of $  J ( x) $,
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398031.png" /></td> <td valign="top" style="width:5%;text-align:right;">(3)</td></tr></table>
+
$$ \tag{3 }
 +
 
 +
\frac{dx}{dt}
 +
  = - J  ^  \prime  ( x) ,\  x ( 0)  = x _ {0} ,
 +
$$
  
 
is a [[Stiff differential system|stiff differential system]].
 
is a [[Stiff differential system|stiff differential system]].
  
In particular, when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398032.png" /> is strictly convex and the Hesse matrix is positive definite (all its eigen values are strictly positive), the inequalities (1) coincide with the well-known requirement for ill-conditioning of the Hesse matrix:
+
In particular, when $  J ( x) $
 +
is strictly convex and the Hesse matrix is positive definite (all its eigen values are strictly positive), the inequalities (1) coincide with the well-known requirement for ill-conditioning of the Hesse matrix:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398033.png" /></td> </tr></table>
+
$$
 +
k ( J  ^ {\prime\prime} ( x) )  = \
 +
 
 +
\frac{\max _ { i }  \lambda _ {i} ( x) }{\min _ { i }  \lambda _ {i} ( x) }
 +
  \gg  1 .
 +
$$
  
 
In this case the spectral condition number is the same as the degree of steepness of the valley.
 
In this case the spectral condition number is the same as the degree of steepness of the valley.
Line 49: Line 127:
 
The [[Coordinate-wise descent method|coordinate-wise descent method]] (see [[#References|[2]]])
 
The [[Coordinate-wise descent method|coordinate-wise descent method]] (see [[#References|[2]]])
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398034.png" /></td> <td valign="top" style="width:5%;text-align:right;">(4)</td></tr></table>
+
$$ \tag{4 }
 +
J ( x _ {1,k+} 1 \dots x _ {i-} 1,k+ 1 , x _ {i,k+} 1 , x _ {i+} 1,k \dots x _ {m,k} ) =
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398035.png" /></td> </tr></table>
+
$$
 +
= \
 +
\min _ { y }  J ( x _ {1,k+} 1 \dots x _ {i-} 1,k , y, x _ {i+} 1,k \dots x _ {m,k} ) ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398036.png" /></td> </tr></table>
+
$$
 +
= 0 , 1 \dots
 +
$$
  
 
in spite of its simplicity and universality, is effective in the valley situation only in the rare case when the orientation of the valleys is along the coordinate axes.
 
in spite of its simplicity and universality, is effective in the valley situation only in the rare case when the orientation of the valleys is along the coordinate axes.
  
A modernization of the method (4) was proposed in [[#References|[2]]], consisting of a rotation of the coordinate axes so that one of the axes lies along <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398037.png" />, after which the search begins at the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398038.png" />-st step. Such an approach leads to one of the axes having a tendency to align along a generator of the bottom of the valley, allowing one to perform successfully the minimization of functions with one-dimensional valleys in several cases. The method is unsuitable for multi-dimensional valleys.
+
A modernization of the method (4) was proposed in [[#References|[2]]], consisting of a rotation of the coordinate axes so that one of the axes lies along $  x _ {k} - x _ {k-} 1 $,  
 +
after which the search begins at the $  ( k + 1 ) $-
 +
st step. Such an approach leads to one of the axes having a tendency to align along a generator of the bottom of the valley, allowing one to perform successfully the minimization of functions with one-dimensional valleys in several cases. The method is unsuitable for multi-dimensional valleys.
  
 
The scheme of the method of steepest descent (cf. [[Steepest descent, method of|Steepest descent, method of]]) is given by the difference equation
 
The scheme of the method of steepest descent (cf. [[Steepest descent, method of|Steepest descent, method of]]) is given by the difference equation
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398039.png" /></td> <td valign="top" style="width:5%;text-align:right;">(5)</td></tr></table>
+
$$ \tag{5 }
 +
x _ {k+} 1  = x _ {k} - h _ {k} J _ {k}  ^  \prime  ,\ \
 +
J _ {k}  ^  \prime  = J  ^  \prime  ( x _ {k} ) ,
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398040.png" /> is chosen by the condition
+
where $  h _ {k} $
 +
is chosen by the condition
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398041.png" /></td> </tr></table>
+
$$
 +
J ( x _ {k+} 1 )  = \
 +
\min _ {h > 0 }  J ( x _ {k} - h J _ {k}  ^  \prime  ) .
 +
$$
  
 
For strictly-convex valley functions, in particular for quadratic ones
 
For strictly-convex valley functions, in particular for quadratic ones
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398042.png" /></td> <td valign="top" style="width:5%;text-align:right;">(6)</td></tr></table>
+
$$ \tag{6 }
 +
J ( x)  =
 +
\frac{1}{2}
 +
x  ^ {T} D x - b  ^ {T} x ,
 +
$$
  
the sequence <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398043.png" /> constructed by algorithm (5) converges geometrically to the minimum point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398044.png" /> of the function (see [[#References|[3]]]):
+
the sequence $  \{ x _ {k} \} $
 +
constructed by algorithm (5) converges geometrically to the minimum point $  x  ^ {*} $
 +
of the function (see [[#References|[3]]]):
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398045.png" /></td> </tr></table>
+
$$
 +
\| x _ {k} - x  ^ {*} \|  \leq  C q  ^ {k} ,
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398046.png" /> and
+
where $  C = \textrm{ const } $
 +
and
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398047.png" /></td> </tr></table>
+
$$
 +
=
 +
\frac{k ( J  ^ {\prime\prime} ( x  ^ {*} ) ) - 1 }{k ( J  ^ {\prime\prime} ( x  ^ {*} ) ) + 1 }
 +
.
 +
$$
  
Since <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398048.png" /> for a valley function, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398049.png" /> and convergence is effectively absent.
+
Since $  k ( J  ^ {\prime\prime} ( x) ) \gg 1 $
 +
for a valley function, $  q \simeq 1 $
 +
and convergence is effectively absent.
  
 
A similar picture can be seen for the simple gradient scheme (see [[#References|[4]]]; [[Gradient method|Gradient method]])
 
A similar picture can be seen for the simple gradient scheme (see [[#References|[4]]]; [[Gradient method|Gradient method]])
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398050.png" /></td> <td valign="top" style="width:5%;text-align:right;">(7)</td></tr></table>
+
$$ \tag{7 }
 +
x _ {k+} 1  = x _ {k} - h J _ {k}  ^ {\prime\prime} ,\ \
 +
J _ {k+} 1  = J ( x _ {k+} 1 ) ,\  h  = \textrm{ const } .
 +
$$
  
Acceleration of its convergence is based on using the results of the previous iterations to make the bottom of the valley more precise. The gradient method (7) could be used (see [[#References|[4]]], [[#References|[5]]]) with computation of the ratio <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398051.png" /> at each iteration. When it becomes steady near the constant value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398052.png" />, a large accelerating step is performed in accordance with the expression
+
Acceleration of its convergence is based on using the results of the previous iterations to make the bottom of the valley more precise. The gradient method (7) could be used (see [[#References|[4]]], [[#References|[5]]]) with computation of the ratio $  q = \| J _ {k}  ^  \prime  \| / \|  J _ {k-} 1  ^  \prime  \| $
 +
at each iteration. When it becomes steady near the constant value $  q = 1 $,  
 +
a large accelerating step is performed in accordance with the expression
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398053.png" /></td> </tr></table>
+
$$
 +
x _ {k+} 1  = x _ {k} -
 +
\frac{h}{1-}
 +
q J _ {k}  ^  \prime  .
 +
$$
  
 
Then descent by the gradient method is continued until the next accelerating step.
 
Then descent by the gradient method is continued until the next accelerating step.
  
Various versions of the method of parallel tangents (see [[#References|[4]]]–[[#References|[6]]]) are based on carrying out the accelerating step along the direction <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398054.png" /> given by the points <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398055.png" /> in the gradient method. In the  "heavy-sphere"  method (see [[#References|[4]]]; [[Heavy sphere, method of the|Heavy sphere, method of the]]) the next approximation has the form
+
Various versions of the method of parallel tangents (see [[#References|[4]]]–[[#References|[6]]]) are based on carrying out the accelerating step along the direction $  x _ {k+} 2 - x _ {k} $
 +
given by the points $  x _ {k} , x _ {k+} 2 $
 +
in the gradient method. In the  "heavy-sphere"  method (see [[#References|[4]]]; [[Heavy sphere, method of the|Heavy sphere, method of the]]) the next approximation has the form
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398056.png" /></td> </tr></table>
+
$$
 +
x _ {k+} 1  = x _ {k} -
 +
\alpha J _ {k}  ^  \prime  +
 +
\beta ( x _ {k} - x _ {k-} 1 ) .
 +
$$
  
 
In the valley method (see [[#References|[7]]]) local descents are carried out by the gradient method (7) from two arbitrarily chosen starting points, and then an accelerating step is made in a direction given by two of the points at the bottom of the valley.
 
In the valley method (see [[#References|[7]]]) local descents are carried out by the gradient method (7) from two arbitrarily chosen starting points, and then an accelerating step is made in a direction given by two of the points at the bottom of the valley.
Line 99: Line 223:
 
These methods are not much more complicated than the gradient method (7), and they are based on it. Accelerated convergence is obtained for a one-dimensional valley. In more general cases of multi-dimensional valleys, where convergence of these methods slows down sharply, one has to turn to the more powerful methods of quadratic approximation, based on Newton's method (cf. [[Newton method|Newton method]])
 
These methods are not much more complicated than the gradient method (7), and they are based on it. Accelerated convergence is obtained for a one-dimensional valley. In more general cases of multi-dimensional valleys, where convergence of these methods slows down sharply, one has to turn to the more powerful methods of quadratic approximation, based on Newton's method (cf. [[Newton method|Newton method]])
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398057.png" /></td> <td valign="top" style="width:5%;text-align:right;">(8)</td></tr></table>
+
$$ \tag{8 }
 +
x _ {k+} 1  = x _ {k} -
 +
( J _ {k}  ^ {\prime\prime} )  ^ {-} 1
 +
J _ {k}  ^  \prime  ,\ \
 +
J _ {k}  ^ {\prime\prime}  = J
 +
^ {\prime\prime} ( x _ {k} ) .
 +
$$
  
 
The minimum of the function (6) satisfies the system of linear equations
 
The minimum of the function (6) satisfies the system of linear equations
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398058.png" /></td> <td valign="top" style="width:5%;text-align:right;">(9)</td></tr></table>
+
$$ \tag{9 }
 +
D \mathbf x  = \mathbf b ,
 +
$$
  
and under the condition of absolute accuracy of all computations, for a quadratic function Newton's method is independent of the degree of steepness of the valley (2) and the dimension of the valley, and leads to the minimum in one step. In reality, with a large condition number <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398059.png" /> and a restricted number of digits in computation, the problem of solving (9) may be ill-posed, and small changes in the elements of the matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398060.png" /> and the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398061.png" /> can lead to large variations in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398062.png" />.
+
and under the condition of absolute accuracy of all computations, for a quadratic function Newton's method is independent of the degree of steepness of the valley (2) and the dimension of the valley, and leads to the minimum in one step. In reality, with a large condition number $  k ( D) $
 +
and a restricted number of digits in computation, the problem of solving (9) may be ill-posed, and small changes in the elements of the matrix $  D $
 +
and the vector $  \mathbf b $
 +
can lead to large variations in $  \mathbf x  ^ {*} $.
  
 
With moderate degrees of steepness in a convex situation, Newton's method often turns out to be preferable, with respect to speed of convergence, to other methods, such as the gradient method.
 
With moderate degrees of steepness in a convex situation, Newton's method often turns out to be preferable, with respect to speed of convergence, to other methods, such as the gradient method.
Line 113: Line 248:
 
Sometimes (see [[#References|[8]]]) the iteration is performed according to the scheme
 
Sometimes (see [[#References|[8]]]) the iteration is performed according to the scheme
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398063.png" /></td> <td valign="top" style="width:5%;text-align:right;">(10)</td></tr></table>
+
$$ \tag{10 }
 +
x _ {k+} 1  = x _ {k} -
 +
( \beta _ {k} E + J _ {k}  ^ {\prime\prime} ) ^ {-} 1 J _ {k}  ^ {\prime\prime} ,
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398064.png" /> is the identity matrix. The scalars <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398065.png" /> are chosen so that the matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398066.png" /> is positive definite and
+
where $  E $
 +
is the identity matrix. The scalars $  \beta _ {k} $
 +
are chosen so that the matrix $  J _ {k}  ^ {\prime\prime} + \beta _ {k} E $
 +
is positive definite and
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398067.png" /></td> </tr></table>
+
$$
 +
\| x _ {k+} 1 - x _ {k} \|  \leq  \epsilon _ {k} .
 +
$$
  
There are several similar approaches (see [[#References|[8]]]) based on obtaining strictly positive definite approximations to the Hesse matrix. For the minimization of valley functions such algorithms turn out to be inefficient because of the difficulty in choosing the parameters <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398068.png" />, etc. The choice of these parameters is based on information on the size of the eigen values of smallest modulus of the Hesse matrix, and for real calculations and a high degree of steepness this information is strongly distorted.
+
There are several similar approaches (see [[#References|[8]]]) based on obtaining strictly positive definite approximations to the Hesse matrix. For the minimization of valley functions such algorithms turn out to be inefficient because of the difficulty in choosing the parameters $  \beta _ {k} , \epsilon _ {k} $,
 +
etc. The choice of these parameters is based on information on the size of the eigen values of smallest modulus of the Hesse matrix, and for real calculations and a high degree of steepness this information is strongly distorted.
  
A more appropriate generalization of Newton's method for the case of minimization of valley functions is based on the continuous principle of optimization. The function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398069.png" /> is juxtaposed to a differential system (3), which is integrable by the system method (see [[Stiff differential system|Stiff differential system]]). The minimization algorithm takes the form
+
A more appropriate generalization of Newton's method for the case of minimization of valley functions is based on the continuous principle of optimization. The function $  J ( x) $
 +
is juxtaposed to a differential system (3), which is integrable by the system method (see [[Stiff differential system|Stiff differential system]]). The minimization algorithm takes the form
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398070.png" /></td> <td valign="top" style="width:5%;text-align:right;">(11a)</td></tr></table>
+
$$ \tag{11a }
 +
x _ {k+} 1  = x _ {k} -
 +
\Phi ( 2  ^ {N} h _ {k}  ^ {0} ) J _ {k}  ^  \prime  ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398071.png" /></td> <td valign="top" style="width:5%;text-align:right;">(11b)</td></tr></table>
+
$$ \tag{11b }
 +
\Phi ( 2  ^ {N} h _ {k}  ^ {0} )
 +
= \int\limits _ { 0 } ^ { {2 }  N h _ {k}  ^ {0} }  \mathop{\rm exp}
 +
( - J _ {k}  ^ {\prime\prime} \tau ) \
 +
d \tau ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398072.png" /></td> <td valign="top" style="width:5%;text-align:right;">(11c)</td></tr></table>
+
$$ \tag{11c }
 +
J _ {k+} 1  = \min _ { N } \
 +
J ( x _ {k} - \Phi ( 2  ^ {N}
 +
h _ {k}  ^ {0} ) J _ {k}  ^  \prime  ) ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398073.png" /></td> </tr></table>
+
$$
 +
h _ {k}  ^ {0}  \leq 
 +
\frac{1}{\| J _ {k}  ^ {\prime\prime} \| }
 +
,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398074.png" /></td> <td valign="top" style="width:5%;text-align:right;">(11d)</td></tr></table>
+
$$ \tag{11d }
 +
\Phi ( h _ {k}  ^ {0} )  = \
 +
h _ {k}  ^ {0}
 +
\left [
 +
E -  
 +
\frac{h _ {k}  ^ {0} }{2}
 +
J _ {k}  ^ {\prime\prime} +
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398075.png" /></td> <td valign="top" style="width:5%;text-align:right;">(11e)</td></tr></table>
+
\frac{( h _ {k}  ^ {0} )  ^ {2} }{6}
 +
( J _ {k}  ^ {\prime\prime} )
 +
^ {2} - \dots \right ] ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398076.png" /></td> </tr></table>
+
$$ \tag{11e }
 +
\Phi ( 2  ^ {s+} 1 h _ {k}  ^ {0} )  = \Phi ( 2  ^ {s}
 +
h _ {k}  ^ {0} ) [ 2 E -
 +
J _ {k}  ^ {\prime\prime} \Phi ( 2
 +
^ {s} h _ {k}  ^ {0} ) ] ,
 +
$$
  
In [[#References|[1]]], an algorithm for the minimization of valley functions has been proposed based on the use of properties of stiff systems. Let the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398077.png" /> be approximated in a neighbourhood of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398078.png" /> by the quadratic function (6). The matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398079.png" /> and the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398080.png" /> are computed by, for example, finite-difference approximation. From the representation of the elements of the matrix
+
$$
 +
= 0 \dots N - 1 .
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398081.png" /></td> </tr></table>
+
In [[#References|[1]]], an algorithm for the minimization of valley functions has been proposed based on the use of properties of stiff systems. Let the function  $  J ( x) $
 +
be approximated in a neighbourhood of  $  x _ {0} $
 +
by the quadratic function (6). The matrix  $  D $
 +
and the vector  $  \mathbf b $
 +
are computed by, for example, finite-difference approximation. From the representation of the elements of the matrix
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398082.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398083.png" />, is an orthonormal basis of eigen vectors of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398084.png" />, it follows that imprecise measurement of these elements distorts the information about the smallest eigen values of an ill-conditioned matrix, and hence leads to ill-posedness of the problem of minimization of (6).
+
$$
 +
d _ {ij}  = \
 +
\sum _ { s= } 1 ^ { m }
 +
u _ {is} u _ {js} \lambda _ {s} ,
 +
$$
 +
 
 +
where  $  u _ {s} = ( u _ {1s} \dots u _ {ms} )  ^ {T} $,
 +
$  s = 1 \dots m $,  
 +
is an orthonormal basis of eigen vectors of $  D $,  
 +
it follows that imprecise measurement of these elements distorts the information about the smallest eigen values of an ill-conditioned matrix, and hence leads to ill-posedness of the problem of minimization of (6).
  
 
Moreover, the system of differential equations of descent for the valley function (6),
 
Moreover, the system of differential equations of descent for the valley function (6),
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398085.png" /></td> </tr></table>
+
$$
 +
 
 +
\frac{dx}{dt}
 +
  = - D x + b ,\ \
 +
x ( 0)  = x _ {0} ,
 +
$$
  
has a solution in which, by virtue of condition (1), the terms with factors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398086.png" /> have an influence only on a small initial interval of length <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398087.png" />.
+
has a solution in which, by virtue of condition (1), the terms with factors $  \mathop{\rm exp} ( - \lambda _ {1} t ) $
 +
have an influence only on a small initial interval of length $  \tau _ {ps} = O ( \lambda _ {1}  ^ {-} 1 ) $.
  
In other words, the components of the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398088.png" /> satisfy the equation
+
In other words, the components of the vector $  x ( t) $
 +
satisfy the equation
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398089.png" /></td> </tr></table>
+
$$
 +
x  ^ {T} ( t) u _ {1} - \lambda _ {1}  ^ {-} 1 b  ^ {T} u _ {1}  = \
 +
( x _ {0}  ^ {T} u _ {1} - \lambda _ {1}  ^ {-} 1 b
 +
^ {T} u _ {1} )  \mathop{\rm exp} ( - \lambda _ {1} t ) ,
 +
$$
  
 
which quickly goes over to the stationary relation
 
which quickly goes over to the stationary relation
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398090.png" /></td> <td valign="top" style="width:5%;text-align:right;">(12)</td></tr></table>
+
$$ \tag{12 }
 +
\sum _ { i= } 1 ^ { m }
 +
u _ {i1} \overline{x}\; _ {i} +
 +
\lambda _ {1}  ^ {-} 1
 +
\sum _ { i= } 1 ^ { m }
 +
b _ {i} u _ {i1}  = 0 ,
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398091.png" /> are the components of a vector satisfying (12). This property is used in the algorithm. Expressing the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398092.png" />-th component of the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398093.png" /> to which the maximal component of the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398094.png" /> corresponds in terms of the remaining components, one replaces <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398095.png" /> by a new function with argument of dimension <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398096.png" />:
+
where $  \overline{x}\; _ {i} $
 +
are the components of a vector satisfying (12). This property is used in the algorithm. Expressing the $  j $-
 +
th component of the vector $  \overline{\mathbf x}\; $
 +
to which the maximal component of the vector $  u _ {1} $
 +
corresponds in terms of the remaining components, one replaces $  J ( x) $
 +
by a new function with argument of dimension $  ( m - 1 ) $:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398097.png" /></td> <td valign="top" style="width:5%;text-align:right;">(13)</td></tr></table>
+
$$ \tag{13 }
 +
J \left ( \overline{x}\; _ {1} \dots \overline{x}\; _ {i-} 1 ,\right .  
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398098.png" /></td> </tr></table>
+
$$
 +
\left .
 +
- u _ {j1}  ^ {-} 1 \sum _ {\begin{array}{c}
 +
i = 1 \\
 +
i \neq j
 +
\end{array}
 +
} ^ { m }  ( u _ {i1} \overline{x}\; _ {i} - \lambda _ {1}  ^ {-} 1 b _ {i} u _ {i1} ) , x _ {j+} 1 \dots x _ {m} \right ) =
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m06398099.png" /></td> </tr></table>
+
$$
 +
= \
 +
\overline{J}\; ( \overline{x}\; _ {1} \dots \overline{x}\; _ {j-} 1 , \overline{x}\; _ {j+} 1 \dots \overline{x}\; _ {m} ) .
 +
$$
  
One then finds a new matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980100.png" /> of order <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980101.png" /> and a new vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980102.png" /> from the function (13) by means of finite-difference approximation. Here it is not so much the reduction of the dimension of the search space that is important, as the reduction of the degree of steepness, so that in the minimization of the new function in the subspace orthogonal to the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980103.png" />, the largest eigen value no longer influences the computation. The requirement that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980104.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980105.png" /> are obtained in terms of the function (13), and not in terms of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980106.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980107.png" />, is a very important point here. The coefficients of relation (12) are found by the power method as coefficients of an arbitrary equation of the system
+
One then finds a new matrix $  \overline{D}\; $
 +
of order $  ( m - 1 ) $
 +
and a new vector $  \overline{\mathbf b}\; $
 +
from the function (13) by means of finite-difference approximation. Here it is not so much the reduction of the dimension of the search space that is important, as the reduction of the degree of steepness, so that in the minimization of the new function in the subspace orthogonal to the vector $  u _ {1} $,  
 +
the largest eigen value no longer influences the computation. The requirement that $  \overline{D}\; $
 +
and $  \overline{\mathbf b}\; $
 +
are obtained in terms of the function (13), and not in terms of $  D $
 +
and $  \mathbf b $,  
 +
is a very important point here. The coefficients of relation (12) are found by the power method as coefficients of an arbitrary equation of the system
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980108.png" /></td> </tr></table>
+
$$
 +
D  ^ {k} \mathbf x  = \
 +
D  ^ {k-} 1 \mathbf b .
 +
$$
  
If the degree of steepness is not reduced or is reduced insignificantly, then the process of elimination of coordinates of the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063980/m063980109.png" /> is continued recursively until the necessary reduction is obtained.
+
If the degree of steepness is not reduced or is reduced insignificantly, then the process of elimination of coordinates of the vector $  \mathbf x $
 +
is continued recursively until the necessary reduction is obtained.
  
 
====References====
 
====References====
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  Yu.V. Rakitskii,  S.M. Ustinov,  I.G. Chernorustskii,  "Computational methods for the solution of stiff systems" , Moscow  (1979)  (In Russian)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  J. Céa,  "Optimisation. Théorie et algorithmes" , Dunod  (1971)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  B.N. Pshenichnyi,  Yu.M. Danilin,  "Numerical methods in extremal problems" , MIR  (1978)  (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  B.T. Polyak,  "Methods for minimizing functions of several variables"  ''Ekonom. i Mat. Metody'' , '''3''' :  6  (1967)  pp. 881–902  (In Russian)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top">  D.K. Faddeev,  V.N. Faddeeva,  "Computational methods of linear algebra" , Freeman  (1963)  (Translated from Russian)</TD></TR><TR><TD valign="top">[6]</TD> <TD valign="top">  D.J. Wilde,  "Optimum seeking methods" , Prentice-Hall  (1964)</TD></TR><TR><TD valign="top">[7]</TD> <TD valign="top">  I.M. Gel'fand,  M.L. Tsetlin,  "The principle of nonlocal search in automatic optimization systems"  ''Soviet Phys. Dokl.'' , '''6'''  (1961)  pp. 192–194  ''Dokl. Akad. Nauk SSSR'' , '''137''' :  2  (1961)  pp. 295–298</TD></TR><TR><TD valign="top">[8]</TD> <TD valign="top">  P.E. Gill (ed.)  W. Murray (ed.) , ''Numerical methods for constrained optimization'' , Acad. Press  (1974)</TD></TR></table>
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  Yu.V. Rakitskii,  S.M. Ustinov,  I.G. Chernorustskii,  "Computational methods for the solution of stiff systems" , Moscow  (1979)  (In Russian)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  J. Céa,  "Optimisation. Théorie et algorithmes" , Dunod  (1971)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  B.N. Pshenichnyi,  Yu.M. Danilin,  "Numerical methods in extremal problems" , MIR  (1978)  (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  B.T. Polyak,  "Methods for minimizing functions of several variables"  ''Ekonom. i Mat. Metody'' , '''3''' :  6  (1967)  pp. 881–902  (In Russian)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top">  D.K. Faddeev,  V.N. Faddeeva,  "Computational methods of linear algebra" , Freeman  (1963)  (Translated from Russian)</TD></TR><TR><TD valign="top">[6]</TD> <TD valign="top">  D.J. Wilde,  "Optimum seeking methods" , Prentice-Hall  (1964)</TD></TR><TR><TD valign="top">[7]</TD> <TD valign="top">  I.M. Gel'fand,  M.L. Tsetlin,  "The principle of nonlocal search in automatic optimization systems"  ''Soviet Phys. Dokl.'' , '''6'''  (1961)  pp. 192–194  ''Dokl. Akad. Nauk SSSR'' , '''137''' :  2  (1961)  pp. 295–298</TD></TR><TR><TD valign="top">[8]</TD> <TD valign="top">  P.E. Gill (ed.)  W. Murray (ed.) , ''Numerical methods for constrained optimization'' , Acad. Press  (1974)</TD></TR></table>
 
 
  
 
====Comments====
 
====Comments====
 
  
 
====References====
 
====References====
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  D.G. Luenberger,  "Linear and nonlinear programming" , Addison-Wesley  (1984)</TD></TR></table>
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  D.G. Luenberger,  "Linear and nonlinear programming" , Addison-Wesley  (1984)</TD></TR></table>

Revision as of 08:00, 6 June 2020


Numerical methods for finding minima of functions of several variables. Suppose one is given a function that is bounded from below and twice continuously differentiable,

$$ J ( x) = J ( x _ {1} \dots x _ {m} ) , $$

and that is known to take its least value at a vector $ x ^ {*} = ( x _ {1} ^ {*} \dots x _ {m} ^ {*} ) ^ {T} $( where $ {} ^ {T} $ denotes transposition). One wishes to construct a sequence of vectors $ \{ x _ {n} \} $, $ x _ {n} = ( x _ {1n} \dots x _ {mn} ) ^ {T} $, such that

$$ \lim\limits _ {n \rightarrow \infty } J ( x _ {n} ) = J ( x ^ {*} ) . $$

There are many methods for obtaining such a sequence. However, the majority of algorithms suffer from a sharp deterioration when the level surfaces $ J ( x) = \textrm{ const } $ of the function to be minimized have a structure which is far from spherical. In this case a domain $ Q $ in which the norm of the gradient vector

$$ J ^ \prime ( x) = \ \left ( \frac{\partial J }{\partial x _ {1} } \dots \frac{\partial J }{\partial x _ {m} } \right ) ^ {T} $$

is essentially smaller than in the rest of the space is called a bottom of a valley, and the function itself is called a valley function. If the dimension of the space of arguments of the function to be minimized exceeds two, the structure of the level surfaces of the valley function may be very complicated. $ ( m - k ) $- dimensional valleys occur, where $ k $ varies from 1 to $ m - 1 $. In three-dimensional space, for example, one and two-dimensional valleys are possible.

Functions of valley type are locally characterized by an ill-conditioned matrix of second derivatives (Hesse matrix or Hessian)

$$ J ^ {\prime\prime} ( x) = \ \left \| \frac{\partial ^ {2} J ( x) }{\partial x _ {i} \partial x _ {j} } \ \right \| ,\ \ i , j = 1 \dots m , $$

which leads to strong variations of $ J ( x) $ along directions coinciding with eigen vectors of the Hesse matrix corresponding to large eigen values, and to small variations along directions corresponding to small eigen values.

Most known methods of optimization allow a fairly rapid descent to the bottom $ Q $ of the valley, leading sometimes to a considerable reduction in the values of $ J ( x) $ as compared with its value at the starting point (descent into the bottom of the valley). However, the process then slows down sharply, and practically stops at some point of $ Q $ which may be far from the required minimum point.

A twice continuously-differentiable function $ J ( x) $ is called a valley function (see [1]) if there exists a domain $ G \subset \mathbf R ^ {m} $ in which the eigen values of the Hesse matrix $ J ^ {\prime\prime} ( x) $, in order of decreasing modulus, satisfy, at any point $ x \in G $, the inequality

$$ \tag{1 } 0 < \left | \min _ { i } \lambda _ {i} ( x) \right | \ll \lambda _ {1} ( x) . $$

The degree of steepness of the valley is characterized by the number

$$ \tag{2 } S = \frac{\lambda _ {1} }{\left | \min _ {\lambda _ {i} \neq 0 } \ \lambda _ {i} \right | } . $$

If the eigen values of $ J ^ {\prime\prime} ( x) $ in $ G $ satisfy the inequalities

$$ | \lambda _ {m} ( x) | \leq \dots \leq | \lambda _ {m-} r+ 1 ( x) | \ll \lambda _ {m-} r ( x) \ \leq \dots \leq \lambda _ {1} ( x) , $$

then the number $ r $ is called the dimension of the valley of the function $ J ( x) $ at $ x \in G $( see [1]).

The system of differential equations describing the trajectory of descent of $ J ( x) $,

$$ \tag{3 } \frac{dx}{dt} = - J ^ \prime ( x) ,\ x ( 0) = x _ {0} , $$

is a stiff differential system.

In particular, when $ J ( x) $ is strictly convex and the Hesse matrix is positive definite (all its eigen values are strictly positive), the inequalities (1) coincide with the well-known requirement for ill-conditioning of the Hesse matrix:

$$ k ( J ^ {\prime\prime} ( x) ) = \ \frac{\max _ { i } \lambda _ {i} ( x) }{\min _ { i } \lambda _ {i} ( x) } \gg 1 . $$

In this case the spectral condition number is the same as the degree of steepness of the valley.

The coordinate-wise descent method (see [2])

$$ \tag{4 } J ( x _ {1,k+} 1 \dots x _ {i-} 1,k+ 1 , x _ {i,k+} 1 , x _ {i+} 1,k \dots x _ {m,k} ) = $$

$$ = \ \min _ { y } J ( x _ {1,k+} 1 \dots x _ {i-} 1,k , y, x _ {i+} 1,k \dots x _ {m,k} ) , $$

$$ k = 0 , 1 \dots $$

in spite of its simplicity and universality, is effective in the valley situation only in the rare case when the orientation of the valleys is along the coordinate axes.

A modernization of the method (4) was proposed in [2], consisting of a rotation of the coordinate axes so that one of the axes lies along $ x _ {k} - x _ {k-} 1 $, after which the search begins at the $ ( k + 1 ) $- st step. Such an approach leads to one of the axes having a tendency to align along a generator of the bottom of the valley, allowing one to perform successfully the minimization of functions with one-dimensional valleys in several cases. The method is unsuitable for multi-dimensional valleys.

The scheme of the method of steepest descent (cf. Steepest descent, method of) is given by the difference equation

$$ \tag{5 } x _ {k+} 1 = x _ {k} - h _ {k} J _ {k} ^ \prime ,\ \ J _ {k} ^ \prime = J ^ \prime ( x _ {k} ) , $$

where $ h _ {k} $ is chosen by the condition

$$ J ( x _ {k+} 1 ) = \ \min _ {h > 0 } J ( x _ {k} - h J _ {k} ^ \prime ) . $$

For strictly-convex valley functions, in particular for quadratic ones

$$ \tag{6 } J ( x) = \frac{1}{2} x ^ {T} D x - b ^ {T} x , $$

the sequence $ \{ x _ {k} \} $ constructed by algorithm (5) converges geometrically to the minimum point $ x ^ {*} $ of the function (see [3]):

$$ \| x _ {k} - x ^ {*} \| \leq C q ^ {k} , $$

where $ C = \textrm{ const } $ and

$$ q = \frac{k ( J ^ {\prime\prime} ( x ^ {*} ) ) - 1 }{k ( J ^ {\prime\prime} ( x ^ {*} ) ) + 1 } . $$

Since $ k ( J ^ {\prime\prime} ( x) ) \gg 1 $ for a valley function, $ q \simeq 1 $ and convergence is effectively absent.

A similar picture can be seen for the simple gradient scheme (see [4]; Gradient method)

$$ \tag{7 } x _ {k+} 1 = x _ {k} - h J _ {k} ^ {\prime\prime} ,\ \ J _ {k+} 1 = J ( x _ {k+} 1 ) ,\ h = \textrm{ const } . $$

Acceleration of its convergence is based on using the results of the previous iterations to make the bottom of the valley more precise. The gradient method (7) could be used (see [4], [5]) with computation of the ratio $ q = \| J _ {k} ^ \prime \| / \| J _ {k-} 1 ^ \prime \| $ at each iteration. When it becomes steady near the constant value $ q = 1 $, a large accelerating step is performed in accordance with the expression

$$ x _ {k+} 1 = x _ {k} - \frac{h}{1-} q J _ {k} ^ \prime . $$

Then descent by the gradient method is continued until the next accelerating step.

Various versions of the method of parallel tangents (see [4][6]) are based on carrying out the accelerating step along the direction $ x _ {k+} 2 - x _ {k} $ given by the points $ x _ {k} , x _ {k+} 2 $ in the gradient method. In the "heavy-sphere" method (see [4]; Heavy sphere, method of the) the next approximation has the form

$$ x _ {k+} 1 = x _ {k} - \alpha J _ {k} ^ \prime + \beta ( x _ {k} - x _ {k-} 1 ) . $$

In the valley method (see [7]) local descents are carried out by the gradient method (7) from two arbitrarily chosen starting points, and then an accelerating step is made in a direction given by two of the points at the bottom of the valley.

These methods are not much more complicated than the gradient method (7), and they are based on it. Accelerated convergence is obtained for a one-dimensional valley. In more general cases of multi-dimensional valleys, where convergence of these methods slows down sharply, one has to turn to the more powerful methods of quadratic approximation, based on Newton's method (cf. Newton method)

$$ \tag{8 } x _ {k+} 1 = x _ {k} - ( J _ {k} ^ {\prime\prime} ) ^ {-} 1 J _ {k} ^ \prime ,\ \ J _ {k} ^ {\prime\prime} = J ^ {\prime\prime} ( x _ {k} ) . $$

The minimum of the function (6) satisfies the system of linear equations

$$ \tag{9 } D \mathbf x = \mathbf b , $$

and under the condition of absolute accuracy of all computations, for a quadratic function Newton's method is independent of the degree of steepness of the valley (2) and the dimension of the valley, and leads to the minimum in one step. In reality, with a large condition number $ k ( D) $ and a restricted number of digits in computation, the problem of solving (9) may be ill-posed, and small changes in the elements of the matrix $ D $ and the vector $ \mathbf b $ can lead to large variations in $ \mathbf x ^ {*} $.

With moderate degrees of steepness in a convex situation, Newton's method often turns out to be preferable, with respect to speed of convergence, to other methods, such as the gradient method.

A large class of quadratic (quasi-Newtonian) methods are based on the use of conjugate directions (see [2], [3], [8]). For the case of minimization of a convex function these algorithms are very effective, since having a quadratic termination they do not require a computation of the matrix of second derivatives.

Sometimes (see [8]) the iteration is performed according to the scheme

$$ \tag{10 } x _ {k+} 1 = x _ {k} - ( \beta _ {k} E + J _ {k} ^ {\prime\prime} ) ^ {-} 1 J _ {k} ^ {\prime\prime} , $$

where $ E $ is the identity matrix. The scalars $ \beta _ {k} $ are chosen so that the matrix $ J _ {k} ^ {\prime\prime} + \beta _ {k} E $ is positive definite and

$$ \| x _ {k+} 1 - x _ {k} \| \leq \epsilon _ {k} . $$

There are several similar approaches (see [8]) based on obtaining strictly positive definite approximations to the Hesse matrix. For the minimization of valley functions such algorithms turn out to be inefficient because of the difficulty in choosing the parameters $ \beta _ {k} , \epsilon _ {k} $, etc. The choice of these parameters is based on information on the size of the eigen values of smallest modulus of the Hesse matrix, and for real calculations and a high degree of steepness this information is strongly distorted.

A more appropriate generalization of Newton's method for the case of minimization of valley functions is based on the continuous principle of optimization. The function $ J ( x) $ is juxtaposed to a differential system (3), which is integrable by the system method (see Stiff differential system). The minimization algorithm takes the form

$$ \tag{11a } x _ {k+} 1 = x _ {k} - \Phi ( 2 ^ {N} h _ {k} ^ {0} ) J _ {k} ^ \prime , $$

$$ \tag{11b } \Phi ( 2 ^ {N} h _ {k} ^ {0} ) = \int\limits _ { 0 } ^ { {2 } N h _ {k} ^ {0} } \mathop{\rm exp} ( - J _ {k} ^ {\prime\prime} \tau ) \ d \tau , $$

$$ \tag{11c } J _ {k+} 1 = \min _ { N } \ J ( x _ {k} - \Phi ( 2 ^ {N} h _ {k} ^ {0} ) J _ {k} ^ \prime ) , $$

$$ h _ {k} ^ {0} \leq \frac{1}{\| J _ {k} ^ {\prime\prime} \| } , $$

$$ \tag{11d } \Phi ( h _ {k} ^ {0} ) = \ h _ {k} ^ {0} \left [ E - \frac{h _ {k} ^ {0} }{2} J _ {k} ^ {\prime\prime} + \frac{( h _ {k} ^ {0} ) ^ {2} }{6} ( J _ {k} ^ {\prime\prime} ) ^ {2} - \dots \right ] , $$

$$ \tag{11e } \Phi ( 2 ^ {s+} 1 h _ {k} ^ {0} ) = \Phi ( 2 ^ {s} h _ {k} ^ {0} ) [ 2 E - J _ {k} ^ {\prime\prime} \Phi ( 2 ^ {s} h _ {k} ^ {0} ) ] , $$

$$ s = 0 \dots N - 1 . $$

In [1], an algorithm for the minimization of valley functions has been proposed based on the use of properties of stiff systems. Let the function $ J ( x) $ be approximated in a neighbourhood of $ x _ {0} $ by the quadratic function (6). The matrix $ D $ and the vector $ \mathbf b $ are computed by, for example, finite-difference approximation. From the representation of the elements of the matrix

$$ d _ {ij} = \ \sum _ { s= } 1 ^ { m } u _ {is} u _ {js} \lambda _ {s} , $$

where $ u _ {s} = ( u _ {1s} \dots u _ {ms} ) ^ {T} $, $ s = 1 \dots m $, is an orthonormal basis of eigen vectors of $ D $, it follows that imprecise measurement of these elements distorts the information about the smallest eigen values of an ill-conditioned matrix, and hence leads to ill-posedness of the problem of minimization of (6).

Moreover, the system of differential equations of descent for the valley function (6),

$$ \frac{dx}{dt} = - D x + b ,\ \ x ( 0) = x _ {0} , $$

has a solution in which, by virtue of condition (1), the terms with factors $ \mathop{\rm exp} ( - \lambda _ {1} t ) $ have an influence only on a small initial interval of length $ \tau _ {ps} = O ( \lambda _ {1} ^ {-} 1 ) $.

In other words, the components of the vector $ x ( t) $ satisfy the equation

$$ x ^ {T} ( t) u _ {1} - \lambda _ {1} ^ {-} 1 b ^ {T} u _ {1} = \ ( x _ {0} ^ {T} u _ {1} - \lambda _ {1} ^ {-} 1 b ^ {T} u _ {1} ) \mathop{\rm exp} ( - \lambda _ {1} t ) , $$

which quickly goes over to the stationary relation

$$ \tag{12 } \sum _ { i= } 1 ^ { m } u _ {i1} \overline{x}\; _ {i} + \lambda _ {1} ^ {-} 1 \sum _ { i= } 1 ^ { m } b _ {i} u _ {i1} = 0 , $$

where $ \overline{x}\; _ {i} $ are the components of a vector satisfying (12). This property is used in the algorithm. Expressing the $ j $- th component of the vector $ \overline{\mathbf x}\; $ to which the maximal component of the vector $ u _ {1} $ corresponds in terms of the remaining components, one replaces $ J ( x) $ by a new function with argument of dimension $ ( m - 1 ) $:

$$ \tag{13 } J \left ( \overline{x}\; _ {1} \dots \overline{x}\; _ {i-} 1 ,\right . $$

$$ \left . - u _ {j1} ^ {-} 1 \sum _ {\begin{array}{c} i = 1 \\ i \neq j \end{array} } ^ { m } ( u _ {i1} \overline{x}\; _ {i} - \lambda _ {1} ^ {-} 1 b _ {i} u _ {i1} ) , x _ {j+} 1 \dots x _ {m} \right ) = $$

$$ = \ \overline{J}\; ( \overline{x}\; _ {1} \dots \overline{x}\; _ {j-} 1 , \overline{x}\; _ {j+} 1 \dots \overline{x}\; _ {m} ) . $$

One then finds a new matrix $ \overline{D}\; $ of order $ ( m - 1 ) $ and a new vector $ \overline{\mathbf b}\; $ from the function (13) by means of finite-difference approximation. Here it is not so much the reduction of the dimension of the search space that is important, as the reduction of the degree of steepness, so that in the minimization of the new function in the subspace orthogonal to the vector $ u _ {1} $, the largest eigen value no longer influences the computation. The requirement that $ \overline{D}\; $ and $ \overline{\mathbf b}\; $ are obtained in terms of the function (13), and not in terms of $ D $ and $ \mathbf b $, is a very important point here. The coefficients of relation (12) are found by the power method as coefficients of an arbitrary equation of the system

$$ D ^ {k} \mathbf x = \ D ^ {k-} 1 \mathbf b . $$

If the degree of steepness is not reduced or is reduced insignificantly, then the process of elimination of coordinates of the vector $ \mathbf x $ is continued recursively until the necessary reduction is obtained.

References

[1] Yu.V. Rakitskii, S.M. Ustinov, I.G. Chernorustskii, "Computational methods for the solution of stiff systems" , Moscow (1979) (In Russian)
[2] J. Céa, "Optimisation. Théorie et algorithmes" , Dunod (1971)
[3] B.N. Pshenichnyi, Yu.M. Danilin, "Numerical methods in extremal problems" , MIR (1978) (Translated from Russian)
[4] B.T. Polyak, "Methods for minimizing functions of several variables" Ekonom. i Mat. Metody , 3 : 6 (1967) pp. 881–902 (In Russian)
[5] D.K. Faddeev, V.N. Faddeeva, "Computational methods of linear algebra" , Freeman (1963) (Translated from Russian)
[6] D.J. Wilde, "Optimum seeking methods" , Prentice-Hall (1964)
[7] I.M. Gel'fand, M.L. Tsetlin, "The principle of nonlocal search in automatic optimization systems" Soviet Phys. Dokl. , 6 (1961) pp. 192–194 Dokl. Akad. Nauk SSSR , 137 : 2 (1961) pp. 295–298
[8] P.E. Gill (ed.) W. Murray (ed.) , Numerical methods for constrained optimization , Acad. Press (1974)

Comments

References

[a1] D.G. Luenberger, "Linear and nonlinear programming" , Addison-Wesley (1984)
How to Cite This Entry:
Minimization methods for functions depending strongly on a few variables. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Minimization_methods_for_functions_depending_strongly_on_a_few_variables&oldid=47847
This article was adapted from an original article by Yu.V. Rakitskii (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article