# Minimization methods for functions depending strongly on a few variables

Numerical methods for finding minima of functions of several variables. Suppose one is given a function that is bounded from below and twice continuously differentiable,

$$J ( x) = J ( x _ {1} \dots x _ {m} ) ,$$

and that is known to take its least value at a vector $x ^ {*} = ( x _ {1} ^ {*} \dots x _ {m} ^ {*} ) ^ {T}$( where ${} ^ {T}$ denotes transposition). One wishes to construct a sequence of vectors $\{ x _ {n} \}$, $x _ {n} = ( x _ {1n} \dots x _ {mn} ) ^ {T}$, such that

$$\lim\limits _ {n \rightarrow \infty } J ( x _ {n} ) = J ( x ^ {*} ) .$$

There are many methods for obtaining such a sequence. However, the majority of algorithms suffer from a sharp deterioration when the level surfaces $J ( x) = \textrm{ const }$ of the function to be minimized have a structure which is far from spherical. In this case a domain $Q$ in which the norm of the gradient vector

$$J ^ \prime ( x) = \ \left ( \frac{\partial J }{\partial x _ {1} } \dots \frac{\partial J }{\partial x _ {m} } \right ) ^ {T}$$

is essentially smaller than in the rest of the space is called a bottom of a valley, and the function itself is called a valley function. If the dimension of the space of arguments of the function to be minimized exceeds two, the structure of the level surfaces of the valley function may be very complicated. $( m - k )$- dimensional valleys occur, where $k$ varies from 1 to $m - 1$. In three-dimensional space, for example, one and two-dimensional valleys are possible.

Functions of valley type are locally characterized by an ill-conditioned matrix of second derivatives (Hesse matrix or Hessian)

$$J ^ {\prime\prime} ( x) = \ \left \| \frac{\partial ^ {2} J ( x) }{\partial x _ {i} \partial x _ {j} } \ \right \| ,\ \ i , j = 1 \dots m ,$$

which leads to strong variations of $J ( x)$ along directions coinciding with eigen vectors of the Hesse matrix corresponding to large eigen values, and to small variations along directions corresponding to small eigen values.

Most known methods of optimization allow a fairly rapid descent to the bottom $Q$ of the valley, leading sometimes to a considerable reduction in the values of $J ( x)$ as compared with its value at the starting point (descent into the bottom of the valley). However, the process then slows down sharply, and practically stops at some point of $Q$ which may be far from the required minimum point.

A twice continuously-differentiable function $J ( x)$ is called a valley function (see [1]) if there exists a domain $G \subset \mathbf R ^ {m}$ in which the eigen values of the Hesse matrix $J ^ {\prime\prime} ( x)$, in order of decreasing modulus, satisfy, at any point $x \in G$, the inequality

$$\tag{1 } 0 < \left | \min _ { i } \lambda _ {i} ( x) \right | \ll \lambda _ {1} ( x) .$$

The degree of steepness of the valley is characterized by the number

$$\tag{2 } S = \frac{\lambda _ {1} }{\left | \min _ {\lambda _ {i} \neq 0 } \ \lambda _ {i} \right | } .$$

If the eigen values of $J ^ {\prime\prime} ( x)$ in $G$ satisfy the inequalities

$$| \lambda _ {m} ( x) | \leq \dots \leq | \lambda _ {m-} r+ 1 ( x) | \ll \lambda _ {m-} r ( x) \ \leq \dots \leq \lambda _ {1} ( x) ,$$

then the number $r$ is called the dimension of the valley of the function $J ( x)$ at $x \in G$( see [1]).

The system of differential equations describing the trajectory of descent of $J ( x)$,

$$\tag{3 } \frac{dx}{dt} = - J ^ \prime ( x) ,\ x ( 0) = x _ {0} ,$$

In particular, when $J ( x)$ is strictly convex and the Hesse matrix is positive definite (all its eigen values are strictly positive), the inequalities (1) coincide with the well-known requirement for ill-conditioning of the Hesse matrix:

$$k ( J ^ {\prime\prime} ( x) ) = \ \frac{\max _ { i } \lambda _ {i} ( x) }{\min _ { i } \lambda _ {i} ( x) } \gg 1 .$$

In this case the spectral condition number is the same as the degree of steepness of the valley.

$$\tag{4 } J ( x _ {1,k+} 1 \dots x _ {i-} 1,k+ 1 , x _ {i,k+} 1 , x _ {i+} 1,k \dots x _ {m,k} ) =$$

$$= \ \min _ { y } J ( x _ {1,k+} 1 \dots x _ {i-} 1,k , y, x _ {i+} 1,k \dots x _ {m,k} ) ,$$

$$k = 0 , 1 \dots$$

in spite of its simplicity and universality, is effective in the valley situation only in the rare case when the orientation of the valleys is along the coordinate axes.

A modernization of the method (4) was proposed in [2], consisting of a rotation of the coordinate axes so that one of the axes lies along $x _ {k} - x _ {k-} 1$, after which the search begins at the $( k + 1 )$- st step. Such an approach leads to one of the axes having a tendency to align along a generator of the bottom of the valley, allowing one to perform successfully the minimization of functions with one-dimensional valleys in several cases. The method is unsuitable for multi-dimensional valleys.

The scheme of the method of steepest descent (cf. Steepest descent, method of) is given by the difference equation

$$\tag{5 } x _ {k+} 1 = x _ {k} - h _ {k} J _ {k} ^ \prime ,\ \ J _ {k} ^ \prime = J ^ \prime ( x _ {k} ) ,$$

where $h _ {k}$ is chosen by the condition

$$J ( x _ {k+} 1 ) = \ \min _ {h > 0 } J ( x _ {k} - h J _ {k} ^ \prime ) .$$

For strictly-convex valley functions, in particular for quadratic ones

$$\tag{6 } J ( x) = \frac{1}{2} x ^ {T} D x - b ^ {T} x ,$$

the sequence $\{ x _ {k} \}$ constructed by algorithm (5) converges geometrically to the minimum point $x ^ {*}$ of the function (see [3]):

$$\| x _ {k} - x ^ {*} \| \leq C q ^ {k} ,$$

where $C = \textrm{ const }$ and

$$q = \frac{k ( J ^ {\prime\prime} ( x ^ {*} ) ) - 1 }{k ( J ^ {\prime\prime} ( x ^ {*} ) ) + 1 } .$$

Since $k ( J ^ {\prime\prime} ( x) ) \gg 1$ for a valley function, $q \simeq 1$ and convergence is effectively absent.

A similar picture can be seen for the simple gradient scheme (see [4]; Gradient method)

$$\tag{7 } x _ {k+} 1 = x _ {k} - h J _ {k} ^ {\prime\prime} ,\ \ J _ {k+} 1 = J ( x _ {k+} 1 ) ,\ h = \textrm{ const } .$$

Acceleration of its convergence is based on using the results of the previous iterations to make the bottom of the valley more precise. The gradient method (7) could be used (see [4], [5]) with computation of the ratio $q = \| J _ {k} ^ \prime \| / \| J _ {k-} 1 ^ \prime \|$ at each iteration. When it becomes steady near the constant value $q = 1$, a large accelerating step is performed in accordance with the expression

$$x _ {k+} 1 = x _ {k} - \frac{h}{1-} q J _ {k} ^ \prime .$$

Then descent by the gradient method is continued until the next accelerating step.

Various versions of the method of parallel tangents (see [4][6]) are based on carrying out the accelerating step along the direction $x _ {k+} 2 - x _ {k}$ given by the points $x _ {k} , x _ {k+} 2$ in the gradient method. In the "heavy-sphere" method (see [4]; Heavy sphere, method of the) the next approximation has the form

$$x _ {k+} 1 = x _ {k} - \alpha J _ {k} ^ \prime + \beta ( x _ {k} - x _ {k-} 1 ) .$$

In the valley method (see [7]) local descents are carried out by the gradient method (7) from two arbitrarily chosen starting points, and then an accelerating step is made in a direction given by two of the points at the bottom of the valley.

These methods are not much more complicated than the gradient method (7), and they are based on it. Accelerated convergence is obtained for a one-dimensional valley. In more general cases of multi-dimensional valleys, where convergence of these methods slows down sharply, one has to turn to the more powerful methods of quadratic approximation, based on Newton's method (cf. Newton method)

$$\tag{8 } x _ {k+} 1 = x _ {k} - ( J _ {k} ^ {\prime\prime} ) ^ {-} 1 J _ {k} ^ \prime ,\ \ J _ {k} ^ {\prime\prime} = J ^ {\prime\prime} ( x _ {k} ) .$$

The minimum of the function (6) satisfies the system of linear equations

$$\tag{9 } D \mathbf x = \mathbf b ,$$

and under the condition of absolute accuracy of all computations, for a quadratic function Newton's method is independent of the degree of steepness of the valley (2) and the dimension of the valley, and leads to the minimum in one step. In reality, with a large condition number $k ( D)$ and a restricted number of digits in computation, the problem of solving (9) may be ill-posed, and small changes in the elements of the matrix $D$ and the vector $\mathbf b$ can lead to large variations in $\mathbf x ^ {*}$.

With moderate degrees of steepness in a convex situation, Newton's method often turns out to be preferable, with respect to speed of convergence, to other methods, such as the gradient method.

A large class of quadratic (quasi-Newtonian) methods are based on the use of conjugate directions (see [2], [3], [8]). For the case of minimization of a convex function these algorithms are very effective, since having a quadratic termination they do not require a computation of the matrix of second derivatives.

Sometimes (see [8]) the iteration is performed according to the scheme

$$\tag{10 } x _ {k+} 1 = x _ {k} - ( \beta _ {k} E + J _ {k} ^ {\prime\prime} ) ^ {-} 1 J _ {k} ^ {\prime\prime} ,$$

where $E$ is the identity matrix. The scalars $\beta _ {k}$ are chosen so that the matrix $J _ {k} ^ {\prime\prime} + \beta _ {k} E$ is positive definite and

$$\| x _ {k+} 1 - x _ {k} \| \leq \epsilon _ {k} .$$

There are several similar approaches (see [8]) based on obtaining strictly positive definite approximations to the Hesse matrix. For the minimization of valley functions such algorithms turn out to be inefficient because of the difficulty in choosing the parameters $\beta _ {k} , \epsilon _ {k}$, etc. The choice of these parameters is based on information on the size of the eigen values of smallest modulus of the Hesse matrix, and for real calculations and a high degree of steepness this information is strongly distorted.

A more appropriate generalization of Newton's method for the case of minimization of valley functions is based on the continuous principle of optimization. The function $J ( x)$ is juxtaposed to a differential system (3), which is integrable by the system method (see Stiff differential system). The minimization algorithm takes the form

$$\tag{11a } x _ {k+} 1 = x _ {k} - \Phi ( 2 ^ {N} h _ {k} ^ {0} ) J _ {k} ^ \prime ,$$

$$\tag{11b } \Phi ( 2 ^ {N} h _ {k} ^ {0} ) = \int\limits _ { 0 } ^ { {2 } N h _ {k} ^ {0} } \mathop{\rm exp} ( - J _ {k} ^ {\prime\prime} \tau ) \ d \tau ,$$

$$\tag{11c } J _ {k+} 1 = \min _ { N } \ J ( x _ {k} - \Phi ( 2 ^ {N} h _ {k} ^ {0} ) J _ {k} ^ \prime ) ,$$

$$h _ {k} ^ {0} \leq \frac{1}{\| J _ {k} ^ {\prime\prime} \| } ,$$

$$\tag{11d } \Phi ( h _ {k} ^ {0} ) = \ h _ {k} ^ {0} \left [ E - \frac{h _ {k} ^ {0} }{2} J _ {k} ^ {\prime\prime} + \frac{( h _ {k} ^ {0} ) ^ {2} }{6} ( J _ {k} ^ {\prime\prime} ) ^ {2} - \dots \right ] ,$$

$$\tag{11e } \Phi ( 2 ^ {s+} 1 h _ {k} ^ {0} ) = \Phi ( 2 ^ {s} h _ {k} ^ {0} ) [ 2 E - J _ {k} ^ {\prime\prime} \Phi ( 2 ^ {s} h _ {k} ^ {0} ) ] ,$$

$$s = 0 \dots N - 1 .$$

In [1], an algorithm for the minimization of valley functions has been proposed based on the use of properties of stiff systems. Let the function $J ( x)$ be approximated in a neighbourhood of $x _ {0}$ by the quadratic function (6). The matrix $D$ and the vector $\mathbf b$ are computed by, for example, finite-difference approximation. From the representation of the elements of the matrix

$$d _ {ij} = \ \sum _ { s= } 1 ^ { m } u _ {is} u _ {js} \lambda _ {s} ,$$

where $u _ {s} = ( u _ {1s} \dots u _ {ms} ) ^ {T}$, $s = 1 \dots m$, is an orthonormal basis of eigen vectors of $D$, it follows that imprecise measurement of these elements distorts the information about the smallest eigen values of an ill-conditioned matrix, and hence leads to ill-posedness of the problem of minimization of (6).

Moreover, the system of differential equations of descent for the valley function (6),

$$\frac{dx}{dt} = - D x + b ,\ \ x ( 0) = x _ {0} ,$$

has a solution in which, by virtue of condition (1), the terms with factors $\mathop{\rm exp} ( - \lambda _ {1} t )$ have an influence only on a small initial interval of length $\tau _ {ps} = O ( \lambda _ {1} ^ {-} 1 )$.

In other words, the components of the vector $x ( t)$ satisfy the equation

$$x ^ {T} ( t) u _ {1} - \lambda _ {1} ^ {-} 1 b ^ {T} u _ {1} = \ ( x _ {0} ^ {T} u _ {1} - \lambda _ {1} ^ {-} 1 b ^ {T} u _ {1} ) \mathop{\rm exp} ( - \lambda _ {1} t ) ,$$

which quickly goes over to the stationary relation

$$\tag{12 } \sum _ { i= } 1 ^ { m } u _ {i1} \overline{x}\; _ {i} + \lambda _ {1} ^ {-} 1 \sum _ { i= } 1 ^ { m } b _ {i} u _ {i1} = 0 ,$$

where $\overline{x}\; _ {i}$ are the components of a vector satisfying (12). This property is used in the algorithm. Expressing the $j$- th component of the vector $\overline{\mathbf x}\;$ to which the maximal component of the vector $u _ {1}$ corresponds in terms of the remaining components, one replaces $J ( x)$ by a new function with argument of dimension $( m - 1 )$:

$$\tag{13 } J \left ( \overline{x}\; _ {1} \dots \overline{x}\; _ {i-} 1 ,\right .$$

$$\left . - u _ {j1} ^ {-} 1 \sum _ {\begin{array}{c} i = 1 \\ i \neq j \end{array} } ^ { m } ( u _ {i1} \overline{x}\; _ {i} - \lambda _ {1} ^ {-} 1 b _ {i} u _ {i1} ) , x _ {j+} 1 \dots x _ {m} \right ) =$$

$$= \ \overline{J}\; ( \overline{x}\; _ {1} \dots \overline{x}\; _ {j-} 1 , \overline{x}\; _ {j+} 1 \dots \overline{x}\; _ {m} ) .$$

One then finds a new matrix $\overline{D}\;$ of order $( m - 1 )$ and a new vector $\overline{\mathbf b}\;$ from the function (13) by means of finite-difference approximation. Here it is not so much the reduction of the dimension of the search space that is important, as the reduction of the degree of steepness, so that in the minimization of the new function in the subspace orthogonal to the vector $u _ {1}$, the largest eigen value no longer influences the computation. The requirement that $\overline{D}\;$ and $\overline{\mathbf b}\;$ are obtained in terms of the function (13), and not in terms of $D$ and $\mathbf b$, is a very important point here. The coefficients of relation (12) are found by the power method as coefficients of an arbitrary equation of the system

$$D ^ {k} \mathbf x = \ D ^ {k-} 1 \mathbf b .$$

If the degree of steepness is not reduced or is reduced insignificantly, then the process of elimination of coordinates of the vector $\mathbf x$ is continued recursively until the necessary reduction is obtained.

#### References

 [1] Yu.V. Rakitskii, S.M. Ustinov, I.G. Chernorustskii, "Computational methods for the solution of stiff systems" , Moscow (1979) (In Russian) [2] J. Céa, "Optimisation. Théorie et algorithmes" , Dunod (1971) [3] B.N. Pshenichnyi, Yu.M. Danilin, "Numerical methods in extremal problems" , MIR (1978) (Translated from Russian) [4] B.T. Polyak, "Methods for minimizing functions of several variables" Ekonom. i Mat. Metody , 3 : 6 (1967) pp. 881–902 (In Russian) [5] D.K. Faddeev, V.N. Faddeeva, "Computational methods of linear algebra" , Freeman (1963) (Translated from Russian) [6] D.J. Wilde, "Optimum seeking methods" , Prentice-Hall (1964) [7] I.M. Gel'fand, M.L. Tsetlin, "The principle of nonlocal search in automatic optimization systems" Soviet Phys. Dokl. , 6 (1961) pp. 192–194 Dokl. Akad. Nauk SSSR , 137 : 2 (1961) pp. 295–298 [8] P.E. Gill (ed.) W. Murray (ed.) , Numerical methods for constrained optimization , Acad. Press (1974)