# Stochastic approximation

A method for solving a class of problems of statistical estimation, in which the new value of the estimator is a modification of an existing estimator, based on new information. The first procedure of stochastic approximation was proposed in 1951 by H. Robbins and S. Monro.

Let every measurement $Y _ {n} ( X _ {n} )$ of a function $R( X)$, $x \in \mathbf R ^ {1}$, at a point $X _ {n}$ contain a random error with mean zero. The Robbins–Monro procedure of stochastic approximation for finding a root of the equation $R( x) = \alpha$ takes the form

$$\tag{1 } X _ {n+} 1 = X _ {n} + a _ {n} ( Y _ {n} ( X _ {n} ) - \alpha ).$$

If $\sum a _ {n} = \infty$, $\sum a _ {n} ^ {2} < \infty$, if $R( x)$ is, for example, an increasing function, if $| R( x) |$ increases no faster than a linear function, and if the random errors are independent, then $X _ {n}$ tends to a root $x _ {0}$ of the equation $R( x) = \alpha$ with probability 1 and in the quadratic mean (see [1], [2]). It is clear from (1) that the process of stochastic approximation is recursive, i.e. a new value of the estimator can be obtained without recourse to the old measurement $Y _ {n}$, and is convenient in cases where the moment at which the estimator is to be represented is not known in advance. The estimator is formed continuously on the basis of observations relating to a given moment. These characteristics also pertain to stochastic approximation with recursive filters, and explain the popularity of stochastic approximation in theoretical and practical applications. The procedure (1) can be directly generalized to the multi-dimensional case.

Another procedure of stochastic approximation, used in finding a maximum point of a regression function $R( x)$, is attributed to J. Kiefer and J. Wolfowitz. Let $Y _ {n} ( x)$ be an observation at the point $x$. The Kiefer–Wolfowitz procedure then takes the form

$$\tag{2 } X _ {n+} 1 - X _ {n} = \ \frac{Y _ {n} ( X _ {n} + C _ {n} ) - Y _ {n} ( X _ {n} - C _ {n} ) }{2C _ {n} } .$$

It has been proved that $X _ {n}$ converges to a maximum point $x _ \max$ of the function $R( x)$ if, for example, $R ^ \prime ( x)( x - x _ \max ) < 0$ when $x \neq x _ \max$, if the regression function and the variance of the random errors do not increase too rapidly when $| x | \rightarrow \infty$, and if the conditions

$$a _ {n} > 0 ,\ \sum C _ {n} = \infty ,\ C _ {n} > 0,$$

$$\sum a _ {n} C _ {n} < \infty ,\ \sum \frac{a _ {n} ^ {2} }{C _ {n} ^ {2} } < \infty ,\ C _ {n} \rightarrow 0$$

are fulfilled. The Kiefer–Wolfowitz procedure of stochastic approximation also permits a multi-dimensional generalization: instead of the right-hand side in (2), an approximate value of the gradient of the function $Y _ {n} ( x)$ has to be substituted.

Procedures of stochastic approximation can naturally be generalized to a continuous observation process. For example, if an observation process is disturbed by a Gaussian white noise, then the analogue of (1) takes the form

$$dX( t) = a( t) dY( t),$$

where

$$dY( t) = R( X( t)) dt + \sigma ( X( t), t) dw( t)$$

is the differential of the process under observation and $w( t)$ is a Wiener process. The conditions of convergence of continuous processes are analogous to those mentioned above for discrete time (see [2]). The basic instrument for proving the convergence of procedures of stochastic approximation is the theorem on the convergence of non-negative supermartingales (see Martingale).

The limit behaviour for an appropriate normalization of the difference $X _ {n} - x _ {0}$ when $n \rightarrow \infty$ has been studied. In (1), let

$$a _ {n} = an ^ {-} 1 ,\ \ Y _ {n} ( X _ {n} ) = R( X _ {n} ) + G( n, X _ {n} , \omega )$$

and let $X _ {n} \rightarrow x _ {0}$ almost certainly when $n \rightarrow \infty$. Given certain restrictions, foremost among which are the requirements

$$R ^ \prime ( x _ {0} ) < 0,\ \ aR ^ \prime ( x _ {0} ) + \frac{1}{2} < 0,$$

$${\mathsf E} G ^ {2} ( n, x, \omega ) \rightarrow S _ {0}$$

where $n \rightarrow \infty$, $x \rightarrow x _ {0}$, the asymptotic normality of the variable $Z _ {n} = \sqrt n ( X _ {n} - x _ {0} )$ with parameters 0, $a ^ {2} S / (- 2aR ^ \prime ( x _ {0} ) - 1)$ has been proved. The least variance of the limit distribution is obtained for $a _ {0} = -[ R ^ \prime ( x _ {0} )] ^ {-} 1$. This choice of $a$ is impossible, since the function $R( x)$ and its derivative are unknown values to be observed. However, in a number of works, adaptive procedures have been constructed in which $a = a( n)$ depends on the observations and approximates $a _ {0}$ when $n \rightarrow \infty$. These procedures possess properties that are asymptotically optimal in the sense of the asymptotic variance.

The results of asymptotic normality are also known in the multi-dimensional case. Let all roots of the matrix

$$A = a \frac{\partial R }{\partial x } ( x _ {0} ) + \frac{1}{2} I$$

have negative real parts ( $I$ is the identity matrix), let

$${\mathsf E} G( n, x, \omega ) G ^ \star ( n, x, \omega ) = \ S( n, x) \rightarrow S _ {0}$$

when $n \rightarrow \infty$, let $x \rightarrow x _ {0}$ and $X _ {n} \rightarrow x _ {0}$ almost certainly, and let certain other not too-restrictive conditions be fulfilled. Then the vector $Z _ {n} = \sqrt n ( X _ {n} - x _ {0} )$ is asymptotically normal with mean zero and with covariance matrix

$$S = a ^ {2} \int\limits _ { 0 } ^ \infty \mathop{\rm exp} ( At) S _ {0} \mathop{\rm exp} ( A ^ \star t) dt.$$

The above result for an asymptotically-optimal Robbins–Monro procedure can also be generalized to the multi-dimensional case. It has been proved that the random process $Z _ {n}$ converges to a Gaussian Markov process in a logarithmic scale. Given certain conditions, the convergence of the moments of the random variable $X _ {n}$ to the moments of the limit law has been proved.

Stochastic approximation-type procedures are convenient in non-parametric situations, since they can be used when a priori information on the regression function is scarce. However, they are also used in estimating the parameter $\theta$ of the density $f( x, \theta )$ through independent observations $X _ {1} \dots X _ {n}$ with this density. Given certain restrictions, the recursive procedure

$$\theta _ {n+} 1 - \theta _ {n} = \ \frac{1}{n} ( I ^ {-} 1 ) ( \theta _ {n} ) \mathop{\rm grad} _ \theta \mathop{\rm ln} f( X _ {n+} 1 ,\ \theta _ {n} )$$

( $I( \theta )$ is the Fisher information matrix of the density $f$) is a consistent and asymptotically-efficient recursive estimator of the parameter $\theta$. The same process is also possible in the case of continuous time.

The behaviour of procedures of stochastic approximation has been studied in the case where the regression function has several zeros (several extremal points), and for different modifications and generalizations of procedures of stochastic approximation.

#### References

 [1] M.T. Wasan, "Stochastic approximation" , Cambridge Univ. Press (1969) [2] M.B. Nevel'son, R.Z. Khas'minskii, "Stochastic approximation and recursive estimation" , Amer. Math. Soc. (1976) (Translated from Russian) [3] Ya.Z. Tsypkin, "Adaption and learning in automatic systems" , Acad. Press (1973) (Translated from Russian) [4] Ya.Z. Tsypkin, "Foundations of the theory of learning systems" , Acad. Press (1973) (Translated from Russian)