Namespaces
Variants
Actions

Difference between revisions of "Stochastic approximation"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
 
Line 1: Line 1:
 +
<!--
 +
s0900101.png
 +
$#A+1 = 64 n = 0
 +
$#C+1 = 64 : ~/encyclopedia/old_files/data/S090/S.0900010 Stochastic approximation
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
 +
 +
{{TEX|auto}}
 +
{{TEX|done}}
 +
 
A method for solving a class of problems of [[Statistical estimation|statistical estimation]], in which the new value of the estimator is a modification of an existing estimator, based on new information. The first procedure of stochastic approximation was proposed in 1951 by H. Robbins and S. Monro.
 
A method for solving a class of problems of [[Statistical estimation|statistical estimation]], in which the new value of the estimator is a modification of an existing estimator, based on new information. The first procedure of stochastic approximation was proposed in 1951 by H. Robbins and S. Monro.
  
Let every measurement <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900101.png" /> of a function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900102.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900103.png" />, at a point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900104.png" /> contain a random error with mean zero. The Robbins–Monro procedure of stochastic approximation for finding a root of the equation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900105.png" /> takes the form
+
Let every measurement $  Y _ {n} ( X _ {n} ) $
 +
of a function $  R( X) $,  
 +
$  x \in \mathbf R  ^ {1} $,  
 +
at a point $  X _ {n} $
 +
contain a random error with mean zero. The Robbins–Monro procedure of stochastic approximation for finding a root of the equation $  R( x) = \alpha $
 +
takes the form
 +
 
 +
$$ \tag{1 }
 +
X _ {n+} 1  =  X _ {n} + a _ {n} ( Y _ {n} ( X _ {n} ) - \alpha ).
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900106.png" /></td> <td valign="top" style="width:5%;text-align:right;">(1)</td></tr></table>
+
If  $  \sum a _ {n} = \infty $,
 +
$  \sum a _ {n}  ^ {2} < \infty $,
 +
if  $  R( x) $
 +
is, for example, an increasing function, if  $  | R( x) | $
 +
increases no faster than a linear function, and if the random errors are independent, then  $  X _ {n} $
 +
tends to a root  $  x _ {0} $
 +
of the equation  $  R( x) = \alpha $
 +
with probability 1 and in the quadratic mean (see [[#References|[1]]], [[#References|[2]]]). It is clear from (1) that the process of stochastic approximation is recursive, i.e. a new value of the estimator can be obtained without recourse to the old measurement  $  Y _ {n} $,
 +
and is convenient in cases where the moment at which the estimator is to be represented is not known in advance. The estimator is formed continuously on the basis of observations relating to a given moment. These characteristics also pertain to stochastic approximation with recursive filters, and explain the popularity of stochastic approximation in theoretical and practical applications. The procedure (1) can be directly generalized to the multi-dimensional case.
  
If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900107.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900108.png" />, if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s0900109.png" /> is, for example, an increasing function, if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001010.png" /> increases no faster than a linear function, and if the random errors are independent, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001011.png" /> tends to a root <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001012.png" /> of the equation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001013.png" /> with probability 1 and in the quadratic mean (see [[#References|[1]]], [[#References|[2]]]). It is clear from (1) that the process of stochastic approximation is recursive, i.e. a new value of the estimator can be obtained without recourse to the old measurement <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001014.png" />, and is convenient in cases where the moment at which the estimator is to be represented is not known in advance. The estimator is formed continuously on the basis of observations relating to a given moment. These characteristics also pertain to stochastic approximation with recursive filters, and explain the popularity of stochastic approximation in theoretical and practical applications. The procedure (1) can be directly generalized to the multi-dimensional case.
+
Another procedure of stochastic approximation, used in finding a maximum point of a regression function $  R( x) $,  
 +
is attributed to J. Kiefer and J. Wolfowitz. Let  $  Y _ {n} ( x) $
 +
be an observation at the point  $  x $.  
 +
The Kiefer–Wolfowitz procedure then takes the form
  
Another procedure of stochastic approximation, used in finding a maximum point of a regression function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001015.png" />, is attributed to J. Kiefer and J. Wolfowitz. Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001016.png" /> be an observation at the point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001017.png" />. The Kiefer–Wolfowitz procedure then takes the form
+
$$ \tag{2 }
 +
X _ {n+} 1 - X _ {n}  = \
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001018.png" /></td> <td valign="top" style="width:5%;text-align:right;">(2)</td></tr></table>
+
\frac{Y _ {n} ( X _ {n} + C _ {n} ) - Y _ {n} ( X _ {n} - C _ {n} ) }{2C _ {n} }
 +
.
 +
$$
  
It has been proved that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001019.png" /> converges to a maximum point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001020.png" /> of the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001021.png" /> if, for example, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001022.png" /> when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001023.png" />, if the regression function and the variance of the random errors do not increase too rapidly when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001024.png" />, and if the conditions
+
It has been proved that $  X _ {n} $
 +
converges to a maximum point $  x _  \max  $
 +
of the function $  R( x) $
 +
if, for example, $  R  ^  \prime  ( x)( x - x _  \max  ) < 0 $
 +
when $  x \neq x _  \max  $,  
 +
if the regression function and the variance of the random errors do not increase too rapidly when $  | x | \rightarrow \infty $,  
 +
and if the conditions
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001025.png" /></td> </tr></table>
+
$$
 +
a _ {n}  > 0 ,\  \sum C _ {n}  = \infty ,\  C _ {n}  > 0,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001026.png" /></td> </tr></table>
+
$$
 +
\sum a _ {n} C _ {n}  < \infty ,\  \sum
 +
\frac{a _ {n}  ^ {2} }{C _ {n}  ^ {2} }
 +
  < \infty ,\  C _ {n}  \rightarrow  0
 +
$$
  
are fulfilled. The Kiefer–Wolfowitz procedure of stochastic approximation also permits a multi-dimensional generalization: instead of the right-hand side in (2), an approximate value of the gradient of the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001027.png" /> has to be substituted.
+
are fulfilled. The Kiefer–Wolfowitz procedure of stochastic approximation also permits a multi-dimensional generalization: instead of the right-hand side in (2), an approximate value of the gradient of the function $  Y _ {n} ( x) $
 +
has to be substituted.
  
 
Procedures of stochastic approximation can naturally be generalized to a continuous observation process. For example, if an observation process is disturbed by a Gaussian white noise, then the analogue of (1) takes the form
 
Procedures of stochastic approximation can naturally be generalized to a continuous observation process. For example, if an observation process is disturbed by a Gaussian white noise, then the analogue of (1) takes the form
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001028.png" /></td> </tr></table>
+
$$
 +
dX( t)  = a( t)  dY( t),
 +
$$
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001029.png" /></td> </tr></table>
+
$$
 +
dY( t)  = R( X( t))  dt + \sigma ( X( t), t)  dw( t)
 +
$$
  
is the differential of the process under observation and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001030.png" /> is a [[Wiener process|Wiener process]]. The conditions of convergence of continuous processes are analogous to those mentioned above for discrete time (see [[#References|[2]]]). The basic instrument for proving the convergence of procedures of stochastic approximation is the theorem on the convergence of non-negative supermartingales (see [[Martingale|Martingale]]).
+
is the differential of the process under observation and $  w( t) $
 +
is a [[Wiener process|Wiener process]]. The conditions of convergence of continuous processes are analogous to those mentioned above for discrete time (see [[#References|[2]]]). The basic instrument for proving the convergence of procedures of stochastic approximation is the theorem on the convergence of non-negative supermartingales (see [[Martingale|Martingale]]).
  
The limit behaviour for an appropriate normalization of the difference <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001031.png" /> when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001032.png" /> has been studied. In (1), let
+
The limit behaviour for an appropriate normalization of the difference $  X _ {n} - x _ {0} $
 +
when $  n \rightarrow \infty $
 +
has been studied. In (1), let
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001033.png" /></td> </tr></table>
+
$$
 +
a _ {n}  = an  ^ {-} 1 ,\ \
 +
Y _ {n} ( X _ {n} )  = R( X _ {n} ) + G( n, X _ {n} , \omega )
 +
$$
  
and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001034.png" /> almost certainly when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001035.png" />. Given certain restrictions, foremost among which are the requirements
+
and let $  X _ {n} \rightarrow x _ {0} $
 +
almost certainly when $  n \rightarrow \infty $.  
 +
Given certain restrictions, foremost among which are the requirements
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001036.png" /></td> </tr></table>
+
$$
 +
R  ^  \prime  ( x _ {0} )  < 0,\ \
 +
aR  ^  \prime  ( x _ {0} ) +
 +
\frac{1}{2}
 +
  < 0,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001037.png" /></td> </tr></table>
+
$$
 +
{\mathsf E} G  ^ {2} ( n, x, \omega )  \rightarrow  S _ {0}  $$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001038.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001039.png" />, the asymptotic normality of the variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001040.png" /> with parameters 0, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001041.png" /> has been proved. The least variance of the limit distribution is obtained for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001042.png" />. This choice of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001043.png" /> is impossible, since the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001044.png" /> and its derivative are unknown values to be observed. However, in a number of works, adaptive procedures have been constructed in which <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001045.png" /> depends on the observations and approximates <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001046.png" /> when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001047.png" />. These procedures possess properties that are asymptotically optimal in the sense of the asymptotic variance.
+
where $  n \rightarrow \infty $,  
 +
$  x \rightarrow x _ {0} $,  
 +
the asymptotic normality of the variable $  Z _ {n} = \sqrt n ( X _ {n} - x _ {0} ) $
 +
with parameters 0, $  a  ^ {2} S / (- 2aR  ^  \prime  ( x _ {0} ) - 1) $
 +
has been proved. The least variance of the limit distribution is obtained for $  a _ {0} = -[ R  ^  \prime  ( x _ {0} )]  ^ {-} 1 $.  
 +
This choice of $  a $
 +
is impossible, since the function $  R( x) $
 +
and its derivative are unknown values to be observed. However, in a number of works, adaptive procedures have been constructed in which $  a = a( n) $
 +
depends on the observations and approximates $  a _ {0} $
 +
when $  n \rightarrow \infty $.  
 +
These procedures possess properties that are asymptotically optimal in the sense of the asymptotic variance.
  
 
The results of asymptotic normality are also known in the multi-dimensional case. Let all roots of the matrix
 
The results of asymptotic normality are also known in the multi-dimensional case. Let all roots of the matrix
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001048.png" /></td> </tr></table>
+
$$
 +
= a
 +
\frac{\partial  R }{\partial  x }
 +
( x _ {0} ) +
 +
\frac{1}{2}
 +
I
 +
$$
 +
 
 +
have negative real parts ( $  I $
 +
is the identity matrix), let
  
have negative real parts (<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001049.png" /> is the identity matrix), let
+
$$
 +
{\mathsf E} G( n, x, \omega ) G  ^  \star  ( n, x, \omega )  = \
 +
S( n, x)  \rightarrow  S _ {0}  $$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001050.png" /></td> </tr></table>
+
when  $  n \rightarrow \infty $,
 +
let  $  x \rightarrow x _ {0} $
 +
and  $  X _ {n} \rightarrow x _ {0} $
 +
almost certainly, and let certain other not too-restrictive conditions be fulfilled. Then the vector  $  Z _ {n} = \sqrt n ( X _ {n} - x _ {0} ) $
 +
is asymptotically normal with mean zero and with covariance matrix
  
when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001051.png" />, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001052.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001053.png" /> almost certainly, and let certain other not too-restrictive conditions be fulfilled. Then the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001054.png" /> is asymptotically normal with mean zero and with covariance matrix
+
$$
 +
= a  ^ {2} \int\limits _ { 0 } ^  \infty    \mathop{\rm exp} ( At) S _ {0}  \mathop{\rm exp} ( A  ^  \star  t)  dt.
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001055.png" /></td> </tr></table>
+
The above result for an asymptotically-optimal Robbins–Monro procedure can also be generalized to the multi-dimensional case. It has been proved that the random process  $  Z _ {n} $
 +
converges to a Gaussian Markov process in a logarithmic scale. Given certain conditions, the convergence of the moments of the random variable  $  X _ {n} $
 +
to the moments of the limit law has been proved.
  
The above result for an asymptotically-optimal Robbins–Monro procedure can also be generalized to the multi-dimensional case. It has been proved that the random process <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001056.png" /> converges to a Gaussian Markov process in a logarithmic scale. Given certain conditions, the convergence of the moments of the random variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001057.png" /> to the moments of the limit law has been proved.
+
Stochastic approximation-type procedures are convenient in non-parametric situations, since they can be used when a priori information on the regression function is scarce. However, they are also used in estimating the parameter  $  \theta $
 +
of the density  $  f( x, \theta ) $
 +
through independent observations  $  X _ {1} \dots X _ {n} $
 +
with this density. Given certain restrictions, the recursive procedure
  
Stochastic approximation-type procedures are convenient in non-parametric situations, since they can be used when a priori information on the regression function is scarce. However, they are also used in estimating the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001058.png" /> of the density <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001059.png" /> through independent observations <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001060.png" /> with this density. Given certain restrictions, the recursive procedure
+
$$
 +
\theta _ {n+} 1 - \theta _ {n}  = \
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001061.png" /></td> </tr></table>
+
\frac{1}{n}
 +
( I  ^ {-} 1 ) ( \theta _ {n} )  \mathop{\rm grad} _  \theta    \mathop{\rm ln}  f( X _ {n+} 1 ,\
 +
\theta _ {n} )
 +
$$
  
(<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001062.png" /> is the Fisher [[Information matrix|information matrix]] of the density <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001063.png" />) is a consistent and asymptotically-efficient recursive estimator of the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s090/s090010/s09001064.png" />. The same process is also possible in the case of continuous time.
+
( $  I( \theta ) $
 +
is the Fisher [[Information matrix|information matrix]] of the density $  f  $)  
 +
is a consistent and asymptotically-efficient recursive estimator of the parameter $  \theta $.  
 +
The same process is also possible in the case of continuous time.
  
 
The behaviour of procedures of stochastic approximation has been studied in the case where the regression function has several zeros (several extremal points), and for different modifications and generalizations of procedures of stochastic approximation.
 
The behaviour of procedures of stochastic approximation has been studied in the case where the regression function has several zeros (several extremal points), and for different modifications and generalizations of procedures of stochastic approximation.
Line 65: Line 170:
 
====References====
 
====References====
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  M.T. Wasan,  "Stochastic approximation" , Cambridge Univ. Press  (1969)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  M.B. Nevel'son,  R.Z. Khas'minskii,  "Stochastic approximation and recursive estimation" , Amer. Math. Soc.  (1976)  (Translated from Russian)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  Ya.Z. Tsypkin,  "Adaption and learning in automatic systems" , Acad. Press  (1973)  (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  Ya.Z. Tsypkin,  "Foundations of the theory of learning systems" , Acad. Press  (1973)  (Translated from Russian)</TD></TR></table>
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  M.T. Wasan,  "Stochastic approximation" , Cambridge Univ. Press  (1969)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  M.B. Nevel'son,  R.Z. Khas'minskii,  "Stochastic approximation and recursive estimation" , Amer. Math. Soc.  (1976)  (Translated from Russian)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  Ya.Z. Tsypkin,  "Adaption and learning in automatic systems" , Acad. Press  (1973)  (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  Ya.Z. Tsypkin,  "Foundations of the theory of learning systems" , Acad. Press  (1973)  (Translated from Russian)</TD></TR></table>
 
 
  
 
====Comments====
 
====Comments====

Latest revision as of 08:23, 6 June 2020


A method for solving a class of problems of statistical estimation, in which the new value of the estimator is a modification of an existing estimator, based on new information. The first procedure of stochastic approximation was proposed in 1951 by H. Robbins and S. Monro.

Let every measurement $ Y _ {n} ( X _ {n} ) $ of a function $ R( X) $, $ x \in \mathbf R ^ {1} $, at a point $ X _ {n} $ contain a random error with mean zero. The Robbins–Monro procedure of stochastic approximation for finding a root of the equation $ R( x) = \alpha $ takes the form

$$ \tag{1 } X _ {n+} 1 = X _ {n} + a _ {n} ( Y _ {n} ( X _ {n} ) - \alpha ). $$

If $ \sum a _ {n} = \infty $, $ \sum a _ {n} ^ {2} < \infty $, if $ R( x) $ is, for example, an increasing function, if $ | R( x) | $ increases no faster than a linear function, and if the random errors are independent, then $ X _ {n} $ tends to a root $ x _ {0} $ of the equation $ R( x) = \alpha $ with probability 1 and in the quadratic mean (see [1], [2]). It is clear from (1) that the process of stochastic approximation is recursive, i.e. a new value of the estimator can be obtained without recourse to the old measurement $ Y _ {n} $, and is convenient in cases where the moment at which the estimator is to be represented is not known in advance. The estimator is formed continuously on the basis of observations relating to a given moment. These characteristics also pertain to stochastic approximation with recursive filters, and explain the popularity of stochastic approximation in theoretical and practical applications. The procedure (1) can be directly generalized to the multi-dimensional case.

Another procedure of stochastic approximation, used in finding a maximum point of a regression function $ R( x) $, is attributed to J. Kiefer and J. Wolfowitz. Let $ Y _ {n} ( x) $ be an observation at the point $ x $. The Kiefer–Wolfowitz procedure then takes the form

$$ \tag{2 } X _ {n+} 1 - X _ {n} = \ \frac{Y _ {n} ( X _ {n} + C _ {n} ) - Y _ {n} ( X _ {n} - C _ {n} ) }{2C _ {n} } . $$

It has been proved that $ X _ {n} $ converges to a maximum point $ x _ \max $ of the function $ R( x) $ if, for example, $ R ^ \prime ( x)( x - x _ \max ) < 0 $ when $ x \neq x _ \max $, if the regression function and the variance of the random errors do not increase too rapidly when $ | x | \rightarrow \infty $, and if the conditions

$$ a _ {n} > 0 ,\ \sum C _ {n} = \infty ,\ C _ {n} > 0, $$

$$ \sum a _ {n} C _ {n} < \infty ,\ \sum \frac{a _ {n} ^ {2} }{C _ {n} ^ {2} } < \infty ,\ C _ {n} \rightarrow 0 $$

are fulfilled. The Kiefer–Wolfowitz procedure of stochastic approximation also permits a multi-dimensional generalization: instead of the right-hand side in (2), an approximate value of the gradient of the function $ Y _ {n} ( x) $ has to be substituted.

Procedures of stochastic approximation can naturally be generalized to a continuous observation process. For example, if an observation process is disturbed by a Gaussian white noise, then the analogue of (1) takes the form

$$ dX( t) = a( t) dY( t), $$

where

$$ dY( t) = R( X( t)) dt + \sigma ( X( t), t) dw( t) $$

is the differential of the process under observation and $ w( t) $ is a Wiener process. The conditions of convergence of continuous processes are analogous to those mentioned above for discrete time (see [2]). The basic instrument for proving the convergence of procedures of stochastic approximation is the theorem on the convergence of non-negative supermartingales (see Martingale).

The limit behaviour for an appropriate normalization of the difference $ X _ {n} - x _ {0} $ when $ n \rightarrow \infty $ has been studied. In (1), let

$$ a _ {n} = an ^ {-} 1 ,\ \ Y _ {n} ( X _ {n} ) = R( X _ {n} ) + G( n, X _ {n} , \omega ) $$

and let $ X _ {n} \rightarrow x _ {0} $ almost certainly when $ n \rightarrow \infty $. Given certain restrictions, foremost among which are the requirements

$$ R ^ \prime ( x _ {0} ) < 0,\ \ aR ^ \prime ( x _ {0} ) + \frac{1}{2} < 0, $$

$$ {\mathsf E} G ^ {2} ( n, x, \omega ) \rightarrow S _ {0} $$

where $ n \rightarrow \infty $, $ x \rightarrow x _ {0} $, the asymptotic normality of the variable $ Z _ {n} = \sqrt n ( X _ {n} - x _ {0} ) $ with parameters 0, $ a ^ {2} S / (- 2aR ^ \prime ( x _ {0} ) - 1) $ has been proved. The least variance of the limit distribution is obtained for $ a _ {0} = -[ R ^ \prime ( x _ {0} )] ^ {-} 1 $. This choice of $ a $ is impossible, since the function $ R( x) $ and its derivative are unknown values to be observed. However, in a number of works, adaptive procedures have been constructed in which $ a = a( n) $ depends on the observations and approximates $ a _ {0} $ when $ n \rightarrow \infty $. These procedures possess properties that are asymptotically optimal in the sense of the asymptotic variance.

The results of asymptotic normality are also known in the multi-dimensional case. Let all roots of the matrix

$$ A = a \frac{\partial R }{\partial x } ( x _ {0} ) + \frac{1}{2} I $$

have negative real parts ( $ I $ is the identity matrix), let

$$ {\mathsf E} G( n, x, \omega ) G ^ \star ( n, x, \omega ) = \ S( n, x) \rightarrow S _ {0} $$

when $ n \rightarrow \infty $, let $ x \rightarrow x _ {0} $ and $ X _ {n} \rightarrow x _ {0} $ almost certainly, and let certain other not too-restrictive conditions be fulfilled. Then the vector $ Z _ {n} = \sqrt n ( X _ {n} - x _ {0} ) $ is asymptotically normal with mean zero and with covariance matrix

$$ S = a ^ {2} \int\limits _ { 0 } ^ \infty \mathop{\rm exp} ( At) S _ {0} \mathop{\rm exp} ( A ^ \star t) dt. $$

The above result for an asymptotically-optimal Robbins–Monro procedure can also be generalized to the multi-dimensional case. It has been proved that the random process $ Z _ {n} $ converges to a Gaussian Markov process in a logarithmic scale. Given certain conditions, the convergence of the moments of the random variable $ X _ {n} $ to the moments of the limit law has been proved.

Stochastic approximation-type procedures are convenient in non-parametric situations, since they can be used when a priori information on the regression function is scarce. However, they are also used in estimating the parameter $ \theta $ of the density $ f( x, \theta ) $ through independent observations $ X _ {1} \dots X _ {n} $ with this density. Given certain restrictions, the recursive procedure

$$ \theta _ {n+} 1 - \theta _ {n} = \ \frac{1}{n} ( I ^ {-} 1 ) ( \theta _ {n} ) \mathop{\rm grad} _ \theta \mathop{\rm ln} f( X _ {n+} 1 ,\ \theta _ {n} ) $$

( $ I( \theta ) $ is the Fisher information matrix of the density $ f $) is a consistent and asymptotically-efficient recursive estimator of the parameter $ \theta $. The same process is also possible in the case of continuous time.

The behaviour of procedures of stochastic approximation has been studied in the case where the regression function has several zeros (several extremal points), and for different modifications and generalizations of procedures of stochastic approximation.

References

[1] M.T. Wasan, "Stochastic approximation" , Cambridge Univ. Press (1969)
[2] M.B. Nevel'son, R.Z. Khas'minskii, "Stochastic approximation and recursive estimation" , Amer. Math. Soc. (1976) (Translated from Russian)
[3] Ya.Z. Tsypkin, "Adaption and learning in automatic systems" , Acad. Press (1973) (Translated from Russian)
[4] Ya.Z. Tsypkin, "Foundations of the theory of learning systems" , Acad. Press (1973) (Translated from Russian)

Comments

Convergence properties of stochastic approximation and other recursive algorithms have been the subject of much research. One approach is the "ordinary differential equations" method ([a1], [a2]), which is based on interpreting suitably rescaled versions of (1) and (2) as Euler approximations to the solution of an ordinary or stochastic differential equation. Asymptotic properties of the algorithm can then be related to stability properties of the corresponding ordinary or stochastic differential equation. Methods based on compactness and weak convergence have also been introduced.

References

[a1] H.J. Kushner, D.S. Clark, "Stochastic approximation methods for constrained and unconstrained systems" , Springer (1978)
[a2] L. Ljung, "Analysis of recursive stochastic algorithms" IEEE Trans. Autom. Control , AC-22 (1977) pp. 551–575
How to Cite This Entry:
Stochastic approximation. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Stochastic_approximation&oldid=17572
This article was adapted from an original article by R.Z. Khas'minskii (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article