Difference between revisions of "Adaptive algorithm"
(Importing text file) |
m (AUTOMATIC EDIT (latexlist): Replaced 69 formulas out of 69 by TEX code with an average confidence of 2.0 and a minimal confidence of 2.0.) |
||
Line 1: | Line 1: | ||
− | The use of adaptive algorithms is now (1998) very widespread across such varied applications as system identification, adaptive control, transmission systems, adaptive filtering for signal processing, and several aspects of pattern recognition and learning. Many examples can be found in the literature [[#References|[a8]]], [[#References|[a1]]], [[#References|[a5]]]; see also, e.g., [[Adaptive sampling|Adaptive sampling]]. The aim of an adaptive algorithm is to estimate an unknown time-invariant (or slowly varying) parameter vector, traditionally denoted by | + | <!--This article has been texified automatically. Since there was no Nroff source code for this article, |
+ | the semi-automatic procedure described at https://encyclopediaofmath.org/wiki/User:Maximilian_Janisch/latexlist | ||
+ | was used. | ||
+ | If the TeX and formula formatting is correct, please remove this message and the {{TEX|semi-auto}} category. | ||
+ | |||
+ | Out of 69 formulas, 69 were replaced by TEX code.--> | ||
+ | |||
+ | {{TEX|semi-auto}}{{TEX|done}} | ||
+ | The use of adaptive algorithms is now (1998) very widespread across such varied applications as system identification, adaptive control, transmission systems, adaptive filtering for signal processing, and several aspects of pattern recognition and learning. Many examples can be found in the literature [[#References|[a8]]], [[#References|[a1]]], [[#References|[a5]]]; see also, e.g., [[Adaptive sampling|Adaptive sampling]]. The aim of an adaptive algorithm is to estimate an unknown time-invariant (or slowly varying) parameter vector, traditionally denoted by $\theta$. | ||
The study of adaptive algorithms is generally made through the theory of [[Stochastic approximation|stochastic approximation]]. | The study of adaptive algorithms is generally made through the theory of [[Stochastic approximation|stochastic approximation]]. | ||
− | Assume that observations | + | Assume that observations $X_n$ are related to the true parameter $\theta ^ { * }$ via an equation |
− | + | \begin{equation*} \mathsf{E} _ { \theta } [ H ( \theta , X ) ] = 0 , \quad \text { if } \theta = \theta ^ { * }, \end{equation*} | |
− | where | + | where $H$ is known, but the distribution of $X$ is unknown and may or may not depend on $\theta$ (as indicated by the subscript on $-$). In many situations, $H$ is the gradient of a functional to be minimized. |
The general structure considered for stochastic algorithms is the following: | The general structure considered for stochastic algorithms is the following: | ||
− | + | \begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } H ( \theta _ { n - 1 } , X _ { n } ), \end{equation*} | |
− | where | + | where $\gamma _ { n }$ is a non-negative non-increasing sequence, typically $1 / n$ or a constant, $X_n$ is the "somehow stationary" observed sequence and $\theta _ { n }$ is at each $n$ an improved estimate of $\theta ^ { * }$. |
− | Note that when | + | Note that when $H ( \theta , X ) = \theta - X$, and $\gamma _ { n } = 1 / n$, $\theta _ { n }$ is simply the [[Arithmetic mean|arithmetic mean]] of the $X_n$. |
− | The easiest situation is when | + | The easiest situation is when $X_n$ is a given sequence of independent identically-distributed random variables. However, this is not always the case: the classical Robbins–Monro algorithms correspond to the situation where a parameter $\theta$ (e.g., a dosage of a chemical product) needs to be tuned so that the effect measured by $X$ is at an average given level $\alpha$. In this case $H ( \theta , X ) = X - \alpha$ and $X_n$ is the result of an experiment made with $\theta _ { n - 1} $. Study of stochastic algorithms is generally restricted to those which have the Markovian structure introduced in [[#References|[a1]]], where $X_n$ is a [[Random variable|random variable]] depending only on $( \theta _ { n - 1} , X _ { n - 1} )$; more precisely, the distribution of $X_n$ conditional to the whole past $( X _ { n - 1 } , \theta _ { n - 1 } , \ldots )$ is given by a transition kernel $P _ { \theta _ { n } } ( X _ { n - 1 } , d x )$. In this case, the expectation above has to be taken with respect to the stationary distribution of the [[Markov chain|Markov chain]] $X_n$ when $\theta$ is fixed. |
==Decreasing gain.== | ==Decreasing gain.== | ||
− | In the case of decreasing gain, typical results which are proved are the almost-sure convergence of | + | In the case of decreasing gain, typical results which are proved are the almost-sure convergence of $\theta _ { n }$ to a solution of the equation $h ( \theta ) = 0$ (where $h ( \theta ) = \mathsf{E} _ { \theta } [ H ( \theta , X ) ]$), and, if $\gamma _ { n } = 1 / n$, the asymptotic normality of $\sqrt { n } ( \theta _ { n } - \theta ^ { * } )$ and the [[Law of the iterated logarithm|law of the iterated logarithm]] ([[#References|[a3]]], [[#References|[a6]]], [[#References|[a1]]], [[#References|[a4]]]). The asymptotic covariance matrix $V$ of $\sqrt { n } ( \theta _ { n } - \theta ^ { * } )$ is the solution of the Lyapunov equation |
− | + | \begin{equation*} \left( h _ { \theta } ^ { * } - \frac { I } { 2 } \right) V + V \left( h _ { \theta } ^ { * } - \frac { I } { 2 } \right) ^ { T } = R ( \theta ^ { * } ), \end{equation*} | |
where | where | ||
− | + | \begin{equation*} h _ { \theta } ^ { * } = \nabla h ( \theta ^ { * } ), \end{equation*} | |
− | + | \begin{equation*} R ( \theta ^ { * } ) = \sum _ { n = - \infty } ^ { \infty } \operatorname { cov } ( H ( \theta ^ { * } , X _ { n } ) , H ( \theta ^ { * } , X _ { 0 } ) ). \end{equation*} | |
− | The expectation is taken with respect to the distribution of the sequence | + | The expectation is taken with respect to the distribution of the sequence $( X _ { n } )$ (in the general Markovian situation, one has to consider here the stationary distribution of the chain of transition kernels $P _ { \theta ^ *} ( X _ { n - 1 }, d x )$). In particular, asymptotic normality occurs only if $I / 2 - h _ { \theta } ^ { * }$ has only negative eigenvalues (otherwise the Lyapunov equation above has no solution). Improving this covariance may be done by insertion of a matrix gain (smallest $V$), |
− | + | \begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } \Gamma H ( \theta _ { n - 1 } , X _ { n } ), \end{equation*} | |
and a simple computation shows that the optimal gain is | and a simple computation shows that the optimal gain is | ||
− | + | \begin{equation*} \Gamma ^ { * } = h _ { \theta } ^ { * } \square ^ { - 1 }. \end{equation*} | |
This matrix may be estimated during the algorithm, but convergence becomes quite difficult to prove. With this choice of the gain, the Cramér–Rao bound is attained. Another way is to use the Polyak–Ruppert averaging method [[#References|[a7]]]: | This matrix may be estimated during the algorithm, but convergence becomes quite difficult to prove. With this choice of the gain, the Cramér–Rao bound is attained. Another way is to use the Polyak–Ruppert averaging method [[#References|[a7]]]: | ||
− | + | \begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } H ( \theta _ { n - 1 } , Y _ { n } ) \end{equation*} | |
− | + | \begin{equation*} \overline { \theta } _ { n } = \overline { \theta } _ { n - 1 } + \frac { 1 } { n } ( \theta _ { n - 1 } - \overline { \theta } _ { n - 1 } ). \end{equation*} | |
− | with a "larger" | + | with a "larger" $\gamma _ { n }$ (typically $\gamma _ { n } = n ^ { - 2 / 3 }$). One can prove the asymptotic optimality of this algorithm. |
==Constant gain.== | ==Constant gain.== | ||
− | Constant-gain algorithms are used for tracking a slowly varying optimal parameter | + | Constant-gain algorithms are used for tracking a slowly varying optimal parameter $\theta _ { n } ^ { * }$. The asymptotic behaviour of the algorithm may be studied when the speed of $\theta _ { n } ^ { * }$ and $\gamma$ tend to zero [[#References|[a1]]]; this so-called method of diffusion approximation consists in proving that, if $\theta _ { n } ^ { * }$ is fixed, then the trajectory of $\theta _ { n }$, suitably centred and rescaled in space and time, is well approximated by a diffusion. One shows that in the case of a smooth variation of $\theta _ { n } ^ { * }$, the gain has to be tuned approximately proportional to $v ^ { 2 / 3 }$, where $v$ is the speed of variation of $\theta _ { n } ^ { * }$. On the other hand, if $\theta _ { n } ^ { * }$ moves along a [[Random walk|random walk]], the gain has to be chosen proportional to the average amplitude of $| \theta _ { n + 1 } ^ { * } - \theta _ { n } ^ { * } |$. Other authors study the stationary limit distribution of $\theta _ { n }$ when $\theta _ { n } ^ { * }$ has a given distribution ([[#References|[a2]]] and references therein). The direct on-line estimation of a good gain $\gamma$ without a priori knowledge on the variation of $\theta ^ { * }$ remains an important open problem. |
====References==== | ====References==== | ||
− | <table>< | + | <table><tr><td valign="top">[a1]</td> <td valign="top"> A. Benveniste, M. Métivier, P. Priouret, "Adaptive Algorithms and Stochastic Approximations" , Springer (1990)</td></tr><tr><td valign="top">[a2]</td> <td valign="top"> B. Delyon, A. Juditsky, "Asymptotical study of parameter tracking algorithms" ''SIAM J. Control and Optimization'' , '''33''' : 1 (1995) pp. 323–345</td></tr><tr><td valign="top">[a3]</td> <td valign="top"> P. Hall, C.C. Heyde, "Martingale limit theory and its applications" , Acad. Press (1980)</td></tr><tr><td valign="top">[a4]</td> <td valign="top"> H.J. Kushner, D.S. Clark, "Stochastic Approximation Methods for Constrained and Unconstrained Systems" , Springer (1978)</td></tr><tr><td valign="top">[a5]</td> <td valign="top"> L. Ljung, T. Soderström, "Theory and practice of recursive identification" , MIT (1983)</td></tr><tr><td valign="top">[a6]</td> <td valign="top"> B.M. Nevel'son, R.Z. Khas'minskii, "Stochastic approximation and recursive estimation" , ''Transl. Math. Monogr.'' , '''47''' , Amer. Math. Soc. (1976)</td></tr><tr><td valign="top">[a7]</td> <td valign="top"> B.T. Polyak, "New stochastic approximation type procedures" ''Autom. Remote Contr.'' , '''51''' : 7 (1960) pp. 937–946</td></tr><tr><td valign="top">[a8]</td> <td valign="top"> G.N. Saridis, "Stochastic approximation methods for identification and control — a survey" ''IEEE-AC'' , '''19''' : l6 (1974)</td></tr></table> |
Latest revision as of 15:31, 1 July 2020
The use of adaptive algorithms is now (1998) very widespread across such varied applications as system identification, adaptive control, transmission systems, adaptive filtering for signal processing, and several aspects of pattern recognition and learning. Many examples can be found in the literature [a8], [a1], [a5]; see also, e.g., Adaptive sampling. The aim of an adaptive algorithm is to estimate an unknown time-invariant (or slowly varying) parameter vector, traditionally denoted by $\theta$.
The study of adaptive algorithms is generally made through the theory of stochastic approximation.
Assume that observations $X_n$ are related to the true parameter $\theta ^ { * }$ via an equation
\begin{equation*} \mathsf{E} _ { \theta } [ H ( \theta , X ) ] = 0 , \quad \text { if } \theta = \theta ^ { * }, \end{equation*}
where $H$ is known, but the distribution of $X$ is unknown and may or may not depend on $\theta$ (as indicated by the subscript on $-$). In many situations, $H$ is the gradient of a functional to be minimized.
The general structure considered for stochastic algorithms is the following:
\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } H ( \theta _ { n - 1 } , X _ { n } ), \end{equation*}
where $\gamma _ { n }$ is a non-negative non-increasing sequence, typically $1 / n$ or a constant, $X_n$ is the "somehow stationary" observed sequence and $\theta _ { n }$ is at each $n$ an improved estimate of $\theta ^ { * }$.
Note that when $H ( \theta , X ) = \theta - X$, and $\gamma _ { n } = 1 / n$, $\theta _ { n }$ is simply the arithmetic mean of the $X_n$.
The easiest situation is when $X_n$ is a given sequence of independent identically-distributed random variables. However, this is not always the case: the classical Robbins–Monro algorithms correspond to the situation where a parameter $\theta$ (e.g., a dosage of a chemical product) needs to be tuned so that the effect measured by $X$ is at an average given level $\alpha$. In this case $H ( \theta , X ) = X - \alpha$ and $X_n$ is the result of an experiment made with $\theta _ { n - 1} $. Study of stochastic algorithms is generally restricted to those which have the Markovian structure introduced in [a1], where $X_n$ is a random variable depending only on $( \theta _ { n - 1} , X _ { n - 1} )$; more precisely, the distribution of $X_n$ conditional to the whole past $( X _ { n - 1 } , \theta _ { n - 1 } , \ldots )$ is given by a transition kernel $P _ { \theta _ { n } } ( X _ { n - 1 } , d x )$. In this case, the expectation above has to be taken with respect to the stationary distribution of the Markov chain $X_n$ when $\theta$ is fixed.
Decreasing gain.
In the case of decreasing gain, typical results which are proved are the almost-sure convergence of $\theta _ { n }$ to a solution of the equation $h ( \theta ) = 0$ (where $h ( \theta ) = \mathsf{E} _ { \theta } [ H ( \theta , X ) ]$), and, if $\gamma _ { n } = 1 / n$, the asymptotic normality of $\sqrt { n } ( \theta _ { n } - \theta ^ { * } )$ and the law of the iterated logarithm ([a3], [a6], [a1], [a4]). The asymptotic covariance matrix $V$ of $\sqrt { n } ( \theta _ { n } - \theta ^ { * } )$ is the solution of the Lyapunov equation
\begin{equation*} \left( h _ { \theta } ^ { * } - \frac { I } { 2 } \right) V + V \left( h _ { \theta } ^ { * } - \frac { I } { 2 } \right) ^ { T } = R ( \theta ^ { * } ), \end{equation*}
where
\begin{equation*} h _ { \theta } ^ { * } = \nabla h ( \theta ^ { * } ), \end{equation*}
\begin{equation*} R ( \theta ^ { * } ) = \sum _ { n = - \infty } ^ { \infty } \operatorname { cov } ( H ( \theta ^ { * } , X _ { n } ) , H ( \theta ^ { * } , X _ { 0 } ) ). \end{equation*}
The expectation is taken with respect to the distribution of the sequence $( X _ { n } )$ (in the general Markovian situation, one has to consider here the stationary distribution of the chain of transition kernels $P _ { \theta ^ *} ( X _ { n - 1 }, d x )$). In particular, asymptotic normality occurs only if $I / 2 - h _ { \theta } ^ { * }$ has only negative eigenvalues (otherwise the Lyapunov equation above has no solution). Improving this covariance may be done by insertion of a matrix gain (smallest $V$),
\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } \Gamma H ( \theta _ { n - 1 } , X _ { n } ), \end{equation*}
and a simple computation shows that the optimal gain is
\begin{equation*} \Gamma ^ { * } = h _ { \theta } ^ { * } \square ^ { - 1 }. \end{equation*}
This matrix may be estimated during the algorithm, but convergence becomes quite difficult to prove. With this choice of the gain, the Cramér–Rao bound is attained. Another way is to use the Polyak–Ruppert averaging method [a7]:
\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } H ( \theta _ { n - 1 } , Y _ { n } ) \end{equation*}
\begin{equation*} \overline { \theta } _ { n } = \overline { \theta } _ { n - 1 } + \frac { 1 } { n } ( \theta _ { n - 1 } - \overline { \theta } _ { n - 1 } ). \end{equation*}
with a "larger" $\gamma _ { n }$ (typically $\gamma _ { n } = n ^ { - 2 / 3 }$). One can prove the asymptotic optimality of this algorithm.
Constant gain.
Constant-gain algorithms are used for tracking a slowly varying optimal parameter $\theta _ { n } ^ { * }$. The asymptotic behaviour of the algorithm may be studied when the speed of $\theta _ { n } ^ { * }$ and $\gamma$ tend to zero [a1]; this so-called method of diffusion approximation consists in proving that, if $\theta _ { n } ^ { * }$ is fixed, then the trajectory of $\theta _ { n }$, suitably centred and rescaled in space and time, is well approximated by a diffusion. One shows that in the case of a smooth variation of $\theta _ { n } ^ { * }$, the gain has to be tuned approximately proportional to $v ^ { 2 / 3 }$, where $v$ is the speed of variation of $\theta _ { n } ^ { * }$. On the other hand, if $\theta _ { n } ^ { * }$ moves along a random walk, the gain has to be chosen proportional to the average amplitude of $| \theta _ { n + 1 } ^ { * } - \theta _ { n } ^ { * } |$. Other authors study the stationary limit distribution of $\theta _ { n }$ when $\theta _ { n } ^ { * }$ has a given distribution ([a2] and references therein). The direct on-line estimation of a good gain $\gamma$ without a priori knowledge on the variation of $\theta ^ { * }$ remains an important open problem.
References
[a1] | A. Benveniste, M. Métivier, P. Priouret, "Adaptive Algorithms and Stochastic Approximations" , Springer (1990) |
[a2] | B. Delyon, A. Juditsky, "Asymptotical study of parameter tracking algorithms" SIAM J. Control and Optimization , 33 : 1 (1995) pp. 323–345 |
[a3] | P. Hall, C.C. Heyde, "Martingale limit theory and its applications" , Acad. Press (1980) |
[a4] | H.J. Kushner, D.S. Clark, "Stochastic Approximation Methods for Constrained and Unconstrained Systems" , Springer (1978) |
[a5] | L. Ljung, T. Soderström, "Theory and practice of recursive identification" , MIT (1983) |
[a6] | B.M. Nevel'son, R.Z. Khas'minskii, "Stochastic approximation and recursive estimation" , Transl. Math. Monogr. , 47 , Amer. Math. Soc. (1976) |
[a7] | B.T. Polyak, "New stochastic approximation type procedures" Autom. Remote Contr. , 51 : 7 (1960) pp. 937–946 |
[a8] | G.N. Saridis, "Stochastic approximation methods for identification and control — a survey" IEEE-AC , 19 : l6 (1974) |
Adaptive algorithm. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Adaptive_algorithm&oldid=49937