Difference between revisions of "Robust statistics"
(Importing text file) |
Ulf Rehmann (talk | contribs) m (tex encoded by computer) |
||
Line 1: | Line 1: | ||
+ | <!-- | ||
+ | r0825301.png | ||
+ | $#A+1 = 20 n = 0 | ||
+ | $#C+1 = 20 : ~/encyclopedia/old_files/data/R082/R.0802530 Robust statistics | ||
+ | Automatically converted into TeX, above some diagnostics. | ||
+ | Please remove this comment and the {{TEX|auto}} line below, | ||
+ | if TeX found to be correct. | ||
+ | --> | ||
+ | |||
+ | {{TEX|auto}} | ||
+ | {{TEX|done}} | ||
+ | |||
The branch of [[Mathematical statistics|mathematical statistics]] concerned with the construction and investigation of statistical procedures (such as parameter estimators and tests) that still behave well when the usual assumptions are not satisfied. For instance, the observations may not follow the normal (Gaussian) distribution exactly, because in practice the data often contain outliers (aberrant values) which are caused by transcription errors, misplaced decimal points, exceptional phenomena, etc. | The branch of [[Mathematical statistics|mathematical statistics]] concerned with the construction and investigation of statistical procedures (such as parameter estimators and tests) that still behave well when the usual assumptions are not satisfied. For instance, the observations may not follow the normal (Gaussian) distribution exactly, because in practice the data often contain outliers (aberrant values) which are caused by transcription errors, misplaced decimal points, exceptional phenomena, etc. | ||
It has been known for many years that the classical procedures (such as the [[Sample average|sample average]]) loose their optimality, and can in fact give arbitrarily bad results, when even a small percentage of the data are outlying. In such situations applied statisticians have turned to alternative methods (e.g. the [[Sample median|sample median]]) which they knew to be less affected by possible outlying values. However, the intuitive notion of robustness was not formalized until the 1960's. The theoretical foundations of robust statistics have been developed in the three stages described below. | It has been known for many years that the classical procedures (such as the [[Sample average|sample average]]) loose their optimality, and can in fact give arbitrarily bad results, when even a small percentage of the data are outlying. In such situations applied statisticians have turned to alternative methods (e.g. the [[Sample median|sample median]]) which they knew to be less affected by possible outlying values. However, the intuitive notion of robustness was not formalized until the 1960's. The theoretical foundations of robust statistics have been developed in the three stages described below. | ||
− | The first mathematical approach is due to P.J. Huber [[#References|[a1]]], who found the solution | + | The first mathematical approach is due to P.J. Huber [[#References|[a1]]], who found the solution $ T ^ {*} $ |
+ | to a minimax variational problem: | ||
− | + | $$ | |
+ | \sup _ {G \in {\mathcal P} _ \epsilon } V( T ^ {*} | ||
+ | , G) = \inf _ { T } \sup _ {G \in | ||
+ | {\mathcal P} _ \epsilon } V( T , G) , | ||
+ | $$ | ||
− | where | + | where $ V( T, G) $ |
+ | denotes the asymptotic variance of an estimator $ T $ | ||
+ | at a distribution $ G $. | ||
+ | The set $ {\mathcal P} _ \epsilon $( | ||
+ | for some $ \epsilon > 0 $) | ||
+ | consists of all mixture distributions $ G=( 1- \epsilon ) \Phi + \epsilon H $, | ||
+ | where $ \Phi $ | ||
+ | is the standard [[Normal distribution|normal distribution]] and $ H $ | ||
+ | is arbitrary (e.g., the Cauchy distribution). In the framework of a univariate location parameter, the solution $ T ^ {*} $ | ||
+ | has come to be known as the Huber estimator. For a survey of this work see [[#References|[a2]]]. | ||
The second approach was based on the influence function [[#References|[a3]]] | The second approach was based on the influence function [[#References|[a3]]] | ||
− | + | $$ | |
+ | IF( x; T, \Phi ) = | ||
+ | \frac \partial {\partial \epsilon } | ||
+ | [ T(( 1- | ||
+ | \epsilon ) \Phi + \epsilon \Delta _ {x} ) ] _ {\epsilon = 0 } | ||
+ | , | ||
+ | $$ | ||
− | where | + | where $ \Delta _ {x} $ |
+ | is the probability distribution which puts all its mass in the point $ x $. | ||
+ | The influence function describes the effect of a single outlier at $ x $ | ||
+ | on (the asymptotic version of) the estimator $ T $. | ||
+ | In this approach one obtains optimally robust estimators and tests by maximizing the asymptotic efficiency subject to an upper bound on the supremum norm of the influence function. Unlike the minimax variance approach, it can also be applied to multivariate problems. The main results of this methodology are found in [[#References|[a4]]]. | ||
The third approach aims for a high breakdown point. The breakdown point of an estimator is the largest fraction of arbitrary outliers it can tolerate without becoming unbounded. This goes beyond the second approach, where only the effect of a single outlier was considered. In [[Linear regression|linear regression]] analysis, the first equivariant high-breakdown estimator was the least median of squares [[#References|[a5]]], defined by | The third approach aims for a high breakdown point. The breakdown point of an estimator is the largest fraction of arbitrary outliers it can tolerate without becoming unbounded. This goes beyond the second approach, where only the effect of a single outlier was considered. In [[Linear regression|linear regression]] analysis, the first equivariant high-breakdown estimator was the least median of squares [[#References|[a5]]], defined by | ||
− | + | $$ | |
+ | \textrm{ minimize } _ {\widehat \beta } \textrm{ median } _ {i= 1 | ||
+ | \dots n } r _ {i} ^ {2} , | ||
+ | $$ | ||
− | where | + | where $ \widehat \beta $ |
+ | is the vector of regression coefficients and $ r _ {i} $ | ||
+ | is the residual of the $ i $- | ||
+ | th observation. High-breakdown estimators also exist for other multivariate situations. A survey of these methods is given in [[#References|[a6]]], which also describes applications. | ||
At present (1990's), research is focussed on the construction of estimators that have a high breakdown point as well as bounded influence function and good asymptotic efficiency. The more recently developed estimators make increasing use of computing resources. | At present (1990's), research is focussed on the construction of estimators that have a high breakdown point as well as bounded influence function and good asymptotic efficiency. The more recently developed estimators make increasing use of computing resources. |
Latest revision as of 08:12, 6 June 2020
The branch of mathematical statistics concerned with the construction and investigation of statistical procedures (such as parameter estimators and tests) that still behave well when the usual assumptions are not satisfied. For instance, the observations may not follow the normal (Gaussian) distribution exactly, because in practice the data often contain outliers (aberrant values) which are caused by transcription errors, misplaced decimal points, exceptional phenomena, etc.
It has been known for many years that the classical procedures (such as the sample average) loose their optimality, and can in fact give arbitrarily bad results, when even a small percentage of the data are outlying. In such situations applied statisticians have turned to alternative methods (e.g. the sample median) which they knew to be less affected by possible outlying values. However, the intuitive notion of robustness was not formalized until the 1960's. The theoretical foundations of robust statistics have been developed in the three stages described below.
The first mathematical approach is due to P.J. Huber [a1], who found the solution $ T ^ {*} $ to a minimax variational problem:
$$ \sup _ {G \in {\mathcal P} _ \epsilon } V( T ^ {*} , G) = \inf _ { T } \sup _ {G \in {\mathcal P} _ \epsilon } V( T , G) , $$
where $ V( T, G) $ denotes the asymptotic variance of an estimator $ T $ at a distribution $ G $. The set $ {\mathcal P} _ \epsilon $( for some $ \epsilon > 0 $) consists of all mixture distributions $ G=( 1- \epsilon ) \Phi + \epsilon H $, where $ \Phi $ is the standard normal distribution and $ H $ is arbitrary (e.g., the Cauchy distribution). In the framework of a univariate location parameter, the solution $ T ^ {*} $ has come to be known as the Huber estimator. For a survey of this work see [a2].
The second approach was based on the influence function [a3]
$$ IF( x; T, \Phi ) = \frac \partial {\partial \epsilon } [ T(( 1- \epsilon ) \Phi + \epsilon \Delta _ {x} ) ] _ {\epsilon = 0 } , $$
where $ \Delta _ {x} $ is the probability distribution which puts all its mass in the point $ x $. The influence function describes the effect of a single outlier at $ x $ on (the asymptotic version of) the estimator $ T $. In this approach one obtains optimally robust estimators and tests by maximizing the asymptotic efficiency subject to an upper bound on the supremum norm of the influence function. Unlike the minimax variance approach, it can also be applied to multivariate problems. The main results of this methodology are found in [a4].
The third approach aims for a high breakdown point. The breakdown point of an estimator is the largest fraction of arbitrary outliers it can tolerate without becoming unbounded. This goes beyond the second approach, where only the effect of a single outlier was considered. In linear regression analysis, the first equivariant high-breakdown estimator was the least median of squares [a5], defined by
$$ \textrm{ minimize } _ {\widehat \beta } \textrm{ median } _ {i= 1 \dots n } r _ {i} ^ {2} , $$
where $ \widehat \beta $ is the vector of regression coefficients and $ r _ {i} $ is the residual of the $ i $- th observation. High-breakdown estimators also exist for other multivariate situations. A survey of these methods is given in [a6], which also describes applications.
At present (1990's), research is focussed on the construction of estimators that have a high breakdown point as well as bounded influence function and good asymptotic efficiency. The more recently developed estimators make increasing use of computing resources.
References
[a1] | P.J. Huber, "Robust estimation of a location parameter" Ann. Math. Stat. , 35 (1964) pp. 73–101 |
[a2] | P.J. Huber, "Robust statistics" , Wiley (1981) |
[a3] | F.R. Hampel, "The influence curve and its role in robust estimation" J. Amer. Statist. Assoc. , 69 (1974) pp. 383–393 |
[a4] | F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, W.A. Stahel, "Robust statistics. The approach based on influence functions" , Wiley (1986) |
[a5] | P.J. Rousseeuw, "Least median of squares regression" J. Amer. Statist. Assoc. , 79 (1984) pp. 871–880 |
[a6] | P.J. Rousseeuw, A.M. Leroy, "Robust regression and outlier detection" , Wiley (1987) |
Robust statistics. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Robust_statistics&oldid=17575