Namespaces
Variants
Actions

Difference between revisions of "Regression analysis"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
(latex details)
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
A branch of mathematical statistics that unifies various practical methods for investigating dependence between variables using statistical data (see [[Regression|Regression]]). The problem of regression in mathematical statistics is characterized by the fact that there is insufficient information about the distributions of the variables under consideration. Suppose, for example, that there are reasons for assuming that a random variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806201.png" /> has a given probability distribution at a fixed value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806202.png" /> of another variable, so that
+
<!--
 +
r0806201.png
 +
$#A+1 = 94 n = 0
 +
$#C+1 = 94 : ~/encyclopedia/old_files/data/R080/R.0800620 Regression analysis
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806203.png" /></td> </tr></table>
+
{{TEX|auto}}
 +
{{TEX|done}}
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806204.png" /> is a set of unknown parameters determining the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806205.png" />, and that it is required to determine the values of these parameters from results of observations. Depending on the nature of the problem and the aims of the analysis, the results of an experiment <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806206.png" /> are interpreted in different ways in relation to the variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806207.png" />. To ascertain the connection between the variables in the experiment, one often uses a model based on simplified assumptions: <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806208.png" /> is a controllable variable, whose values are given in advance for the design of the experiment, and the observed value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r0806209.png" /> can be written in the form
+
A branch of mathematical statistics that unifies various practical methods for investigating dependence between variables using statistical data (see [[Regression|Regression]]). The problem of regression in mathematical statistics is characterized by the fact that there is insufficient information about the distributions of the variables under consideration. Suppose, for example, that there are reasons for assuming that a random variable  $  Y $
 +
has a given probability distribution at a fixed value  $  x $
 +
of another variable, so that
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062010.png" /></td> </tr></table>
+
$$
 +
{\mathsf E} ( Y \mid  x )  = g ( x , \beta ) ,
 +
$$
  
where the variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062011.png" /> characterize the errors, which are independent for various measurements and identically distributed with mean zero and constant variance. In the case of an uncontrollable variable, the results of the observations <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062012.png" /> represent a sample from a certain two-dimensional aggregate. The methods of regression analysis are the same in both cases, although the interpretations of the results differ (in the latter case, the analysis is substantially supplemented by methods from the theory of [[Correlation (in statistics)|correlation (in statistics)]]).
+
where $  \beta $
 +
is a set of unknown parameters determining the function  $  g ( x) $,  
 +
and that it is required to determine the values of these parameters from results of observations. Depending on the nature of the problem and the aims of the analysis, the results of an experiment  $  ( x _ {1} , y _ {1} ) \dots ( x _ {n} , y _ {n} ) $
 +
are interpreted in different ways in relation to the variable  $  x $.
 +
To ascertain the connection between the variables in the experiment, one often uses a model based on simplified assumptions:  $  x $
 +
is a controllable variable, whose values are given in advance for the design of the experiment, and the observed value  $  y $
 +
can be written in the form
  
The study of regression for experimental data is carried out using methods based on the principles of mean-square regression. Regression analysis solves the following fundamental problems: 1) the choice of a regression model, which implies assumptions about the dependence of the regression function on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062013.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062014.png" />; 2) an estimate of the parameters <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062015.png" /> in the selected model, perhaps by the method of least squares; and 3) testing the statistical hypotheses about the regression.
+
$$
 +
y _ {i}  =  g ( x _ {i} , \beta ) + \epsilon _ {i} ,\ \
 +
i = 1 \dots n ,
 +
$$
 +
 
 +
where the variables  $  \epsilon _ {i} $
 +
characterize the errors, which are independent for various measurements and identically distributed with mean zero and constant variance. In the case of an uncontrollable variable, the results of the observations  $  ( x _ {1} , y _ {1} ) \dots ( x _ {n} , y _ {n} ) $
 +
represent a sample from a certain two-dimensional aggregate. The methods of regression analysis are the same in both cases, although the interpretations of the results differ (in the latter case, the analysis is substantially supplemented by methods from the theory of [[Correlation (in statistics)|correlation (in statistics)]]).
 +
 
 +
The study of regression for experimental data is carried out using methods based on the principles of mean-square regression. Regression analysis solves the following fundamental problems: 1) the choice of a regression model, which implies assumptions about the dependence of the regression function on $  x $
 +
and $  \beta $;  
 +
2) an estimate of the parameters $  \beta $
 +
in the selected model, perhaps by the method of least squares; and 3) testing the statistical hypotheses about the regression.
  
 
From the point of view of a single method for estimating unknown parameters, the most natural one is a regression model that is linear in these parameters:
 
From the point of view of a single method for estimating unknown parameters, the most natural one is a regression model that is linear in these parameters:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062016.png" /></td> </tr></table>
+
$$
 +
g ( x , \beta )  = \beta _ {0} g _ {0} ( x) +
 +
{} \dots + \beta _ {m} g _ {m} ( x) .
 +
$$
  
The choice of the functions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062017.png" /> is sometimes arrived at by arranging the experimental values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062018.png" /> on a scattergram or, more often, by theoretical considerations. It is thus assumed that the variance <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062019.png" /> of the results of the observations is constant (or proportional to a known function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062020.png" />). The standard method of regression estimation is based on the use of a polynomial of some degree <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062021.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062022.png" />:
+
The choice of the functions $  g _ {i} ( x) $
 +
is sometimes arrived at by arranging the experimental values $  ( x , y ) $
 +
on a scattergram or, more often, by theoretical considerations. It is thus assumed that the variance $  \sigma  ^ {2} $
 +
of the results of the observations is constant (or proportional to a known function of $  x $).  
 +
The standard method of regression estimation is based on the use of a polynomial of some degree $  m $,  
 +
$  1 \leq  m < n $:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062023.png" /></td> </tr></table>
+
$$
 +
g ( x , \beta )  = \
 +
\beta _ {0} + \beta _ {1} x + \dots + \beta _ {m} x  ^ {m}
 +
$$
  
 
or, in the simplest case, of a linear function ([[Linear regression|linear regression]])
 
or, in the simplest case, of a linear function ([[Linear regression|linear regression]])
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062024.png" /></td> </tr></table>
+
$$
 +
g ( x , \beta )  = \beta _ {0} + \beta _ {1} x .
 +
$$
  
 
There are criteria for testing linearity and for choosing the degree of the approximating polynomial.
 
There are criteria for testing linearity and for choosing the degree of the approximating polynomial.
  
According to the principles of mean-square regression, the estimation of the unknown regression coefficients <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062025.png" /> (cf. [[Regression coefficient|Regression coefficient]]) and the variance <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062026.png" /> (cf. [[Dispersion|Dispersion]]) is realized by the method of least squares. Thus, as statistical estimators of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062027.png" /> one chooses values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062028.png" /> which minimize the expression
+
According to the principles of mean-square regression, the estimation of the unknown regression coefficients $  \beta _ {0} \dots \beta _ {m} $(
 +
cf. [[Regression coefficient|Regression coefficient]]) and the variance $  \sigma  ^ {2} $(
 +
cf. [[Dispersion|Dispersion]]) is realized by the method of least squares. Thus, as statistical estimators of $  \beta _ {0} \dots \beta _ {m} $
 +
one chooses values $  \widehat \beta  _ {0} \dots \widehat \beta  _ {m} $
 +
which minimize the expression
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062029.png" /></td> </tr></table>
+
$$
 +
\sum_{i=1} ^ { n } ( y _ {i} - g ( x _ {i} ) )  ^ {2} .
 +
$$
  
The polynomial <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062030.png" /> thus obtained is called the empirical regression curve, and is a statistical estimator of the unknown proper regression curve. Assuming linearity of regression, the equation of the empirical regression curve has the form
+
The polynomial $  \widehat{g}  ( x) = \widehat \beta  _ {0} + \dots + \widehat \beta  _ {m} x  ^ {m} $
 +
thus obtained is called the empirical regression curve, and is a statistical estimator of the unknown proper regression curve. Assuming linearity of regression, the equation of the empirical regression curve has the form
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062031.png" /></td> </tr></table>
+
$$
 +
\widehat{g}  ( x)  = \widehat \beta  _ {0} + \widehat \beta  _ {1} x ,
 +
$$
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062032.png" /></td> </tr></table>
+
$$
 +
\widehat \beta  _ {0}  = \overline{y}\; - \widehat \beta  _ {1} \overline{x}\; ,\ \
 +
\widehat \beta  _ {1}  =
 +
\frac{\sum _ { i } ( x _ {i} - \overline{x}\; ) ( y _ {i} - y ) }{\sum _ { i } ( x _ {i} - \overline{x}\; )  ^ {2} }
 +
,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062033.png" /></td> </tr></table>
+
$$
 +
\overline{x}\; =
 +
\frac{1}{n}
 +
\sum _ { i } x _ {i} ,\  \overline{y}\; =
 +
\frac{1}{n}
 +
\sum _ { i } y _ {i} .
 +
$$
  
The random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062034.png" /> are called sample regression coefficients (or estimated regression coefficients). An [[Unbiased estimator|unbiased estimator]] of the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062035.png" /> is given by
+
The random variables $  \widehat \beta  _ {0} \dots \widehat \beta  _ {m} $
 +
are called sample regression coefficients (or estimated regression coefficients). An [[Unbiased estimator|unbiased estimator]] of the parameter $  \sigma  ^ {2} $
 +
is given by
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062036.png" /></td> </tr></table>
+
$$
 +
s  ^ {2}  = \frac{\sum_{i=1}^ { n }  ( y _ {i} - \widehat{g}  ( x _ {i} ) )  ^ {2} }{n - m }
 +
.
 +
$$
  
If the variance depends on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062037.png" />, the method of least squares is applicable with certain modifications.
+
If the variance depends on $  x $,  
 +
the method of least squares is applicable with certain modifications.
  
If one studies the dependence of a random variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062038.png" /> on several variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062039.png" />, then it is more convenient to write the general linear regression model in matrix form: An observation vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062040.png" /> with independent components <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062041.png" /> has mean value and covariance matrix given by
+
If one studies the dependence of a random variable $  Y $
 +
on several variables $  x _ {1} \dots x _ {k} $,  
 +
then it is more convenient to write the general linear regression model in matrix form: An observation vector $  y $
 +
with independent components $  y _ {1} \dots y _ {n} $
 +
has mean value and covariance matrix given by
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062042.png" /></td> <td valign="top" style="width:5%;text-align:right;">(*)</td></tr></table>
+
$$ \tag{* }
 +
{\mathsf E} ( Y \mid  x _ {1} \dots x _ {k} )  = X \beta ,\ \
 +
{\mathsf D} ( Y \mid  x _ {1} \dots x _ {k} ) =  \sigma  ^ {2} I ,
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062043.png" /> is the vector of regression coefficients, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062044.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062045.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062046.png" />, is a matrix of known variables related to each other, generally speaking, in an arbitrary fashion, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062047.png" /> is the identity matrix of order <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062048.png" />; moreover, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062049.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062050.png" />. More generally one can assume that there is correlation between the observations <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062051.png" />:
+
where $  \beta = ( \beta _ {1} \dots \beta _ {k} ) $
 +
is the vector of regression coefficients, $  X = \| x _ {ij} \| $,  
 +
$  i = 1 \dots n $,  
 +
$  j = 1 \dots k $,  
 +
is a matrix of known variables related to each other, generally speaking, in an arbitrary fashion, and $  I $
 +
is the identity matrix of order $  n $;  
 +
moreover, $  n > k $
 +
and $  | X  ^ {T} X | \neq 0 $.  
 +
More generally one can assume that there is correlation between the observations $  y _ {i} $:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062052.png" /></td> </tr></table>
+
$$
 +
{\mathsf E} ( Y \mid  x _ {1} \dots x _ {k} )  = X \beta ,\ \
 +
{\mathsf D} ( Y \mid  x _ {1} \dots x _ {k} )  = \sigma  ^ {2} A ,
 +
$$
  
for some known matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062053.png" />. But this scheme can be reduced to the model (*). An unbiased estimator for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062054.png" /> by the method of least squares is given by
+
for some known matrix $  A $.  
 +
But this scheme can be reduced to the model (*). An unbiased estimator for $  \beta $
 +
by the method of least squares is given by
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062055.png" /></td> </tr></table>
+
$$
 +
\widehat \beta    = ( X  ^ {T} X )  ^ {-1} X  ^ {T} y ,
 +
$$
  
and an unbiased estimator for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062056.png" /> is given by
+
and an unbiased estimator for $  \sigma  ^ {2} $
 +
is given by
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062057.png" /></td> </tr></table>
+
$$
 +
s  ^ {2}  = \frac{1}{n-k} ( y  ^ {T} y - \beta  ^ {T} X  ^ {T} y ) .
 +
$$
  
Model (*) is the most general linear model, in that it is applicable to various regression situations and encompasses all forms of polynomial regression of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062058.png" /> with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062059.png" /> (in particular, the above polynomial regression of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062060.png" /> with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062061.png" /> of order <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062062.png" /> can be reduced to the model (*), in which <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062063.png" /> of the regression variables are functionally connected). In this linear interpretation of regression analysis, the problem of estimating <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062064.png" /> and the calculation of the covariance matrix of estimators <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062065.png" /> reduces to the problem of inverting the matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062066.png" />.
+
Model (*) is the most general linear model, in that it is applicable to various regression situations and encompasses all forms of polynomial regression of $  Y $
 +
with respect to $  x _ {1} \dots x _ {k} $(
 +
in particular, the above polynomial regression of $  Y $
 +
with respect to $  x $
 +
of order $  m $
 +
can be reduced to the model (*), in which $  m $
 +
of the regression variables are functionally connected). In this linear interpretation of regression analysis, the problem of estimating $  \beta $
 +
and the calculation of the covariance matrix of estimators $  {\mathsf D} \widehat \beta  = \sigma  ^ {2} ( X  ^ {T} X )  ^ {-1} $
 +
reduces to the problem of inverting the matrix $  X  ^ {T} X $.
  
The above method for constructing an empirical regression assuming a normal distribution of the results of the observations leads to estimators for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062067.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062068.png" /> that coincide with the maximum-likelihood estimators. However, the estimators obtained by this method are, in a certain sense, also optimal in the case of deviation from normality, provided only that the sample size is sufficiently large.
+
The above method for constructing an empirical regression assuming a normal distribution of the results of the observations leads to estimators for $  \beta $
 +
and $  \sigma  ^ {2} $
 +
that coincide with the maximum-likelihood estimators. However, the estimators obtained by this method are, in a certain sense, also optimal in the case of deviation from normality, provided only that the sample size is sufficiently large.
  
In the given matrix form, the general linear regression model (*) admits a simple extension to the case when the observed variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062069.png" /> are random vector variables. This does not give rise to any new statistical problem (see [[Regression matrix|Regression matrix]]).
+
In the given matrix form, the general linear regression model (*) admits a simple extension to the case when the observed variables $  y _ {i} $
 +
are random vector variables. This does not give rise to any new statistical problem (see [[Regression matrix|Regression matrix]]).
  
The problems of regression analysis are not restricted to the construction of point estimators of the parameters <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062070.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062071.png" /> in the general linear model (*). The problem of the accuracy of a constructed empirical relation is most effectively solved under the assumption that the observation vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062072.png" /> is normally distributed. If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062073.png" /> is normally distributed and since the estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062074.png" /> is a linear function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062075.png" />, one can conclude that the variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062076.png" /> is normally distributed with mean <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062077.png" /> and variance <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062078.png" />, where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062079.png" /> is the diagonal entry of the matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062080.png" />. Apart from this, the estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062081.png" /> for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062082.png" /> is distributed independently of any component of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062083.png" />, and the variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062084.png" /> has a [["Chi-squared" distribution| "chi-squared"  distribution]] with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062085.png" /> degrees of freedom. Hence the statistic
+
The problems of regression analysis are not restricted to the construction of point estimators of the parameters $  \beta $
 +
and $  \sigma  ^ {2} $
 +
in the general linear model (*). The problem of the accuracy of a constructed empirical relation is most effectively solved under the assumption that the observation vector $  y $
 +
is normally distributed. If $  y $
 +
is normally distributed and since the estimator $  \widehat \beta  $
 +
is a linear function of $  y $,  
 +
one can conclude that the variable $  \widehat \beta  _ {i} $
 +
is normally distributed with mean $  \beta _ {i} $
 +
and variance $  {\mathsf D} \widehat \beta  _ {i} = \sigma  ^ {2} b _ {ii} $,  
 +
where $  b _ {ii} $
 +
is the diagonal entry of the matrix $  ( X  ^ {T} X )  ^ {-1} $.  
 +
Apart from this, the estimator $  s  ^ {2} $
 +
for $  \sigma  ^ {2} $
 +
is distributed independently of any component of $  \widehat \beta  $,  
 +
and the variable $  ( n - k ) s  ^ {2} / \sigma  ^ {2} $
 +
has a [[Chi-squared distribution| "chi-squared"  distribution]] with $  n - k $
 +
degrees of freedom. Hence the statistic
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062086.png" /></td> </tr></table>
+
$$
 +
=
 +
\frac{( \widehat \beta  _ {i} - \beta _ {i} ) }{[ s  ^ {2} b _ {ii} ]  ^ {1/2} }
  
has the [[Student distribution|Student distribution]] with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062087.png" /> degrees of freedom. This fact is used to construct confidence intervals for the parameters <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062088.png" /> and for testing hypotheses about the values taking by them. One can also find confidence intervals for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062089.png" /> for fixed values of all the regression variables, and confidence intervals containing the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062090.png" />-th subsequent value of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062091.png" /> (called prediction intervals). Finally, starting from a vector of sample regression coefficients <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062092.png" /> one can construct a confidence ellipsoid for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062093.png" />, or for any set of unknown regression coefficients, and also a confidence region for the entire regression curve.
+
$$
 +
 
 +
has the [[Student distribution|Student distribution]] with $  n - k $
 +
degrees of freedom. This fact is used to construct confidence intervals for the parameters $  \beta _ {i} $
 +
and for testing hypotheses about the values taking by them. One can also find confidence intervals for $  {\mathsf E} ( Y \mid  x _ {1} \dots x _ {k} ) $
 +
for fixed values of all the regression variables, and confidence intervals containing the $  ( n + 1 ) $-
 +
th subsequent value of $  y $(
 +
called prediction intervals). Finally, starting from a vector of sample regression coefficients $  \widehat \beta  $
 +
one can construct a confidence ellipsoid for $  \beta $,  
 +
or for any set of unknown regression coefficients, and also a confidence region for the entire regression curve.
  
 
Regression analysis is one of the most widely used methods for processing experimental data when investigating relations in physics, biology, economics, technology, and other fields. Such branches of mathematical statistics as [[Dispersion analysis|dispersion analysis]] and the [[Design of experiments|design of experiments]] are based on models of regression analysis, and these models are widely used in [[Multi-dimensional statistical analysis|multi-dimensional statistical analysis]].
 
Regression analysis is one of the most widely used methods for processing experimental data when investigating relations in physics, biology, economics, technology, and other fields. Such branches of mathematical statistics as [[Dispersion analysis|dispersion analysis]] and the [[Design of experiments|design of experiments]] are based on models of regression analysis, and these models are widely used in [[Multi-dimensional statistical analysis|multi-dimensional statistical analysis]].
Line 77: Line 209:
 
====References====
 
====References====
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  M.G. Kendall,  A. Stuart,  "The advanced theory of statistics" , '''2. Inference and relationship''' , Griffin  (1979)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  N.V. Smirnov,  I.V. Dunin-Barkovskii,  "Mathematische Statistik in der Technik" , Deutsch. Verlag Wissenschaft.  (1969)  (Translated from Russian)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  S.A. Aivazyan,  "Statistical research on dependence" , Moscow  (1968)  (In Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  C.R. Rao,  "Linear statistical inference and its applications" , Wiley  (1965)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top">  N.R. Draper,  H. Smith,  "Applied regression analysis" , Wiley  (1981)</TD></TR></table>
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  M.G. Kendall,  A. Stuart,  "The advanced theory of statistics" , '''2. Inference and relationship''' , Griffin  (1979)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  N.V. Smirnov,  I.V. Dunin-Barkovskii,  "Mathematische Statistik in der Technik" , Deutsch. Verlag Wissenschaft.  (1969)  (Translated from Russian)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  S.A. Aivazyan,  "Statistical research on dependence" , Moscow  (1968)  (In Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  C.R. Rao,  "Linear statistical inference and its applications" , Wiley  (1965)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top">  N.R. Draper,  H. Smith,  "Applied regression analysis" , Wiley  (1981)</TD></TR></table>
 
 
  
 
====Comments====
 
====Comments====
Modern research — inspired by modern computational facilities — is aimed at developing methods for regression analysis when the classical assumptions of regression analysis do not hold. For instance, one can estimate the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/r/r080/r080620/r08062094.png" /> using only smoothness assumptions by adapting methods from density estimation, or one can produce robust estimators (cf. [[Robust statistics|Robust statistics]]) in the linear regression model by minimizing the sum of absolute deviations from the regression line instead of the sum of their squares.
+
Modern research — inspired by modern computational facilities — is aimed at developing methods for regression analysis when the classical assumptions of regression analysis do not hold. For instance, one can estimate the function $  {\mathsf E} ( Y \mid  X = x) $
 +
using only smoothness assumptions by adapting methods from density estimation, or one can produce robust estimators (cf. [[Robust statistics|Robust statistics]]) in the linear regression model by minimizing the sum of absolute deviations from the regression line instead of the sum of their squares.
  
 
====References====
 
====References====
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  W. Härdle,  "Applied nonparametric regression" , Cambridge Univ. Press  (1990)</TD></TR></table>
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  W. Härdle,  "Applied nonparametric regression" , Cambridge Univ. Press  (1990)</TD></TR></table>

Latest revision as of 13:04, 13 January 2024


A branch of mathematical statistics that unifies various practical methods for investigating dependence between variables using statistical data (see Regression). The problem of regression in mathematical statistics is characterized by the fact that there is insufficient information about the distributions of the variables under consideration. Suppose, for example, that there are reasons for assuming that a random variable $ Y $ has a given probability distribution at a fixed value $ x $ of another variable, so that

$$ {\mathsf E} ( Y \mid x ) = g ( x , \beta ) , $$

where $ \beta $ is a set of unknown parameters determining the function $ g ( x) $, and that it is required to determine the values of these parameters from results of observations. Depending on the nature of the problem and the aims of the analysis, the results of an experiment $ ( x _ {1} , y _ {1} ) \dots ( x _ {n} , y _ {n} ) $ are interpreted in different ways in relation to the variable $ x $. To ascertain the connection between the variables in the experiment, one often uses a model based on simplified assumptions: $ x $ is a controllable variable, whose values are given in advance for the design of the experiment, and the observed value $ y $ can be written in the form

$$ y _ {i} = g ( x _ {i} , \beta ) + \epsilon _ {i} ,\ \ i = 1 \dots n , $$

where the variables $ \epsilon _ {i} $ characterize the errors, which are independent for various measurements and identically distributed with mean zero and constant variance. In the case of an uncontrollable variable, the results of the observations $ ( x _ {1} , y _ {1} ) \dots ( x _ {n} , y _ {n} ) $ represent a sample from a certain two-dimensional aggregate. The methods of regression analysis are the same in both cases, although the interpretations of the results differ (in the latter case, the analysis is substantially supplemented by methods from the theory of correlation (in statistics)).

The study of regression for experimental data is carried out using methods based on the principles of mean-square regression. Regression analysis solves the following fundamental problems: 1) the choice of a regression model, which implies assumptions about the dependence of the regression function on $ x $ and $ \beta $; 2) an estimate of the parameters $ \beta $ in the selected model, perhaps by the method of least squares; and 3) testing the statistical hypotheses about the regression.

From the point of view of a single method for estimating unknown parameters, the most natural one is a regression model that is linear in these parameters:

$$ g ( x , \beta ) = \beta _ {0} g _ {0} ( x) + {} \dots + \beta _ {m} g _ {m} ( x) . $$

The choice of the functions $ g _ {i} ( x) $ is sometimes arrived at by arranging the experimental values $ ( x , y ) $ on a scattergram or, more often, by theoretical considerations. It is thus assumed that the variance $ \sigma ^ {2} $ of the results of the observations is constant (or proportional to a known function of $ x $). The standard method of regression estimation is based on the use of a polynomial of some degree $ m $, $ 1 \leq m < n $:

$$ g ( x , \beta ) = \ \beta _ {0} + \beta _ {1} x + \dots + \beta _ {m} x ^ {m} $$

or, in the simplest case, of a linear function (linear regression)

$$ g ( x , \beta ) = \beta _ {0} + \beta _ {1} x . $$

There are criteria for testing linearity and for choosing the degree of the approximating polynomial.

According to the principles of mean-square regression, the estimation of the unknown regression coefficients $ \beta _ {0} \dots \beta _ {m} $( cf. Regression coefficient) and the variance $ \sigma ^ {2} $( cf. Dispersion) is realized by the method of least squares. Thus, as statistical estimators of $ \beta _ {0} \dots \beta _ {m} $ one chooses values $ \widehat \beta _ {0} \dots \widehat \beta _ {m} $ which minimize the expression

$$ \sum_{i=1} ^ { n } ( y _ {i} - g ( x _ {i} ) ) ^ {2} . $$

The polynomial $ \widehat{g} ( x) = \widehat \beta _ {0} + \dots + \widehat \beta _ {m} x ^ {m} $ thus obtained is called the empirical regression curve, and is a statistical estimator of the unknown proper regression curve. Assuming linearity of regression, the equation of the empirical regression curve has the form

$$ \widehat{g} ( x) = \widehat \beta _ {0} + \widehat \beta _ {1} x , $$

where

$$ \widehat \beta _ {0} = \overline{y}\; - \widehat \beta _ {1} \overline{x}\; ,\ \ \widehat \beta _ {1} = \frac{\sum _ { i } ( x _ {i} - \overline{x}\; ) ( y _ {i} - y ) }{\sum _ { i } ( x _ {i} - \overline{x}\; ) ^ {2} } , $$

$$ \overline{x}\; = \frac{1}{n} \sum _ { i } x _ {i} ,\ \overline{y}\; = \frac{1}{n} \sum _ { i } y _ {i} . $$

The random variables $ \widehat \beta _ {0} \dots \widehat \beta _ {m} $ are called sample regression coefficients (or estimated regression coefficients). An unbiased estimator of the parameter $ \sigma ^ {2} $ is given by

$$ s ^ {2} = \frac{\sum_{i=1}^ { n } ( y _ {i} - \widehat{g} ( x _ {i} ) ) ^ {2} }{n - m } . $$

If the variance depends on $ x $, the method of least squares is applicable with certain modifications.

If one studies the dependence of a random variable $ Y $ on several variables $ x _ {1} \dots x _ {k} $, then it is more convenient to write the general linear regression model in matrix form: An observation vector $ y $ with independent components $ y _ {1} \dots y _ {n} $ has mean value and covariance matrix given by

$$ \tag{* } {\mathsf E} ( Y \mid x _ {1} \dots x _ {k} ) = X \beta ,\ \ {\mathsf D} ( Y \mid x _ {1} \dots x _ {k} ) = \sigma ^ {2} I , $$

where $ \beta = ( \beta _ {1} \dots \beta _ {k} ) $ is the vector of regression coefficients, $ X = \| x _ {ij} \| $, $ i = 1 \dots n $, $ j = 1 \dots k $, is a matrix of known variables related to each other, generally speaking, in an arbitrary fashion, and $ I $ is the identity matrix of order $ n $; moreover, $ n > k $ and $ | X ^ {T} X | \neq 0 $. More generally one can assume that there is correlation between the observations $ y _ {i} $:

$$ {\mathsf E} ( Y \mid x _ {1} \dots x _ {k} ) = X \beta ,\ \ {\mathsf D} ( Y \mid x _ {1} \dots x _ {k} ) = \sigma ^ {2} A , $$

for some known matrix $ A $. But this scheme can be reduced to the model (*). An unbiased estimator for $ \beta $ by the method of least squares is given by

$$ \widehat \beta = ( X ^ {T} X ) ^ {-1} X ^ {T} y , $$

and an unbiased estimator for $ \sigma ^ {2} $ is given by

$$ s ^ {2} = \frac{1}{n-k} ( y ^ {T} y - \beta ^ {T} X ^ {T} y ) . $$

Model (*) is the most general linear model, in that it is applicable to various regression situations and encompasses all forms of polynomial regression of $ Y $ with respect to $ x _ {1} \dots x _ {k} $( in particular, the above polynomial regression of $ Y $ with respect to $ x $ of order $ m $ can be reduced to the model (*), in which $ m $ of the regression variables are functionally connected). In this linear interpretation of regression analysis, the problem of estimating $ \beta $ and the calculation of the covariance matrix of estimators $ {\mathsf D} \widehat \beta = \sigma ^ {2} ( X ^ {T} X ) ^ {-1} $ reduces to the problem of inverting the matrix $ X ^ {T} X $.

The above method for constructing an empirical regression assuming a normal distribution of the results of the observations leads to estimators for $ \beta $ and $ \sigma ^ {2} $ that coincide with the maximum-likelihood estimators. However, the estimators obtained by this method are, in a certain sense, also optimal in the case of deviation from normality, provided only that the sample size is sufficiently large.

In the given matrix form, the general linear regression model (*) admits a simple extension to the case when the observed variables $ y _ {i} $ are random vector variables. This does not give rise to any new statistical problem (see Regression matrix).

The problems of regression analysis are not restricted to the construction of point estimators of the parameters $ \beta $ and $ \sigma ^ {2} $ in the general linear model (*). The problem of the accuracy of a constructed empirical relation is most effectively solved under the assumption that the observation vector $ y $ is normally distributed. If $ y $ is normally distributed and since the estimator $ \widehat \beta $ is a linear function of $ y $, one can conclude that the variable $ \widehat \beta _ {i} $ is normally distributed with mean $ \beta _ {i} $ and variance $ {\mathsf D} \widehat \beta _ {i} = \sigma ^ {2} b _ {ii} $, where $ b _ {ii} $ is the diagonal entry of the matrix $ ( X ^ {T} X ) ^ {-1} $. Apart from this, the estimator $ s ^ {2} $ for $ \sigma ^ {2} $ is distributed independently of any component of $ \widehat \beta $, and the variable $ ( n - k ) s ^ {2} / \sigma ^ {2} $ has a "chi-squared" distribution with $ n - k $ degrees of freedom. Hence the statistic

$$ t = \frac{( \widehat \beta _ {i} - \beta _ {i} ) }{[ s ^ {2} b _ {ii} ] ^ {1/2} } $$

has the Student distribution with $ n - k $ degrees of freedom. This fact is used to construct confidence intervals for the parameters $ \beta _ {i} $ and for testing hypotheses about the values taking by them. One can also find confidence intervals for $ {\mathsf E} ( Y \mid x _ {1} \dots x _ {k} ) $ for fixed values of all the regression variables, and confidence intervals containing the $ ( n + 1 ) $- th subsequent value of $ y $( called prediction intervals). Finally, starting from a vector of sample regression coefficients $ \widehat \beta $ one can construct a confidence ellipsoid for $ \beta $, or for any set of unknown regression coefficients, and also a confidence region for the entire regression curve.

Regression analysis is one of the most widely used methods for processing experimental data when investigating relations in physics, biology, economics, technology, and other fields. Such branches of mathematical statistics as dispersion analysis and the design of experiments are based on models of regression analysis, and these models are widely used in multi-dimensional statistical analysis.

References

[1] M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 2. Inference and relationship , Griffin (1979)
[2] N.V. Smirnov, I.V. Dunin-Barkovskii, "Mathematische Statistik in der Technik" , Deutsch. Verlag Wissenschaft. (1969) (Translated from Russian)
[3] S.A. Aivazyan, "Statistical research on dependence" , Moscow (1968) (In Russian)
[4] C.R. Rao, "Linear statistical inference and its applications" , Wiley (1965)
[5] N.R. Draper, H. Smith, "Applied regression analysis" , Wiley (1981)

Comments

Modern research — inspired by modern computational facilities — is aimed at developing methods for regression analysis when the classical assumptions of regression analysis do not hold. For instance, one can estimate the function $ {\mathsf E} ( Y \mid X = x) $ using only smoothness assumptions by adapting methods from density estimation, or one can produce robust estimators (cf. Robust statistics) in the linear regression model by minimizing the sum of absolute deviations from the regression line instead of the sum of their squares.

References

[a1] W. Härdle, "Applied nonparametric regression" , Cambridge Univ. Press (1990)
How to Cite This Entry:
Regression analysis. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Regression_analysis&oldid=17714
This article was adapted from an original article by A.V. Prokhorov (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article