# Regression analysis

A branch of mathematical statistics that unifies various practical methods for investigating dependence between variables using statistical data (see Regression). The problem of regression in mathematical statistics is characterized by the fact that there is insufficient information about the distributions of the variables under consideration. Suppose, for example, that there are reasons for assuming that a random variable $Y$ has a given probability distribution at a fixed value $x$ of another variable, so that

$${\mathsf E} ( Y \mid x ) = g ( x , \beta ) ,$$

where $\beta$ is a set of unknown parameters determining the function $g ( x)$, and that it is required to determine the values of these parameters from results of observations. Depending on the nature of the problem and the aims of the analysis, the results of an experiment $( x _ {1} , y _ {1} ) \dots ( x _ {n} , y _ {n} )$ are interpreted in different ways in relation to the variable $x$. To ascertain the connection between the variables in the experiment, one often uses a model based on simplified assumptions: $x$ is a controllable variable, whose values are given in advance for the design of the experiment, and the observed value $y$ can be written in the form

$$y _ {i} = g ( x _ {i} , \beta ) + \epsilon _ {i} ,\ \ i = 1 \dots n ,$$

where the variables $\epsilon _ {i}$ characterize the errors, which are independent for various measurements and identically distributed with mean zero and constant variance. In the case of an uncontrollable variable, the results of the observations $( x _ {1} , y _ {1} ) \dots ( x _ {n} , y _ {n} )$ represent a sample from a certain two-dimensional aggregate. The methods of regression analysis are the same in both cases, although the interpretations of the results differ (in the latter case, the analysis is substantially supplemented by methods from the theory of correlation (in statistics)).

The study of regression for experimental data is carried out using methods based on the principles of mean-square regression. Regression analysis solves the following fundamental problems: 1) the choice of a regression model, which implies assumptions about the dependence of the regression function on $x$ and $\beta$; 2) an estimate of the parameters $\beta$ in the selected model, perhaps by the method of least squares; and 3) testing the statistical hypotheses about the regression.

From the point of view of a single method for estimating unknown parameters, the most natural one is a regression model that is linear in these parameters:

$$g ( x , \beta ) = \beta _ {0} g _ {0} ( x) + {} \dots + \beta _ {m} g _ {m} ( x) .$$

The choice of the functions $g _ {i} ( x)$ is sometimes arrived at by arranging the experimental values $( x , y )$ on a scattergram or, more often, by theoretical considerations. It is thus assumed that the variance $\sigma ^ {2}$ of the results of the observations is constant (or proportional to a known function of $x$). The standard method of regression estimation is based on the use of a polynomial of some degree $m$, $1 \leq m < n$:

$$g ( x , \beta ) = \ \beta _ {0} + \beta _ {1} x + \dots + \beta _ {m} x ^ {m}$$

or, in the simplest case, of a linear function (linear regression)

$$g ( x , \beta ) = \beta _ {0} + \beta _ {1} x .$$

There are criteria for testing linearity and for choosing the degree of the approximating polynomial.

According to the principles of mean-square regression, the estimation of the unknown regression coefficients $\beta _ {0} \dots \beta _ {m}$( cf. Regression coefficient) and the variance $\sigma ^ {2}$( cf. Dispersion) is realized by the method of least squares. Thus, as statistical estimators of $\beta _ {0} \dots \beta _ {m}$ one chooses values $\widehat \beta _ {0} \dots \widehat \beta _ {m}$ which minimize the expression

$$\sum _ { i= } 1 ^ { n } ( y _ {i} - g ( x _ {i} ) ) ^ {2} .$$

The polynomial $\widehat{g} ( x) = \widehat \beta _ {0} + \dots + \widehat \beta _ {m} x ^ {m}$ thus obtained is called the empirical regression curve, and is a statistical estimator of the unknown proper regression curve. Assuming linearity of regression, the equation of the empirical regression curve has the form

$$\widehat{g} ( x) = \widehat \beta _ {0} + \widehat \beta _ {1} x ,$$

where

$$\widehat \beta _ {0} = \overline{y}\; - \widehat \beta _ {1} \overline{x}\; ,\ \ \widehat \beta _ {1} = \frac{\sum _ { i } ( x _ {i} - \overline{x}\; ) ( y _ {i} - y ) }{\sum _ { i } ( x _ {i} - \overline{x}\; ) ^ {2} } ,$$

$$\overline{x}\; = \frac{1}{n} \sum _ { i } x _ {i} ,\ \overline{y}\; = \frac{1}{n} \sum _ { i } y _ {i} .$$

The random variables $\widehat \beta _ {0} \dots \widehat \beta _ {m}$ are called sample regression coefficients (or estimated regression coefficients). An unbiased estimator of the parameter $\sigma ^ {2}$ is given by

$$s ^ {2} = \ \frac{\sum _ { i= } 1 ^ { n } ( y _ {i} - \widehat{g} ( x _ {i} ) ) ^ {2} }{n - m } .$$

If the variance depends on $x$, the method of least squares is applicable with certain modifications.

If one studies the dependence of a random variable $Y$ on several variables $x _ {1} \dots x _ {k}$, then it is more convenient to write the general linear regression model in matrix form: An observation vector $y$ with independent components $y _ {1} \dots y _ {n}$ has mean value and covariance matrix given by

$$\tag{* } {\mathsf E} ( Y \mid x _ {1} \dots x _ {k} ) = X \beta ,\ \ {\mathsf D} ( Y \mid x _ {1} \dots x _ {k} ) = \sigma ^ {2} I ,$$

where $\beta = ( \beta _ {1} \dots \beta _ {k} )$ is the vector of regression coefficients, $X = \| x _ {ij} \|$, $i = 1 \dots n$, $j = 1 \dots k$, is a matrix of known variables related to each other, generally speaking, in an arbitrary fashion, and $I$ is the identity matrix of order $n$; moreover, $n > k$ and $| X ^ {T} X | \neq 0$. More generally one can assume that there is correlation between the observations $y _ {i}$:

$${\mathsf E} ( Y \mid x _ {1} \dots x _ {k} ) = X \beta ,\ \ {\mathsf D} ( Y \mid x _ {1} \dots x _ {k} ) = \sigma ^ {2} A ,$$

for some known matrix $A$. But this scheme can be reduced to the model (*). An unbiased estimator for $\beta$ by the method of least squares is given by

$$\widehat \beta = ( X ^ {T} X ) ^ {-} 1 X ^ {T} y ,$$

and an unbiased estimator for $\sigma ^ {2}$ is given by

$$s ^ {2} = \ \frac{1}{n-} k ( y ^ {T} y - \beta ^ {T} X ^ {T} y ) .$$

Model (*) is the most general linear model, in that it is applicable to various regression situations and encompasses all forms of polynomial regression of $Y$ with respect to $x _ {1} \dots x _ {k}$( in particular, the above polynomial regression of $Y$ with respect to $x$ of order $m$ can be reduced to the model (*), in which $m$ of the regression variables are functionally connected). In this linear interpretation of regression analysis, the problem of estimating $\beta$ and the calculation of the covariance matrix of estimators ${\mathsf D} \widehat \beta = \sigma ^ {2} ( X ^ {T} X ) ^ {-} 1$ reduces to the problem of inverting the matrix $X ^ {T} X$.

The above method for constructing an empirical regression assuming a normal distribution of the results of the observations leads to estimators for $\beta$ and $\sigma ^ {2}$ that coincide with the maximum-likelihood estimators. However, the estimators obtained by this method are, in a certain sense, also optimal in the case of deviation from normality, provided only that the sample size is sufficiently large.

In the given matrix form, the general linear regression model (*) admits a simple extension to the case when the observed variables $y _ {i}$ are random vector variables. This does not give rise to any new statistical problem (see Regression matrix).

The problems of regression analysis are not restricted to the construction of point estimators of the parameters $\beta$ and $\sigma ^ {2}$ in the general linear model (*). The problem of the accuracy of a constructed empirical relation is most effectively solved under the assumption that the observation vector $y$ is normally distributed. If $y$ is normally distributed and since the estimator $\widehat \beta$ is a linear function of $y$, one can conclude that the variable $\widehat \beta _ {i}$ is normally distributed with mean $\beta _ {i}$ and variance ${\mathsf D} \widehat \beta _ {i} = \sigma ^ {2} b _ {ii}$, where $b _ {ii}$ is the diagonal entry of the matrix $( X ^ {T} X ) ^ {-} 1$. Apart from this, the estimator $s ^ {2}$ for $\sigma ^ {2}$ is distributed independently of any component of $\widehat \beta$, and the variable $( n - k ) s ^ {2} / \sigma ^ {2}$ has a "chi-squared" distribution with $n - k$ degrees of freedom. Hence the statistic

$$t = \frac{( \widehat \beta _ {i} - \beta _ {i} ) }{[ s ^ {2} b _ {ii} ] ^ {1/2} }$$

has the Student distribution with $n - k$ degrees of freedom. This fact is used to construct confidence intervals for the parameters $\beta _ {i}$ and for testing hypotheses about the values taking by them. One can also find confidence intervals for ${\mathsf E} ( Y \mid x _ {1} \dots x _ {k} )$ for fixed values of all the regression variables, and confidence intervals containing the $( n + 1 )$- th subsequent value of $y$( called prediction intervals). Finally, starting from a vector of sample regression coefficients $\widehat \beta$ one can construct a confidence ellipsoid for $\beta$, or for any set of unknown regression coefficients, and also a confidence region for the entire regression curve.

Regression analysis is one of the most widely used methods for processing experimental data when investigating relations in physics, biology, economics, technology, and other fields. Such branches of mathematical statistics as dispersion analysis and the design of experiments are based on models of regression analysis, and these models are widely used in multi-dimensional statistical analysis.

#### References

 [1] M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 2. Inference and relationship , Griffin (1979) [2] N.V. Smirnov, I.V. Dunin-Barkovskii, "Mathematische Statistik in der Technik" , Deutsch. Verlag Wissenschaft. (1969) (Translated from Russian) [3] S.A. Aivazyan, "Statistical research on dependence" , Moscow (1968) (In Russian) [4] C.R. Rao, "Linear statistical inference and its applications" , Wiley (1965) [5] N.R. Draper, H. Smith, "Applied regression analysis" , Wiley (1981)

Modern research — inspired by modern computational facilities — is aimed at developing methods for regression analysis when the classical assumptions of regression analysis do not hold. For instance, one can estimate the function ${\mathsf E} ( Y \mid X = x)$ using only smoothness assumptions by adapting methods from density estimation, or one can produce robust estimators (cf. Robust statistics) in the linear regression model by minimizing the sum of absolute deviations from the regression line instead of the sum of their squares.