Least squares, method of
A method in the theory of errors (cf. Errors, theory of) for estimating unknown quantities on the basis of results of measurement involving random errors. It is also used for the approximate representation of a given function by other (simpler) functions and it often proves useful for the processing of observations. It was first proposed by C.F. Gauss (1794–1795) and A. Legendre (1805–1806). The establishment of a rigorous basis for the method of least squares and the delineation of the limits of meaningful applicability of it were provided by A.A. Markov and A.N. Kolmogorov. In the simplest case, that of linear relationships (see below) and observations containing no systematic errors but only random ones, the leastsquare estimators of the unknown quantities are linear functions of the observed values. These estimators involve no systematic errors, i.e. they are unbiased (see Unbiased estimator). If the random errors in the observations are independent and follow a normal distribution law, the method yields estimators with minimum variance, i.e. they are efficient estimators (see Statistical estimation). It is in this sense that the method of least squares is optimal among all other methods for the determination of unbiased estimators. However, if the distribution of the random errors significantly departs from the normal one, the method is no longer necessarily the best.
In laying rigorous foundations for the method of least squares (in the Gaussian version), one assumes that the "loss" incurred by replacing the (unknown) exact value of some quantity $ \mu $ by an approximate value $ X $, calculated on the basis of observations, is proportional to the squared error $ ( X  \mu ) ^ {2} $; as an optimal estimator one takes a value of $ X $, free of systematic errors, for which the mean $ {\mathsf E} ( X  \mu ) ^ {2} $ of the "loss" is minimal. This requirement is fundamental to the method of least squares. In the general case, the task of finding an optimal estimator $ X $— in the leastsquare sense — is extremely complex, and so in practice one confines oneself to a narrower goal: One takes $ X $ to be a linear function of the observed results which involves no systematic errors, and such that the mean error is minimal relative to the class of all linear functions. If the random errors of the observations are normally distributed and the estimated quantity $ \mu $ depends linearly on the mean values of the observed results (a case which arises very frequently in applications of the method of least squares), the solution of this restricted problem turns out to be also the solution of the general problem. Under these conditions, the optimal estimator $ X $ also obeys a normal law with mean $ \mu $, and so the probability density of $ X $,
$$ p ( x; \mu , \sigma ) = \ \frac{1}{\sqrt {2 \pi \sigma } } \mathop{\rm exp} \left \{  \frac{( x  \mu ) ^ {2} }{2 \sigma ^ {2} } \right \} , $$
attains a maximum at $ x = \mu $( this property is a rigorous expression of the statement, very common in the theory of errors, that "the estimator X furnished by the method of least squares is the most probable value of the unknown parameter m" ).
The case of one unknown.
Assume that an unknown constant $ \mu $ is to be estimated on the basis of $ n $ independent observations, which have yielded results $ Y _ {1} \dots Y _ {n} $, i.e. $ Y _ {1} = \mu + \delta _ {1} , \dots Y _ {n} = \mu + \delta _ {n} $, where the $ \delta _ {1} \dots \delta _ {n} $ are random errors (by the usual definition in the classical theory of errors, random errors are independent random variables with expectation zero: $ {\mathsf E} \delta _ {i} = 0 $; if $ {\mathsf E} \delta _ {i} \neq 0 $, then the $ \delta _ {i} $ are called systematic errors). According to the method of least squares, one adopts an estimator $ X $ for $ \mu $ that minimizes the sum of the squares (hence the name of the method)
$$ S ( X) = \ \sum _ {i = 1 } ^ { n } p _ {i} ( X  Y _ {i} ) ^ {2} , $$
where
$$ p _ {i} = \frac{k}{\sigma _ {i} ^ {2} } \ \ \textrm{ and } \ \ \sigma _ {i} ^ {2} = {\mathsf D} \delta _ {i} = \ {\mathsf E} \delta _ {i} ^ {2} $$
(the coefficient $ k > 0 $ can be arbitrarily chosen). The number $ p _ {i} $ is known as the weight and $ \sigma _ {i} ^ {2} $ as the squared deviation of the $ i $ th measurement. In particular, if all measurements are equally precise, then $ \sigma _ {1} = {} \dots = \sigma _ {n} $, and one can then put $ p _ {1} = \dots = p _ {n} = 1 $; but if each $ Y _ {i} $ is the arithmetic mean of $ n _ {i} $ equallyprecise measurements, one puts $ p _ {i} = n _ {i} $.
The sum $ S ( X) $ will be minimal if $ X $ is chosen to be the weighted average
$$ X = \overline{Y}\; = { \frac{1}{P} } \sum p _ {i} Y _ {i} ,\ \ \textrm{ where } \ P = \sum p _ {i} . $$
The estimator $ \overline{Y}\; $ for $ \mu $ is unbiased, has weight $ P $ and variance $ {\mathsf D} \overline{Y}\; = k/P $. In particular, if all measurements are equally precise, then $ Y $ is the arithmetic mean of the results of measurement:
$$ \overline{Y}\; = { \frac{1}{n} } \sum Y _ {i} \ \ \textrm{ and } \ \ {\mathsf D} \overline{Y}\; = \frac{\sigma ^ {2} }{n} . $$
Subject to certain general assumptions one can show that if $ n $, the number of observations, is sufficiently large, then the distribution of the estimator $ \overline{Y}\; $ differs only slightly from that of the normal law with expectation $ \mu $ and variance $ k/P $. In this case the absolute error of the approximate equality $ \mu \approx \overline{Y}\; $ is less than $ t \sqrt {k/P } $ with probability approximately equal to
$$ \tag{1 } I ( t) = \ \frac{2}{\sqrt {2 \pi } } \int\limits _ { 0 } ^ { t } e ^ { u ^ {2} /2 } du $$
(e.g. $ I ( 1.96) = 0.950 $; $ I ( 2.58) = 0.990 $; $ I ( 3.00) = 0.997 $).
If the measurement weights $ p _ {i} $ are given and the factor $ k $ remains undetermined before the observations are taken, then this factor and the variance of $ \overline{Y}\; $ can be estimated by
$$ k \approx \ \frac{S ( \overline{Y}\; ) }{n  1 } $$
and
$$ {\mathsf D} \overline{Y}\; = \ \frac{k}{P} \approx s ^ {2} = \ \frac{S ( \overline{Y}\; ) }{( n  1) P } $$
(both these estimators are also unbiased).
In the practically important case in which the errors $ \delta _ {i} $ are normally distributed, one can determine the exact probability with which the absolute error of the equality $ \mu \approx \overline{Y}\; $ will be less than $ ts $( where $ t $ is an arbitrary positive number) as
$$ \tag{2 } I _ {n  1 } ( t) = \ C _ {n  1 } \int\limits _ { 0 } ^ { t } \left ( 1 + \frac{v ^ {2} }{n  1 } \right ) ^ {} n/2 dv, $$
where the constant $ C _ {n  1 } $ is so chosen that $ I _ {n  1 } ( \infty ) = 1 $( the Student distribution with $ n  1 $ degrees of freedom). For large $ n $, formula (2) can be replaced by (1). For relatively small $ n $, however, the use of formula (1) may cause large errors. For example, according to (1), a value $ I = 0.99 $ corresponds to $ t = 2.58 $; the true values of $ t $ defined for small $ n $ as the solutions of the equations $ I _ {n  1 } ( t) = 0.99 $
are shown in the following table:
<tbody> </tbody>

Example. To determine the mass of a body, it was weighted ten times in exactly the same way assuming the different weighings to be independent; the results $ Y _ {i} $(
in grams) were:
<tbody> </tbody>

( $ n _ {i} $ denotes the number of cases in which the mass $ Y _ {i} $ was observed; $ n = \sum n _ {i} = 10 $). Since the different weighings were assumed to be equally precise, one puts $ p _ {i} = n _ {i} $ and the estimate for the unknown weight $ \mu $ is chosen to be $ Y = \sum n _ {i} Y _ {i} / \sum n _ {i} = 18.431 $. Putting, say, $ I _ {9} = 0.95 $ and checking the tables of the Student distribution with nine degrees of freedom, one finds that $ t = 2.262 $, so that the maximum absolute error in the approximate equality $ \mu \approx 18.431 $ is
$$ ts = \ t \sqrt {\sum n _ {i} \frac{( Y _ {i}  \overline{Y}\; ) ^ {2} }{90} } = \ 2.262 \cdot 0.0048 = 0.011. $$
Thus, $ 18.420 < \mu < 18.442 $ with probability $ 0.05 $.
The case of several unknowns (linear relationships).
Suppose that $ n $ measurement results $ Y _ {1} \dots Y _ {n} $ are related to $ m $ unknown constants $ x _ {1} \dots x _ {m} $( $ m < n $) by means of independent linear relationships
$$ \tag{3 } Y _ {i} = \ \sum _ {j = 1 } ^ { m } a _ {ij} x _ {j} + \delta _ {i} ,\ \ i = 1 \dots n, $$
where the $ a _ {ij} $ are known coefficients and the $ \delta _ {i} $ are independent random errors in the measurements. One wishes to estimate the unknown quantities $ x _ {j} $( this problem can be regarded as a generalization of the previous problem, with $ \mu = x _ {i} $ and $ m = a _ {i1} = 1 $; $ i = 1 \dots n $).
Since $ {\mathsf E} \delta _ {i} = 0 $, the mean values of the results, $ y _ {i} = {\mathsf E} Y _ {i} $, are related to the unknowns $ x _ {1} \dots x _ {m} $ by means of the following linear equations:
$$ \tag{4 } y _ {i} = \ \sum _ {j = 1 } ^ { m } a _ {ij} x _ {j} ,\ \ i = 1 \dots n. $$
Consequently, the desired values of $ x _ {i} $ are a solution of the system (4) (on the assumption that the equations are consistent). The precise values of the measured magnitudes $ y _ {i} $ and the random errors $ \delta _ {i} $ are usually unknown, and so one generally replaces the systems (3) and (4) by the conditional equations:
$$ Y _ {i} = \ \sum _ {j = 1 } ^ { m } a _ {ij} x _ {j} ,\ \ i = 1 \dots n. $$
In the method of least squares, estimators for the unknowns $ x _ {i} $ are the quantities $ X _ {i} $ that minimize the sum of the squared deviations
$$ S = \ \sum _ {i = 1 } ^ { n } p _ {i} \left ( Y _ {i}  \sum _ {j = 1 } ^ { m } a _ {ij} X _ {j} \right ) ^ {2} . $$
(As in the previous case, $ p _ {i} $ is the weight of the measurement and $ Y _ {i} $ is the quantity inversely proportional to the variance of the random error $ \delta _ {i} $.) The conditional equations are as a rule inconsistent, i.e. for arbitrary values of $ X _ {j} $ the differences
$$ \Delta _ {i} = \ Y _ {i}  \sum _ {j = 1 } ^ { m } a _ {ij} X _ {j} ,\ \ i = 1 \dots n, $$
cannot all vanish, in general. The method of least squares prescribes taking as estimators those values of $ X _ {j} $ that minimize the sum $ S $. In those exceptional cases in which the conditional equations are consistent, and therefore solvable, the solution consists precisely of the estimators furnished by the method of least squares.
The sum $ S $ of the squares is a quadratic polynomial in the variables $ X _ {j} $; this polynomial attains a minimum at the values of $ X _ {1} \dots X _ {m} $ for which all the first partial derivatives vanish:
$$ \frac{\partial S }{\partial X _ {j} } = \  2 \sum _ {i = 1 } ^ { n } p _ {i} \Delta _ {i} = 0,\ \ j = 1 \dots m. $$
Hence it follows that the $ X _ {j} $ furnished by the method of least squares must satisfy the socalled normal equations. In the notation proposed by Gauss, these equations are:
$$ \tag{5 } \sum _ {j = 1 } ^ { m } [ pa _ {j} a _ {l} ] X _ {j} = [ pYa _ {l} ],\ \ l = 1 \dots m, $$
where
$$ [ pa _ {j} a _ {l} ] = \ \sum _ {i = 1 } ^ { n } p _ {i} a _ {ij} a _ {il} $$
and
$$ [ pYa _ {l} ] = \ \sum _ {i = 1 } ^ { n } p _ {i} Y _ {i} a _ {il} . $$
Estimators obtained by solving the system of normal equations are unbiased $ ( {\mathsf E} X _ {j} = x _ {j} ) $; the variances $ {\mathsf D} X _ {j} $ are $ kd _ {jj} /d $, where $ d $ is the determinant of the system (5) and $ d _ {jj} $ is the minor corresponding to the diagonal entry $ [ pa _ {j} a _ {j} ] $( in other words, $ d _ {jj} /d $ is the weight of the estimator $ X _ {j} $). If the proportionality factor $ k $( known as the variance per unit weight) is not known in advance, it can be estimated, and with it the variances $ {\mathsf D} X _ {j} $, by the formulas:
$$ k \approx \frac{S}{n  m } \ \ \textrm{ and } \ \ {\mathsf D} X _ {j} \approx \ s _ {j} ^ {2} = \ \frac{Sd _ {jj} }{d ( n  m) } $$
( $ S $ is the minimum value of the original sum of the squares). Subject to certain general assumptions one can show that if the number of observations $ n $ is sufficiently large, the absolute error of the approximate equality $ x _ {j} \approx X _ {j} $ is less than $ ts _ {j} $ with probability approximately given by the integral (1). If the random errors $ \delta _ {i} $ obey a normal law, then all quotients $ ( X _ {j}  x _ {j} )/s _ {j} $ are distributed according to the Student law with $ n  m $ degrees of freedom (a precise estimation of the absolute error of the approximate equality makes use of the integral (2), just as in the case of one unknown). In addition, the minimum value of $ S $ is independent of $ X _ {1} \dots X _ {m} $( in the probabilistic sense), and therefore the approximate values of the variances $ {\mathsf D} X _ {j} \approx s _ {j} ^ {2} $ are independent of the estimators $ X _ {j} $ themselves.
One of the most typical applications of the method of least squares is in the "smoothing" of observed results $ Y _ {j} $ for which $ a _ {ij} = a _ {j} ( t _ {i} ) $ in equations (3), where the $ a _ {j} ( t) $ are known functions of some parameter $ t $( if $ t $ is time, then $ t _ {1} , t _ {2} \dots $ are the times at which the observations are made). A case encountered quite frequently in practical work is that known as parabolic interpolation, when the $ a _ {j} ( t) $ are polynomials (e.g. $ a _ {1} ( t) = 1 $, $ a _ {2} ( t) = t $, $ a _ {3} ( t) = t ^ {2} ,\dots $); if $ t _ {2}  t _ {1} = \dots = t _ {n}  t _ {n  1 } $ and if the observations are equally precise, then the estimators $ X _ {j} $ can be calculated by using tables of orthogonal polynomials. Another case of practical importance is harmonic interpolation, when the $ a _ {j} ( t) $ are trigonometric functions (e.g. $ a _ {j} ( t) = \cos ( j  1) t $, $ j = 1 \dots m $).
Example. In order to estimate the precision of a certain method of chemical analysis, the method was used to determine the concentration of CaO in ten standard samples of known composition. The results are listed in the following table ( $ i $ is the number of the experiment, $ t _ {i} $ the true concentration of CaO, $ T _ {i} $ is the concentration of CaO determined by the chemical analysis, $ Y _ {i} = T _ {i}  t _ {i} $
the error in chemical analysis):
<tbody> </tbody>

If the results of the chemical analysis do not involve systematic errors, then $ {\mathsf E} Y _ {i} = 0 $. But if such errors exist, they may be represented as a first approximation by $ {\mathsf E} Y _ {i} = \alpha + \beta t _ {i} $( $ \alpha $ is known as the constant bias, $ \beta t _ {i} $ as the methodical bias) or, equivalently,
$$ {\mathsf E} Y _ {i} = ( \alpha + \beta \overline{t}\; ) + \beta ( t _ {i}  \overline{t}\; ), $$
where
$$ \overline{t}\; = { \frac{1}{10} } \sum _ {i = 1 } ^ { 10 } t _ {i} = 23.25. $$
To find estimates of $ \alpha $ and $ \beta $, it will suffice to estimate the quantities $ x _ {1} = \alpha + \beta \overline{t}\; $ and $ x _ {2} = \beta $. The conditional equations here are
$$ Y _ {i} = x _ {1} + x _ {2} ( t _ {i}  \overline{t}\; ),\ \ i = 1 \dots 10, $$
so that $ a _ {i1} = 1 $, $ a _ {i2} = t _ {i}  \overline{t}\; $( since by assumption the observations are equally precise or homoscedastic (cf. Homoscedasticity), $ p _ {i} = 1 $). Since $ [ a _ {1} a _ {2} ] = [ a _ {2} a _ {1} ] = \sum ( t _ {i}  \overline{t}\; ) = 0 $, the system of normal equations assumes a particularly simple form:
$$ [ a _ {1} a _ {1} ] X _ {1} = [ Ya _ {1} ]; \ \ [ a _ {2} a _ {2} ] X _ {2} = [ Ya _ {2} ], $$
where
$$ [ a _ {1} a _ {1} ] = 10,\ \ [ a _ {2} a _ {2} ] = \sum ( t _ {i}  \overline{t}\; ) ^ {2} = \ 1569, $$
$$ [ Ya _ {1} ] = \sum y _ {i} =  3.5, $$
$$ [ Ya _ {2} ] = \sum Y _ {i} ( t _ {i}  \overline{t}\; ) =  8.225. $$
The variances of the components of the solution are
$$ {\mathsf D} X _ {1} = \frac{k}{[ a _ {1} a _ {1} ] } = { \frac{k}{10} } \ \ \textrm{ and } \ \ {\mathsf D} X _ {2} = \frac{k}{[ a _ {2} a _ {2} ] } = { \frac{k}{1569} } , $$
where $ k $ is the unknown variance per unit weight (in this case, $ k $ is the variance of any of the quantities $ Y _ {i} $). Since in this example the components of the solution have values $ X _ {1} =  0.35 $ and $ X _ {2} =  0.00524 $, one finds
$$ k \approx \frac{S}{n  m } = $$
$$ = \ { \frac{1}{8} } \sum _ {i = 1 } ^ { 10 } [ Y _ {i}  X _ {1}  X _ {2} ( t _ {i}  \overline{t}\; )] ^ {2} = 0.0427, $$
$$ {\mathsf D} X _ {1} \approx s _ {1} ^ {2} = 0.00427,\ \ {\mathsf D} X _ {2} \approx s _ {2} ^ {2} = 0.000272, $$
$$ s _ {1} = 0.065,\ s _ {2} = 0.00522. $$
If the random errors in the observations follow a normal law, then the quotients $  X _ {j}  x _ {j}  /s _ {j} $, $ j = 1, 2 $, are distributed according to a Student $ t $ distribution. In particular, if the results of the observations are free of systematic errors, then $ x _ {1} = x _ {2} = 0 $, so that the quotients $  X _ {1}  /s _ {1} $ and $  X _ {2}  /s _ {2} $ must follow a Student $ t $ distribution. Using tables of the Student distribution with $ n  m = 8 $ degrees of freedom, one verifies that if indeed $ x _ {1} = x _ {2} = 0 $, then with probability 0.999 each of these quotients cannot exceed 5.04 and with probability 0.95 they will not exceed 2.31. In this case $  X _ {1}  /s _ {1} = 5.38 > 5.04 $, and therefore the hypothesis that there are no systematic errors should be rejected. At the same time, it should be realized that the hypothesis of absence of systematic errors $ ( x _ {2} = 0) $ is not contradicted by the results of the observations, since $  X _ {2}  /s _ {2} = 1.004 < 2.31 $. Thus, it follows that the approximate formula $ t = T + 0.35 $ is a logical choice for the determination of $ t $ on the basis of the observed value $ T $.
The case of several unknowns (nonlinear relationships).
Suppose that the $ n $ results $ Y _ {i} $ of measurements are related to the $ m $ unknowns $ x _ {j} $( $ m < n $) by means of functions $ Y _ {i} = f _ {i} ( x _ {1} \dots x _ {m} ) + \delta _ {i} $, $ i = 1 \dots n $, where the $ \delta _ {i} $ are independent random errors, and the functions $ f _ {i} $( which need not be linear) are differentiable. According to the method of least squares, estimators $ X _ {j} $ for the $ x _ {j} $ are those for which the sum of squares is smallest. Since the functions $ f _ {i} $ are nonlinear, solving the normal equations $ \partial S/ \partial X _ {j} = 0 $ may present considerable difficulties. Sometimes nonlinear relationships may be reduced to linear forms by transformation.
For example, in the magnetization of iron, the magnetic field strength $ H $ is related to the magnetic induction via the empirical formula $ B = H/( x _ {1} + H x _ {2} ) $( the coefficients $ x _ {1} $ and $ x _ {2} $ being determined by measuring the values of $ B _ {i} $ for various preassigned values $ H _ {i} $). The induction $ B $ is a nonlinear function of $ x _ {1} $ and $ x _ {2} $. However, the reciprocal of the induction is a linear function of $ x _ {1} $ and $ x _ {2} $. Application of the method of least squares to the original and the transformed equations may well yield different estimates for the unknowns $ x _ {1} $ and $ x _ {2} $, but if the variance of the random errors involved in measuring the induction is significantly less than the measured quantities $ B _ {i} $, then $ {\mathsf D} ( 1/B) \approx B ^ {} 4 $. Thus the quantities $ 1/B _ {i} $ should be assigned weights $ ( 1/B _ {i} ) ^ {4} $; it is natural to expect that under these conditions the difference between estimates in the nonlinear and linear cases is negligible for practical purposes.
In cases when it is not possible to replace the nonlinear equations by linear ones via identity transformations, one resorts to another linearization technique. Inspecting the original $ n $ equations, single out any $ m $ of them, the solution $ X _ {1} ^ {0} \dots X _ {m} ^ {0} $ of which will serve as a zeroth approximation for the unknowns $ x _ {j} $. Putting $ \xi = X _ {j}  X _ {j} ^ {0} $, the conditional equations may be rewritten as
$$ Y _ {i} = f _ {i} ( X _ {1} ^ {0} + \xi _ {1} \dots X _ {m} ^ {0} + \xi _ {m} ),\ \ i = 1 \dots n. $$
Expanding the terms on the right in powers of $ \xi _ {j} $ and dropping all but the linear terms, one obtains
$$ Y _ {i}  ( f _ {i} ) _ {0} = \ \sum _ {j = 1 } ^ { m } \left ( \frac{\partial f _ {i} }{\partial x _ {j} } \right ) _ {0} \xi _ {j} ,\ \ i = 1 \dots n, $$
where $ ( f _ {i} ) _ {0} $ and $ ( \partial f _ {i} / \partial x _ {j} ) _ {0} $ are the values of $ f _ {i} $ and its derivative with respect to $ x _ {j} $ at the point $ x _ {1} = X _ {1} ^ {0} \dots x _ {m} = X _ {m} ^ {0} $. This is a linear system of equations so that the method of least squares is readily applicable to determine estimates for the unknowns $ \xi _ {j} $. Once this has been done, one obtains a first approximation for the unknown: $ X _ {j} ^ {1} = X _ {j} ^ {0} + \xi _ {j} $. These quantities are now treated as the initial approximation, and the entire procedure is repeated, as long as two consecutive approximations do not coincide to within the desired precision. If the variance of the errors $ \delta _ {i} $ decreases, the procedure is convergent.
Very commonly, if $ {\mathsf D} \delta _ {i} $ is small, the first approximation alone is quite sufficient; there is no point in requiring $ X _ {j} $ to be determined with precision significantly exceeding $ \sqrt { {\mathsf D} X _ {j} } $.
In many cases of practical importance (in particular, in the estimation of complicated nonlinear relationships), the number of unknown parameters becomes very large, and so implementation of the method of least squares becomes effective only if modern computing techniques are employed.
References
[1]  A.A. Markov, "Wahrscheinlichkeitsrechung" , Teubner (1912) (Translated from Russian) 
[2]  A.N. Kolmogorov, "On substantiation of the method of least squares" Uspekhi Mat. Nauk , 1 : 1 (1946) pp. 57–70 (In Russian) 
[3]  Yu.V. Linnik, "Methode der kleinste Quadraten in moderner Darstellung" , Deutsch. Verlag Wissenschaft. (1961) (Translated from Russian) 
[4]  V.V. Nalimov, "The application of mathematical statistics to chemical analysis" , Pergamon (1963) (Translated from Russian) 
[5]  F.R. Helmert, "Die Ausgleichungsrechnung nach der Methode der kleinsten Quadrate" , Teubner (1907) 
Comments
In fact, Gauss proposed the normal distribution as a way of justifying the method of least squares. Moreover, his original derivation of the leastsquare estimator as the "best linear unbiased estimator" (as it is nowadays interpreted; the socalled Gauss–Markov theorem) was actually a derivation of this estimator as "most likely value of m" in a Bayesian sense (cf. Bayesian approach), with respect to a uniform pivot on $ \mu $, cf. [a1].
For (the results of) the use of least squares in estimating the coefficients of (presumed) linear relationships $ y = A x $ between repeatedlymeasured vectors of variables $ y $ and $ x $( linear regression), cf. Regression matrix and Regression analysis.
References
[a1]  S.M. Stigler, "The history of statistics" , Harvard Univ. Press (1986) 
[a2]  N.R. Draper, H. Smith, "Applied regression analysis" , Wiley (1981) 
[a3]  C.R. Rao, "Linear statistical inference and its applications" , Wiley (1965) 
[a4]  C. Daniel, F.S. Wood, "Fitting equations to data: computer analysis of multifactor data for scientists and engineers" , Wiley (1971) 
Least squares, method of. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Least_squares,_method_of&oldid=47599