Correlation (in statistics)
A dependence between random variables not necessarily expressed by a rigorous functional relationship. Unlike functional dependence, a correlation is, as a rule, considered when one of the random variables depends not only on the other (given) one, but also on several random factors. The dependence between two random events is manifested in the fact that the conditional probability of one of them, given the occurrence of the other, differs from the unconditional probability. Similarly, the influence of one random variable on another is characterized by the conditional distributions of one of them, given fixed values of the other. Let and
be random variables with given joint distribution, let
and
be the expectations of
and
, let
and
be the variances of
and
, and let
be the correlation coefficient of
and
. Assume that for every possible value
the conditional mathematical expectation
of
is defined; then the function
is known as the regression of
given
, and its graph is the regression curve of
given
. The dependence of
on
is manifested in the variation of the mean values of
as
varies, although for each fixed value
,
remains a random variable with a well-defined spread. In order to determine to what degree of accuracy the regression reproduces the variation of
as
varies, one uses the conditional variance of
for a given
or its mean value (a measure of the spread of
about the regression curve):
![]() |
If and
are independent, then all conditional mathematical expectations of
are independent of
and coincide with the unconditional expectations:
; and then also
. When
is a function of
in the strict sense of the word, then for each
the variable
takes only one definite value and
. Similarly one defines
(the regression of
given
). A natural index of the concentration of the distribution near the regression curve
is the correlation ratio
![]() |
One has if and only if the regression has the form
, and in that case the correlation coefficient
vanishes and
is not correlated with
. If the regression of
given
is linear, i.e. the regression curve is the straight line
![]() |
then
![]() |
if, moreover, , then
is related to
through an exact linear dependence; but if
, there is no functional dependence between
and
. There is an exact functional dependence of
on
, other than a linear one, if and only if
. With rare exceptions, the practical use of the correlation coefficient as a measure of the lack of dependence is justifiable only when the joint distribution of
and
is normal (or close to normal), since in that case
implies that
and
are independent. Use of
as a measure of dependence for arbitrary random variables
and
frequently leads to erroneous conclusions, since
may vanish even when a functional dependence exists. If the joint distribution of
and
is normal, then both regression curves are straight lines and
uniquely determines the concentration of the distribution near the regression curves: When
the regression curves merge into one, corresponding to linear dependence between
and
; when
one has independence.
When studying the interdependence of several random variables with a given joint distribution, one uses multiple and partial correlation ratios and coefficients. The latter are evaluated using the ordinary correlation coefficients between
and
, the totality of which form the correlation matrix. A measure of the linear relationship between
and the totality of the other variables
is provided by the multiple-correlation coefficient. If the mutual relationship of
and
is assumed to be determined by the influence of the other variables
, then the partial correlation coefficient of
and
with respect to
is an index of the linear relationship between
and
relative to
.
For measures of correlation based on rank statistics (cf. Rank statistic) see Kendall coefficient of rank correlation; Spearman coefficient of rank correlation.
Mathematical statisticians have developed methods for estimating coefficients that characterize the correlation between random variables or tests; there are also methods to test hypotheses concerning their values, using their sampling analogues. These methods are collectively known as correlation analysis. Correlation analysis of statistical data consists of the following basic practical steps: 1) the construction of a scatter plot and the compilation of a correlation table; 2) the computation of sampling correlation ratios or correlation coefficients; 3) testing statistical hypothesis concerning the significance of the dependence. Further investigation may consist in establishing the concrete form of the dependence between the variables (see Regression).
Among the aids to analysis of two-dimensional sample data are the scatter plot and the correlation table. The scatter plot is obtained by plotting the sample points on the coordinate plane. Examination of the configuration formed by the points of the scatter plot yields a preliminary idea of the type of dependence between the random variables (e.g. whether one of the variables increases or decreases on the average as the other increases). Prior to numerical processing, the results are usually grouped and presented in the form of a correlation table. In each entry of this table one writes the number of pairs
with components in the appropriate grouping intervals. Assuming that the grouping intervals (in each of the variables) are equal in length, one takes the centres
(or
) of the intervals and the numbers
as the basis for calculation.
For more accurate information about the nature and strength of the relationship than that provided by the scatter plot, one turns to the correlation coefficient and correlation ratio. The sample correlation coefficient is defined by the formula
![]() |
where
![]() |
and
![]() |
In the case of a large number of independent observations, governed by one and the same near-normal distribution, is a good approximation to the true correlation coefficient
. In all other cases, as characteristic of strength of the relationship the correlation ratio is recommended, the interpretation of which is independent of the type of dependence being studied. The sample value
is computed from the entries in the correlation table:
![]() |
where the numerator represents the spread of the conditional mean values about the unconditional mean
(the sample value
is defined analogously). The quantity
is used as an indicator of the deviation of the regression from linearity.
The testing of hypotheses concerning the significance of a relationship are based on the distributions of the sample correlation characteristics. In the case of a normal distribution, the value of the sample correlation coefficient is significantly distinct from zero if
![]() |
where is the critical value of the Student
-distribution with
degrees of freedom corresponding to the chosen significance level
. If
one usually uses the Fisher
-transform, with
replaced by
according to the formula
![]() |
Even at relatively small values the distribution of
is a good approximation to the normal distribution with mathematical expectation
![]() |
and variance . On this basis one can now define approximate confidence intervals for the true correlation coefficient
.
For the distribution of the sample correlation ratio and for tests of the linearity hypothesis for the regression, see [3].
References
[1] | H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946) |
[2] | B.L. van der Waerden, "Mathematische Statistik" , Springer (1957) |
[3] | M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 2. Inference and relationship , Griffin (1979) |
[4] | S.A. Aivazyan, "Statistical research on dependence" , Moscow (1968) (In Russian) |
Correlation (in statistics). Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Correlation_(in_statistics)&oldid=11629