Namespaces
Variants
Actions

Difference between revisions of "Correlation (in statistics)"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
 
Line 1: Line 1:
A dependence between random variables not necessarily expressed by a rigorous functional relationship. Unlike functional dependence, a correlation is, as a rule, considered when one of the random variables depends not only on the other (given) one, but also on several random factors. The dependence between two random events is manifested in the fact that the conditional probability of one of them, given the occurrence of the other, differs from the unconditional probability. Similarly, the influence of one random variable on another is characterized by the conditional distributions of one of them, given fixed values of the other. Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265601.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265602.png" /> be random variables with given joint distribution, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265603.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265604.png" /> be the expectations of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265605.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265606.png" />, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265607.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265608.png" /> be the variances of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c0265609.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656010.png" />, and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656011.png" /> be the correlation coefficient of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656012.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656013.png" />. Assume that for every possible value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656014.png" /> the conditional mathematical expectation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656015.png" /> of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656016.png" /> is defined; then the function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656017.png" /> is known as the [[Regression|regression]] of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656018.png" /> given <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656019.png" />, and its graph is the regression curve of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656020.png" /> given <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656021.png" />. The dependence of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656022.png" /> on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656023.png" /> is manifested in the variation of the mean values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656024.png" /> as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656025.png" /> varies, although for each fixed value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656026.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656027.png" /> remains a random variable with a well-defined spread. In order to determine to what degree of accuracy the regression reproduces the variation of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656028.png" /> as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656029.png" /> varies, one uses the conditional variance of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656030.png" /> for a given <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656031.png" /> or its mean value (a measure of the spread of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656032.png" /> about the regression curve):
+
<!--
 +
c0265601.png
 +
$#A+1 = 129 n = 0
 +
$#C+1 = 129 : ~/encyclopedia/old_files/data/C026/C.0206560 Correlation (in statistics)
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656033.png" /></td> </tr></table>
+
{{TEX|auto}}
 +
{{TEX|done}}
  
If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656034.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656035.png" /> are independent, then all conditional mathematical expectations of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656036.png" /> are independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656037.png" /> and coincide with the unconditional expectations: <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656038.png" />; and then also <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656039.png" />. When <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656040.png" /> is a function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656041.png" /> in the strict sense of the word, then for each <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656042.png" /> the variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656043.png" /> takes only one definite value and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656044.png" />. Similarly one defines <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656045.png" /> (the regression of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656046.png" /> given <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656047.png" />). A natural index of the concentration of the distribution near the regression curve <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656048.png" /> is the [[Correlation ratio|correlation ratio]]
+
A dependence between random variables not necessarily expressed by a rigorous functional relationship. Unlike functional dependence, a correlation is, as a rule, considered when one of the random variables depends not only on the other (given) one, but also on several random factors. The dependence between two random events is manifested in the fact that the conditional probability of one of them, given the occurrence of the other, differs from the unconditional probability. Similarly, the influence of one random variable on another is characterized by the conditional distributions of one of them, given fixed values of the other. Let  $  X $
 +
and $  Y $
 +
be random variables with given joint distribution, let  $  m _ {X} $
 +
and  $  m _ {Y} $
 +
be the expectations of  $  X $
 +
and  $  Y $,
 +
let  $  \sigma _ {X}  ^ {2} $
 +
and  $  \sigma _ {Y}  ^ {2} $
 +
be the variances of  $  X $
 +
and  $  Y $,
 +
and let  $  \rho $
 +
be the correlation coefficient of  $  X $
 +
and $  Y $.  
 +
Assume that for every possible value  $  X = x $
 +
the conditional mathematical expectation  $  y ( x) = {\mathsf E} [ Y \mid  X = x] $
 +
of  $  Y $
 +
is defined; then the function $  y ( x) $
 +
is known as the [[Regression|regression]] of $  Y $
 +
given  $  X $,
 +
and its graph is the regression curve of  $  Y $
 +
given  $  X $.  
 +
The dependence of  $  Y $
 +
on  $  X $
 +
is manifested in the variation of the mean values of  $  Y $
 +
as  $  X $
 +
varies, although for each fixed value  $  X = x $,
 +
$  Y $
 +
remains a random variable with a well-defined spread. In order to determine to what degree of accuracy the regression reproduces the variation of  $  Y $
 +
as  $  X $
 +
varies, one uses the conditional variance of $  Y $
 +
for a given $  X = x $
 +
or its mean value (a measure of the spread of $  Y $
 +
about the regression curve):
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656049.png" /></td> </tr></table>
+
$$
 +
\sigma _ {Y \mid  X }  ^ {2}  = \
 +
{\mathsf E} [ Y - {\mathsf E} ( Y \mid  X = x)]  ^ {2} .
 +
$$
  
One has <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656050.png" /> if and only if the regression has the form <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656051.png" />, and in that case the correlation coefficient <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656052.png" /> vanishes and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656053.png" /> is not correlated with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656054.png" />. If the regression of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656055.png" /> given <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656056.png" /> is linear, i.e. the regression curve is the straight line
+
If  $  X $
 +
and  $  Y $
 +
are independent, then all conditional mathematical expectations of  $  Y $
 +
are independent of  $  x $
 +
and coincide with the unconditional expectations:  $  y ( x) = m _ {Y} $;
 +
and then also  $  \sigma _ {Y \mid  X }  ^ {2} = \sigma _ {Y}  ^ {2} $.  
 +
When  $  Y $
 +
is a function of  $  X $
 +
in the strict sense of the word, then for each  $  X = x $
 +
the variable  $  Y $
 +
takes only one definite value and $  \sigma _ {Y \mid  X }  ^ {2} = 0 $.  
 +
Similarly one defines  $  x ( y) = {\mathsf E} [ X \mid  Y = y] $(
 +
the regression of $  X $
 +
given $  Y  $).  
 +
A natural index of the concentration of the distribution near the regression curve $  y ( x) $
 +
is the [[Correlation ratio|correlation ratio]]
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656057.png" /></td> </tr></table>
+
$$
 +
\eta _ {Y \mid  X }  ^ {2}  = \
 +
1 -  
 +
\frac{\sigma _ {Y \mid  X }  ^ {2} }{\sigma _ {Y}  ^ {2} }
 +
.
 +
$$
 +
 
 +
One has  $  \eta _ {Y \mid  X }  ^ {2} = 0 $
 +
if and only if the regression has the form  $  y ( x) = m _ {Y} $,
 +
and in that case the correlation coefficient  $  \rho $
 +
vanishes and  $  Y $
 +
is not correlated with  $  X $.  
 +
If the regression of  $  Y $
 +
given  $  X $
 +
is linear, i.e. the regression curve is the straight line
 +
 
 +
$$
 +
y ( x)  =  m _ {y} + \rho
 +
 
 +
\frac{\sigma _ {Y} }{\sigma _ {X} }
 +
 
 +
( x - m _ {X} ),
 +
$$
  
 
then
 
then
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656058.png" /></td> </tr></table>
+
$$
 +
\sigma _ {Y \mid  X }  ^ {2}  = \
 +
\sigma _ {Y}  ^ {2} ( 1 - \rho  ^ {2} ) \ \
 +
\textrm{ and } \ \
 +
\eta _ {Y \mid  X }  ^ {2}  = \rho  ^ {2} ;
 +
$$
  
if, moreover, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656059.png" />, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656060.png" /> is related to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656061.png" /> through an exact linear dependence; but if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656062.png" />, there is no functional dependence between <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656063.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656064.png" />. There is an exact functional dependence of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656065.png" /> on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656066.png" />, other than a linear one, if and only if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656067.png" />. With rare exceptions, the practical use of the correlation coefficient as a measure of the lack of dependence is justifiable only when the joint distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656068.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656069.png" /> is normal (or close to normal), since in that case <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656070.png" /> implies that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656071.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656072.png" /> are independent. Use of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656073.png" /> as a measure of dependence for arbitrary random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656074.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656075.png" /> frequently leads to erroneous conclusions, since <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656076.png" /> may vanish even when a functional dependence exists. If the joint distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656077.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656078.png" /> is normal, then both regression curves are straight lines and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656079.png" /> uniquely determines the concentration of the distribution near the regression curves: When <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656080.png" /> the regression curves merge into one, corresponding to linear dependence between <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656081.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656082.png" />; when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656083.png" /> one has independence.
+
if, moreover, $  | \rho | = 1 $,  
 +
then $  Y $
 +
is related to $  X $
 +
through an exact linear dependence; but if $  \eta _ {Y \mid  X }  ^ {2} = \rho  ^ {2} < 1 $,  
 +
there is no functional dependence between $  Y $
 +
and $  X $.  
 +
There is an exact functional dependence of $  Y $
 +
on $  X $,  
 +
other than a linear one, if and only if $  \rho  ^ {2} < \eta _ {Y \mid  X }  ^ {2} = 1 $.  
 +
With rare exceptions, the practical use of the correlation coefficient as a measure of the lack of dependence is justifiable only when the joint distribution of $  X $
 +
and $  Y $
 +
is normal (or close to normal), since in that case $  \rho = 0 $
 +
implies that $  X $
 +
and $  Y $
 +
are independent. Use of $  \rho $
 +
as a measure of dependence for arbitrary random variables $  X $
 +
and $  Y $
 +
frequently leads to erroneous conclusions, since $  \rho $
 +
may vanish even when a functional dependence exists. If the joint distribution of $  X $
 +
and $  Y $
 +
is normal, then both regression curves are straight lines and $  \rho $
 +
uniquely determines the concentration of the distribution near the regression curves: When $  | \rho | = 1 $
 +
the regression curves merge into one, corresponding to linear dependence between $  X $
 +
and $  Y $;  
 +
when $  \rho = 0 $
 +
one has independence.
  
When studying the interdependence of several random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656084.png" /> with a given joint distribution, one uses multiple and partial correlation ratios and coefficients. The latter are evaluated using the ordinary correlation coefficients between <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656085.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656086.png" />, the totality of which form the [[Correlation matrix|correlation matrix]]. A measure of the linear relationship between <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656087.png" /> and the totality of the other variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656088.png" /> is provided by the [[Multiple-correlation coefficient|multiple-correlation coefficient]]. If the mutual relationship of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656089.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656090.png" /> is assumed to be determined by the influence of the other variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656091.png" />, then the [[Partial correlation coefficient|partial correlation coefficient]] of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656092.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656093.png" /> with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656094.png" /> is an index of the linear relationship between <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656095.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656096.png" /> relative to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656097.png" />.
+
When studying the interdependence of several random variables $  X _ {1} \dots X _ {n} $
 +
with a given joint distribution, one uses multiple and partial correlation ratios and coefficients. The latter are evaluated using the ordinary correlation coefficients between $  X _ {i} $
 +
and $  X _ {j} $,  
 +
the totality of which form the [[Correlation matrix|correlation matrix]]. A measure of the linear relationship between $  X _ {1} $
 +
and the totality of the other variables $  X _ {2} \dots X _ {n} $
 +
is provided by the [[Multiple-correlation coefficient|multiple-correlation coefficient]]. If the mutual relationship of $  X _ {1} $
 +
and $  X _ {2} $
 +
is assumed to be determined by the influence of the other variables $  X _ {3} \dots X _ {n} $,  
 +
then the [[Partial correlation coefficient|partial correlation coefficient]] of $  X _ {1} $
 +
and $  X _ {2} $
 +
with respect to $  X _ {3} \dots X _ {n} $
 +
is an index of the linear relationship between $  X _ {1} $
 +
and $  X _ {2} $
 +
relative to $  X _ {3} \dots X _ {n} $.
  
 
For measures of correlation based on rank statistics (cf. [[Rank statistic|Rank statistic]]) see [[Kendall coefficient of rank correlation|Kendall coefficient of rank correlation]]; [[Spearman coefficient of rank correlation|Spearman coefficient of rank correlation]].
 
For measures of correlation based on rank statistics (cf. [[Rank statistic|Rank statistic]]) see [[Kendall coefficient of rank correlation|Kendall coefficient of rank correlation]]; [[Spearman coefficient of rank correlation|Spearman coefficient of rank correlation]].
Line 23: Line 146:
 
Mathematical statisticians have developed methods for estimating coefficients that characterize the correlation between random variables or tests; there are also methods to test hypotheses concerning their values, using their sampling analogues. These methods are collectively known as correlation analysis. Correlation analysis of statistical data consists of the following basic practical steps: 1) the construction of a scatter plot and the compilation of a correlation table; 2) the computation of sampling correlation ratios or correlation coefficients; 3) testing statistical hypothesis concerning the significance of the dependence. Further investigation may consist in establishing the concrete form of the dependence between the variables (see [[Regression|Regression]]).
 
Mathematical statisticians have developed methods for estimating coefficients that characterize the correlation between random variables or tests; there are also methods to test hypotheses concerning their values, using their sampling analogues. These methods are collectively known as correlation analysis. Correlation analysis of statistical data consists of the following basic practical steps: 1) the construction of a scatter plot and the compilation of a correlation table; 2) the computation of sampling correlation ratios or correlation coefficients; 3) testing statistical hypothesis concerning the significance of the dependence. Further investigation may consist in establishing the concrete form of the dependence between the variables (see [[Regression|Regression]]).
  
Among the aids to analysis of two-dimensional sample data are the scatter plot and the correlation table. The scatter plot is obtained by plotting the sample points on the coordinate plane. Examination of the configuration formed by the points of the scatter plot yields a preliminary idea of the type of dependence between the random variables (e.g. whether one of the variables increases or decreases on the average as the other increases). Prior to numerical processing, the results are usually grouped and presented in the form of a correlation table. In each entry of this table one writes the number <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656098.png" /> of pairs <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c02656099.png" /> with components in the appropriate grouping intervals. Assuming that the grouping intervals (in each of the variables) are equal in length, one takes the centres <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560100.png" /> (or <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560101.png" />) of the intervals and the numbers <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560102.png" /> as the basis for calculation.
+
Among the aids to analysis of two-dimensional sample data are the scatter plot and the correlation table. The scatter plot is obtained by plotting the sample points on the coordinate plane. Examination of the configuration formed by the points of the scatter plot yields a preliminary idea of the type of dependence between the random variables (e.g. whether one of the variables increases or decreases on the average as the other increases). Prior to numerical processing, the results are usually grouped and presented in the form of a correlation table. In each entry of this table one writes the number $  n _ {ij} $
 +
of pairs $  ( x, y) $
 +
with components in the appropriate grouping intervals. Assuming that the grouping intervals (in each of the variables) are equal in length, one takes the centres $  x _ {i} $(
 +
or $  y _ {i} $)  
 +
of the intervals and the numbers $  n _ {ij} $
 +
as the basis for calculation.
  
 
For more accurate information about the nature and strength of the relationship than that provided by the scatter plot, one turns to the correlation coefficient and correlation ratio. The sample correlation coefficient is defined by the formula
 
For more accurate information about the nature and strength of the relationship than that provided by the scatter plot, one turns to the correlation coefficient and correlation ratio. The sample correlation coefficient is defined by the formula
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560103.png" /></td> </tr></table>
+
$$
 +
\widehat \rho    = \
 +
 
 +
\frac{\sum _ { i } \sum _ { j }
 +
( x _ {i} - \overline{x}\; ) ( y _ {j} - \overline{y}\; ) n _ {ij} }{\sqrt {\sum _ { i } n _ {i  \cdot }  ( x _ {i} - \overline{x}\; )  ^ {2} }
 +
\sqrt {\sum _ { j } n _ {\cdot  j }  ( y _ {j} - \overline{y}\; )  ^ {2} } }
 +
,
 +
$$
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560104.png" /></td> </tr></table>
+
$$
 +
n _ {i  \cdot }  = \
 +
\sum _ { j } n _ {ij} ,\ \
 +
n _ {\cdot  j }  = \
 +
\sum _ { i } n _ {ij}  $$
  
 
and
 
and
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560105.png" /></td> </tr></table>
+
$$
 +
\overline{x}\;  = \
 +
 
 +
\frac{\sum _ { i } n _ {i  \cdot }  x _ {i} }{n}
 +
,\ \
 +
\overline{y}\;  = \
 +
 
 +
\frac{\sum _ { j } n _ {\cdot  j }  y _ {j} }{n}
 +
.
 +
$$
 +
 
 +
In the case of a large number of independent observations, governed by one and the same near-normal distribution,  $  \widehat \rho  $
 +
is a good approximation to the true correlation coefficient  $  \rho $.
 +
In all other cases, as characteristic of strength of the relationship the correlation ratio is recommended, the interpretation of which is independent of the type of dependence being studied. The sample value  $  \widehat \eta  {} _ {Y \mid  X }  ^ {2} $
 +
is computed from the entries in the correlation table:
 +
 
 +
$$
 +
\widehat \eta  {} _ {Y \mid  X }  ^ {2}  = \
 +
 
 +
\frac{ {
 +
\frac{1}{n}
 +
} \sum _ { i } n _ {i  \cdot }
 +
( \overline{y}\; _ {i} - \overline{y}\; )  ^ {2} }{ {
 +
\frac{1}{n}
 +
} \sum _ { j } n _ {\cdot  j }
 +
( y _ {j} - \overline{y}\; )  ^ {2} }
 +
,
 +
$$
 +
 
 +
where the numerator represents the spread of the conditional mean values  $  \overline{y}\; _ {i} $
 +
about the unconditional mean  $  \overline{y}\; $(
 +
the sample value  $  \widehat \eta  {} _ {X \mid  Y }  ^ {2} $
 +
is defined analogously). The quantity  $  \widehat \eta  {} _ {Y \mid  X }  ^ {2} - \widehat \rho  {}  ^ {2} $
 +
is used as an indicator of the deviation of the regression from linearity.
 +
 
 +
The testing of hypotheses concerning the significance of a relationship are based on the distributions of the sample correlation characteristics. In the case of a normal distribution, the value of the sample correlation coefficient  $  \widehat \rho  $
 +
is significantly distinct from zero if
 +
 
 +
$$
 +
( \widehat \rho  )  ^ {2}  > \
 +
\left [ 1 +
 +
 
 +
\frac{n - 2 }{t _  \alpha  ^ {2} }
  
In the case of a large number of independent observations, governed by one and the same near-normal distribution, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560106.png" /> is a good approximation to the true correlation coefficient <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560107.png" />. In all other cases, as characteristic of strength of the relationship the correlation ratio is recommended, the interpretation of which is independent of the type of dependence being studied. The sample value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560108.png" /> is computed from the entries in the correlation table:
+
\right ]  ^ {-} 1 ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560109.png" /></td> </tr></table>
+
where  $  t _  \alpha  $
 +
is the critical value of the Student  $  t $-
 +
distribution with  $  ( n - 2) $
 +
degrees of freedom corresponding to the chosen significance level  $  \alpha $.
 +
If  $  \rho \neq 0 $
 +
one usually uses the Fisher  $  z $-
 +
transform, with  $  \widehat \rho  $
 +
replaced by  $  z $
 +
according to the formula
  
where the numerator represents the spread of the conditional mean values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560110.png" /> about the unconditional mean <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560111.png" /> (the sample value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560112.png" /> is defined analogously). The quantity <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560113.png" /> is used as an indicator of the deviation of the regression from linearity.
+
$$
 +
= {
 +
\frac{1}{2}
 +
}
 +
\mathop{\rm ln} \left (
  
The testing of hypotheses concerning the significance of a relationship are based on the distributions of the sample correlation characteristics. In the case of a normal distribution, the value of the sample correlation coefficient <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560114.png" /> is significantly distinct from zero if
+
\frac{1 + \widehat \rho  }{1 - \widehat \rho  }
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560115.png" /></td> </tr></table>
+
\right ) .
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560116.png" /> is the critical value of the Student <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560117.png" />-distribution with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560118.png" /> degrees of freedom corresponding to the chosen significance level <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560119.png" />. If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560120.png" /> one usually uses the Fisher <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560121.png" />-transform, with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560122.png" /> replaced by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560123.png" /> according to the formula
+
Even at relatively small values  $  n $
 +
the distribution of $  z $
 +
is a good approximation to the normal distribution with mathematical expectation
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560124.png" /></td> </tr></table>
+
$$
  
Even at relatively small values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560125.png" /> the distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560126.png" /> is a good approximation to the normal distribution with mathematical expectation
+
\frac{1}{2}
 +
  \mathop{\rm ln} 
 +
\frac{1+ \rho }{1 - \rho }
 +
+
 +
\frac \rho {2( n - 1) }
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560127.png" /></td> </tr></table>
+
$$
  
and variance <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560128.png" />. On this basis one can now define approximate confidence intervals for the true correlation coefficient <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/c/c026/c026560/c026560129.png" />.
+
and variance $  1/( n - 3) $.  
 +
On this basis one can now define approximate confidence intervals for the true correlation coefficient $  \rho $.
  
 
For the distribution of the sample correlation ratio and for tests of the linearity hypothesis for the regression, see [[#References|[3]]].
 
For the distribution of the sample correlation ratio and for tests of the linearity hypothesis for the regression, see [[#References|[3]]].

Latest revision as of 17:31, 5 June 2020


A dependence between random variables not necessarily expressed by a rigorous functional relationship. Unlike functional dependence, a correlation is, as a rule, considered when one of the random variables depends not only on the other (given) one, but also on several random factors. The dependence between two random events is manifested in the fact that the conditional probability of one of them, given the occurrence of the other, differs from the unconditional probability. Similarly, the influence of one random variable on another is characterized by the conditional distributions of one of them, given fixed values of the other. Let $ X $ and $ Y $ be random variables with given joint distribution, let $ m _ {X} $ and $ m _ {Y} $ be the expectations of $ X $ and $ Y $, let $ \sigma _ {X} ^ {2} $ and $ \sigma _ {Y} ^ {2} $ be the variances of $ X $ and $ Y $, and let $ \rho $ be the correlation coefficient of $ X $ and $ Y $. Assume that for every possible value $ X = x $ the conditional mathematical expectation $ y ( x) = {\mathsf E} [ Y \mid X = x] $ of $ Y $ is defined; then the function $ y ( x) $ is known as the regression of $ Y $ given $ X $, and its graph is the regression curve of $ Y $ given $ X $. The dependence of $ Y $ on $ X $ is manifested in the variation of the mean values of $ Y $ as $ X $ varies, although for each fixed value $ X = x $, $ Y $ remains a random variable with a well-defined spread. In order to determine to what degree of accuracy the regression reproduces the variation of $ Y $ as $ X $ varies, one uses the conditional variance of $ Y $ for a given $ X = x $ or its mean value (a measure of the spread of $ Y $ about the regression curve):

$$ \sigma _ {Y \mid X } ^ {2} = \ {\mathsf E} [ Y - {\mathsf E} ( Y \mid X = x)] ^ {2} . $$

If $ X $ and $ Y $ are independent, then all conditional mathematical expectations of $ Y $ are independent of $ x $ and coincide with the unconditional expectations: $ y ( x) = m _ {Y} $; and then also $ \sigma _ {Y \mid X } ^ {2} = \sigma _ {Y} ^ {2} $. When $ Y $ is a function of $ X $ in the strict sense of the word, then for each $ X = x $ the variable $ Y $ takes only one definite value and $ \sigma _ {Y \mid X } ^ {2} = 0 $. Similarly one defines $ x ( y) = {\mathsf E} [ X \mid Y = y] $( the regression of $ X $ given $ Y $). A natural index of the concentration of the distribution near the regression curve $ y ( x) $ is the correlation ratio

$$ \eta _ {Y \mid X } ^ {2} = \ 1 - \frac{\sigma _ {Y \mid X } ^ {2} }{\sigma _ {Y} ^ {2} } . $$

One has $ \eta _ {Y \mid X } ^ {2} = 0 $ if and only if the regression has the form $ y ( x) = m _ {Y} $, and in that case the correlation coefficient $ \rho $ vanishes and $ Y $ is not correlated with $ X $. If the regression of $ Y $ given $ X $ is linear, i.e. the regression curve is the straight line

$$ y ( x) = m _ {y} + \rho \frac{\sigma _ {Y} }{\sigma _ {X} } ( x - m _ {X} ), $$

then

$$ \sigma _ {Y \mid X } ^ {2} = \ \sigma _ {Y} ^ {2} ( 1 - \rho ^ {2} ) \ \ \textrm{ and } \ \ \eta _ {Y \mid X } ^ {2} = \rho ^ {2} ; $$

if, moreover, $ | \rho | = 1 $, then $ Y $ is related to $ X $ through an exact linear dependence; but if $ \eta _ {Y \mid X } ^ {2} = \rho ^ {2} < 1 $, there is no functional dependence between $ Y $ and $ X $. There is an exact functional dependence of $ Y $ on $ X $, other than a linear one, if and only if $ \rho ^ {2} < \eta _ {Y \mid X } ^ {2} = 1 $. With rare exceptions, the practical use of the correlation coefficient as a measure of the lack of dependence is justifiable only when the joint distribution of $ X $ and $ Y $ is normal (or close to normal), since in that case $ \rho = 0 $ implies that $ X $ and $ Y $ are independent. Use of $ \rho $ as a measure of dependence for arbitrary random variables $ X $ and $ Y $ frequently leads to erroneous conclusions, since $ \rho $ may vanish even when a functional dependence exists. If the joint distribution of $ X $ and $ Y $ is normal, then both regression curves are straight lines and $ \rho $ uniquely determines the concentration of the distribution near the regression curves: When $ | \rho | = 1 $ the regression curves merge into one, corresponding to linear dependence between $ X $ and $ Y $; when $ \rho = 0 $ one has independence.

When studying the interdependence of several random variables $ X _ {1} \dots X _ {n} $ with a given joint distribution, one uses multiple and partial correlation ratios and coefficients. The latter are evaluated using the ordinary correlation coefficients between $ X _ {i} $ and $ X _ {j} $, the totality of which form the correlation matrix. A measure of the linear relationship between $ X _ {1} $ and the totality of the other variables $ X _ {2} \dots X _ {n} $ is provided by the multiple-correlation coefficient. If the mutual relationship of $ X _ {1} $ and $ X _ {2} $ is assumed to be determined by the influence of the other variables $ X _ {3} \dots X _ {n} $, then the partial correlation coefficient of $ X _ {1} $ and $ X _ {2} $ with respect to $ X _ {3} \dots X _ {n} $ is an index of the linear relationship between $ X _ {1} $ and $ X _ {2} $ relative to $ X _ {3} \dots X _ {n} $.

For measures of correlation based on rank statistics (cf. Rank statistic) see Kendall coefficient of rank correlation; Spearman coefficient of rank correlation.

Mathematical statisticians have developed methods for estimating coefficients that characterize the correlation between random variables or tests; there are also methods to test hypotheses concerning their values, using their sampling analogues. These methods are collectively known as correlation analysis. Correlation analysis of statistical data consists of the following basic practical steps: 1) the construction of a scatter plot and the compilation of a correlation table; 2) the computation of sampling correlation ratios or correlation coefficients; 3) testing statistical hypothesis concerning the significance of the dependence. Further investigation may consist in establishing the concrete form of the dependence between the variables (see Regression).

Among the aids to analysis of two-dimensional sample data are the scatter plot and the correlation table. The scatter plot is obtained by plotting the sample points on the coordinate plane. Examination of the configuration formed by the points of the scatter plot yields a preliminary idea of the type of dependence between the random variables (e.g. whether one of the variables increases or decreases on the average as the other increases). Prior to numerical processing, the results are usually grouped and presented in the form of a correlation table. In each entry of this table one writes the number $ n _ {ij} $ of pairs $ ( x, y) $ with components in the appropriate grouping intervals. Assuming that the grouping intervals (in each of the variables) are equal in length, one takes the centres $ x _ {i} $( or $ y _ {i} $) of the intervals and the numbers $ n _ {ij} $ as the basis for calculation.

For more accurate information about the nature and strength of the relationship than that provided by the scatter plot, one turns to the correlation coefficient and correlation ratio. The sample correlation coefficient is defined by the formula

$$ \widehat \rho = \ \frac{\sum _ { i } \sum _ { j } ( x _ {i} - \overline{x}\; ) ( y _ {j} - \overline{y}\; ) n _ {ij} }{\sqrt {\sum _ { i } n _ {i \cdot } ( x _ {i} - \overline{x}\; ) ^ {2} } \sqrt {\sum _ { j } n _ {\cdot j } ( y _ {j} - \overline{y}\; ) ^ {2} } } , $$

where

$$ n _ {i \cdot } = \ \sum _ { j } n _ {ij} ,\ \ n _ {\cdot j } = \ \sum _ { i } n _ {ij} $$

and

$$ \overline{x}\; = \ \frac{\sum _ { i } n _ {i \cdot } x _ {i} }{n} ,\ \ \overline{y}\; = \ \frac{\sum _ { j } n _ {\cdot j } y _ {j} }{n} . $$

In the case of a large number of independent observations, governed by one and the same near-normal distribution, $ \widehat \rho $ is a good approximation to the true correlation coefficient $ \rho $. In all other cases, as characteristic of strength of the relationship the correlation ratio is recommended, the interpretation of which is independent of the type of dependence being studied. The sample value $ \widehat \eta {} _ {Y \mid X } ^ {2} $ is computed from the entries in the correlation table:

$$ \widehat \eta {} _ {Y \mid X } ^ {2} = \ \frac{ { \frac{1}{n} } \sum _ { i } n _ {i \cdot } ( \overline{y}\; _ {i} - \overline{y}\; ) ^ {2} }{ { \frac{1}{n} } \sum _ { j } n _ {\cdot j } ( y _ {j} - \overline{y}\; ) ^ {2} } , $$

where the numerator represents the spread of the conditional mean values $ \overline{y}\; _ {i} $ about the unconditional mean $ \overline{y}\; $( the sample value $ \widehat \eta {} _ {X \mid Y } ^ {2} $ is defined analogously). The quantity $ \widehat \eta {} _ {Y \mid X } ^ {2} - \widehat \rho {} ^ {2} $ is used as an indicator of the deviation of the regression from linearity.

The testing of hypotheses concerning the significance of a relationship are based on the distributions of the sample correlation characteristics. In the case of a normal distribution, the value of the sample correlation coefficient $ \widehat \rho $ is significantly distinct from zero if

$$ ( \widehat \rho ) ^ {2} > \ \left [ 1 + \frac{n - 2 }{t _ \alpha ^ {2} } \right ] ^ {-} 1 , $$

where $ t _ \alpha $ is the critical value of the Student $ t $- distribution with $ ( n - 2) $ degrees of freedom corresponding to the chosen significance level $ \alpha $. If $ \rho \neq 0 $ one usually uses the Fisher $ z $- transform, with $ \widehat \rho $ replaced by $ z $ according to the formula

$$ z = { \frac{1}{2} } \mathop{\rm ln} \left ( \frac{1 + \widehat \rho }{1 - \widehat \rho } \right ) . $$

Even at relatively small values $ n $ the distribution of $ z $ is a good approximation to the normal distribution with mathematical expectation

$$ \frac{1}{2} \mathop{\rm ln} \frac{1+ \rho }{1 - \rho } + \frac \rho {2( n - 1) } $$

and variance $ 1/( n - 3) $. On this basis one can now define approximate confidence intervals for the true correlation coefficient $ \rho $.

For the distribution of the sample correlation ratio and for tests of the linearity hypothesis for the regression, see [3].

References

[1] H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946)
[2] B.L. van der Waerden, "Mathematische Statistik" , Springer (1957)
[3] M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 2. Inference and relationship , Griffin (1979)
[4] S.A. Aivazyan, "Statistical research on dependence" , Moscow (1968) (In Russian)
How to Cite This Entry:
Correlation (in statistics). Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Correlation_(in_statistics)&oldid=11629
This article was adapted from an original article by A.V. Prokhorov (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article