Namespaces
Variants
Actions

Difference between revisions of "Discriminant analysis"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
Line 1: Line 1:
 +
<!--
 +
d0332101.png
 +
$#A+1 = 107 n = 0
 +
$#C+1 = 107 : ~/encyclopedia/old_files/data/D033/D.0303210 Discriminant analysis,
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
 +
 +
{{TEX|auto}}
 +
{{TEX|done}}
 +
 
''discriminance analysis, discriminatory analysis''
 
''discriminance analysis, discriminatory analysis''
  
 
A branch of mathematical statistics dealing with the development and study of statistical methods for solving the following discrimination (distinguishing) problem: To identify, from the results of observations, the set originally containing an object drawn from a family of sets at random. This task is of practical importance if it is impossible to observe the mark by which the object can be unambiguously assigned to a definite set or if the time and labour required for this purpose are excessive; also in cases in which the information concerning such a mark has been lost and must be recovered, or when future events are to be predicted from available data. Situations of the first-mentioned type are encountered in medical practice — e.g. when a diagnosis must be based on a large number of non-specific symptoms. An example of the second type is the determination of the sex of a person dead for a long time from the remains found during archeological excavations. A situation of the third type arises, for example, in a statistical prognosis of long-range results of a therapy. One method of discriminant analysis is [[Multi-dimensional statistical analysis|multi-dimensional statistical analysis]], serving for a quantitative expression and processing of the available information in accordance with the criterion for an optimal solution which has been chosen.
 
A branch of mathematical statistics dealing with the development and study of statistical methods for solving the following discrimination (distinguishing) problem: To identify, from the results of observations, the set originally containing an object drawn from a family of sets at random. This task is of practical importance if it is impossible to observe the mark by which the object can be unambiguously assigned to a definite set or if the time and labour required for this purpose are excessive; also in cases in which the information concerning such a mark has been lost and must be recovered, or when future events are to be predicted from available data. Situations of the first-mentioned type are encountered in medical practice — e.g. when a diagnosis must be based on a large number of non-specific symptoms. An example of the second type is the determination of the sex of a person dead for a long time from the remains found during archeological excavations. A situation of the third type arises, for example, in a statistical prognosis of long-range results of a therapy. One method of discriminant analysis is [[Multi-dimensional statistical analysis|multi-dimensional statistical analysis]], serving for a quantitative expression and processing of the available information in accordance with the criterion for an optimal solution which has been chosen.
  
The problem of discrimination may be put in the following general form. Let the observations carried out on a random object result in the realization of a <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332101.png" />-dimensional random vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332102.png" /> (where the prime denotes transposition) of the values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332103.png" /> marks of the object. The task is to find a rule for assigning the object to one of the possible sets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332104.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332105.png" />, on the basis of the value of the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332106.png" />. The principle of construction of such a rule of discrimination is the subdivision of the entire [[Sampling space|sampling space]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332107.png" /> of values of the vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332108.png" /> into domains <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d0332109.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321010.png" />, so that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321011.png" /> forms part of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321012.png" />, the object belongs to the set <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321013.png" />. The selection of the rule of discrimination to be employed out of all such possible rules is based on an optimality principle, in the light of the a priori information available on the sets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321014.png" /> and on the probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321015.png" /> of withdrawal of the object from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321016.png" />. In so doing, allowance is made for the size of deficiency by incorrect discrimination. A priori information on the sets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321017.png" /> may be the knowledge of the distribution functions of the vector of marks of the objects in each such set; it may also be represented as selections from each one of these sets, when the a priori probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321018.png" /> of the sets may or may not be known. Clearly, the more complete the initial information, the more exact will be the recommendations.
+
The problem of discrimination may be put in the following general form. Let the observations carried out on a random object result in the realization of a $  p $-
 +
dimensional random vector $  x  ^  \prime  = ( x _ {1} \dots x _ {p} ) $(
 +
where the prime denotes transposition) of the values of $  p $
 +
marks of the object. The task is to find a rule for assigning the object to one of the possible sets $  \pi _ {i} $,  
 +
$  i= 1 \dots k $,  
 +
on the basis of the value of the vector $  x $.  
 +
The principle of construction of such a rule of discrimination is the subdivision of the entire [[Sampling space|sampling space]] $  R $
 +
of values of the vector $  x $
 +
into domains $  R _ {i} $,  
 +
$  i= 1 \dots k $,  
 +
so that if $  x $
 +
forms part of $  R _ {i} $,  
 +
the object belongs to the set $  \pi _ {i} $.  
 +
The selection of the rule of discrimination to be employed out of all such possible rules is based on an optimality principle, in the light of the a priori information available on the sets $  \pi _ {i} $
 +
and on the probabilities $  q _ {i} $
 +
of withdrawal of the object from $  \pi _ {i} $.  
 +
In so doing, allowance is made for the size of deficiency by incorrect discrimination. A priori information on the sets $  \pi _ {i} $
 +
may be the knowledge of the distribution functions of the vector of marks of the objects in each such set; it may also be represented as selections from each one of these sets, when the a priori probabilities $  q _ {i} $
 +
of the sets may or may not be known. Clearly, the more complete the initial information, the more exact will be the recommendations.
  
Consider the case of two sets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321019.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321020.png" /> when complete initial information is available: the distribution functions of the vector of marks are known for both sets, and the a priori probabilities are also known (the Bayesian approach). Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321021.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321022.png" /> be the distribution functions of the vector of marks in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321023.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321024.png" /> respectively, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321025.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321026.png" /> the distribution densities, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321027.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321028.png" />, the deficiency due to an object in set <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321029.png" /> being assigned to the set <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321030.png" />. Then the probabilities of incorrect discrimination between objects from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321031.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321032.png" /> are, respectively,
+
Consider the case of two sets $  \pi _ {1} $
 +
and $  \pi _ {2} $
 +
when complete initial information is available: the distribution functions of the vector of marks are known for both sets, and the a priori probabilities are also known (the Bayesian approach). Let $  P _ {1} ( x) $
 +
and $  P _ {2} ( x) $
 +
be the distribution functions of the vector of marks in $  \pi _ {1} $
 +
and $  \pi _ {2} $
 +
respectively, $  p _ {1} ( x) $
 +
and $  p _ {2} ( x) $
 +
the distribution densities, and $  C ( i \mid  j ) $,
 +
$  i, j= 1, 2 $,  
 +
the deficiency due to an object in set $  j $
 +
being assigned to the set $  i $.  
 +
Then the probabilities of incorrect discrimination between objects from $  \pi _ {1} $
 +
and $  \pi _ {2} $
 +
are, respectively,
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321033.png" /></td> </tr></table>
+
$$
 +
P ( 2 \mid  1; R)  = \int\limits _ {R _ {2} } p _ {1} ( x)  dx,\  P( 1 \mid  2 ;
 +
R )  = \int\limits _ {R _ {1} } p _ {2} ( x)  dx
 +
$$
  
(where the symbol <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321034.png" /> denotes the probability of assigning an object in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321035.png" /> to the set <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321036.png" /> using the rule <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321037.png" />), while the [[Mathematical expectation|mathematical expectation]] of the deficiencies connected with incorrect discrimination is
+
(where the symbol $  P( i \mid  j ;  R) $
 +
denotes the probability of assigning an object in $  \pi _ {j} $
 +
to the set $  \pi _ {i} $
 +
using the rule $  R $),  
 +
while the [[Mathematical expectation|mathematical expectation]] of the deficiencies connected with incorrect discrimination is
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321038.png" /></td> </tr></table>
+
$$
 +
C ( 2 \mid  1) P ( 2 \mid  1 ; R ) q _ {1} + C ( 1 \mid  2 ) P ( 1 \mid  2; R ) q _ {2} .
 +
$$
  
 
It is natural in this situation to select, as the optimality principle, the principle of minimization of this magnitude which, in the present case, leads to the following partition of the sampling space [[#References|[1]]]:
 
It is natural in this situation to select, as the optimality principle, the principle of minimization of this magnitude which, in the present case, leads to the following partition of the sampling space [[#References|[1]]]:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321039.png" /></td> <td valign="top" style="width:5%;text-align:right;">(1a)</td></tr></table>
+
$$ \tag{1a }
 +
R _ {1} :  
 +
\frac{p _ {1} ( x) }{p _ {2} ( x) }
 +
  \geq 
 +
\frac{C ( 1 \mid  2 ) q _ {2} }{C ( 2 \mid  1 ) q _ {1} }
 +
,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321040.png" /></td> <td valign="top" style="width:5%;text-align:right;">(1b)</td></tr></table>
+
$$ \tag{1b }
 +
R _ {2} :  
 +
\frac{p _ {1} ( x) }{p _ {2} ( x) }
 +
  <
 +
\frac{C ( 1 \mid  2 ) q _ {2} }{C ( 2 \mid  1 ) q _ {1} }
 +
.
 +
$$
  
 
If the conditions
 
If the conditions
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321041.png" /></td> </tr></table>
+
$$
 +
{\mathsf P} \left \{
 +
\frac{p _ {1} ( x) }{p _ {2} ( x) }
 +
=
 +
 
 +
\frac{C ( 1 \mid  2) q _ {2} }{C ( 2 \mid  1 ) q _ {1} }
 +
\left | \pi _ {i} \right . \right \}  = 0,\  i= 1, 2,
 +
$$
  
 
are met, such a partition is unique, up to a set of probability zero. A similar rule of discrimination in this case may also be arrived at in other ways — e.g. using the [[Neyman–Pearson lemma|Neyman–Pearson lemma]], which forms part of the theory of testing statistical hypotheses.
 
are met, such a partition is unique, up to a set of probability zero. A similar rule of discrimination in this case may also be arrived at in other ways — e.g. using the [[Neyman–Pearson lemma|Neyman–Pearson lemma]], which forms part of the theory of testing statistical hypotheses.
Line 27: Line 99:
 
In selecting a specific optimality criterion, the quality of the rule of discrimination is measured by the value of the mathematical expectations of the losses, and, out of two rules, the one giving the lower expectation is selected.
 
In selecting a specific optimality criterion, the quality of the rule of discrimination is measured by the value of the mathematical expectations of the losses, and, out of two rules, the one giving the lower expectation is selected.
  
If, in a discrimination problem, the a priori probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321042.png" /> are not known, the natural course is to look for the solution in the class of permissible rules by choosing the rule which minimizes the maximum over all the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321043.png" /> of the mathematical expectations of the losses (such a rule is called a minimax rule). The mathematical expectations of the losses, if the observations were carried out on objects from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321044.png" /> or <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321045.png" /> respectively, are
+
If, in a discrimination problem, the a priori probabilities $  q _ {i} $
 +
are not known, the natural course is to look for the solution in the class of permissible rules by choosing the rule which minimizes the maximum over all the $  q _ {i} $
 +
of the mathematical expectations of the losses (such a rule is called a minimax rule). The mathematical expectations of the losses, if the observations were carried out on objects from $  \pi _ {1} $
 +
or $  \pi _ {2} $
 +
respectively, are
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321046.png" /></td> </tr></table>
+
$$
 +
C ( 2 \mid  1 ) P ( 2 \mid  1 ; R )  = r ( 1 , R ) ,\  C ( 1 \mid
 +
2 ) P ( 1 \mid  2 ; R )  = r ( 2 , R ) .
 +
$$
  
 
The following theorem is valid [[#References|[1]]]: If the conditions
 
The following theorem is valid [[#References|[1]]]: If the conditions
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321047.png" /></td> </tr></table>
+
$$
 +
{\mathsf P} \left \{
 +
\frac{p _ {1} ( x) }{p _ {2} ( x) }
 +
 
 +
= k \left | \pi _ {i} \right . \right \}  = 0 ,\  i = 1 , 2 ,\  0 \leq  k \leq  \infty ,
 +
$$
 +
 
 +
are met, the class of Bayesian methods is the minimal complete class. The minimax rule  $  R  ^ {*} $
 +
of this class is obtained for the value of  $  q _ {i} $
 +
which meets the conditions  $  P( 2 \mid  1 ; R  ^ {*} ) = P ( 1 \mid  2 ; R  ^ {*} ) $.
 +
In the important case when  $  P _ {1} $
 +
and  $  P _ {2} $
 +
are multi-dimensional normal distributions with average vectors  $  \mu  ^ {(} 1) $
 +
and  $  \mu  ^ {(} 2) $
 +
and common [[Covariance matrix|covariance matrix]]  $  \Sigma $,
 +
the rule of discrimination (1a)–(1b) assumes the form
 +
 
 +
$$ \tag{2a }
 +
R _ {1} : x  ^  \prime  \Sigma  ^ {-} 1 ( \mu  ^ {(} 1) - \mu  ^ {(} 2) ) -
 +
$$
 +
 
 +
$$
 +
-
  
are met, the class of Bayesian methods is the minimal complete class. The minimax rule <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321048.png" /> of this class is obtained for the value of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321049.png" /> which meets the conditions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321050.png" />. In the important case when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321051.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321052.png" /> are multi-dimensional normal distributions with average vectors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321053.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321054.png" /> and common [[Covariance matrix|covariance matrix]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321055.png" />, the rule of discrimination (1a)(1b) assumes the form
+
\frac{1}{2}
 +
( \mu  ^ {(} 1) + \mu  ^ {(} 2) )  ^  \prime  \Sigma  ^ {-} 1 ( \mu  ^ {(} 1) - \mu  ^ {(} 2) )  \geq  \
 +
\mathop{\rm ln}  k ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321056.png" /></td> <td valign="top" style="width:5%;text-align:right;">(2a)</td></tr></table>
+
$$ \tag{2b }
 +
R _ {2} : x  ^  \prime  \Sigma  ^ {-} 1 ( \mu  ^ {(} 1) - \mu  ^ {(} 2) ) -
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321057.png" /></td> </tr></table>
+
$$
 +
-  
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321058.png" /></td> <td valign="top" style="width:5%;text-align:right;">(2b)</td></tr></table>
+
\frac{1}{2}
 +
( \mu  ^ {(} 1) + \mu  ^ {(} 2) )  ^  \prime  \Sigma  ^ {-} 1 ( \mu  ^ {(} 1) - \mu  ^ {(} 2) ) < \
 +
\mathop{\rm ln}  k ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321059.png" /></td> </tr></table>
+
where  $  k = {q _ {2} C ( 1 \mid  2 ) } / \{ q _ {1} C ( 2 \mid  1) \} $.
 +
If  $  C ( 1 \mid  2 ) = C ( 2 \mid  1 ) $
 +
and  $  q _ {2} = q _ {1} $,
 +
then  $  \mathop{\rm ln}  k = 0 $
 +
and
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321060.png" />. If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321061.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321062.png" />, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321063.png" /> and
+
$$ \tag{3 }
 +
R _ {1} : D ( x)  = x  ^  \prime  \Sigma  ^ {-} 1 ( \mu ^ {( 1) } - \mu ^
 +
{( 2) } ) \geq
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321064.png" /></td> <td valign="top" style="width:5%;text-align:right;">(3)</td></tr></table>
+
$$
 +
\geq  \
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321065.png" /></td> </tr></table>
+
\frac{1}{2}
 +
( \mu ^ {( 1) } - \mu ^ {( 2) } )
 +
^  \prime  \Sigma  ^ {-} 1 ( \mu ^ {( 1) } - \mu ^ {( 2) } ) .
 +
$$
  
If the priori probabilities are not known, one may choose <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321066.png" />, for example, from the condition of minimal error involved in incorrect discrimination or by stipulating that the mathematical expectations of the losses caused by incorrect discrimination be zero. Generally speaking, the choice of the optimality criterion is determined by the nature of the problem itself. The expression on the left-hand side of (3) is said to be the [[Discriminant function|discriminant function]] of the problem; it may be interpreted as a surface in the sampling space separating the sets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321067.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321068.png" />. In the above example this function is linear, i.e. such a surface is a hypersurface. If, in the above example, the covariance matrices are different, the discriminant function will be a quadratic function in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321069.png" />. In order to simplify the calculations, the minimal complete class of linear discrimination procedures has been found for this case [[#References|[3]]].
+
If the priori probabilities are not known, one may choose $  \mathop{\rm ln}  k = c $,  
 +
for example, from the condition of minimal error involved in incorrect discrimination or by stipulating that the mathematical expectations of the losses caused by incorrect discrimination be zero. Generally speaking, the choice of the optimality criterion is determined by the nature of the problem itself. The expression on the left-hand side of (3) is said to be the [[Discriminant function|discriminant function]] of the problem; it may be interpreted as a surface in the sampling space separating the sets $  \pi _ {1} $
 +
and $  \pi _ {2} $.  
 +
In the above example this function is linear, i.e. such a surface is a hypersurface. If, in the above example, the covariance matrices are different, the discriminant function will be a quadratic function in $  x $.  
 +
In order to simplify the calculations, the minimal complete class of linear discrimination procedures has been found for this case [[#References|[3]]].
  
As regards the applications of discriminant analysis, the most important situation is one in which the initial information on the distribution is represented by samples from these distributions. The problem is then formulated as follows. Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321070.png" /> be samples out of the set <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321071.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321072.png" />; let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321073.png" /> be the vector of marks of the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321074.png" />-th object selected from the set <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321075.png" />, and let an additional observation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321076.png" /> have been made on an object belonging to the set <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321077.png" />. One is required to find the rule assigning the observation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321078.png" /> to one of these sets. The first to solve this problem in the case of two sets was R.A. Fisher [[#References|[4]]], who laid the foundations of discriminant analysis. Instead of using the vector of marks as the characterization of the object, he used their linear combination — a hyperplane which, in a certain sense, is the best division of the set of sample points — and arrived at the discriminant function (3).
+
As regards the applications of discriminant analysis, the most important situation is one in which the initial information on the distribution is represented by samples from these distributions. The problem is then formulated as follows. Let $  x _ {1}  ^ {(} i) \dots x _ {n _ {i}  }  ^ {(} i) $
 +
be samples out of the set $  \pi _ {i} $,  
 +
$  i = 1 \dots k $;  
 +
let $  x _ {j}  ^ {(} i) = ( x _ {j _ {1}  }  ^ {(} i) \dots x _ {j _ {p}  }  ^ {(} i) ) $
 +
be the vector of marks of the $  j $-
 +
th object selected from the set $  i $,  
 +
and let an additional observation $  x  ^  \prime  = ( x _ {1} \dots x _ {p} ) $
 +
have been made on an object belonging to the set $  \pi _ {i} $.  
 +
One is required to find the rule assigning the observation $  x $
 +
to one of these sets. The first to solve this problem in the case of two sets was R.A. Fisher [[#References|[4]]], who laid the foundations of discriminant analysis. Instead of using the vector of marks as the characterization of the object, he used their linear combination — a hyperplane which, in a certain sense, is the best division of the set of sample points — and arrived at the discriminant function (3).
  
 
The case most frequently studied is when the distributions of the vectors of marks in each set are known to be normal, but the parameters of the distributions are not known. Here, the obvious approach is to replace the (unknown) distribution parameters in the discriminant function (3) by their optimal estimates [[#References|[5]]], [[#References|[6]]]. As in the case when the distributions are known, the rule of discrimination may be based on the likelihood ratio [[#References|[7]]], [[#References|[8]]].
 
The case most frequently studied is when the distributions of the vectors of marks in each set are known to be normal, but the parameters of the distributions are not known. Here, the obvious approach is to replace the (unknown) distribution parameters in the discriminant function (3) by their optimal estimates [[#References|[5]]], [[#References|[6]]]. As in the case when the distributions are known, the rule of discrimination may be based on the likelihood ratio [[#References|[7]]], [[#References|[8]]].
  
In a great majority of cases, the results obtained in discriminant analysis concern normal distributions. Studies are carried out on the applicability in the normal case of situations where the normal distribution is assumed to be only approximate [[#References|[9]]]. In such studies the problems of discriminant analysis are considered in the framework of the general theory of decision functions, and the properties of the rules of discrimination are investigated with respect to the so-called <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321080.png" />-optimality principle, which includes in a natural way both the Bayesian and the minimax approach. In fact, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321081.png" /> be the probability of the error involved in applying the rule <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321082.png" /> of discrimination, when the vector of a priori probabilities is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321083.png" />. Further, assume that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321084.png" />, where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321085.png" /> is some set in the space of vectors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321086.png" />. A rule <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321087.png" /> is said to be <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321089.png" />-optimal if
+
In a great majority of cases, the results obtained in discriminant analysis concern normal distributions. Studies are carried out on the applicability in the normal case of situations where the normal distribution is assumed to be only approximate [[#References|[9]]]. In such studies the problems of discriminant analysis are considered in the framework of the general theory of decision functions, and the properties of the rules of discrimination are investigated with respect to the so-called $  Q $-
 +
optimality principle, which includes in a natural way both the Bayesian and the minimax approach. In fact, let $  R( \xi , \delta ) $
 +
be the probability of the error involved in applying the rule $  \delta $
 +
of discrimination, when the vector of a priori probabilities is $  \xi $.  
 +
Further, assume that $  \xi \in Q $,  
 +
where $  Q $
 +
is some set in the space of vectors $  \xi $.  
 +
A rule $  \delta  ^ {*} $
 +
is said to be $  Q $-
 +
optimal if
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321090.png" /></td> <td valign="top" style="width:5%;text-align:right;">(4)</td></tr></table>
+
$$ \tag{4 }
 +
\sup _ {\xi \in Q }  R ( \xi , \delta  ^ {*} )  = \
 +
\inf _ {\delta \in D }  \sup _ {\xi \in Q }
 +
R ( \xi , \delta ) =  R _ {Q} ,
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321091.png" /> is the set of all possible rules of discrimination. Let the functional form <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321092.png" /> be known, which depends on the distribution parameter of the vector of marks in each set, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321093.png" />, and suppose that the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321094.png" /> is unknown and must be estimated by sampling. If now <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321095.png" /> are such that there exists a <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321096.png" />-optimal rule <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321097.png" /> of discrimination for the distributions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321098.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d03321099.png" />, when the value of the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210100.png" /> is known, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210101.png" /> is a strongly consistent estimator of the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210102.png" /> based on a sample of volume <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210103.png" />, then, if certain additional conditions are met, the sequence of rules <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210104.png" /> is asymptotically <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210106.png" />-optimal as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210107.png" />, i.e. one has, with probability one,
+
where $  D $
 +
is the set of all possible rules of discrimination. Let the functional form $  P _ {i} ( x , \lambda _ {i} ) $
 +
be known, which depends on the distribution parameter of the vector of marks in each set, $  i = 1, 2 $,  
 +
and suppose that the parameter $  \lambda $
 +
is unknown and must be estimated by sampling. If now $  P _ {i} ( x , \lambda _ {i} ) $
 +
are such that there exists a $  Q $-
 +
optimal rule $  \delta  ^ {*} ( \lambda _ {1} , \lambda _ {2} ) $
 +
of discrimination for the distributions $  P _ {i} ( x , \lambda _ {i} ) $,
 +
$  i = 1 , 2 $,  
 +
when the value of the parameter $  \lambda = ( \lambda _ {1} , \lambda _ {2} ) $
 +
is known, and $  \{ \lambda _ {i} ^ {( n _ {i} ) } \} $
 +
is a strongly consistent estimator of the parameter $  \lambda _ {i} $
 +
based on a sample of volume $  n _ {i} $,  
 +
then, if certain additional conditions are met, the sequence of rules $  \{ \delta  ^ {*} ( \lambda _ {1} ^ {( n _ {1} ) } , \lambda _ {2} ^ {( n _ {2} ) } ) \} $
 +
is asymptotically $  Q $-
 +
optimal as $  n _ {1} , n _ {2} \rightarrow \infty $,  
 +
i.e. one has, with probability one,
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210108.png" /></td> <td valign="top" style="width:5%;text-align:right;">(5)</td></tr></table>
+
$$ \tag{5 }
 +
\lim\limits _ {n _ {1} , n _ {2} \rightarrow \infty }  \sup _
 +
{\xi \in Q }  R ( \xi , \delta  ^ {*} ( \lambda _ {1} ^ {( n _ {1} ) }
 +
, \lambda _ {2} ^ {( n _ {2} ) } ) ) =  R _ {Q} ,
 +
$$
  
where the risk <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210109.png" /> on the left-hand side of (5) may be computed both for the true parameter values and when such values are replaced by their estimates <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d033/d033210/d033210110.png" />. If a consistent estimator only is required, a somewhat weaker statement applies.
+
where the risk $  R $
 +
on the left-hand side of (5) may be computed both for the true parameter values and when such values are replaced by their estimates $  \lambda _ {i} ^ {( n _ {i} ) } $.  
 +
If a consistent estimator only is required, a somewhat weaker statement applies.
  
 
Non-parametric discrimination methods, in which the exact functional form of the distributions need not be known, and which make it possible to solve discrimination problems on the basis of a small amount of a priori information about the sets, are particularly valuable in practical applications [[#References|[2]]], [[#References|[10]]].
 
Non-parametric discrimination methods, in which the exact functional form of the distributions need not be known, and which make it possible to solve discrimination problems on the basis of a small amount of a priori information about the sets, are particularly valuable in practical applications [[#References|[2]]], [[#References|[10]]].

Revision as of 19:36, 5 June 2020


discriminance analysis, discriminatory analysis

A branch of mathematical statistics dealing with the development and study of statistical methods for solving the following discrimination (distinguishing) problem: To identify, from the results of observations, the set originally containing an object drawn from a family of sets at random. This task is of practical importance if it is impossible to observe the mark by which the object can be unambiguously assigned to a definite set or if the time and labour required for this purpose are excessive; also in cases in which the information concerning such a mark has been lost and must be recovered, or when future events are to be predicted from available data. Situations of the first-mentioned type are encountered in medical practice — e.g. when a diagnosis must be based on a large number of non-specific symptoms. An example of the second type is the determination of the sex of a person dead for a long time from the remains found during archeological excavations. A situation of the third type arises, for example, in a statistical prognosis of long-range results of a therapy. One method of discriminant analysis is multi-dimensional statistical analysis, serving for a quantitative expression and processing of the available information in accordance with the criterion for an optimal solution which has been chosen.

The problem of discrimination may be put in the following general form. Let the observations carried out on a random object result in the realization of a $ p $- dimensional random vector $ x ^ \prime = ( x _ {1} \dots x _ {p} ) $( where the prime denotes transposition) of the values of $ p $ marks of the object. The task is to find a rule for assigning the object to one of the possible sets $ \pi _ {i} $, $ i= 1 \dots k $, on the basis of the value of the vector $ x $. The principle of construction of such a rule of discrimination is the subdivision of the entire sampling space $ R $ of values of the vector $ x $ into domains $ R _ {i} $, $ i= 1 \dots k $, so that if $ x $ forms part of $ R _ {i} $, the object belongs to the set $ \pi _ {i} $. The selection of the rule of discrimination to be employed out of all such possible rules is based on an optimality principle, in the light of the a priori information available on the sets $ \pi _ {i} $ and on the probabilities $ q _ {i} $ of withdrawal of the object from $ \pi _ {i} $. In so doing, allowance is made for the size of deficiency by incorrect discrimination. A priori information on the sets $ \pi _ {i} $ may be the knowledge of the distribution functions of the vector of marks of the objects in each such set; it may also be represented as selections from each one of these sets, when the a priori probabilities $ q _ {i} $ of the sets may or may not be known. Clearly, the more complete the initial information, the more exact will be the recommendations.

Consider the case of two sets $ \pi _ {1} $ and $ \pi _ {2} $ when complete initial information is available: the distribution functions of the vector of marks are known for both sets, and the a priori probabilities are also known (the Bayesian approach). Let $ P _ {1} ( x) $ and $ P _ {2} ( x) $ be the distribution functions of the vector of marks in $ \pi _ {1} $ and $ \pi _ {2} $ respectively, $ p _ {1} ( x) $ and $ p _ {2} ( x) $ the distribution densities, and $ C ( i \mid j ) $, $ i, j= 1, 2 $, the deficiency due to an object in set $ j $ being assigned to the set $ i $. Then the probabilities of incorrect discrimination between objects from $ \pi _ {1} $ and $ \pi _ {2} $ are, respectively,

$$ P ( 2 \mid 1; R) = \int\limits _ {R _ {2} } p _ {1} ( x) dx,\ P( 1 \mid 2 ; R ) = \int\limits _ {R _ {1} } p _ {2} ( x) dx $$

(where the symbol $ P( i \mid j ; R) $ denotes the probability of assigning an object in $ \pi _ {j} $ to the set $ \pi _ {i} $ using the rule $ R $), while the mathematical expectation of the deficiencies connected with incorrect discrimination is

$$ C ( 2 \mid 1) P ( 2 \mid 1 ; R ) q _ {1} + C ( 1 \mid 2 ) P ( 1 \mid 2; R ) q _ {2} . $$

It is natural in this situation to select, as the optimality principle, the principle of minimization of this magnitude which, in the present case, leads to the following partition of the sampling space [1]:

$$ \tag{1a } R _ {1} : \frac{p _ {1} ( x) }{p _ {2} ( x) } \geq \frac{C ( 1 \mid 2 ) q _ {2} }{C ( 2 \mid 1 ) q _ {1} } , $$

$$ \tag{1b } R _ {2} : \frac{p _ {1} ( x) }{p _ {2} ( x) } < \frac{C ( 1 \mid 2 ) q _ {2} }{C ( 2 \mid 1 ) q _ {1} } . $$

If the conditions

$$ {\mathsf P} \left \{ \frac{p _ {1} ( x) }{p _ {2} ( x) } = \frac{C ( 1 \mid 2) q _ {2} }{C ( 2 \mid 1 ) q _ {1} } \left | \pi _ {i} \right . \right \} = 0,\ i= 1, 2, $$

are met, such a partition is unique, up to a set of probability zero. A similar rule of discrimination in this case may also be arrived at in other ways — e.g. using the Neyman–Pearson lemma, which forms part of the theory of testing statistical hypotheses.

In selecting a specific optimality criterion, the quality of the rule of discrimination is measured by the value of the mathematical expectations of the losses, and, out of two rules, the one giving the lower expectation is selected.

If, in a discrimination problem, the a priori probabilities $ q _ {i} $ are not known, the natural course is to look for the solution in the class of permissible rules by choosing the rule which minimizes the maximum over all the $ q _ {i} $ of the mathematical expectations of the losses (such a rule is called a minimax rule). The mathematical expectations of the losses, if the observations were carried out on objects from $ \pi _ {1} $ or $ \pi _ {2} $ respectively, are

$$ C ( 2 \mid 1 ) P ( 2 \mid 1 ; R ) = r ( 1 , R ) ,\ C ( 1 \mid 2 ) P ( 1 \mid 2 ; R ) = r ( 2 , R ) . $$

The following theorem is valid [1]: If the conditions

$$ {\mathsf P} \left \{ \frac{p _ {1} ( x) }{p _ {2} ( x) } = k \left | \pi _ {i} \right . \right \} = 0 ,\ i = 1 , 2 ,\ 0 \leq k \leq \infty , $$

are met, the class of Bayesian methods is the minimal complete class. The minimax rule $ R ^ {*} $ of this class is obtained for the value of $ q _ {i} $ which meets the conditions $ P( 2 \mid 1 ; R ^ {*} ) = P ( 1 \mid 2 ; R ^ {*} ) $. In the important case when $ P _ {1} $ and $ P _ {2} $ are multi-dimensional normal distributions with average vectors $ \mu ^ {(} 1) $ and $ \mu ^ {(} 2) $ and common covariance matrix $ \Sigma $, the rule of discrimination (1a)–(1b) assumes the form

$$ \tag{2a } R _ {1} : x ^ \prime \Sigma ^ {-} 1 ( \mu ^ {(} 1) - \mu ^ {(} 2) ) - $$

$$ - \frac{1}{2} ( \mu ^ {(} 1) + \mu ^ {(} 2) ) ^ \prime \Sigma ^ {-} 1 ( \mu ^ {(} 1) - \mu ^ {(} 2) ) \geq \ \mathop{\rm ln} k , $$

$$ \tag{2b } R _ {2} : x ^ \prime \Sigma ^ {-} 1 ( \mu ^ {(} 1) - \mu ^ {(} 2) ) - $$

$$ - \frac{1}{2} ( \mu ^ {(} 1) + \mu ^ {(} 2) ) ^ \prime \Sigma ^ {-} 1 ( \mu ^ {(} 1) - \mu ^ {(} 2) ) < \ \mathop{\rm ln} k , $$

where $ k = {q _ {2} C ( 1 \mid 2 ) } / \{ q _ {1} C ( 2 \mid 1) \} $. If $ C ( 1 \mid 2 ) = C ( 2 \mid 1 ) $ and $ q _ {2} = q _ {1} $, then $ \mathop{\rm ln} k = 0 $ and

$$ \tag{3 } R _ {1} : D ( x) = x ^ \prime \Sigma ^ {-} 1 ( \mu ^ {( 1) } - \mu ^ {( 2) } ) \geq $$

$$ \geq \ \frac{1}{2} ( \mu ^ {( 1) } - \mu ^ {( 2) } ) ^ \prime \Sigma ^ {-} 1 ( \mu ^ {( 1) } - \mu ^ {( 2) } ) . $$

If the priori probabilities are not known, one may choose $ \mathop{\rm ln} k = c $, for example, from the condition of minimal error involved in incorrect discrimination or by stipulating that the mathematical expectations of the losses caused by incorrect discrimination be zero. Generally speaking, the choice of the optimality criterion is determined by the nature of the problem itself. The expression on the left-hand side of (3) is said to be the discriminant function of the problem; it may be interpreted as a surface in the sampling space separating the sets $ \pi _ {1} $ and $ \pi _ {2} $. In the above example this function is linear, i.e. such a surface is a hypersurface. If, in the above example, the covariance matrices are different, the discriminant function will be a quadratic function in $ x $. In order to simplify the calculations, the minimal complete class of linear discrimination procedures has been found for this case [3].

As regards the applications of discriminant analysis, the most important situation is one in which the initial information on the distribution is represented by samples from these distributions. The problem is then formulated as follows. Let $ x _ {1} ^ {(} i) \dots x _ {n _ {i} } ^ {(} i) $ be samples out of the set $ \pi _ {i} $, $ i = 1 \dots k $; let $ x _ {j} ^ {(} i) = ( x _ {j _ {1} } ^ {(} i) \dots x _ {j _ {p} } ^ {(} i) ) $ be the vector of marks of the $ j $- th object selected from the set $ i $, and let an additional observation $ x ^ \prime = ( x _ {1} \dots x _ {p} ) $ have been made on an object belonging to the set $ \pi _ {i} $. One is required to find the rule assigning the observation $ x $ to one of these sets. The first to solve this problem in the case of two sets was R.A. Fisher [4], who laid the foundations of discriminant analysis. Instead of using the vector of marks as the characterization of the object, he used their linear combination — a hyperplane which, in a certain sense, is the best division of the set of sample points — and arrived at the discriminant function (3).

The case most frequently studied is when the distributions of the vectors of marks in each set are known to be normal, but the parameters of the distributions are not known. Here, the obvious approach is to replace the (unknown) distribution parameters in the discriminant function (3) by their optimal estimates [5], [6]. As in the case when the distributions are known, the rule of discrimination may be based on the likelihood ratio [7], [8].

In a great majority of cases, the results obtained in discriminant analysis concern normal distributions. Studies are carried out on the applicability in the normal case of situations where the normal distribution is assumed to be only approximate [9]. In such studies the problems of discriminant analysis are considered in the framework of the general theory of decision functions, and the properties of the rules of discrimination are investigated with respect to the so-called $ Q $- optimality principle, which includes in a natural way both the Bayesian and the minimax approach. In fact, let $ R( \xi , \delta ) $ be the probability of the error involved in applying the rule $ \delta $ of discrimination, when the vector of a priori probabilities is $ \xi $. Further, assume that $ \xi \in Q $, where $ Q $ is some set in the space of vectors $ \xi $. A rule $ \delta ^ {*} $ is said to be $ Q $- optimal if

$$ \tag{4 } \sup _ {\xi \in Q } R ( \xi , \delta ^ {*} ) = \ \inf _ {\delta \in D } \sup _ {\xi \in Q } R ( \xi , \delta ) = R _ {Q} , $$

where $ D $ is the set of all possible rules of discrimination. Let the functional form $ P _ {i} ( x , \lambda _ {i} ) $ be known, which depends on the distribution parameter of the vector of marks in each set, $ i = 1, 2 $, and suppose that the parameter $ \lambda $ is unknown and must be estimated by sampling. If now $ P _ {i} ( x , \lambda _ {i} ) $ are such that there exists a $ Q $- optimal rule $ \delta ^ {*} ( \lambda _ {1} , \lambda _ {2} ) $ of discrimination for the distributions $ P _ {i} ( x , \lambda _ {i} ) $, $ i = 1 , 2 $, when the value of the parameter $ \lambda = ( \lambda _ {1} , \lambda _ {2} ) $ is known, and $ \{ \lambda _ {i} ^ {( n _ {i} ) } \} $ is a strongly consistent estimator of the parameter $ \lambda _ {i} $ based on a sample of volume $ n _ {i} $, then, if certain additional conditions are met, the sequence of rules $ \{ \delta ^ {*} ( \lambda _ {1} ^ {( n _ {1} ) } , \lambda _ {2} ^ {( n _ {2} ) } ) \} $ is asymptotically $ Q $- optimal as $ n _ {1} , n _ {2} \rightarrow \infty $, i.e. one has, with probability one,

$$ \tag{5 } \lim\limits _ {n _ {1} , n _ {2} \rightarrow \infty } \sup _ {\xi \in Q } R ( \xi , \delta ^ {*} ( \lambda _ {1} ^ {( n _ {1} ) } , \lambda _ {2} ^ {( n _ {2} ) } ) ) = R _ {Q} , $$

where the risk $ R $ on the left-hand side of (5) may be computed both for the true parameter values and when such values are replaced by their estimates $ \lambda _ {i} ^ {( n _ {i} ) } $. If a consistent estimator only is required, a somewhat weaker statement applies.

Non-parametric discrimination methods, in which the exact functional form of the distributions need not be known, and which make it possible to solve discrimination problems on the basis of a small amount of a priori information about the sets, are particularly valuable in practical applications [2], [10].

Problems in discriminant analysis may involve random observations of both qualitative and quantitative marks (a mixed case is also possible). There is no difference in principle between the two. If the marks are quantitative, the concept of a multi-dimensional state of the object is introduced, and one considers the distribution with respect to this state. The manner in which the distribution function of the vector of marks is estimated depends on the nature of the observations. The Bayesian and the minimax approaches are again employed whenever suitable, and it is possible to establish a discrimination procedure based on the likelihood ratio. It is sometimes expedient to pass from quantitative to qualitative magnitudes by subdividing the frequency function and, conversely, from qualitative to quantitative, by introducing suitable fictitious variables which convert qualitative information. In so doing it must clearly be ensured that this does not involve a significant deterioration in the quality of the rule.

So far problems in discriminant analysis for a fixed dimension of the space of vectors of marks have been considered only. In practice, however, it is often the worker himself who has to choose the dimension. It would appear at first sight that the introduction of a new mark in the discriminant function will not impair (and may improve) its quality. In fact, however, the effectiveness of the discrimination may be impaired for various reasons (including the fact that estimates rather than the true values of the distribution parameters are often employed). Moreover, an increase in the number of marks results in a considerable increase in the computational labour. The choice of marks may be effected in various ways, which in many cases are dictated by simple common sense. The method involving the computation of the Mahalanobis distance between distributions [11] is theoretically the best substantiated method. Methods involving successive sampling of marks are of special interest.

Problems involving the assignment of an object to one of several possible sets were studied for a long time under the heading of classification problems. M.G. Kendall [2] subdivided all problems of sampling of one out of several equivalent possibilities into three classes, and his terminology is used in the present article. He gave the term "discrimination problems" to problems of the kind discussed above, and reserved the term "classification" for the task of partitioning a given sample or a set of objects into groups — if possible, homogeneous. In discrimination problems the existence of such groups is discussed as part of the conditions of the problem; here it is the subject of the investigation. In the discrimination problems discussed above the object under study was the result of a random choice out of some finite distribution. A more general situation is also possible — when the object under study is the realization of some random process in continuous time.

Discriminant analysis is closely connected with the theory of pattern recognition.

References

[1] T.W. Anderson, "An introduction to multivariate statistical analysis" , Wiley (1958)
[2] M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 3. Design and analysis, and time-series , Griffin (1966)
[3] T.W. Anderson, R.R. Bahadur, "Classification into two multivariate normal distributions with different covariance matrices" Ann. Math. Stat. , 33 : 2 (1962) pp. 420–431
[4] R.A. Fisher, Ann. Eugenics , 7 : 11 (1936) pp. 179–188
[5] A. Wald, "On a statistical problem arising in the classification of an individual into one of two groups" Ann. Math. Stat. , 15 : 2 (1944) pp. 145–162
[6] S. John, "On some classification problems I, II" Sankhya , 22 : 3–4 (1960) pp. 301–316
[7] B.L. Welch, "Note on discriminant functions" Biometrika , 31 : 1–2 (1939) pp. 218–220
[8] S.D. Gupta, "Optimum classification rules for classification into two multivariate normal distributions" Ann. Math. Stat. , 36 : 4 (1965) pp. 1174–1184
[9] O. Bunke, "Stabilität statistischer Entscheidungsprobleme und Anwendugen in der Discriminanzanalyse" Z. Wahrsch. Verw. Geb. , 7 : 2 (1967) pp. 131–146
[10] J. van Ryzin, "Bayes risk consistency of classification procedures using density estimation" Sankhya Ser. A , 28 (1966) pp. 261–270
[11] A. Kudô, Mem. Fac. Sci. Kyushu Univ. Ser. A , 17 : 1 (1963) pp. 63–75
[12] V.Yu. Urbakh, "Statistical methods of classification" , 1 , Moscow (1969) pp. 79–173 (In Russian) ((See the list of references))
How to Cite This Entry:
Discriminant analysis. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Discriminant_analysis&oldid=17339
This article was adapted from an original article by N.M. MitrofanovaA.P. Khusu (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article