Discriminant analysis
discriminance analysis, discriminatory analysis
A branch of mathematical statistics dealing with the development and study of statistical methods for solving the following discrimination (distinguishing) problem: To identify, from the results of observations, the set originally containing an object drawn from a family of sets at random. This task is of practical importance if it is impossible to observe the mark by which the object can be unambiguously assigned to a definite set or if the time and labour required for this purpose are excessive; also in cases in which the information concerning such a mark has been lost and must be recovered, or when future events are to be predicted from available data. Situations of the first-mentioned type are encountered in medical practice — e.g. when a diagnosis must be based on a large number of non-specific symptoms. An example of the second type is the determination of the sex of a person dead for a long time from the remains found during archeological excavations. A situation of the third type arises, for example, in a statistical prognosis of long-range results of a therapy. One method of discriminant analysis is multi-dimensional statistical analysis, serving for a quantitative expression and processing of the available information in accordance with the criterion for an optimal solution which has been chosen.
The problem of discrimination may be put in the following general form. Let the observations carried out on a random object result in the realization of a $ p $- dimensional random vector $ x ^ \prime = ( x _ {1} \dots x _ {p} ) $( where the prime denotes transposition) of the values of $ p $ marks of the object. The task is to find a rule for assigning the object to one of the possible sets $ \pi _ {i} $, $ i= 1 \dots k $, on the basis of the value of the vector $ x $. The principle of construction of such a rule of discrimination is the subdivision of the entire sampling space $ R $ of values of the vector $ x $ into domains $ R _ {i} $, $ i= 1 \dots k $, so that if $ x $ forms part of $ R _ {i} $, the object belongs to the set $ \pi _ {i} $. The selection of the rule of discrimination to be employed out of all such possible rules is based on an optimality principle, in the light of the a priori information available on the sets $ \pi _ {i} $ and on the probabilities $ q _ {i} $ of withdrawal of the object from $ \pi _ {i} $. In so doing, allowance is made for the size of deficiency by incorrect discrimination. A priori information on the sets $ \pi _ {i} $ may be the knowledge of the distribution functions of the vector of marks of the objects in each such set; it may also be represented as selections from each one of these sets, when the a priori probabilities $ q _ {i} $ of the sets may or may not be known. Clearly, the more complete the initial information, the more exact will be the recommendations.
Consider the case of two sets $ \pi _ {1} $ and $ \pi _ {2} $ when complete initial information is available: the distribution functions of the vector of marks are known for both sets, and the a priori probabilities are also known (the Bayesian approach). Let $ P _ {1} ( x) $ and $ P _ {2} ( x) $ be the distribution functions of the vector of marks in $ \pi _ {1} $ and $ \pi _ {2} $ respectively, $ p _ {1} ( x) $ and $ p _ {2} ( x) $ the distribution densities, and $ C ( i \mid j ) $, $ i, j= 1, 2 $, the deficiency due to an object in set $ j $ being assigned to the set $ i $. Then the probabilities of incorrect discrimination between objects from $ \pi _ {1} $ and $ \pi _ {2} $ are, respectively,
$$ P ( 2 \mid 1; R) = \int\limits _ {R _ {2} } p _ {1} ( x) dx,\ P( 1 \mid 2 ; R ) = \int\limits _ {R _ {1} } p _ {2} ( x) dx $$
(where the symbol $ P( i \mid j ; R) $ denotes the probability of assigning an object in $ \pi _ {j} $ to the set $ \pi _ {i} $ using the rule $ R $), while the mathematical expectation of the deficiencies connected with incorrect discrimination is
$$ C ( 2 \mid 1) P ( 2 \mid 1 ; R ) q _ {1} + C ( 1 \mid 2 ) P ( 1 \mid 2; R ) q _ {2} . $$
It is natural in this situation to select, as the optimality principle, the principle of minimization of this magnitude which, in the present case, leads to the following partition of the sampling space [1]:
$$ \tag{1a } R _ {1} : \frac{p _ {1} ( x) }{p _ {2} ( x) } \geq \frac{C ( 1 \mid 2 ) q _ {2} }{C ( 2 \mid 1 ) q _ {1} } , $$
$$ \tag{1b } R _ {2} : \frac{p _ {1} ( x) }{p _ {2} ( x) } < \frac{C ( 1 \mid 2 ) q _ {2} }{C ( 2 \mid 1 ) q _ {1} } . $$
If the conditions
$$ {\mathsf P} \left \{ \frac{p _ {1} ( x) }{p _ {2} ( x) } = \frac{C ( 1 \mid 2) q _ {2} }{C ( 2 \mid 1 ) q _ {1} } \left | \pi _ {i} \right . \right \} = 0,\ i= 1, 2, $$
are met, such a partition is unique, up to a set of probability zero. A similar rule of discrimination in this case may also be arrived at in other ways — e.g. using the Neyman–Pearson lemma, which forms part of the theory of testing statistical hypotheses.
In selecting a specific optimality criterion, the quality of the rule of discrimination is measured by the value of the mathematical expectations of the losses, and, out of two rules, the one giving the lower expectation is selected.
If, in a discrimination problem, the a priori probabilities $ q _ {i} $ are not known, the natural course is to look for the solution in the class of permissible rules by choosing the rule which minimizes the maximum over all the $ q _ {i} $ of the mathematical expectations of the losses (such a rule is called a minimax rule). The mathematical expectations of the losses, if the observations were carried out on objects from $ \pi _ {1} $ or $ \pi _ {2} $ respectively, are
$$ C ( 2 \mid 1 ) P ( 2 \mid 1 ; R ) = r ( 1 , R ) ,\ C ( 1 \mid 2 ) P ( 1 \mid 2 ; R ) = r ( 2 , R ) . $$
The following theorem is valid [1]: If the conditions
$$ {\mathsf P} \left \{ \frac{p _ {1} ( x) }{p _ {2} ( x) } = k \left | \pi _ {i} \right . \right \} = 0 ,\ i = 1 , 2 ,\ 0 \leq k \leq \infty , $$
are met, the class of Bayesian methods is the minimal complete class. The minimax rule $ R ^ {*} $ of this class is obtained for the value of $ q _ {i} $ which meets the conditions $ P( 2 \mid 1 ; R ^ {*} ) = P ( 1 \mid 2 ; R ^ {*} ) $. In the important case when $ P _ {1} $ and $ P _ {2} $ are multi-dimensional normal distributions with average vectors $ \mu ^ {(} 1) $ and $ \mu ^ {(} 2) $ and common covariance matrix $ \Sigma $, the rule of discrimination (1a)–(1b) assumes the form
$$ \tag{2a } R _ {1} : x ^ \prime \Sigma ^ {-} 1 ( \mu ^ {(} 1) - \mu ^ {(} 2) ) - $$
$$ - \frac{1}{2} ( \mu ^ {(} 1) + \mu ^ {(} 2) ) ^ \prime \Sigma ^ {-} 1 ( \mu ^ {(} 1) - \mu ^ {(} 2) ) \geq \ \mathop{\rm ln} k , $$
$$ \tag{2b } R _ {2} : x ^ \prime \Sigma ^ {-} 1 ( \mu ^ {(} 1) - \mu ^ {(} 2) ) - $$
$$ - \frac{1}{2} ( \mu ^ {(} 1) + \mu ^ {(} 2) ) ^ \prime \Sigma ^ {-} 1 ( \mu ^ {(} 1) - \mu ^ {(} 2) ) < \ \mathop{\rm ln} k , $$
where $ k = {q _ {2} C ( 1 \mid 2 ) } / \{ q _ {1} C ( 2 \mid 1) \} $. If $ C ( 1 \mid 2 ) = C ( 2 \mid 1 ) $ and $ q _ {2} = q _ {1} $, then $ \mathop{\rm ln} k = 0 $ and
$$ \tag{3 } R _ {1} : D ( x) = x ^ \prime \Sigma ^ {-} 1 ( \mu ^ {( 1) } - \mu ^ {( 2) } ) \geq $$
$$ \geq \ \frac{1}{2} ( \mu ^ {( 1) } - \mu ^ {( 2) } ) ^ \prime \Sigma ^ {-} 1 ( \mu ^ {( 1) } - \mu ^ {( 2) } ) . $$
If the priori probabilities are not known, one may choose $ \mathop{\rm ln} k = c $, for example, from the condition of minimal error involved in incorrect discrimination or by stipulating that the mathematical expectations of the losses caused by incorrect discrimination be zero. Generally speaking, the choice of the optimality criterion is determined by the nature of the problem itself. The expression on the left-hand side of (3) is said to be the discriminant function of the problem; it may be interpreted as a surface in the sampling space separating the sets $ \pi _ {1} $ and $ \pi _ {2} $. In the above example this function is linear, i.e. such a surface is a hypersurface. If, in the above example, the covariance matrices are different, the discriminant function will be a quadratic function in $ x $. In order to simplify the calculations, the minimal complete class of linear discrimination procedures has been found for this case [3].
As regards the applications of discriminant analysis, the most important situation is one in which the initial information on the distribution is represented by samples from these distributions. The problem is then formulated as follows. Let $ x _ {1} ^ {(} i) \dots x _ {n _ {i} } ^ {(} i) $ be samples out of the set $ \pi _ {i} $, $ i = 1 \dots k $; let $ x _ {j} ^ {(} i) = ( x _ {j _ {1} } ^ {(} i) \dots x _ {j _ {p} } ^ {(} i) ) $ be the vector of marks of the $ j $- th object selected from the set $ i $, and let an additional observation $ x ^ \prime = ( x _ {1} \dots x _ {p} ) $ have been made on an object belonging to the set $ \pi _ {i} $. One is required to find the rule assigning the observation $ x $ to one of these sets. The first to solve this problem in the case of two sets was R.A. Fisher [4], who laid the foundations of discriminant analysis. Instead of using the vector of marks as the characterization of the object, he used their linear combination — a hyperplane which, in a certain sense, is the best division of the set of sample points — and arrived at the discriminant function (3).
The case most frequently studied is when the distributions of the vectors of marks in each set are known to be normal, but the parameters of the distributions are not known. Here, the obvious approach is to replace the (unknown) distribution parameters in the discriminant function (3) by their optimal estimates [5], [6]. As in the case when the distributions are known, the rule of discrimination may be based on the likelihood ratio [7], [8].
In a great majority of cases, the results obtained in discriminant analysis concern normal distributions. Studies are carried out on the applicability in the normal case of situations where the normal distribution is assumed to be only approximate [9]. In such studies the problems of discriminant analysis are considered in the framework of the general theory of decision functions, and the properties of the rules of discrimination are investigated with respect to the so-called $ Q $- optimality principle, which includes in a natural way both the Bayesian and the minimax approach. In fact, let $ R( \xi , \delta ) $ be the probability of the error involved in applying the rule $ \delta $ of discrimination, when the vector of a priori probabilities is $ \xi $. Further, assume that $ \xi \in Q $, where $ Q $ is some set in the space of vectors $ \xi $. A rule $ \delta ^ {*} $ is said to be $ Q $- optimal if
$$ \tag{4 } \sup _ {\xi \in Q } R ( \xi , \delta ^ {*} ) = \ \inf _ {\delta \in D } \sup _ {\xi \in Q } R ( \xi , \delta ) = R _ {Q} , $$
where $ D $ is the set of all possible rules of discrimination. Let the functional form $ P _ {i} ( x , \lambda _ {i} ) $ be known, which depends on the distribution parameter of the vector of marks in each set, $ i = 1, 2 $, and suppose that the parameter $ \lambda $ is unknown and must be estimated by sampling. If now $ P _ {i} ( x , \lambda _ {i} ) $ are such that there exists a $ Q $- optimal rule $ \delta ^ {*} ( \lambda _ {1} , \lambda _ {2} ) $ of discrimination for the distributions $ P _ {i} ( x , \lambda _ {i} ) $, $ i = 1 , 2 $, when the value of the parameter $ \lambda = ( \lambda _ {1} , \lambda _ {2} ) $ is known, and $ \{ \lambda _ {i} ^ {( n _ {i} ) } \} $ is a strongly consistent estimator of the parameter $ \lambda _ {i} $ based on a sample of volume $ n _ {i} $, then, if certain additional conditions are met, the sequence of rules $ \{ \delta ^ {*} ( \lambda _ {1} ^ {( n _ {1} ) } , \lambda _ {2} ^ {( n _ {2} ) } ) \} $ is asymptotically $ Q $- optimal as $ n _ {1} , n _ {2} \rightarrow \infty $, i.e. one has, with probability one,
$$ \tag{5 } \lim\limits _ {n _ {1} , n _ {2} \rightarrow \infty } \sup _ {\xi \in Q } R ( \xi , \delta ^ {*} ( \lambda _ {1} ^ {( n _ {1} ) } , \lambda _ {2} ^ {( n _ {2} ) } ) ) = R _ {Q} , $$
where the risk $ R $ on the left-hand side of (5) may be computed both for the true parameter values and when such values are replaced by their estimates $ \lambda _ {i} ^ {( n _ {i} ) } $. If a consistent estimator only is required, a somewhat weaker statement applies.
Non-parametric discrimination methods, in which the exact functional form of the distributions need not be known, and which make it possible to solve discrimination problems on the basis of a small amount of a priori information about the sets, are particularly valuable in practical applications [2], [10].
Problems in discriminant analysis may involve random observations of both qualitative and quantitative marks (a mixed case is also possible). There is no difference in principle between the two. If the marks are quantitative, the concept of a multi-dimensional state of the object is introduced, and one considers the distribution with respect to this state. The manner in which the distribution function of the vector of marks is estimated depends on the nature of the observations. The Bayesian and the minimax approaches are again employed whenever suitable, and it is possible to establish a discrimination procedure based on the likelihood ratio. It is sometimes expedient to pass from quantitative to qualitative magnitudes by subdividing the frequency function and, conversely, from qualitative to quantitative, by introducing suitable fictitious variables which convert qualitative information. In so doing it must clearly be ensured that this does not involve a significant deterioration in the quality of the rule.
So far problems in discriminant analysis for a fixed dimension of the space of vectors of marks have been considered only. In practice, however, it is often the worker himself who has to choose the dimension. It would appear at first sight that the introduction of a new mark in the discriminant function will not impair (and may improve) its quality. In fact, however, the effectiveness of the discrimination may be impaired for various reasons (including the fact that estimates rather than the true values of the distribution parameters are often employed). Moreover, an increase in the number of marks results in a considerable increase in the computational labour. The choice of marks may be effected in various ways, which in many cases are dictated by simple common sense. The method involving the computation of the Mahalanobis distance between distributions [11] is theoretically the best substantiated method. Methods involving successive sampling of marks are of special interest.
Problems involving the assignment of an object to one of several possible sets were studied for a long time under the heading of classification problems. M.G. Kendall [2] subdivided all problems of sampling of one out of several equivalent possibilities into three classes, and his terminology is used in the present article. He gave the term "discrimination problems" to problems of the kind discussed above, and reserved the term "classification" for the task of partitioning a given sample or a set of objects into groups — if possible, homogeneous. In discrimination problems the existence of such groups is discussed as part of the conditions of the problem; here it is the subject of the investigation. In the discrimination problems discussed above the object under study was the result of a random choice out of some finite distribution. A more general situation is also possible — when the object under study is the realization of some random process in continuous time.
Discriminant analysis is closely connected with the theory of pattern recognition.
References
[1] | T.W. Anderson, "An introduction to multivariate statistical analysis" , Wiley (1958) |
[2] | M.G. Kendall, A. Stuart, "The advanced theory of statistics" , 3. Design and analysis, and time-series , Griffin (1966) |
[3] | T.W. Anderson, R.R. Bahadur, "Classification into two multivariate normal distributions with different covariance matrices" Ann. Math. Stat. , 33 : 2 (1962) pp. 420–431 |
[4] | R.A. Fisher, Ann. Eugenics , 7 : 11 (1936) pp. 179–188 |
[5] | A. Wald, "On a statistical problem arising in the classification of an individual into one of two groups" Ann. Math. Stat. , 15 : 2 (1944) pp. 145–162 |
[6] | S. John, "On some classification problems I, II" Sankhya , 22 : 3–4 (1960) pp. 301–316 |
[7] | B.L. Welch, "Note on discriminant functions" Biometrika , 31 : 1–2 (1939) pp. 218–220 |
[8] | S.D. Gupta, "Optimum classification rules for classification into two multivariate normal distributions" Ann. Math. Stat. , 36 : 4 (1965) pp. 1174–1184 |
[9] | O. Bunke, "Stabilität statistischer Entscheidungsprobleme und Anwendugen in der Discriminanzanalyse" Z. Wahrsch. Verw. Geb. , 7 : 2 (1967) pp. 131–146 |
[10] | J. van Ryzin, "Bayes risk consistency of classification procedures using density estimation" Sankhya Ser. A , 28 (1966) pp. 261–270 |
[11] | A. Kudô, Mem. Fac. Sci. Kyushu Univ. Ser. A , 17 : 1 (1963) pp. 63–75 |
[12] | V.Yu. Urbakh, "Statistical methods of classification" , 1 , Moscow (1969) pp. 79–173 (In Russian) ((See the list of references)) |
Discriminant analysis. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Discriminant_analysis&oldid=46740