Difference between revisions of "Statistical manifold"
(Importing text file) |
Ulf Rehmann (talk | contribs) m (tex encoded by computer) |
||
Line 1: | Line 1: | ||
− | + | <!-- | |
+ | s1102601.png | ||
+ | $#A+1 = 97 n = 0 | ||
+ | $#C+1 = 97 : ~/encyclopedia/old_files/data/S110/S.1100260 Statistical manifold | ||
+ | Automatically converted into TeX, above some diagnostics. | ||
+ | Please remove this comment and the {{TEX|auto}} line below, | ||
+ | if TeX found to be correct. | ||
+ | --> | ||
− | + | {{TEX|auto}} | |
+ | {{TEX|done}} | ||
− | with | + | A [[Manifold|manifold]] $ S $ |
+ | endowed with a symmetric [[Connection|connection]] $ \nabla $ | ||
+ | and a [[Riemannian metric|Riemannian metric]] $ g $. | ||
+ | This structure is abstracted from parametric statistics, i.e. inference from data distributed according to some unknown member of a parametrized family of probability distributions. The most cited such family is the multivariate normal distribution for data $ x \in \mathbf R ^ {n} $, | ||
+ | given by | ||
− | + | $$ | |
+ | ( 2 \pi { \mathop{\rm det} } A ) ^ {- n/2 } { \mathop{\rm exp} } - [ ( x - \mu ) ^ {t} A ^ {- 1 } ( x - \mu ) ] dx _ {1} \dots dx _ {n} , | ||
+ | $$ | ||
− | + | with as parameters the mean $ \mu \in \mathbf R ^ {n} $ | |
+ | and the covariance matrix $ A $. | ||
+ | One thinks of the distributions themselves as points on a "surface" and the parameters as coordinates for these points. In this way any parametric family constitutes a manifold $ S $ | ||
+ | with allowable parametrizations providing admissible coordinate systems. | ||
− | + | One can think of measures on a set as analogous to points in a plane and measurable functions (or random variables) as analogous to arrows which translate one point to another. The [[Random variable|random variable]] $ f $ | |
+ | translates the [[Measure|measure]] $ \mu $ | ||
+ | to another $ \nu $ | ||
+ | by the formula $ d \nu = e ^ {f} d \mu $, | ||
+ | meaning that $ \nu $ | ||
+ | has density $ e ^ {f} $ | ||
+ | with respect to $ \mu $. | ||
+ | Composition of translation operations corresponds to adding the random variables (or arrows), and for points $ p $ | ||
+ | and $ q $ | ||
+ | there is a unique translation moving $ p $ | ||
+ | to $ q $, | ||
+ | provided one stays within an equivalence class of measures, subject to some regularity conditions. Such translation by a vector space of "arrows" is called an affine structure, and it is taken to be the essence of flat geometry. | ||
− | + | Probability measures can be regarded as finite measures up to scale, since any finite (non-negative) measure can be uniquely scaled into one. In this sense, probability distributions live inside the flat geometry of an equivalence class as the finite measures. By choosing a finite number of linearly independent random variables $ T _ {1} ( x ) \dots T _ {k} ( x ) $, | |
+ | one obtains finite-dimensional affine subspaces of measures of the form | ||
− | + | $$ | |
+ | { \mathop{\rm exp} } [ \eta _ {1} T _ {1} ( x ) + \dots + \eta _ {k} T _ {k} ( x ) ] d \mu . | ||
+ | $$ | ||
+ | |||
+ | For an open subset of the parameters $ \eta \in \mathbf R ^ {k} $, | ||
+ | these measures are finite and can be scaled to probability measures | ||
+ | |||
+ | $$ | ||
+ | { \mathop{\rm exp} } [ \eta _ {1} T _ {1} ( x ) + \dots + \eta _ {k} T _ {k} ( x ) - K ( \eta ) ] d \mu. | ||
+ | $$ | ||
They are the well-known exponential families. Their flatness can be related directly to their characterization in terms of sufficiency reduction. | They are the well-known exponential families. Their flatness can be related directly to their characterization in terms of sufficiency reduction. | ||
− | General families of probability distributions | + | General families of probability distributions $ \mu ( \theta ) $ |
+ | are usually expressed as $ d \mu ( \theta ) = e ^ {l ( x, \theta ) } d \mu $, | ||
+ | where $ \mu $ | ||
+ | is fixed. Geometrically this amounts to choosing an origin $ \mu $ | ||
+ | and describing each distribution $ \mu ( \theta ) $ | ||
+ | in terms of its displacement vector $ l ( x, \theta ) $ | ||
+ | from that origin. Differences in these displacement vectors give the displacement of one point in the family from another, and the derivatives | ||
− | + | $$ | |
+ | l _ {i} ( x, \theta ) = { | ||
+ | \frac{\partial l ( x, \theta ) }{\partial \theta ^ {i} } | ||
+ | } | ||
+ | $$ | ||
− | give infinitesimal displacements or tangent vectors to the family, or manifold | + | give infinitesimal displacements or tangent vectors to the family, or manifold $ S $, |
+ | at the point $ \mu ( \theta ) $. | ||
+ | The vector of random variables $ l _ {i} ( -, \theta ) $ | ||
+ | is called the score and its components span the tangent space to the manifold $ S $ | ||
+ | at $ \mu ( \theta ) $. | ||
− | By restricting | + | By restricting $ f $ |
+ | and $ g $ | ||
+ | to the span of the score components, $ {\mathsf E} _ {\mu ( \theta ) } ( fg ) $ | ||
+ | defines an [[Inner product|inner product]] on the tangent space at $ \mu ( \theta ) $. | ||
+ | Its matrix with respect to the score basis is $ {\mathsf E} _ {\mu ( \theta ) } ( l _ {i} ( x, \theta ) l _ {j} ( x, \theta ) ) $, | ||
+ | known as the Fisher information matrix. In this sense, the Fisher information defines an inner product on each tangent space of $ S $, | ||
+ | i.e. a Riemannian metric $ g $. | ||
+ | This is an observation going back to C.R. Rao, who noted that the multivariate normal family becomes a space of constant negative curvature under this metric [[#References|[a7]]]. | ||
− | Because, as | + | Because, as $ \theta $ |
+ | varies, each score component $ l _ {i} $ | ||
+ | provides a tangent vector at each point of $ S $, | ||
+ | they are vector fields on $ S $( | ||
+ | cf. [[Vector field|Vector field]]). The second derivatives | ||
− | + | $$ | |
+ | l _ {ij } ( x, \theta ) = { | ||
+ | \frac{\partial ^ {2} l ( x, \theta ) }{\partial \theta ^ {i} \partial \theta ^ {j} } | ||
+ | } | ||
+ | $$ | ||
− | give rates of change of these vector fields, but not intrinsically on | + | give rates of change of these vector fields, but not intrinsically on $ S $, |
+ | since these random variables will not generally lie in the span of the score components. By using the Fisher information $ g $ | ||
+ | one can project the second derivatives onto the tangent spaces, thus defining intrinsically the rate of change of these vector fields on $ S $. | ||
+ | Via linearity in $ X $ | ||
+ | and the Leibnitz rule in $ Y $ | ||
+ | one defines $ \nabla _ {X} Y $, | ||
+ | the rate of change of the vector field $ Y $ | ||
+ | along the vector field $ X $, | ||
+ | for any two vector fields $ X $ | ||
+ | and $ Y $ | ||
+ | on $ S $. | ||
+ | $ \nabla $ | ||
+ | is called the Amari $ 1 $- | ||
+ | connection [[#References|[a1]]]. S.-I. Amari noted that the dual connection $ \nabla ^ {*} $ | ||
+ | with respect to $ g $ | ||
+ | was generally different from $ \nabla $, | ||
+ | i.e. $ \nabla $ | ||
+ | is not the Riemannian or Levi-Civita connection of $ g $. | ||
+ | One can therefore define a whole $ 1 $- | ||
+ | parameter family of connections $ \nabla ^ \alpha = \alpha \nabla + ( 1 - \alpha ) \nabla ^ {*} $, | ||
+ | so that $ \nabla ^ \alpha $ | ||
+ | and $ \nabla ^ {- \alpha } $ | ||
+ | are dual with respect to $ g $, | ||
+ | and, in particular, $ \nabla ^ {1/2 } $ | ||
+ | is the [[Levi-Civita connection|Levi-Civita connection]]. | ||
− | Amari showed that statistical divergences, such as Kullback–Leibler distances (cf. [[Kullback–Leibler-type distance measures|Kullback–Leibler-type distance measures]]), could be defined in terms of these structures. By duality, the rate of change of vector fields allows one to define the rate of change of | + | Amari showed that statistical divergences, such as Kullback–Leibler distances (cf. [[Kullback–Leibler-type distance measures|Kullback–Leibler-type distance measures]]), could be defined in terms of these structures. By duality, the rate of change of vector fields allows one to define the rate of change of $ 1 $- |
+ | forms and hence of differentials of functions. For any function $ f $, | ||
+ | $ \nabla _ {X} df ( Y ) $ | ||
+ | is bilinear and symmetric in $ X $ | ||
+ | and $ Y $. | ||
+ | It therefore makes sense to try to realize the Fisher information in this form, i.e. to solve $ \nabla df = g $[[#References|[a6]]]. Solutions, if they exist, are uniquely determined by specifying the value of $ df $ | ||
+ | at a single point. Unfortunately $ \nabla $ | ||
+ | must be flat in order that a good supply of solutions exists. Let $ \Theta ( p, - ) $ | ||
+ | be the solution whose differential vanishes at $ p $. | ||
+ | Amari calls $ \Theta ( p,q ) $ | ||
+ | the statistical divergence between $ p $ | ||
+ | and $ q $. | ||
+ | For the $ 1 $- | ||
+ | connection it is the Kullback–Leibler distance. It is easy to show that the minimum value of $ \Theta ( p,q ) $, | ||
+ | for $ p $ | ||
+ | fixed and $ q $ | ||
+ | ranging over a submanifold $ U $, | ||
+ | occurs at the point $ q $ | ||
+ | where the geodesic of $ \nabla $ | ||
+ | joining $ p $ | ||
+ | to $ q $ | ||
+ | meets $ U $ | ||
+ | orthogonally according to $ g $. | ||
− | Much research on statistical manifolds centres on finding improved or more insightful asymptotic formulas. A connection is equivalent to specifying the rates of change of differentials | + | Much research on statistical manifolds centres on finding improved or more insightful asymptotic formulas. A connection is equivalent to specifying the rates of change of differentials $ df $, |
+ | and the geometric form of second-order Taylor expansion. The "string (in statistics)strings" [[#References|[a3]]] and "yoke (in statistics)yokes" [[#References|[a5]]] of O.E. Barndorff-Nielsen and P. Blaesild (cf. also [[Yoke|Yoke]]) define full geometric Taylor expansions in terms related to parametric statistics, seeking insight into results such as the Bartlett adjustment and the Barndorff-Nielsen $ p ^ {*} $ | ||
+ | formula [[#References|[a4]]]. | ||
See also [[Differential geometry in statistical inference|Differential geometry in statistical inference]]. | See also [[Differential geometry in statistical inference|Differential geometry in statistical inference]]. |
Latest revision as of 08:23, 6 June 2020
A manifold $ S $
endowed with a symmetric connection $ \nabla $
and a Riemannian metric $ g $.
This structure is abstracted from parametric statistics, i.e. inference from data distributed according to some unknown member of a parametrized family of probability distributions. The most cited such family is the multivariate normal distribution for data $ x \in \mathbf R ^ {n} $,
given by
$$ ( 2 \pi { \mathop{\rm det} } A ) ^ {- n/2 } { \mathop{\rm exp} } - [ ( x - \mu ) ^ {t} A ^ {- 1 } ( x - \mu ) ] dx _ {1} \dots dx _ {n} , $$
with as parameters the mean $ \mu \in \mathbf R ^ {n} $ and the covariance matrix $ A $. One thinks of the distributions themselves as points on a "surface" and the parameters as coordinates for these points. In this way any parametric family constitutes a manifold $ S $ with allowable parametrizations providing admissible coordinate systems.
One can think of measures on a set as analogous to points in a plane and measurable functions (or random variables) as analogous to arrows which translate one point to another. The random variable $ f $ translates the measure $ \mu $ to another $ \nu $ by the formula $ d \nu = e ^ {f} d \mu $, meaning that $ \nu $ has density $ e ^ {f} $ with respect to $ \mu $. Composition of translation operations corresponds to adding the random variables (or arrows), and for points $ p $ and $ q $ there is a unique translation moving $ p $ to $ q $, provided one stays within an equivalence class of measures, subject to some regularity conditions. Such translation by a vector space of "arrows" is called an affine structure, and it is taken to be the essence of flat geometry.
Probability measures can be regarded as finite measures up to scale, since any finite (non-negative) measure can be uniquely scaled into one. In this sense, probability distributions live inside the flat geometry of an equivalence class as the finite measures. By choosing a finite number of linearly independent random variables $ T _ {1} ( x ) \dots T _ {k} ( x ) $, one obtains finite-dimensional affine subspaces of measures of the form
$$ { \mathop{\rm exp} } [ \eta _ {1} T _ {1} ( x ) + \dots + \eta _ {k} T _ {k} ( x ) ] d \mu . $$
For an open subset of the parameters $ \eta \in \mathbf R ^ {k} $, these measures are finite and can be scaled to probability measures
$$ { \mathop{\rm exp} } [ \eta _ {1} T _ {1} ( x ) + \dots + \eta _ {k} T _ {k} ( x ) - K ( \eta ) ] d \mu. $$
They are the well-known exponential families. Their flatness can be related directly to their characterization in terms of sufficiency reduction.
General families of probability distributions $ \mu ( \theta ) $ are usually expressed as $ d \mu ( \theta ) = e ^ {l ( x, \theta ) } d \mu $, where $ \mu $ is fixed. Geometrically this amounts to choosing an origin $ \mu $ and describing each distribution $ \mu ( \theta ) $ in terms of its displacement vector $ l ( x, \theta ) $ from that origin. Differences in these displacement vectors give the displacement of one point in the family from another, and the derivatives
$$ l _ {i} ( x, \theta ) = { \frac{\partial l ( x, \theta ) }{\partial \theta ^ {i} } } $$
give infinitesimal displacements or tangent vectors to the family, or manifold $ S $, at the point $ \mu ( \theta ) $. The vector of random variables $ l _ {i} ( -, \theta ) $ is called the score and its components span the tangent space to the manifold $ S $ at $ \mu ( \theta ) $.
By restricting $ f $ and $ g $ to the span of the score components, $ {\mathsf E} _ {\mu ( \theta ) } ( fg ) $ defines an inner product on the tangent space at $ \mu ( \theta ) $. Its matrix with respect to the score basis is $ {\mathsf E} _ {\mu ( \theta ) } ( l _ {i} ( x, \theta ) l _ {j} ( x, \theta ) ) $, known as the Fisher information matrix. In this sense, the Fisher information defines an inner product on each tangent space of $ S $, i.e. a Riemannian metric $ g $. This is an observation going back to C.R. Rao, who noted that the multivariate normal family becomes a space of constant negative curvature under this metric [a7].
Because, as $ \theta $ varies, each score component $ l _ {i} $ provides a tangent vector at each point of $ S $, they are vector fields on $ S $( cf. Vector field). The second derivatives
$$ l _ {ij } ( x, \theta ) = { \frac{\partial ^ {2} l ( x, \theta ) }{\partial \theta ^ {i} \partial \theta ^ {j} } } $$
give rates of change of these vector fields, but not intrinsically on $ S $, since these random variables will not generally lie in the span of the score components. By using the Fisher information $ g $ one can project the second derivatives onto the tangent spaces, thus defining intrinsically the rate of change of these vector fields on $ S $. Via linearity in $ X $ and the Leibnitz rule in $ Y $ one defines $ \nabla _ {X} Y $, the rate of change of the vector field $ Y $ along the vector field $ X $, for any two vector fields $ X $ and $ Y $ on $ S $. $ \nabla $ is called the Amari $ 1 $- connection [a1]. S.-I. Amari noted that the dual connection $ \nabla ^ {*} $ with respect to $ g $ was generally different from $ \nabla $, i.e. $ \nabla $ is not the Riemannian or Levi-Civita connection of $ g $. One can therefore define a whole $ 1 $- parameter family of connections $ \nabla ^ \alpha = \alpha \nabla + ( 1 - \alpha ) \nabla ^ {*} $, so that $ \nabla ^ \alpha $ and $ \nabla ^ {- \alpha } $ are dual with respect to $ g $, and, in particular, $ \nabla ^ {1/2 } $ is the Levi-Civita connection.
Amari showed that statistical divergences, such as Kullback–Leibler distances (cf. Kullback–Leibler-type distance measures), could be defined in terms of these structures. By duality, the rate of change of vector fields allows one to define the rate of change of $ 1 $- forms and hence of differentials of functions. For any function $ f $, $ \nabla _ {X} df ( Y ) $ is bilinear and symmetric in $ X $ and $ Y $. It therefore makes sense to try to realize the Fisher information in this form, i.e. to solve $ \nabla df = g $[a6]. Solutions, if they exist, are uniquely determined by specifying the value of $ df $ at a single point. Unfortunately $ \nabla $ must be flat in order that a good supply of solutions exists. Let $ \Theta ( p, - ) $ be the solution whose differential vanishes at $ p $. Amari calls $ \Theta ( p,q ) $ the statistical divergence between $ p $ and $ q $. For the $ 1 $- connection it is the Kullback–Leibler distance. It is easy to show that the minimum value of $ \Theta ( p,q ) $, for $ p $ fixed and $ q $ ranging over a submanifold $ U $, occurs at the point $ q $ where the geodesic of $ \nabla $ joining $ p $ to $ q $ meets $ U $ orthogonally according to $ g $.
Much research on statistical manifolds centres on finding improved or more insightful asymptotic formulas. A connection is equivalent to specifying the rates of change of differentials $ df $, and the geometric form of second-order Taylor expansion. The "string (in statistics)strings" [a3] and "yoke (in statistics)yokes" [a5] of O.E. Barndorff-Nielsen and P. Blaesild (cf. also Yoke) define full geometric Taylor expansions in terms related to parametric statistics, seeking insight into results such as the Bartlett adjustment and the Barndorff-Nielsen $ p ^ {*} $ formula [a4].
See also Differential geometry in statistical inference.
References
[a1] | S-I. Amari, "Differential-geometrical methods in statistics" , Lecture Notes in Statistics , 28 , Springer (1985) |
[a2] | S-I. Amari, O.E. Barndorff-Nielsen, R.E. Kass, S.L. Lauritzen, C.R. Rao, "Differential geometry in statistical inference" , Lecture Notes Monograph Ser. , 10 , Inst. Math. Statistics, Hayward, California (1987) |
[a3] | O.E. Barndorff-Nielsen, "Strings, tensorial combinants, and Bartlett adjustments" Proc. R. Soc. London A , 406 (1986) pp. 127–137 |
[a4] | O.E. Barndorff-Nielsen, "Parametric statistical models and likelihood" , Lecture Notes in Statistics , 50 , Springer (1988) |
[a5] | P. Blaesild, "Yokes and tensors derived from yokes" Ann. Inst. Statist. Math. , 43 : 1 (1991) pp. 95–113 |
[a6] | M.K. Murray, J.W Rice, "Differential geometry and statistics" , Monographs on Statistics and Applied Probability , 48 , Chapman and Hall (1993) |
[a7] | C.R. Rao, "Information and the accuracy attainable in the estimation of statistical parameters" Bull. Calcutta Math. Soc. , 37 (1945) pp. 81–91 |
Statistical manifold. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Statistical_manifold&oldid=48815