Namespaces
Variants
Actions

Difference between revisions of "Statistical manifold"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
 
Line 1: Line 1:
A [[Manifold|manifold]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102601.png" /> endowed with a symmetric [[Connection|connection]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102602.png" /> and a [[Riemannian metric|Riemannian metric]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102603.png" />. This structure is abstracted from parametric statistics, i.e. inference from data distributed according to some unknown member of a parametrized family of probability distributions. The most cited such family is the multivariate normal distribution for data <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102604.png" />, given by
+
<!--
 +
s1102601.png
 +
$#A+1 = 97 n = 0
 +
$#C+1 = 97 : ~/encyclopedia/old_files/data/S110/S.1100260 Statistical manifold
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102605.png" /></td> </tr></table>
+
{{TEX|auto}}
 +
{{TEX|done}}
  
with as parameters the mean <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102606.png" /> and the covariance matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102607.png" />. One thinks of the distributions themselves as points on "surface" and the parameters as coordinates for these points. In this way any parametric family constitutes a manifold <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102608.png" /> with allowable parametrizations providing admissible coordinate systems.
+
A [[Manifold|manifold]]  $  S $
 +
endowed with a symmetric [[Connection|connection]]  $  \nabla $
 +
and a [[Riemannian metric|Riemannian metric]] $ g $.  
 +
This structure is abstracted from parametric statistics, i.e. inference from data distributed according to some unknown member of a parametrized family of probability distributions. The most cited such family is the multivariate normal distribution for data  $  x \in \mathbf R  ^ {n} $,
 +
given by
  
One can think of measures on a set as analogous to points in a plane and measurable functions (or random variables) as analogous to arrows which translate one point to another. The [[Random variable|random variable]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s1102609.png" /> translates the [[Measure|measure]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026010.png" /> to another <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026011.png" /> by the formula <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026012.png" />, meaning that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026013.png" /> has density <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026014.png" /> with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026015.png" />. Composition of translation operations corresponds to adding the random variables (or arrows), and for points <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026016.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026017.png" /> there is a unique translation moving <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026018.png" /> to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026019.png" />, provided one stays within an equivalence class of measures, subject to some regularity conditions. Such translation by a vector space of "arrows"  is called an affine structure, and it is taken to be the essence of flat geometry.
+
$$
 +
( 2 \pi { \mathop{\rm det} } A ) ^ {- n/2 } { \mathop{\rm exp} } - [ ( x - \mu ^ {t} A ^ {- 1 } ( x - \mu ) ] dx _ {1} \dots dx _ {n} ,
 +
$$
  
Probability measures can be regarded as finite measures up to scale, since any finite (non-negative) measure can be uniquely scaled into one. In this sense, probability distributions live inside the flat geometry of an equivalence class as the finite measures. By choosing a finite number of linearly independent random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026020.png" />, one obtains finite-dimensional affine subspaces of measures of the form
+
with as parameters the mean  $  \mu \in \mathbf R  ^ {n} $
 +
and the covariance matrix  $  A $.  
 +
One thinks of the distributions themselves as points on a "surface" and the parameters as coordinates for these points. In this way any parametric family constitutes a manifold  $  S $
 +
with allowable parametrizations providing admissible coordinate systems.
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026021.png" /></td> </tr></table>
+
One can think of measures on a set as analogous to points in a plane and measurable functions (or random variables) as analogous to arrows which translate one point to another. The [[Random variable|random variable]]  $  f $
 +
translates the [[Measure|measure]]  $  \mu $
 +
to another  $  \nu $
 +
by the formula  $  d \nu = e  ^ {f}  d \mu $,
 +
meaning that  $  \nu $
 +
has density  $  e  ^ {f} $
 +
with respect to  $  \mu $.
 +
Composition of translation operations corresponds to adding the random variables (or arrows), and for points  $  p $
 +
and  $  q $
 +
there is a unique translation moving  $  p $
 +
to  $  q $,
 +
provided one stays within an equivalence class of measures, subject to some regularity conditions. Such translation by a vector space of  "arrows" is called an affine structure, and it is taken to be the essence of flat geometry.
  
For an open subset of the parameters <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026022.png" />, these measures are finite and can be scaled to probability measures
+
Probability measures can be regarded as finite measures up to scale, since any finite (non-negative) measure can be uniquely scaled into one. In this sense, probability distributions live inside the flat geometry of an equivalence class as the finite measures. By choosing a finite number of linearly independent random variables  $  T _ {1} ( x ) \dots T _ {k} ( x ) $,
 +
one obtains finite-dimensional affine subspaces of measures of the form
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026023.png" /></td> </tr></table>
+
$$
 +
{ \mathop{\rm exp} } [ \eta _ {1} T _ {1} ( x ) + \dots + \eta _ {k} T _ {k} ( x ) ]  d \mu .
 +
$$
 +
 
 +
For an open subset of the parameters  $  \eta \in \mathbf R  ^ {k} $,
 +
these measures are finite and can be scaled to probability measures
 +
 
 +
$$
 +
{ \mathop{\rm exp} } [ \eta _ {1} T _ {1} ( x ) + \dots + \eta _ {k} T _ {k} ( x ) - K ( \eta ) ]  d \mu.
 +
$$
  
 
They are the well-known exponential families. Their flatness can be related directly to their characterization in terms of sufficiency reduction.
 
They are the well-known exponential families. Their flatness can be related directly to their characterization in terms of sufficiency reduction.
  
General families of probability distributions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026024.png" /> are usually expressed as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026025.png" />, where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026026.png" /> is fixed. Geometrically this amounts to choosing an origin <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026027.png" /> and describing each distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026028.png" /> in terms of its displacement vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026029.png" /> from that origin. Differences in these displacement vectors give the displacement of one point in the family from another, and the derivatives
+
General families of probability distributions $  \mu ( \theta ) $
 +
are usually expressed as $  d \mu ( \theta ) = e ^ {l ( x, \theta ) }  d \mu $,  
 +
where $  \mu $
 +
is fixed. Geometrically this amounts to choosing an origin $  \mu $
 +
and describing each distribution $  \mu ( \theta ) $
 +
in terms of its displacement vector $  l ( x, \theta ) $
 +
from that origin. Differences in these displacement vectors give the displacement of one point in the family from another, and the derivatives
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026030.png" /></td> </tr></table>
+
$$
 +
l _ {i} ( x, \theta ) = {
 +
\frac{\partial  l ( x, \theta ) }{\partial  \theta  ^ {i} }
 +
}
 +
$$
  
give infinitesimal displacements or tangent vectors to the family, or manifold <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026031.png" />, at the point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026032.png" />. The vector of random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026033.png" /> is called the score and its components span the tangent space to the manifold <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026034.png" /> at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026035.png" />.
+
give infinitesimal displacements or tangent vectors to the family, or manifold $  S $,  
 +
at the point $  \mu ( \theta ) $.  
 +
The vector of random variables $  l _ {i} ( -, \theta ) $
 +
is called the score and its components span the tangent space to the manifold $  S $
 +
at $  \mu ( \theta ) $.
  
By restricting <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026036.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026037.png" /> to the span of the score components, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026038.png" /> defines an [[Inner product|inner product]] on the tangent space at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026039.png" />. Its matrix with respect to the score basis is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026040.png" />, known as the Fisher information matrix. In this sense, the Fisher information defines an inner product on each tangent space of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026041.png" />, i.e. a Riemannian metric <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026042.png" />. This is an observation going back to C.R. Rao, who noted that the multivariate normal family becomes a space of constant negative curvature under this metric [[#References|[a7]]].
+
By restricting $  f $
 +
and $  g $
 +
to the span of the score components, $  {\mathsf E} _ {\mu ( \theta ) }  ( fg ) $
 +
defines an [[Inner product|inner product]] on the tangent space at $  \mu ( \theta ) $.  
 +
Its matrix with respect to the score basis is $  {\mathsf E} _ {\mu ( \theta ) }  ( l _ {i} ( x, \theta ) l _ {j} ( x, \theta ) ) $,  
 +
known as the Fisher information matrix. In this sense, the Fisher information defines an inner product on each tangent space of $  S $,  
 +
i.e. a Riemannian metric $  g $.  
 +
This is an observation going back to C.R. Rao, who noted that the multivariate normal family becomes a space of constant negative curvature under this metric [[#References|[a7]]].
  
Because, as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026043.png" /> varies, each score component <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026044.png" /> provides a tangent vector at each point of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026045.png" />, they are vector fields on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026046.png" /> (cf. [[Vector field|Vector field]]). The second derivatives
+
Because, as $  \theta $
 +
varies, each score component $  l _ {i} $
 +
provides a tangent vector at each point of $  S $,  
 +
they are vector fields on $  S $(
 +
cf. [[Vector field|Vector field]]). The second derivatives
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026047.png" /></td> </tr></table>
+
$$
 +
l _ {ij }  ( x, \theta ) = {
 +
\frac{\partial  ^ {2} l ( x, \theta ) }{\partial  \theta  ^ {i} \partial  \theta  ^ {j} }
 +
}
 +
$$
  
give rates of change of these vector fields, but not intrinsically on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026048.png" />, since these random variables will not generally lie in the span of the score components. By using the Fisher information <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026049.png" /> one can project the second derivatives onto the tangent spaces, thus defining intrinsically the rate of change of these vector fields on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026050.png" />. Via linearity in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026051.png" /> and the Leibnitz rule in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026052.png" /> one defines <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026053.png" />, the rate of change of the vector field <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026054.png" /> along the vector field <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026055.png" />, for any two vector fields <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026056.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026057.png" /> on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026058.png" />. <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026059.png" /> is called the Amari <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026061.png" />-connection [[#References|[a1]]]. S.-I. Amari noted that the dual connection <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026062.png" /> with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026063.png" /> was generally different from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026064.png" />, i.e. <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026065.png" /> is not the Riemannian or Levi-Civita connection of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026066.png" />. One can therefore define a whole <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026067.png" />-parameter family of connections <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026068.png" />, so that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026069.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026070.png" /> are dual with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026071.png" />, and, in particular, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026072.png" /> is the [[Levi-Civita connection|Levi-Civita connection]].
+
give rates of change of these vector fields, but not intrinsically on $  S $,  
 +
since these random variables will not generally lie in the span of the score components. By using the Fisher information $  g $
 +
one can project the second derivatives onto the tangent spaces, thus defining intrinsically the rate of change of these vector fields on $  S $.  
 +
Via linearity in $  X $
 +
and the Leibnitz rule in $  Y $
 +
one defines $  \nabla _ {X} Y $,  
 +
the rate of change of the vector field $  Y $
 +
along the vector field $  X $,  
 +
for any two vector fields $  X $
 +
and $  Y $
 +
on $  S $.  
 +
$  \nabla $
 +
is called the Amari $  1 $-
 +
connection [[#References|[a1]]]. S.-I. Amari noted that the dual connection $  \nabla  ^ {*} $
 +
with respect to $  g $
 +
was generally different from $  \nabla $,  
 +
i.e. $  \nabla $
 +
is not the Riemannian or Levi-Civita connection of $  g $.  
 +
One can therefore define a whole $  1 $-
 +
parameter family of connections $  \nabla  ^  \alpha  = \alpha \nabla + ( 1 - \alpha ) \nabla  ^ {*} $,  
 +
so that $  \nabla  ^  \alpha  $
 +
and $  \nabla ^ {- \alpha } $
 +
are dual with respect to $  g $,  
 +
and, in particular, $  \nabla ^ {1/2 } $
 +
is the [[Levi-Civita connection|Levi-Civita connection]].
  
Amari showed that statistical divergences, such as Kullback–Leibler distances (cf. [[Kullback–Leibler-type distance measures|Kullback–Leibler-type distance measures]]), could be defined in terms of these structures. By duality, the rate of change of vector fields allows one to define the rate of change of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026073.png" />-forms and hence of differentials of functions. For any function <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026074.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026075.png" /> is bilinear and symmetric in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026076.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026077.png" />. It therefore makes sense to try to realize the Fisher information in this form, i.e. to solve <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026078.png" /> [[#References|[a6]]]. Solutions, if they exist, are uniquely determined by specifying the value of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026079.png" /> at a single point. Unfortunately <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026080.png" /> must be flat in order that a good supply of solutions exists. Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026081.png" /> be the solution whose differential vanishes at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026082.png" />. Amari calls <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026083.png" /> the statistical divergence between <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026084.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026085.png" />. For the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026086.png" />-connection it is the Kullback–Leibler distance. It is easy to show that the minimum value of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026087.png" />, for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026088.png" /> fixed and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026089.png" /> ranging over a submanifold <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026090.png" />, occurs at the point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026091.png" /> where the geodesic of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026092.png" /> joining <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026093.png" /> to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026094.png" /> meets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026095.png" /> orthogonally according to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026096.png" />.
+
Amari showed that statistical divergences, such as Kullback–Leibler distances (cf. [[Kullback–Leibler-type distance measures|Kullback–Leibler-type distance measures]]), could be defined in terms of these structures. By duality, the rate of change of vector fields allows one to define the rate of change of $  1 $-
 +
forms and hence of differentials of functions. For any function $  f $,  
 +
$  \nabla _ {X} df ( Y ) $
 +
is bilinear and symmetric in $  X $
 +
and $  Y $.  
 +
It therefore makes sense to try to realize the Fisher information in this form, i.e. to solve $  \nabla df = g $[[#References|[a6]]]. Solutions, if they exist, are uniquely determined by specifying the value of $  df $
 +
at a single point. Unfortunately $  \nabla $
 +
must be flat in order that a good supply of solutions exists. Let $  \Theta ( p, - ) $
 +
be the solution whose differential vanishes at $  p $.  
 +
Amari calls $  \Theta ( p,q ) $
 +
the statistical divergence between $  p $
 +
and $  q $.  
 +
For the $  1 $-
 +
connection it is the Kullback–Leibler distance. It is easy to show that the minimum value of $  \Theta ( p,q ) $,  
 +
for $  p $
 +
fixed and $  q $
 +
ranging over a submanifold $  U $,  
 +
occurs at the point $  q $
 +
where the geodesic of $  \nabla $
 +
joining $  p $
 +
to $  q $
 +
meets $  U $
 +
orthogonally according to $  g $.
  
Much research on statistical manifolds centres on finding improved or more insightful asymptotic formulas. A connection is equivalent to specifying the rates of change of differentials <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026097.png" />, and the geometric form of second-order Taylor expansion. The  "string (in statistics)strings"  [[#References|[a3]]] and  "yoke (in statistics)yokes"  [[#References|[a5]]] of O.E. Barndorff-Nielsen and P. Blaesild (cf. also [[Yoke|Yoke]]) define full geometric Taylor expansions in terms related to parametric statistics, seeking insight into results such as the Bartlett adjustment and the Barndorff-Nielsen <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/s/s110/s110260/s11026099.png" /> formula [[#References|[a4]]].
+
Much research on statistical manifolds centres on finding improved or more insightful asymptotic formulas. A connection is equivalent to specifying the rates of change of differentials $  df $,  
 +
and the geometric form of second-order Taylor expansion. The  "string (in statistics)strings"  [[#References|[a3]]] and  "yoke (in statistics)yokes"  [[#References|[a5]]] of O.E. Barndorff-Nielsen and P. Blaesild (cf. also [[Yoke|Yoke]]) define full geometric Taylor expansions in terms related to parametric statistics, seeking insight into results such as the Bartlett adjustment and the Barndorff-Nielsen $  p  ^ {*} $
 +
formula [[#References|[a4]]].
  
 
See also [[Differential geometry in statistical inference|Differential geometry in statistical inference]].
 
See also [[Differential geometry in statistical inference|Differential geometry in statistical inference]].

Latest revision as of 08:23, 6 June 2020


A manifold $ S $ endowed with a symmetric connection $ \nabla $ and a Riemannian metric $ g $. This structure is abstracted from parametric statistics, i.e. inference from data distributed according to some unknown member of a parametrized family of probability distributions. The most cited such family is the multivariate normal distribution for data $ x \in \mathbf R ^ {n} $, given by

$$ ( 2 \pi { \mathop{\rm det} } A ) ^ {- n/2 } { \mathop{\rm exp} } - [ ( x - \mu ) ^ {t} A ^ {- 1 } ( x - \mu ) ] dx _ {1} \dots dx _ {n} , $$

with as parameters the mean $ \mu \in \mathbf R ^ {n} $ and the covariance matrix $ A $. One thinks of the distributions themselves as points on a "surface" and the parameters as coordinates for these points. In this way any parametric family constitutes a manifold $ S $ with allowable parametrizations providing admissible coordinate systems.

One can think of measures on a set as analogous to points in a plane and measurable functions (or random variables) as analogous to arrows which translate one point to another. The random variable $ f $ translates the measure $ \mu $ to another $ \nu $ by the formula $ d \nu = e ^ {f} d \mu $, meaning that $ \nu $ has density $ e ^ {f} $ with respect to $ \mu $. Composition of translation operations corresponds to adding the random variables (or arrows), and for points $ p $ and $ q $ there is a unique translation moving $ p $ to $ q $, provided one stays within an equivalence class of measures, subject to some regularity conditions. Such translation by a vector space of "arrows" is called an affine structure, and it is taken to be the essence of flat geometry.

Probability measures can be regarded as finite measures up to scale, since any finite (non-negative) measure can be uniquely scaled into one. In this sense, probability distributions live inside the flat geometry of an equivalence class as the finite measures. By choosing a finite number of linearly independent random variables $ T _ {1} ( x ) \dots T _ {k} ( x ) $, one obtains finite-dimensional affine subspaces of measures of the form

$$ { \mathop{\rm exp} } [ \eta _ {1} T _ {1} ( x ) + \dots + \eta _ {k} T _ {k} ( x ) ] d \mu . $$

For an open subset of the parameters $ \eta \in \mathbf R ^ {k} $, these measures are finite and can be scaled to probability measures

$$ { \mathop{\rm exp} } [ \eta _ {1} T _ {1} ( x ) + \dots + \eta _ {k} T _ {k} ( x ) - K ( \eta ) ] d \mu. $$

They are the well-known exponential families. Their flatness can be related directly to their characterization in terms of sufficiency reduction.

General families of probability distributions $ \mu ( \theta ) $ are usually expressed as $ d \mu ( \theta ) = e ^ {l ( x, \theta ) } d \mu $, where $ \mu $ is fixed. Geometrically this amounts to choosing an origin $ \mu $ and describing each distribution $ \mu ( \theta ) $ in terms of its displacement vector $ l ( x, \theta ) $ from that origin. Differences in these displacement vectors give the displacement of one point in the family from another, and the derivatives

$$ l _ {i} ( x, \theta ) = { \frac{\partial l ( x, \theta ) }{\partial \theta ^ {i} } } $$

give infinitesimal displacements or tangent vectors to the family, or manifold $ S $, at the point $ \mu ( \theta ) $. The vector of random variables $ l _ {i} ( -, \theta ) $ is called the score and its components span the tangent space to the manifold $ S $ at $ \mu ( \theta ) $.

By restricting $ f $ and $ g $ to the span of the score components, $ {\mathsf E} _ {\mu ( \theta ) } ( fg ) $ defines an inner product on the tangent space at $ \mu ( \theta ) $. Its matrix with respect to the score basis is $ {\mathsf E} _ {\mu ( \theta ) } ( l _ {i} ( x, \theta ) l _ {j} ( x, \theta ) ) $, known as the Fisher information matrix. In this sense, the Fisher information defines an inner product on each tangent space of $ S $, i.e. a Riemannian metric $ g $. This is an observation going back to C.R. Rao, who noted that the multivariate normal family becomes a space of constant negative curvature under this metric [a7].

Because, as $ \theta $ varies, each score component $ l _ {i} $ provides a tangent vector at each point of $ S $, they are vector fields on $ S $( cf. Vector field). The second derivatives

$$ l _ {ij } ( x, \theta ) = { \frac{\partial ^ {2} l ( x, \theta ) }{\partial \theta ^ {i} \partial \theta ^ {j} } } $$

give rates of change of these vector fields, but not intrinsically on $ S $, since these random variables will not generally lie in the span of the score components. By using the Fisher information $ g $ one can project the second derivatives onto the tangent spaces, thus defining intrinsically the rate of change of these vector fields on $ S $. Via linearity in $ X $ and the Leibnitz rule in $ Y $ one defines $ \nabla _ {X} Y $, the rate of change of the vector field $ Y $ along the vector field $ X $, for any two vector fields $ X $ and $ Y $ on $ S $. $ \nabla $ is called the Amari $ 1 $- connection [a1]. S.-I. Amari noted that the dual connection $ \nabla ^ {*} $ with respect to $ g $ was generally different from $ \nabla $, i.e. $ \nabla $ is not the Riemannian or Levi-Civita connection of $ g $. One can therefore define a whole $ 1 $- parameter family of connections $ \nabla ^ \alpha = \alpha \nabla + ( 1 - \alpha ) \nabla ^ {*} $, so that $ \nabla ^ \alpha $ and $ \nabla ^ {- \alpha } $ are dual with respect to $ g $, and, in particular, $ \nabla ^ {1/2 } $ is the Levi-Civita connection.

Amari showed that statistical divergences, such as Kullback–Leibler distances (cf. Kullback–Leibler-type distance measures), could be defined in terms of these structures. By duality, the rate of change of vector fields allows one to define the rate of change of $ 1 $- forms and hence of differentials of functions. For any function $ f $, $ \nabla _ {X} df ( Y ) $ is bilinear and symmetric in $ X $ and $ Y $. It therefore makes sense to try to realize the Fisher information in this form, i.e. to solve $ \nabla df = g $[a6]. Solutions, if they exist, are uniquely determined by specifying the value of $ df $ at a single point. Unfortunately $ \nabla $ must be flat in order that a good supply of solutions exists. Let $ \Theta ( p, - ) $ be the solution whose differential vanishes at $ p $. Amari calls $ \Theta ( p,q ) $ the statistical divergence between $ p $ and $ q $. For the $ 1 $- connection it is the Kullback–Leibler distance. It is easy to show that the minimum value of $ \Theta ( p,q ) $, for $ p $ fixed and $ q $ ranging over a submanifold $ U $, occurs at the point $ q $ where the geodesic of $ \nabla $ joining $ p $ to $ q $ meets $ U $ orthogonally according to $ g $.

Much research on statistical manifolds centres on finding improved or more insightful asymptotic formulas. A connection is equivalent to specifying the rates of change of differentials $ df $, and the geometric form of second-order Taylor expansion. The "string (in statistics)strings" [a3] and "yoke (in statistics)yokes" [a5] of O.E. Barndorff-Nielsen and P. Blaesild (cf. also Yoke) define full geometric Taylor expansions in terms related to parametric statistics, seeking insight into results such as the Bartlett adjustment and the Barndorff-Nielsen $ p ^ {*} $ formula [a4].

See also Differential geometry in statistical inference.

References

[a1] S-I. Amari, "Differential-geometrical methods in statistics" , Lecture Notes in Statistics , 28 , Springer (1985)
[a2] S-I. Amari, O.E. Barndorff-Nielsen, R.E. Kass, S.L. Lauritzen, C.R. Rao, "Differential geometry in statistical inference" , Lecture Notes Monograph Ser. , 10 , Inst. Math. Statistics, Hayward, California (1987)
[a3] O.E. Barndorff-Nielsen, "Strings, tensorial combinants, and Bartlett adjustments" Proc. R. Soc. London A , 406 (1986) pp. 127–137
[a4] O.E. Barndorff-Nielsen, "Parametric statistical models and likelihood" , Lecture Notes in Statistics , 50 , Springer (1988)
[a5] P. Blaesild, "Yokes and tensors derived from yokes" Ann. Inst. Statist. Math. , 43 : 1 (1991) pp. 95–113
[a6] M.K. Murray, J.W Rice, "Differential geometry and statistics" , Monographs on Statistics and Applied Probability , 48 , Chapman and Hall (1993)
[a7] C.R. Rao, "Information and the accuracy attainable in the estimation of statistical parameters" Bull. Calcutta Math. Soc. , 37 (1945) pp. 81–91
How to Cite This Entry:
Statistical manifold. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Statistical_manifold&oldid=13987
This article was adapted from an original article by J.W. Rice (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article