Namespaces
Variants
Actions

Difference between revisions of "Zipf law"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (link)
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
In 1949 G.K. Zipf published [[#References|[a25]]]. A large, if not the main, part of it was devoted to the principles of human use of a language and the following was the main thesis of the author [[#References|[a25]]], pp. 20–21: "From the viewpoint of the speaker, there would doubtless exist an important latent economy in a vocabulary that consisted exclusively of one single word — a single word that would mean whatever the speaker wanted it to mean. Thus, if there were <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301101.png" /> different meanings to be verbalized, this word would have <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301102.png" /> different meanings. [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301103.png" />] But from the viewpoint of the auditor, a single word vocabulary would represent the acme of verbal labour, since he would be faced by the impossible task of determining the particular meaning to which the single word in given situation might refer. Indeed, from the viewpoint of the auditor [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301104.png" />] the important internal economy of speech would be found rather in vocabulary of such size that [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301105.png" />] if there were <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301106.png" /> different meanings, there would be <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301107.png" /> different words, with one meaning per word.
+
<!--This article has been texified automatically. Since there was no Nroff source code for this article,  
 +
the semi-automatic procedure described at https://encyclopediaofmath.org/wiki/User:Maximilian_Janisch/latexlist
 +
was used.
 +
If the TeX and formula formatting is correct and if all png images have been replaced by TeX code, please remove this message and the {{TEX|semi-auto}} category.
  
Thus, if there are an <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301108.png" /> number of different distinctive meanings to be verbalized, there will be 1) a speaker's economy in possessing a vocabulary of one word which will refer to all the distinctive meanings; and there will also be 2) an opposing auditor's economy in possessing a vocabulary of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301109.png" /> different words with one distinctive meaning for each word. Obviously the two opposing economies are in extreme conflict.
+
Out of 154 formulas, 148 were replaced by TEX code.-->
  
In the language of these two [economies or forces] we may say that the vocabulary of a given stream of speech is constantly subject of the opposing Forces of Unification and Diversification which will determine both the number <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011010.png" /> of actual words in the vocabulary, and also the meaning of those words.
+
{{TEX|semi-auto}}{{TEX|part}}
 +
In 1949 G.K. Zipf published [[#References|[a25]]]. A large, if not the main, part of it was devoted to the principles of human use of a language and the following was the main thesis of the author [[#References|[a25]]], pp. 20–21: "From the viewpoint of the speaker, there would doubtless exist an important latent economy in a vocabulary that consisted exclusively of one single word — a single word that would mean whatever the speaker wanted it to mean. Thus, if there were $m$ different meanings to be verbalized, this word would have $m$ different meanings. [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301103.png"/>] But from the viewpoint of the auditor, a single word vocabulary would represent the acme of verbal labour, since he would be faced by the impossible task of determining the particular meaning to which the single word in given situation might refer. Indeed, from the viewpoint of the auditor [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301104.png"/>] the important internal economy of speech would be found rather in vocabulary of such size that [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z1301105.png"/>] if there were $m$ different meanings, there would be $m$ different words, with one meaning per word.
 +
 
 +
Thus, if there are an $m$ number of different distinctive meanings to be verbalized, there will be 1) a speaker's economy in possessing a vocabulary of one word which will refer to all the distinctive meanings; and there will also be 2) an opposing auditor's economy in possessing a vocabulary of $m$ different words with one distinctive meaning for each word. Obviously the two opposing economies are in extreme conflict.
 +
 
 +
In the language of these two [economies or forces] we may say that the vocabulary of a given stream of speech is constantly subject of the opposing Forces of Unification and Diversification which will determine both the number $n$ of actual words in the vocabulary, and also the meaning of those words.
  
 
"
 
"
  
As far as this was the main thesis, the main data and the main empirical fact discussed thoroughly in the book is: "James Joyce's novel Ulysses, with its <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011011.png" /> running words, represents a sizable sample of running speech that may fairly be said to have served successfully in the communication of ideas. An index to the number of different words therein, together with the actual frequencies of their respective occurrences, has already been made with exemplary methods by Dr. Miles L. Hanley and associates ([[#References|[a27]]]) [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011012.png" />].
+
As far as this was the main thesis, the main data and the main empirical fact discussed thoroughly in the book is: "James Joyce's novel Ulysses, with its $260,430$ running words, represents a sizable sample of running speech that may fairly be said to have served successfully in the communication of ideas. An index to the number of different words therein, together with the actual frequencies of their respective occurrences, has already been made with exemplary methods by Dr. Miles L. Hanley and associates ([[#References|[a27]]]) [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011012.png"/>].
  
To the above published index has been added an appendix from the careful hands of Dr. M. Joos, in which is set forth all the qualitative information that is necessary for our present purposes. For Dr. Joos not only tells us that there are <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011013.png" /> different words in the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011014.png" /> running words; [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011015.png" />] and tells us the actual frequency, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011016.png" />, with which the different ranks, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011017.png" />, occur.
+
To the above published index has been added an appendix from the careful hands of Dr. M. Joos, in which is set forth all the qualitative information that is necessary for our present purposes. For Dr. Joos not only tells us that there are $29,899$ different words in the $260,430$ running words; [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011015.png"/>] and tells us the actual frequency, $f$, with which the different ranks, $r$, occur.
  
It is evident that the relationship between the various ranks, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011018.png" />, of these words and their respective frequencies, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011019.png" />, is potentially quite instructive about the entire matter of vocabulary balance, not only because it involves the frequencies with which the different words occur but also because the terminal rank of the list tells us the number of different words in the sample. And we remember that both the frequencies of occurrence and the number of different words will be important factors in the counterbalancing of the Forces of Unification and Diversification in the hypothetical vocabulary balance of any sample of speech.
+
It is evident that the relationship between the various ranks, $r$, of these words and their respective frequencies, $f$, is potentially quite instructive about the entire matter of vocabulary balance, not only because it involves the frequencies with which the different words occur but also because the terminal rank of the list tells us the number of different words in the sample. And we remember that both the frequencies of occurrence and the number of different words will be important factors in the counterbalancing of the Forces of Unification and Diversification in the hypothetical vocabulary balance of any sample of speech.
  
 
"
 
"
  
Now, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011020.png" /> be the frequencies of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011021.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011022.png" />, different disjoint events in a [[Sample|sample]] of size <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011023.png" />, like occurrences of different words in a text of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011024.png" /> running words, and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011025.png" /> be an  "empirical vocabulary" , that is, the number of all different events in the sample. Furthermore, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011026.png" /> be the [[Order statistic|order statistic]] based on the frequencies and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011027.png" />. Clearly, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011028.png" /> is the number of events that occurred in the sample exactly <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011029.png" /> times and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011030.png" />. Consider the statement that
+
Now, let $\{ f _ { i n } \} _ { i = 1 } ^ { N }$ be the frequencies of $N$, $N \leq \infty$, different disjoint events in a [[Sample|sample]] of size $n$, like occurrences of different words in a text of $n$ running words, and let $\mu _ { n } = \sum _ { i = 1 } ^ { N } 1 _ { \{ f _ { i n } \geq 1 \} }$ be an  "empirical vocabulary" , that is, the number of all different events in the sample. Furthermore, let $f_{ ( 1 , n )} \geq \ldots \geq f _{( \mu _ { n } , n )}$ be the [[Order statistic|order statistic]] based on the frequencies and $G _ { n } ( x ) = \sum _ { i = 1 } ^ { N } 1_{ \{ f _ { i n } \geq x \} }$. Clearly, $\Delta G _ { n } ( x ) \equiv \mu _ { n } ( x ) = \sum {\bf 1} _ { \{ f _ { i n } = x \} }$ is the number of events that occurred in the sample exactly $x$ times and $G _ { n } ( 1 ) = \mu _ { n }$. Consider the statement that
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011031.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a1)</td></tr></table>
+
\begin{equation} \tag{a1} k f _{( k , n )} \approx \mu _ { n } ,\; k = 1,2 , \ldots, \end{equation}
  
in some sense [[#References|[a25]]], II.A, p.24. There is a bit of a problem with such statement for high values of the rank <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011032.png" />. Indeed, though in general <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011033.png" /> for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011034.png" />, typically <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011035.png" /> and one can replace (a1) by
+
in some sense [[#References|[a25]]], II.A, p.24. There is a bit of a problem with such statement for high values of the rank $k$. Indeed, though in general $G _ { n } ( f ( k , n ) ) = \operatorname { max } \left\{ k ^ { \prime } : f _ { ( k ^ { \prime } , n ) } = f ( k , n ) \right\}$ for $k = 1,2 , \dots$, typically $G _ { n } ( f_{( k , n )} ) = k$ and one can replace (a1) by
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011036.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a2)</td></tr></table>
+
\begin{equation} \tag{a2} G _ { n } ( x ) x \approx \mu _ { n } ,\; x = f _{( 1 , n )} , f _{( 2 , n )}, \dots . \end{equation}
  
For <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011037.png" /> close to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011038.png" /> there will be plenty of frequencies of the same value which must have different ranks, and therefore the meaning of (a1) becomes obscure. However, one can use (a2) for all values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011039.png" />. In the form
+
For $k$ close to $\mu _ { n }$ there will be plenty of frequencies of the same value which must have different ranks, and therefore the meaning of (a1) becomes obscure. However, one can use (a2) for all values of $x = 1 , \dots , f_{( 1 , n )}$. In the form
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011040.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a3)</td></tr></table>
+
\begin{equation} \tag{a3} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \approx \Delta \frac { 1 } { x } = \frac { 1 } { x ( x + 1 ) } , x = 1,2 , \dots, \end{equation}
  
it becomes especially vivid, for it says that even for very large <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011041.png" /> (of many thousands) half of the empirical vocabulary should come from events which occurred in the sample only once, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011042.png" />. The events that occurred only twice, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011043.png" />, should constitute <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011044.png" /> of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011045.png" />, and the events that occurred not more than <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011046.png" /> times constitute <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011047.png" />-th of empirical vocabulary. So, somewhat counterintuitive even in very big samples, rare events should be presented in plenty and should constitute almost total of all different events that occurred. Any of statements (a1), (a2) or (a3) are called Zipf's law.
+
it becomes especially vivid, for it says that even for very large $n$ (of many thousands) half of the empirical vocabulary should come from events which occurred in the sample only once, $x = 1$. The events that occurred only twice, $x = 2$, should constitute $1 / 6$ of $\mu _ { n }$, and the events that occurred not more than $10$ times constitute $10 / 11$-th of empirical vocabulary. So, somewhat counterintuitive even in very big samples, rare events should be presented in plenty and should constitute almost total of all different events that occurred. Any of statements (a1), (a2) or (a3) are called Zipf's law.
  
Assume for simplicity that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011048.png" /> are drawn from a [[Multinomial distribution|multinomial distribution]] with probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011049.png" /> which form an array with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011050.png" />. The asymptotic behaviour of the statistics <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011051.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011052.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011053.png" />, which all are symmetric, is governed by the distribution function of the probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011054.png" />:
+
Assume for simplicity that $\{ f _ { i n } \} _ { i = 1 } ^ { N }$ are drawn from a [[Multinomial distribution|multinomial distribution]] with probabilities $\{ p _ { i n } \}^ { N } _ { 1 }$ which form an array with respect to $n$. The asymptotic behaviour of the statistics $\{ f _{( k , n )} \} _ { k = 1 } ^ { \mu _ { n } }$, $\{ \mu _ { n } ( k ) \} _ { k = 1 } ^ { \mu _ { n } }$, $G _ { n } ( . )$, which all are symmetric, is governed by the distribution function of the probabilities $\{ p _ { i n } \}^ { N } _ { 1 }$:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011055.png" /></td> </tr></table>
+
\begin{equation*} G _ { p , n } ( x ) = \sum _ { i = 1 } ^ { N } 1 _ { \{ n p _ { i n } \geq x \} }. \end{equation*}
  
 
Namely, if
 
Namely, if
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011056.png" /></td> </tr></table>
+
\begin{equation*} \frac { 1 } { n } G _ { p , n } \stackrel { \omega } { \rightarrow } G, \end{equation*}
  
 
then
 
then
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011057.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a4)</td></tr></table>
+
\begin{equation} \tag{a4} \frac { \mu _ { n } ( x ) } { n } \stackrel { \mathsf{P} } { \rightarrow } - \int _ { 0 } ^ { \infty } \frac { \lambda ^ { x } } { x ! } e ^ { - \lambda } G ( d \lambda ), \end{equation}
  
 
while if
 
while if
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011058.png" /></td> </tr></table>
+
\begin{equation*} R _ { n } \stackrel { \omega } { \rightarrow } R \text { and } \operatorname { lim } _ { \varepsilon \rightarrow 0 } \operatorname { sup } _ { n } \int _ { 0 } ^ { \varepsilon } z R _ { n } ( d z ) = 0, \end{equation*}
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011059.png" /></td> </tr></table>
+
\begin{equation*} R _ { n } ( x ) = \frac { G _ { p , n } ( x ) } { \int _ { 0 } ^ { \infty } ( 1 - e ^ { - z } ) G _ { p , n } ( d z ) }, \end{equation*}
  
 
then
 
then
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011060.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a5)</td></tr></table>
+
\begin{equation} \tag{a5} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \stackrel { \mathsf{P} } { \rightarrow } \alpha ( x ) = - \int _ { 0 } ^ { \infty } \frac { \lambda ^ { x } e ^ { - \lambda } } { x ! } R ( d \lambda ) \end{equation}
  
(see, e.g., [[#References|[a7]]]). Therefore the limiting expression <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011061.png" /> is by no means unique and there are as many possible limiting expressions for the spectral statistics <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011062.png" /> as there are measures with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011063.png" />. The right-hand side of (a5) gives Zipf's law (a3) if and only if
+
(see, e.g., [[#References|[a7]]]). Therefore the limiting expression $1 / x ( x + 1 )$ is by no means unique and there are as many possible limiting expressions for the spectral statistics $\mu _ { n } ( x ) / \mu _ { n }$ as there are measures with $\int _ { 0 } ^ { \infty } ( 1 - e ^ { - \lambda } ) R ( d \lambda ) = 1$. The right-hand side of (a5) gives Zipf's law (a3) if and only if
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011064.png" /></td> </tr></table>
+
\begin{equation*} R ( x ) = \int _ { 0 } ^ { \infty } \frac { 1 } { 1 + z } e ^ { - z x } d z. \end{equation*}
  
 
The concepts of C. Shannon in [[Information theory|information theory]] were relatively new when B. Mandelbrot [[#References|[a11]]] used them to make Zipf's reasoning on a compromise between speaker's and auditor's economies rigorous and derive (a3) formally. Therefore the statement
 
The concepts of C. Shannon in [[Information theory|information theory]] were relatively new when B. Mandelbrot [[#References|[a11]]] used them to make Zipf's reasoning on a compromise between speaker's and auditor's economies rigorous and derive (a3) formally. Therefore the statement
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011065.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a6)</td></tr></table>
+
\begin{equation} \tag{a6} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \approx \frac { 1 } { ( a + b x ) ^ { 2 } } \end{equation}
  
 
is frequently called the Zipf–Mandelbrot law. Among earlier work on distributions of the sort,
 
is frequently called the Zipf–Mandelbrot law. Among earlier work on distributions of the sort,
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011066.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a7)</td></tr></table>
+
\begin{equation} \tag{a7} f _{( k , n )} \sim A k ^ { - ( 1 + q ) } \end{equation}
  
 
is, of course, the famous [[Pareto distribution|Pareto distribution]] of population over income [[#References|[a17]]].
 
is, of course, the famous [[Pareto distribution|Pareto distribution]] of population over income [[#References|[a17]]].
Line 69: Line 77:
 
In lexical data, both Mandelbrot [[#References|[a12]]] and Zipf [[#References|[a25]]] note that the so-called Zipf's law was first observed by J.B. Estoup [[#References|[a2]]]. (Note that Zipf's role was not so much the observation itself, which, by the way, he never claimed to be the first to make, as the serious attempt to explain the mechanism behind it.)
 
In lexical data, both Mandelbrot [[#References|[a12]]] and Zipf [[#References|[a25]]] note that the so-called Zipf's law was first observed by J.B. Estoup [[#References|[a2]]]. (Note that Zipf's role was not so much the observation itself, which, by the way, he never claimed to be the first to make, as the serious attempt to explain the mechanism behind it.)
  
G.U. Yule [[#References|[a24]]], in his analysis of data of J.G. Willis [[#References|[a22]]] on problems of biological taxonomy, introduced the following expression for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011067.png" />:
+
[[Yule, George Udny|G.U. Yule]] [[#References|[a24]]], in his analysis of data of J.G. Willis [[#References|[a22]]] on problems of biological taxonomy, introduced the following expression for $\alpha ( x )$:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011068.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a8)</td></tr></table>
+
\begin{equation} \tag{a8} \alpha ( x ) = \frac { \Gamma ( \beta + 1 ) \Gamma ( x ) } { \Gamma ( x + \beta + 1 ) }, \end{equation}
  
depending on some parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011069.png" />.
+
depending on some parameter $\beta$.
  
H.A. Simon [[#References|[a19]]] proposed and studied a certain Markov model for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011070.png" /> as a process in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011071.png" /> and obtained (a8) and a somewhat more general form of it as a limiting expression for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011072.png" />. His core assumption was most natural: as there are <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011073.png" /> words used <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011074.png" /> times in the first <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011075.png" /> draws, the probability that a word from this group will be used on the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011076.png" />st draw is proportional to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011077.png" />. In other words,
+
H.A. Simon [[#References|[a19]]] proposed and studied a certain Markov model for $\{ \mu _ { n } ( x ) : x = 1,2 , \ldots \}$ as a process in $n$ and obtained (a8) and a somewhat more general form of it as a limiting expression for $\mu _ { n } ( x ) / \mu _ { n }$. His core assumption was most natural: as there are $x \mu _ { n } ( x )$ words used $x$ times in the first $n$ draws, the probability that a word from this group will be used on the $( n + 1 )$st draw is proportional to $x \mu _ { n } ( x )$. In other words,
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011078.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a9)</td></tr></table>
+
\begin{equation} \tag{a9} \mathsf{E} [ \mu _ { n + 1 } ( x ) | \mu _ { n } ( \cdot ) ] - \mu _ { n } ( x ) = \end{equation}
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011079.png" /></td> </tr></table>
+
\begin{equation*} = k ( n ) [ ( x - 1 ) \mu _ { n } ( x - 1 ) - x \mu _ { n } ( x ) ]. \end{equation*}
  
His other assumption that a new (unused) word would occur on each trial with the same constant probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011080.png" /> was, perhaps, a little bit controversial. Indeed, it implies that the empirical vocabulary <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011081.png" /> increases with the same rate as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011082.png" />, while for Zipf's law in its strict sense of (a3) (or for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011083.png" /> in (a7)) one obviously has
+
His other assumption that a new (unused) word would occur on each trial with the same constant probability $\alpha$ was, perhaps, a little bit controversial. Indeed, it implies that the empirical vocabulary $\mu _ { n }$ increases with the same rate as $n$, while for Zipf's law in its strict sense of (a3) (or for $q = 1$ in (a7)) one obviously has
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011084.png" /></td> </tr></table>
+
\begin{equation*} \frac { n } { \mu _ { n } } = \frac { \sum _ { x = 1 } ^ { n } x \mu _ { n } ( x ) } { \mu _ { n } } \sim \sum _ { x = 1 } ^ { n } \frac { 1 } { x + 1 } \rightarrow \infty . \end{equation*}
  
This and some other concerns of analytic nature (cf. the criticism in [[#References|[a12]]], [[#References|[a13]]] and the replies [[#References|[a20]]], [[#References|[a21]]]) led Simon to the third assumption:  "[a word] may be dropped with the same average rate" . This is described in very clear form in [[#References|[a20]]]: "We consider a sequence of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011085.png" /> words. We add words to the end of the sequence, and drop words from the sequence at the same average rate, so that the length of the sequence remains <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011086.png" />. For the birth process we assume: the probability that the next word added is one that now occurs <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011087.png" /> times is proportional to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011088.png" />. The probability that the next word is a new word is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011089.png" />, a constant. For the death process we assume: the probability that the next word dropped is one that now occurs <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011090.png" /> times is proportional to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011091.png" />. The terms <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011092.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011093.png" /> are constant parameters. The steady state relation is:
+
This and some other concerns of analytic nature (cf. the criticism in [[#References|[a12]]], [[#References|[a13]]] and the replies [[#References|[a20]]], [[#References|[a21]]]) led Simon to the third assumption:  "[a word] may be dropped with the same average rate" . This is described in very clear form in [[#References|[a20]]]: "We consider a sequence of $k$ words. We add words to the end of the sequence, and drop words from the sequence at the same average rate, so that the length of the sequence remains $k$. For the birth process we assume: the probability that the next word added is one that now occurs $i$ times is proportional to $( i + c ) \mu ( i )$. The probability that the next word is a new word is $\alpha$, a constant. For the death process we assume: the probability that the next word dropped is one that now occurs $i$ times is proportional to $( i + d ) \mu ( i )$. The terms $c$ and $d$ are constant parameters. The steady state relation is:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011094.png" /></td> </tr></table>
+
\begin{equation*} \mu ( i , m + 1 ) - \mu ( i , m ) = \end{equation*}
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011095.png" /></td> </tr></table>
+
\begin{equation*} = \frac { ( 1 - \alpha ) } {  k  + c m _ { k } } \text{..} [ ( i - 1 + c ) \mu ( i - 1 , m ) - ( i + c ) \mu ( i , m ) ] + \end{equation*}
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011096.png" /></td> </tr></table>
+
\begin{equation*} - \frac { 1 } { k + d n _ { k } } {..} [ ( i + d ) \mu ( i , m ) - ( i + d + 1 ) \mu ( i + 1 , m ) ] = 0. \end{equation*}
  
A solution to this equation, independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011097.png" />, is:
+
A solution to this equation, independent of $m$, is:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011098.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a10)</td></tr></table>
+
\begin{equation} \tag{a10} \mu ( i , m ) = A \lambda ^ { i } B ( i + c , d - c + 1 ), \end{equation}
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z13011099.png" /></td> </tr></table>
+
\begin{equation*} \lambda = \frac { ( 1 - \alpha ) ( k + d n _ { k } ) } { ( k + cn _ { k } ) } \end{equation*}
  
and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110100.png" /> is the beta function.
+
and $B$ is the beta function.
  
 
"
 
"
  
See [[#References|[a12]]] for an analytically beneficial replacement of the difference equations (a9) (for expectations <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110101.png" />) by certain partial differential equations. See also [[#References|[a3]]] for the exact solution of these difference equations. Simon suggested to call (a8) the Yule distribution, but (a8) and (a10) are more appropriately and more commonly called the Yule–Simon distribution.
+
See [[#References|[a12]]] for an analytically beneficial replacement of the difference equations (a9) (for expectations $\mathsf E \mu _ { n } ( x )$) by certain partial differential equations. See also [[#References|[a3]]] for the exact solution of these difference equations. Simon suggested to call (a8) the Yule distribution, but (a8) and (a10) are more appropriately and more commonly called the Yule–Simon distribution.
  
Approximately at the same time, R.H. MacArthur [[#References|[a10]]] published a note on a model for the relative abundancy of species. His model of  "non-overlapping niches"  was very simple: "The environment is compared with a stick of unit length on which <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110102.png" /> points are thrown at random. The stick is broken at these points and lengths of the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110103.png" /> resulting segments are proportional to the abundance of the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110104.png" /> species. [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110105.png" />] The expected length of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110106.png" />-th shortest interval is given by
+
Approximately at the same time, R.H. MacArthur [[#References|[a10]]] published a note on a model for the relative abundancy of species. His model of  "non-overlapping niches"  was very simple: "The environment is compared with a stick of unit length on which $m - 1$ points are thrown at random. The stick is broken at these points and lengths of the $m$ resulting segments are proportional to the abundance of the $m$ species. [<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110105.png"/>] The expected length of $r$-th shortest interval is given by
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110107.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a11)</td></tr></table>
+
\begin{equation} \tag{a11} \frac { 1 } { m } \sum _ { i = 1 } ^ { r } \frac { 1 } { m - i + 1 } = p ( z ) \end{equation}
  
so that the expected abundance of the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110108.png" />-th rarest species among <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110109.png" /> species and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110110.png" /> individuals is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110111.png" />.
+
so that the expected abundance of the $r$-th rarest species among $m$ species and $n$ individuals is $m p ( z )$.
  
 
" It appears (see [[#References|[a26]]]) that  "bird censuses"  from tropical forests and many temperate regions fit this almost perfectly.
 
" It appears (see [[#References|[a26]]]) that  "bird censuses"  from tropical forests and many temperate regions fit this almost perfectly.
  
The rate of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110112.png" /> in <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110113.png" /> is, certainly, very different from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110114.png" />. Very interestingly, MacArthur suggests, then, to view divergence of a distribution from (a11) as an indication and a measure of heterogeneity of a population. His idea was to  "use several sticks of very different size"  to produce other curves.
+
The rate of $p ( z )$ in $z$ is, certainly, very different from $z ^ { - ( 1 + q ) }$. Very interestingly, MacArthur suggests, then, to view divergence of a distribution from (a11) as an indication and a measure of heterogeneity of a population. His idea was to  "use several sticks of very different size"  to produce other curves.
  
 
In a certain sense, this was actualized in a very elegant model of B. Hill and M. Woodroofe (see [[#References|[a4]]], [[#References|[a5]]], [[#References|[a23]]]).
 
In a certain sense, this was actualized in a very elegant model of B. Hill and M. Woodroofe (see [[#References|[a4]]], [[#References|[a5]]], [[#References|[a23]]]).
  
According to Hill, if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110115.png" /> individuals are to be distributed to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110116.png" /> non-empty genera in accordance with [[Bose–Einstein statistics|Bose–Einstein statistics]], i.e. all allocations are equi-probable, each with probability
+
According to Hill, if $N$ individuals are to be distributed to $M$ non-empty genera in accordance with [[Bose–Einstein statistics|Bose–Einstein statistics]], i.e. all allocations are equi-probable, each with probability
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110117.png" /></td> </tr></table>
+
\begin{equation*} \frac { 1 } { \left( \begin{array} { c } { N - 1 } \\ { M - 1 } \end{array} \right) }, \end{equation*}
  
and if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110118.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110119.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110120.png" />, then for the number of genera with exactly <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110121.png" /> species one has
+
and if $\mathsf{P} \{ M / N \leq x \} \stackrel { \omega } { \rightarrow } F ( x )$, $0 \leq x \leq 1$, $F ( 0 ) = 0$, then for the number of genera with exactly $x$ species one has
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110122.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a12)</td></tr></table>
+
\begin{equation} \tag{a12} \frac { \mu _ { N } ( x ) } { M } \stackrel { d } { \rightarrow } U ( 1 - U ) ^ { x - 1 }, \end{equation}
  
where the random variable <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110123.png" /> has distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110124.png" />. Note that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110125.png" /> has a [[Uniform distribution|uniform distribution]], then, as a consequence of (a12),
+
where the random variable $U$ has distribution $F$. Note that if $U$ has a [[Uniform distribution|uniform distribution]], then, as a consequence of (a12),
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110126.png" /></td> </tr></table>
+
\begin{equation*} \textsf{E} \frac { \mu _ { N } ( x ) } { M } \rightarrow \frac { 1 } { x ( x + 1 ) }, \end{equation*}
  
which is exactly (a3). By considering a [[Triangular array|triangular array]] of independent (or exchangeable, as the authors have it) families of genera <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110127.png" /> with allocations of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110128.png" /> species in each family <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110129.png" /> (the different  "sticks"  of MacArthur [[#References|[a10]]]!), Hill and Woodroofe have shown that for combined statistics <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110130.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110131.png" /> one has
+
which is exactly (a3). By considering a [[Triangular array|triangular array]] of independent (or exchangeable, as the authors have it) families of genera $M _ { i k }$ with allocations of $N _ { i k }$ species in each family $i = 1 , \ldots , k$ (the different  "sticks"  of MacArthur [[#References|[a10]]]!), Hill and Woodroofe have shown that for combined statistics $\mu _ { N _ { k } } ( x ) = \sum _ { i = 1 } ^ { k } \mu _ { i N _ { i } } ( x )$ and $M _ { k } = \sum _ { i = 1 } ^ { k } M _ { i k }$ one has
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110132.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a13)</td></tr></table>
+
\begin{equation} \tag{a13} \frac { \mu _ { N } ( x ) } { M } \stackrel { \mathsf{P} } { \rightarrow } \int _ { 0 } ^ { 1 } u ( 1 - u ) ^ { x - 1 } F ( d x ). \end{equation}
  
 
A. Rouault [[#References|[a18]]] considered a certain [[Markov process|Markov process]] as a model for formation of a text word by word and obtained the following expression for the limit of spectral statistics:
 
A. Rouault [[#References|[a18]]] considered a certain [[Markov process|Markov process]] as a model for formation of a text word by word and obtained the following expression for the limit of spectral statistics:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110133.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a14)</td></tr></table>
+
\begin{equation} \tag{a14} \alpha ( x ) = \frac { \beta \Gamma ( x - \beta ) } { \Gamma ( 1 - \beta ) \Gamma ( x + 1 ) }, \end{equation}
  
depending on a parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110134.png" />.
+
depending on a parameter $\beta$.
  
This Karlin–Rouault law appeared in a very different context in [[#References|[a8]]]: Let the unit interval be divided into two in the ratio <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110135.png" />. Let each of these two be again subdivided in the same ratio, etc. At step <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110136.png" /> one obtains probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110137.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110138.png" />, of the form <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110139.png" /> for some <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110140.png" />. One can think of filling a questionnaire with  "yes-no"  questions at random with probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110141.png" /> for one of the answers in each question. Then the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110142.png" /> are the probabilities of each particular answer to the questions or  "opinion" . It was proved that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110143.png" />, then
+
This Karlin–Rouault law appeared in a very different context in [[#References|[a8]]]: Let the unit interval be divided into two in the ratio $a : 1 - a$. Let each of these two be again subdivided in the same ratio, etc. At step $q$ one obtains probabilities $p _ { i }$, $i = 1 , \dots , 2 ^ { q }$, of the form $a ^ { k } ( 1 - a ) ^ { q - k }$ for some $k = 0 , \dots , q$. One can think of filling a questionnaire with  "yes-no"  questions at random with probability $a$ for one of the answers in each question. Then the $p _ { i }$ are the probabilities of each particular answer to the questions or  "opinion" . It was proved that if $a \neq 1 / 2$, then
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110144.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a15)</td></tr></table>
+
\begin{equation} \tag{a15} \mu _ { n } \rightarrow \infty \quad \text { but } \frac { \mu _ { n } } { n } \rightarrow 0 \end{equation}
  
 
and that
 
and that
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110145.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a16)</td></tr></table>
+
\begin{equation} \tag{a16} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \rightarrow \frac { \int _ { - \infty } ^ { \infty } \alpha ^ { s ( x + \beta ) } e ^ { - \alpha ^ { s } } d N ( s ) } { \Gamma ( x + 1 ) \int _ { - \infty } ^ { \infty } \alpha ^ { s  \beta } e ^ { - \alpha ^ { s } } d N ( s ) }, \end{equation}
  
with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110146.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110147.png" /> and with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110148.png" /> being a counting measure (concentrated on integers). With <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110149.png" /> replaced by the [[Lebesgue measure|Lebesgue measure]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110150.png" /> and using the change of variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110151.png" />, the right-hand side of (a16) will coincide with (a14).
+
with $\alpha = a / ( 1 - a )$ and $\beta = \beta ( \alpha , c ) &lt; 1$ and with $N ( s )$ being a counting measure (concentrated on integers). With $d N ( s )$ replaced by the [[Lebesgue measure|Lebesgue measure]] $d s$ and using the change of variables $u = \alpha ^ { s }$, the right-hand side of (a16) will coincide with (a14).
  
A possible heuristic interpretation of (a15) is that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110152.png" />, then saying  "as many men as many minds"  is incorrect — though the number of  "minds"  is infinitely large it is still asymptotically smaller than the number of  "men" .
+
A possible heuristic interpretation of (a15) is that if $a \neq 1 / 2$, then saying  "as many men as many minds"  is incorrect — though the number of  "minds"  is infinitely large it is still asymptotically smaller than the number of  "men" .
  
 
One of the present time (2000) characteristics of the study of Zipf's law is the increase of interest in the more general concept of  "distributions with a large number of rare events"  (LNRE distributions), which was introduced and first systematically studied in [[#References|[a6]]] (partly reproduced in [[#References|[a7]]]).
 
One of the present time (2000) characteristics of the study of Zipf's law is the increase of interest in the more general concept of  "distributions with a large number of rare events"  (LNRE distributions), which was introduced and first systematically studied in [[#References|[a6]]] (partly reproduced in [[#References|[a7]]]).
  
For a systematic application of LNRE distributions to the analysis of texts, see [[#References|[a1]]]. On the other hand, studies on specific laws continue. In a series of papers, Yu. Orlov (see, e.g., [[#References|[a14]]], [[#References|[a15]]], [[#References|[a16]]]) promoted the hypothesis that the presence of Zipf's law is the characteristic property of a complete, finished and well-balanced text. Though the hypothesis looks beautiful and attractive, for some readers the evidence may look thin. The problem of non-parametric estimation of the functions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110153.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/z/z130/z130110/z130110154.png" /> through the expressions (a3) and (a5) provides another most natural example of an  "inverse problem" , studied recently by C. Klaassen and R. Mnatsakanov [[#References|[a9]]].
+
For a systematic application of LNRE distributions to the analysis of texts, see [[#References|[a1]]]. On the other hand, studies on specific laws continue. In a series of papers, Yu. Orlov (see, e.g., [[#References|[a14]]], [[#References|[a15]]], [[#References|[a16]]]) promoted the hypothesis that the presence of Zipf's law is the characteristic property of a complete, finished and well-balanced text. Though the hypothesis looks beautiful and attractive, for some readers the evidence may look thin. The problem of non-parametric estimation of the functions $G$ and $R$ through the expressions (a3) and (a5) provides another most natural example of an  "inverse problem" , studied recently by C. Klaassen and R. Mnatsakanov [[#References|[a9]]].
  
 
In conclusion, the following quote from [[#References|[a12]]], whose views I share, is given: "The form of Zipf's law [see (a7)] is so striking and also so very different from any classical distribution of statistics that it is quite widely felt that it  "should"  have a basically simple reason, possibly as  "weak"  and general as the reasons which account for the role of Gaussian distribution. But, in fact, these laws turn out to be quite resistant to such an analysis. Thus, irrespective of any claim as to their practical importance, the  "explanation"  of their role has long been one of the best defined and most conspicuous challenges to those interested in statistical laws of nature.
 
In conclusion, the following quote from [[#References|[a12]]], whose views I share, is given: "The form of Zipf's law [see (a7)] is so striking and also so very different from any classical distribution of statistics that it is quite widely felt that it  "should"  have a basically simple reason, possibly as  "weak"  and general as the reasons which account for the role of Gaussian distribution. But, in fact, these laws turn out to be quite resistant to such an analysis. Thus, irrespective of any claim as to their practical importance, the  "explanation"  of their role has long been one of the best defined and most conspicuous challenges to those interested in statistical laws of nature.
Line 162: Line 170:
  
 
====References====
 
====References====
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  R.H. Baayen,  "Word frequency distribution" , Kluwer Acad. Publ.  (2000)</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  J.B. Estoup,  "Gammes stenographiques" , Paris  (1916)  (Edition: Fourth)</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  R. Gunther,  B. Schapiro,  P. Wagner,  "Physical complexity and Zipf's law"  ''Internat. J. Theoret. Phys.'' , '''31''' :  3  (1999)  pp. 525–543</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top">  B.M. Hill,  "Zipf's law and prior distributions for the composition of a population"  ''J. Amer. Statist. Assoc.'' , '''65'''  (1970)  pp. 1220–1232</TD></TR><TR><TD valign="top">[a5]</TD> <TD valign="top">  B.M. Hill,  M. Woodroofe,  "Stronger forms of Zipf's law"  ''J. Amer. Statist. Assoc.'' , '''70'''  (1975)  pp. 212–219</TD></TR><TR><TD valign="top">[a6]</TD> <TD valign="top">  E.V. Khmaladze,  "The statistical analysis of large number of rare events"  ''Techn. Rept. Dept. Math. Statist. CWI, Amsterdam'' , '''MS-R8804'''  (1987)</TD></TR><TR><TD valign="top">[a7]</TD> <TD valign="top">  E.V. Khmaladze,  R.J. Chitashvili,  "Statistical analysis of large number of rate events and related problems"  ''Trans. Tbilisi Math. Inst.'' , '''91'''  (1989)  pp. 196–245</TD></TR><TR><TD valign="top">[a8]</TD> <TD valign="top">  E.V. Khmaladze,  Z.P. Tsigroshvili,  "On polynomial distributions with large number of rare events"  ''Math. Methods Statist.'' , '''2''' :  3  (1993)  pp. 240–247</TD></TR><TR><TD valign="top">[a9]</TD> <TD valign="top">  C.A.J. Klaassen,  R.M. Mnatsakanov,  "Consistent estimation of the structural distribution function"  ''Scand. J. Statist.'' , '''27'''  (2000)  pp. 1–14</TD></TR><TR><TD valign="top">[a10]</TD> <TD valign="top">  R.H. MacArthur,  "On relative abundancy of bird species"  ''Proc. Nat. Acad. Sci. USA'' , '''43'''  (1957)  pp. 293–295</TD></TR><TR><TD valign="top">[a11]</TD> <TD valign="top">  B. Mandelbrot,  "An information theory of the statistical structure of language"  W.E. Jackson (ed.) , ''Communication Theory'' , Acad. Press  (1953)  pp. 503–512</TD></TR><TR><TD valign="top">[a12]</TD> <TD valign="top">  B. Mandelbrot,  "A note on a class of skew distribution functions: Analysis and critque of a paper by H.A. Simon"  ''Inform. &amp; Control'' , '''2'''  (1959)  pp. 90–99</TD></TR><TR><TD valign="top">[a13]</TD> <TD valign="top">  B. Mandelbrot,  "Final note on a class of skew distribution functions: Analysis and critique of a model due to H.A. Simon"  ''Inform. &amp; Control'' , '''4'''  (1961)  pp. 198–216; 300–304</TD></TR><TR><TD valign="top">[a14]</TD> <TD valign="top">  Ju.K. Orlov,  "Dynamik der Haufigkeitsstrukturen"  H. Guiter (ed.)  M.V. Arapov (ed.) , ''Studies on Zipf's law'' , Brockmeyer, Bochum  (1983)  pp. 116–153</TD></TR><TR><TD valign="top">[a15]</TD> <TD valign="top">  Ju.K. Orlov,  "Ein Model der Haufigskeitstruktur des Vokabulars"  H. Guiter (ed.)  M.V. Arapov (ed.) , ''Studies on Zipf's law'' , Brockmeyer, Bochum  (1983)  pp. 154–233</TD></TR><TR><TD valign="top">[a16]</TD> <TD valign="top">  Ju.K. Orlov,  R.Y. Chitashvili,  "On the statistical interpretation of Zipf's law"  ''Bull. Acad. Sci. Georgia'' , '''109'''  (1983)  pp. 505–508</TD></TR><TR><TD valign="top">[a17]</TD> <TD valign="top">  V. Pareto,  "Course d'economie politique" , Lausanne and Paris  (1897)</TD></TR><TR><TD valign="top">[a18]</TD> <TD valign="top">  A. Rouault,  "Loi de Zipf et sources markoviennes"  ''Ann. Inst. H. Poincaré'' , '''14'''  (1978)  pp. 169–188</TD></TR><TR><TD valign="top">[a19]</TD> <TD valign="top">  H.A. Simon,  "On a class of skew distribution functions"  ''Biometrika'' , '''42'''  (1955)  pp. 435–440</TD></TR><TR><TD valign="top">[a20]</TD> <TD valign="top">  H.A. Simon,  "Some further notes on a class of skew distribution functions"  ''Inform. &amp; Control'' , '''3'''  (1960)  pp. 80–88</TD></TR><TR><TD valign="top">[a21]</TD> <TD valign="top">  H.A. Simon,  "Reply to `final note' by Benoit Mandelbrot"  ''Inform. &amp; Control'' , '''4'''  (1961)  pp. 217–223</TD></TR><TR><TD valign="top">[a22]</TD> <TD valign="top">  J.G. Willis,  "Age and area" , Cambridge Univ. Press  (1922)</TD></TR><TR><TD valign="top">[a23]</TD> <TD valign="top">  M. Woodroofe,  B.M. Hill,  "On Zipf's law"  ''J. Appl. Probab.'' , '''12''' :  3  (1975)  pp. 425–434</TD></TR><TR><TD valign="top">[a24]</TD> <TD valign="top">  G.U. Yule,  "A mathematical theory of evolution, based on the conclusions of Dr.J.C. Willis, F.R.S."  ''Philos. Trans. Royal Soc. London Ser. B'' , '''213'''  (1924)  pp. 21–87</TD></TR><TR><TD valign="top">[a25]</TD> <TD valign="top">  G.K. Zipf,  "Human behaviour and the principles of least effort" , Addison-Wesley  (1949)</TD></TR><TR><TD valign="top">[a26]</TD> <TD valign="top">  L.I. Davis,  ''Aud. Field Notes'' , '''9'''  (1955)  pp. 425</TD></TR><TR><TD valign="top">[a27]</TD> <TD valign="top">  L.M. Hanley,  "Word index to James Joyce's Ulysses" , Madison, Wisc.  (1937)  (statistical tabulations by M. Joos)</TD></TR></table>
+
<table><tr><td valign="top">[a1]</td> <td valign="top">  R.H. Baayen,  "Word frequency distribution" , Kluwer Acad. Publ.  (2000)</td></tr><tr><td valign="top">[a2]</td> <td valign="top">  J.B. Estoup,  "Gammes stenographiques" , Paris  (1916)  (Edition: Fourth)</td></tr><tr><td valign="top">[a3]</td> <td valign="top">  R. Gunther,  B. Schapiro,  P. Wagner,  "Physical complexity and Zipf's law"  ''Internat. J. Theoret. Phys.'' , '''31''' :  3  (1999)  pp. 525–543</td></tr><tr><td valign="top">[a4]</td> <td valign="top">  B.M. Hill,  "Zipf's law and prior distributions for the composition of a population"  ''J. Amer. Statist. Assoc.'' , '''65'''  (1970)  pp. 1220–1232</td></tr><tr><td valign="top">[a5]</td> <td valign="top">  B.M. Hill,  M. Woodroofe,  "Stronger forms of Zipf's law"  ''J. Amer. Statist. Assoc.'' , '''70'''  (1975)  pp. 212–219</td></tr><tr><td valign="top">[a6]</td> <td valign="top">  E.V. Khmaladze,  "The statistical analysis of large number of rare events"  ''Techn. Rept. Dept. Math. Statist. CWI, Amsterdam'' , '''MS-R8804'''  (1987)</td></tr><tr><td valign="top">[a7]</td> <td valign="top">  E.V. Khmaladze,  R.J. Chitashvili,  "Statistical analysis of large number of rate events and related problems"  ''Trans. Tbilisi Math. Inst.'' , '''91'''  (1989)  pp. 196–245</td></tr><tr><td valign="top">[a8]</td> <td valign="top">  E.V. Khmaladze,  Z.P. Tsigroshvili,  "On polynomial distributions with large number of rare events"  ''Math. Methods Statist.'' , '''2''' :  3  (1993)  pp. 240–247</td></tr><tr><td valign="top">[a9]</td> <td valign="top">  C.A.J. Klaassen,  R.M. Mnatsakanov,  "Consistent estimation of the structural distribution function"  ''Scand. J. Statist.'' , '''27'''  (2000)  pp. 1–14</td></tr><tr><td valign="top">[a10]</td> <td valign="top">  R.H. MacArthur,  "On relative abundancy of bird species"  ''Proc. Nat. Acad. Sci. USA'' , '''43'''  (1957)  pp. 293–295</td></tr><tr><td valign="top">[a11]</td> <td valign="top">  B. Mandelbrot,  "An information theory of the statistical structure of language"  W.E. Jackson (ed.) , ''Communication Theory'' , Acad. Press  (1953)  pp. 503–512</td></tr><tr><td valign="top">[a12]</td> <td valign="top">  B. Mandelbrot,  "A note on a class of skew distribution functions: Analysis and critque of a paper by H.A. Simon"  ''Inform. &amp; Control'' , '''2'''  (1959)  pp. 90–99</td></tr><tr><td valign="top">[a13]</td> <td valign="top">  B. Mandelbrot,  "Final note on a class of skew distribution functions: Analysis and critique of a model due to H.A. Simon"  ''Inform. &amp; Control'' , '''4'''  (1961)  pp. 198–216; 300–304</td></tr><tr><td valign="top">[a14]</td> <td valign="top">  Ju.K. Orlov,  "Dynamik der Haufigkeitsstrukturen"  H. Guiter (ed.)  M.V. Arapov (ed.) , ''Studies on Zipf's law'' , Brockmeyer, Bochum  (1983)  pp. 116–153</td></tr><tr><td valign="top">[a15]</td> <td valign="top">  Ju.K. Orlov,  "Ein Model der Haufigskeitstruktur des Vokabulars"  H. Guiter (ed.)  M.V. Arapov (ed.) , ''Studies on Zipf's law'' , Brockmeyer, Bochum  (1983)  pp. 154–233</td></tr><tr><td valign="top">[a16]</td> <td valign="top">  Ju.K. Orlov,  R.Y. Chitashvili,  "On the statistical interpretation of Zipf's law"  ''Bull. Acad. Sci. Georgia'' , '''109'''  (1983)  pp. 505–508</td></tr><tr><td valign="top">[a17]</td> <td valign="top">  V. Pareto,  "Course d'economie politique" , Lausanne and Paris  (1897)</td></tr><tr><td valign="top">[a18]</td> <td valign="top">  A. Rouault,  "Loi de Zipf et sources markoviennes"  ''Ann. Inst. H. Poincaré'' , '''14'''  (1978)  pp. 169–188</td></tr><tr><td valign="top">[a19]</td> <td valign="top">  H.A. Simon,  "On a class of skew distribution functions"  ''Biometrika'' , '''42'''  (1955)  pp. 435–440</td></tr><tr><td valign="top">[a20]</td> <td valign="top">  H.A. Simon,  "Some further notes on a class of skew distribution functions"  ''Inform. &amp; Control'' , '''3'''  (1960)  pp. 80–88</td></tr><tr><td valign="top">[a21]</td> <td valign="top">  H.A. Simon,  "Reply to `final note' by Benoit Mandelbrot"  ''Inform. &amp; Control'' , '''4'''  (1961)  pp. 217–223</td></tr><tr><td valign="top">[a22]</td> <td valign="top">  J.G. Willis,  "Age and area" , Cambridge Univ. Press  (1922)</td></tr><tr><td valign="top">[a23]</td> <td valign="top">  M. Woodroofe,  B.M. Hill,  "On Zipf's law"  ''J. Appl. Probab.'' , '''12''' :  3  (1975)  pp. 425–434</td></tr><tr><td valign="top">[a24]</td> <td valign="top">  G.U. Yule,  "A mathematical theory of evolution, based on the conclusions of Dr.J.C. Willis, F.R.S."  ''Philos. Trans. Royal Soc. London Ser. B'' , '''213'''  (1924)  pp. 21–87</td></tr><tr><td valign="top">[a25]</td> <td valign="top">  G.K. Zipf,  "Human behaviour and the principles of least effort" , Addison-Wesley  (1949)</td></tr><tr><td valign="top">[a26]</td> <td valign="top">  L.I. Davis,  ''Aud. Field Notes'' , '''9'''  (1955)  pp. 425</td></tr><tr><td valign="top">[a27]</td> <td valign="top">  L.M. Hanley,  "Word index to James Joyce's Ulysses" , Madison, Wisc.  (1937)  (statistical tabulations by M. Joos)</td></tr></table>

Latest revision as of 14:54, 18 March 2023

In 1949 G.K. Zipf published [a25]. A large, if not the main, part of it was devoted to the principles of human use of a language and the following was the main thesis of the author [a25], pp. 20–21: "From the viewpoint of the speaker, there would doubtless exist an important latent economy in a vocabulary that consisted exclusively of one single word — a single word that would mean whatever the speaker wanted it to mean. Thus, if there were $m$ different meanings to be verbalized, this word would have $m$ different meanings. [] But from the viewpoint of the auditor, a single word vocabulary would represent the acme of verbal labour, since he would be faced by the impossible task of determining the particular meaning to which the single word in given situation might refer. Indeed, from the viewpoint of the auditor [] the important internal economy of speech would be found rather in vocabulary of such size that [] if there were $m$ different meanings, there would be $m$ different words, with one meaning per word.

Thus, if there are an $m$ number of different distinctive meanings to be verbalized, there will be 1) a speaker's economy in possessing a vocabulary of one word which will refer to all the distinctive meanings; and there will also be 2) an opposing auditor's economy in possessing a vocabulary of $m$ different words with one distinctive meaning for each word. Obviously the two opposing economies are in extreme conflict.

In the language of these two [economies or forces] we may say that the vocabulary of a given stream of speech is constantly subject of the opposing Forces of Unification and Diversification which will determine both the number $n$ of actual words in the vocabulary, and also the meaning of those words.

"

As far as this was the main thesis, the main data and the main empirical fact discussed thoroughly in the book is: "James Joyce's novel Ulysses, with its $260,430$ running words, represents a sizable sample of running speech that may fairly be said to have served successfully in the communication of ideas. An index to the number of different words therein, together with the actual frequencies of their respective occurrences, has already been made with exemplary methods by Dr. Miles L. Hanley and associates ([a27]) [].

To the above published index has been added an appendix from the careful hands of Dr. M. Joos, in which is set forth all the qualitative information that is necessary for our present purposes. For Dr. Joos not only tells us that there are $29,899$ different words in the $260,430$ running words; [] and tells us the actual frequency, $f$, with which the different ranks, $r$, occur.

It is evident that the relationship between the various ranks, $r$, of these words and their respective frequencies, $f$, is potentially quite instructive about the entire matter of vocabulary balance, not only because it involves the frequencies with which the different words occur but also because the terminal rank of the list tells us the number of different words in the sample. And we remember that both the frequencies of occurrence and the number of different words will be important factors in the counterbalancing of the Forces of Unification and Diversification in the hypothetical vocabulary balance of any sample of speech.

"

Now, let $\{ f _ { i n } \} _ { i = 1 } ^ { N }$ be the frequencies of $N$, $N \leq \infty$, different disjoint events in a sample of size $n$, like occurrences of different words in a text of $n$ running words, and let $\mu _ { n } = \sum _ { i = 1 } ^ { N } 1 _ { \{ f _ { i n } \geq 1 \} }$ be an "empirical vocabulary" , that is, the number of all different events in the sample. Furthermore, let $f_{ ( 1 , n )} \geq \ldots \geq f _{( \mu _ { n } , n )}$ be the order statistic based on the frequencies and $G _ { n } ( x ) = \sum _ { i = 1 } ^ { N } 1_{ \{ f _ { i n } \geq x \} }$. Clearly, $\Delta G _ { n } ( x ) \equiv \mu _ { n } ( x ) = \sum {\bf 1} _ { \{ f _ { i n } = x \} }$ is the number of events that occurred in the sample exactly $x$ times and $G _ { n } ( 1 ) = \mu _ { n }$. Consider the statement that

\begin{equation} \tag{a1} k f _{( k , n )} \approx \mu _ { n } ,\; k = 1,2 , \ldots, \end{equation}

in some sense [a25], II.A, p.24. There is a bit of a problem with such statement for high values of the rank $k$. Indeed, though in general $G _ { n } ( f ( k , n ) ) = \operatorname { max } \left\{ k ^ { \prime } : f _ { ( k ^ { \prime } , n ) } = f ( k , n ) \right\}$ for $k = 1,2 , \dots$, typically $G _ { n } ( f_{( k , n )} ) = k$ and one can replace (a1) by

\begin{equation} \tag{a2} G _ { n } ( x ) x \approx \mu _ { n } ,\; x = f _{( 1 , n )} , f _{( 2 , n )}, \dots . \end{equation}

For $k$ close to $\mu _ { n }$ there will be plenty of frequencies of the same value which must have different ranks, and therefore the meaning of (a1) becomes obscure. However, one can use (a2) for all values of $x = 1 , \dots , f_{( 1 , n )}$. In the form

\begin{equation} \tag{a3} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \approx \Delta \frac { 1 } { x } = \frac { 1 } { x ( x + 1 ) } , x = 1,2 , \dots, \end{equation}

it becomes especially vivid, for it says that even for very large $n$ (of many thousands) half of the empirical vocabulary should come from events which occurred in the sample only once, $x = 1$. The events that occurred only twice, $x = 2$, should constitute $1 / 6$ of $\mu _ { n }$, and the events that occurred not more than $10$ times constitute $10 / 11$-th of empirical vocabulary. So, somewhat counterintuitive even in very big samples, rare events should be presented in plenty and should constitute almost total of all different events that occurred. Any of statements (a1), (a2) or (a3) are called Zipf's law.

Assume for simplicity that $\{ f _ { i n } \} _ { i = 1 } ^ { N }$ are drawn from a multinomial distribution with probabilities $\{ p _ { i n } \}^ { N } _ { 1 }$ which form an array with respect to $n$. The asymptotic behaviour of the statistics $\{ f _{( k , n )} \} _ { k = 1 } ^ { \mu _ { n } }$, $\{ \mu _ { n } ( k ) \} _ { k = 1 } ^ { \mu _ { n } }$, $G _ { n } ( . )$, which all are symmetric, is governed by the distribution function of the probabilities $\{ p _ { i n } \}^ { N } _ { 1 }$:

\begin{equation*} G _ { p , n } ( x ) = \sum _ { i = 1 } ^ { N } 1 _ { \{ n p _ { i n } \geq x \} }. \end{equation*}

Namely, if

\begin{equation*} \frac { 1 } { n } G _ { p , n } \stackrel { \omega } { \rightarrow } G, \end{equation*}

then

\begin{equation} \tag{a4} \frac { \mu _ { n } ( x ) } { n } \stackrel { \mathsf{P} } { \rightarrow } - \int _ { 0 } ^ { \infty } \frac { \lambda ^ { x } } { x ! } e ^ { - \lambda } G ( d \lambda ), \end{equation}

while if

\begin{equation*} R _ { n } \stackrel { \omega } { \rightarrow } R \text { and } \operatorname { lim } _ { \varepsilon \rightarrow 0 } \operatorname { sup } _ { n } \int _ { 0 } ^ { \varepsilon } z R _ { n } ( d z ) = 0, \end{equation*}

where

\begin{equation*} R _ { n } ( x ) = \frac { G _ { p , n } ( x ) } { \int _ { 0 } ^ { \infty } ( 1 - e ^ { - z } ) G _ { p , n } ( d z ) }, \end{equation*}

then

\begin{equation} \tag{a5} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \stackrel { \mathsf{P} } { \rightarrow } \alpha ( x ) = - \int _ { 0 } ^ { \infty } \frac { \lambda ^ { x } e ^ { - \lambda } } { x ! } R ( d \lambda ) \end{equation}

(see, e.g., [a7]). Therefore the limiting expression $1 / x ( x + 1 )$ is by no means unique and there are as many possible limiting expressions for the spectral statistics $\mu _ { n } ( x ) / \mu _ { n }$ as there are measures with $\int _ { 0 } ^ { \infty } ( 1 - e ^ { - \lambda } ) R ( d \lambda ) = 1$. The right-hand side of (a5) gives Zipf's law (a3) if and only if

\begin{equation*} R ( x ) = \int _ { 0 } ^ { \infty } \frac { 1 } { 1 + z } e ^ { - z x } d z. \end{equation*}

The concepts of C. Shannon in information theory were relatively new when B. Mandelbrot [a11] used them to make Zipf's reasoning on a compromise between speaker's and auditor's economies rigorous and derive (a3) formally. Therefore the statement

\begin{equation} \tag{a6} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \approx \frac { 1 } { ( a + b x ) ^ { 2 } } \end{equation}

is frequently called the Zipf–Mandelbrot law. Among earlier work on distributions of the sort,

\begin{equation} \tag{a7} f _{( k , n )} \sim A k ^ { - ( 1 + q ) } \end{equation}

is, of course, the famous Pareto distribution of population over income [a17].

In lexical data, both Mandelbrot [a12] and Zipf [a25] note that the so-called Zipf's law was first observed by J.B. Estoup [a2]. (Note that Zipf's role was not so much the observation itself, which, by the way, he never claimed to be the first to make, as the serious attempt to explain the mechanism behind it.)

G.U. Yule [a24], in his analysis of data of J.G. Willis [a22] on problems of biological taxonomy, introduced the following expression for $\alpha ( x )$:

\begin{equation} \tag{a8} \alpha ( x ) = \frac { \Gamma ( \beta + 1 ) \Gamma ( x ) } { \Gamma ( x + \beta + 1 ) }, \end{equation}

depending on some parameter $\beta$.

H.A. Simon [a19] proposed and studied a certain Markov model for $\{ \mu _ { n } ( x ) : x = 1,2 , \ldots \}$ as a process in $n$ and obtained (a8) and a somewhat more general form of it as a limiting expression for $\mu _ { n } ( x ) / \mu _ { n }$. His core assumption was most natural: as there are $x \mu _ { n } ( x )$ words used $x$ times in the first $n$ draws, the probability that a word from this group will be used on the $( n + 1 )$st draw is proportional to $x \mu _ { n } ( x )$. In other words,

\begin{equation} \tag{a9} \mathsf{E} [ \mu _ { n + 1 } ( x ) | \mu _ { n } ( \cdot ) ] - \mu _ { n } ( x ) = \end{equation}

\begin{equation*} = k ( n ) [ ( x - 1 ) \mu _ { n } ( x - 1 ) - x \mu _ { n } ( x ) ]. \end{equation*}

His other assumption that a new (unused) word would occur on each trial with the same constant probability $\alpha$ was, perhaps, a little bit controversial. Indeed, it implies that the empirical vocabulary $\mu _ { n }$ increases with the same rate as $n$, while for Zipf's law in its strict sense of (a3) (or for $q = 1$ in (a7)) one obviously has

\begin{equation*} \frac { n } { \mu _ { n } } = \frac { \sum _ { x = 1 } ^ { n } x \mu _ { n } ( x ) } { \mu _ { n } } \sim \sum _ { x = 1 } ^ { n } \frac { 1 } { x + 1 } \rightarrow \infty . \end{equation*}

This and some other concerns of analytic nature (cf. the criticism in [a12], [a13] and the replies [a20], [a21]) led Simon to the third assumption: "[a word] may be dropped with the same average rate" . This is described in very clear form in [a20]: "We consider a sequence of $k$ words. We add words to the end of the sequence, and drop words from the sequence at the same average rate, so that the length of the sequence remains $k$. For the birth process we assume: the probability that the next word added is one that now occurs $i$ times is proportional to $( i + c ) \mu ( i )$. The probability that the next word is a new word is $\alpha$, a constant. For the death process we assume: the probability that the next word dropped is one that now occurs $i$ times is proportional to $( i + d ) \mu ( i )$. The terms $c$ and $d$ are constant parameters. The steady state relation is:

\begin{equation*} \mu ( i , m + 1 ) - \mu ( i , m ) = \end{equation*}

\begin{equation*} = \frac { ( 1 - \alpha ) } { k + c m _ { k } } \text{..} [ ( i - 1 + c ) \mu ( i - 1 , m ) - ( i + c ) \mu ( i , m ) ] + \end{equation*}

\begin{equation*} - \frac { 1 } { k + d n _ { k } } {..} [ ( i + d ) \mu ( i , m ) - ( i + d + 1 ) \mu ( i + 1 , m ) ] = 0. \end{equation*}

A solution to this equation, independent of $m$, is:

\begin{equation} \tag{a10} \mu ( i , m ) = A \lambda ^ { i } B ( i + c , d - c + 1 ), \end{equation}

where

\begin{equation*} \lambda = \frac { ( 1 - \alpha ) ( k + d n _ { k } ) } { ( k + cn _ { k } ) } \end{equation*}

and $B$ is the beta function.

"

See [a12] for an analytically beneficial replacement of the difference equations (a9) (for expectations $\mathsf E \mu _ { n } ( x )$) by certain partial differential equations. See also [a3] for the exact solution of these difference equations. Simon suggested to call (a8) the Yule distribution, but (a8) and (a10) are more appropriately and more commonly called the Yule–Simon distribution.

Approximately at the same time, R.H. MacArthur [a10] published a note on a model for the relative abundancy of species. His model of "non-overlapping niches" was very simple: "The environment is compared with a stick of unit length on which $m - 1$ points are thrown at random. The stick is broken at these points and lengths of the $m$ resulting segments are proportional to the abundance of the $m$ species. [] The expected length of $r$-th shortest interval is given by

\begin{equation} \tag{a11} \frac { 1 } { m } \sum _ { i = 1 } ^ { r } \frac { 1 } { m - i + 1 } = p ( z ) \end{equation}

so that the expected abundance of the $r$-th rarest species among $m$ species and $n$ individuals is $m p ( z )$.

" It appears (see [a26]) that "bird censuses" from tropical forests and many temperate regions fit this almost perfectly.

The rate of $p ( z )$ in $z$ is, certainly, very different from $z ^ { - ( 1 + q ) }$. Very interestingly, MacArthur suggests, then, to view divergence of a distribution from (a11) as an indication and a measure of heterogeneity of a population. His idea was to "use several sticks of very different size" to produce other curves.

In a certain sense, this was actualized in a very elegant model of B. Hill and M. Woodroofe (see [a4], [a5], [a23]).

According to Hill, if $N$ individuals are to be distributed to $M$ non-empty genera in accordance with Bose–Einstein statistics, i.e. all allocations are equi-probable, each with probability

\begin{equation*} \frac { 1 } { \left( \begin{array} { c } { N - 1 } \\ { M - 1 } \end{array} \right) }, \end{equation*}

and if $\mathsf{P} \{ M / N \leq x \} \stackrel { \omega } { \rightarrow } F ( x )$, $0 \leq x \leq 1$, $F ( 0 ) = 0$, then for the number of genera with exactly $x$ species one has

\begin{equation} \tag{a12} \frac { \mu _ { N } ( x ) } { M } \stackrel { d } { \rightarrow } U ( 1 - U ) ^ { x - 1 }, \end{equation}

where the random variable $U$ has distribution $F$. Note that if $U$ has a uniform distribution, then, as a consequence of (a12),

\begin{equation*} \textsf{E} \frac { \mu _ { N } ( x ) } { M } \rightarrow \frac { 1 } { x ( x + 1 ) }, \end{equation*}

which is exactly (a3). By considering a triangular array of independent (or exchangeable, as the authors have it) families of genera $M _ { i k }$ with allocations of $N _ { i k }$ species in each family $i = 1 , \ldots , k$ (the different "sticks" of MacArthur [a10]!), Hill and Woodroofe have shown that for combined statistics $\mu _ { N _ { k } } ( x ) = \sum _ { i = 1 } ^ { k } \mu _ { i N _ { i } } ( x )$ and $M _ { k } = \sum _ { i = 1 } ^ { k } M _ { i k }$ one has

\begin{equation} \tag{a13} \frac { \mu _ { N } ( x ) } { M } \stackrel { \mathsf{P} } { \rightarrow } \int _ { 0 } ^ { 1 } u ( 1 - u ) ^ { x - 1 } F ( d x ). \end{equation}

A. Rouault [a18] considered a certain Markov process as a model for formation of a text word by word and obtained the following expression for the limit of spectral statistics:

\begin{equation} \tag{a14} \alpha ( x ) = \frac { \beta \Gamma ( x - \beta ) } { \Gamma ( 1 - \beta ) \Gamma ( x + 1 ) }, \end{equation}

depending on a parameter $\beta$.

This Karlin–Rouault law appeared in a very different context in [a8]: Let the unit interval be divided into two in the ratio $a : 1 - a$. Let each of these two be again subdivided in the same ratio, etc. At step $q$ one obtains probabilities $p _ { i }$, $i = 1 , \dots , 2 ^ { q }$, of the form $a ^ { k } ( 1 - a ) ^ { q - k }$ for some $k = 0 , \dots , q$. One can think of filling a questionnaire with "yes-no" questions at random with probability $a$ for one of the answers in each question. Then the $p _ { i }$ are the probabilities of each particular answer to the questions or "opinion" . It was proved that if $a \neq 1 / 2$, then

\begin{equation} \tag{a15} \mu _ { n } \rightarrow \infty \quad \text { but } \frac { \mu _ { n } } { n } \rightarrow 0 \end{equation}

and that

\begin{equation} \tag{a16} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \rightarrow \frac { \int _ { - \infty } ^ { \infty } \alpha ^ { s ( x + \beta ) } e ^ { - \alpha ^ { s } } d N ( s ) } { \Gamma ( x + 1 ) \int _ { - \infty } ^ { \infty } \alpha ^ { s \beta } e ^ { - \alpha ^ { s } } d N ( s ) }, \end{equation}

with $\alpha = a / ( 1 - a )$ and $\beta = \beta ( \alpha , c ) < 1$ and with $N ( s )$ being a counting measure (concentrated on integers). With $d N ( s )$ replaced by the Lebesgue measure $d s$ and using the change of variables $u = \alpha ^ { s }$, the right-hand side of (a16) will coincide with (a14).

A possible heuristic interpretation of (a15) is that if $a \neq 1 / 2$, then saying "as many men as many minds" is incorrect — though the number of "minds" is infinitely large it is still asymptotically smaller than the number of "men" .

One of the present time (2000) characteristics of the study of Zipf's law is the increase of interest in the more general concept of "distributions with a large number of rare events" (LNRE distributions), which was introduced and first systematically studied in [a6] (partly reproduced in [a7]).

For a systematic application of LNRE distributions to the analysis of texts, see [a1]. On the other hand, studies on specific laws continue. In a series of papers, Yu. Orlov (see, e.g., [a14], [a15], [a16]) promoted the hypothesis that the presence of Zipf's law is the characteristic property of a complete, finished and well-balanced text. Though the hypothesis looks beautiful and attractive, for some readers the evidence may look thin. The problem of non-parametric estimation of the functions $G$ and $R$ through the expressions (a3) and (a5) provides another most natural example of an "inverse problem" , studied recently by C. Klaassen and R. Mnatsakanov [a9].

In conclusion, the following quote from [a12], whose views I share, is given: "The form of Zipf's law [see (a7)] is so striking and also so very different from any classical distribution of statistics that it is quite widely felt that it "should" have a basically simple reason, possibly as "weak" and general as the reasons which account for the role of Gaussian distribution. But, in fact, these laws turn out to be quite resistant to such an analysis. Thus, irrespective of any claim as to their practical importance, the "explanation" of their role has long been one of the best defined and most conspicuous challenges to those interested in statistical laws of nature.

"

References

[a1] R.H. Baayen, "Word frequency distribution" , Kluwer Acad. Publ. (2000)
[a2] J.B. Estoup, "Gammes stenographiques" , Paris (1916) (Edition: Fourth)
[a3] R. Gunther, B. Schapiro, P. Wagner, "Physical complexity and Zipf's law" Internat. J. Theoret. Phys. , 31 : 3 (1999) pp. 525–543
[a4] B.M. Hill, "Zipf's law and prior distributions for the composition of a population" J. Amer. Statist. Assoc. , 65 (1970) pp. 1220–1232
[a5] B.M. Hill, M. Woodroofe, "Stronger forms of Zipf's law" J. Amer. Statist. Assoc. , 70 (1975) pp. 212–219
[a6] E.V. Khmaladze, "The statistical analysis of large number of rare events" Techn. Rept. Dept. Math. Statist. CWI, Amsterdam , MS-R8804 (1987)
[a7] E.V. Khmaladze, R.J. Chitashvili, "Statistical analysis of large number of rate events and related problems" Trans. Tbilisi Math. Inst. , 91 (1989) pp. 196–245
[a8] E.V. Khmaladze, Z.P. Tsigroshvili, "On polynomial distributions with large number of rare events" Math. Methods Statist. , 2 : 3 (1993) pp. 240–247
[a9] C.A.J. Klaassen, R.M. Mnatsakanov, "Consistent estimation of the structural distribution function" Scand. J. Statist. , 27 (2000) pp. 1–14
[a10] R.H. MacArthur, "On relative abundancy of bird species" Proc. Nat. Acad. Sci. USA , 43 (1957) pp. 293–295
[a11] B. Mandelbrot, "An information theory of the statistical structure of language" W.E. Jackson (ed.) , Communication Theory , Acad. Press (1953) pp. 503–512
[a12] B. Mandelbrot, "A note on a class of skew distribution functions: Analysis and critque of a paper by H.A. Simon" Inform. & Control , 2 (1959) pp. 90–99
[a13] B. Mandelbrot, "Final note on a class of skew distribution functions: Analysis and critique of a model due to H.A. Simon" Inform. & Control , 4 (1961) pp. 198–216; 300–304
[a14] Ju.K. Orlov, "Dynamik der Haufigkeitsstrukturen" H. Guiter (ed.) M.V. Arapov (ed.) , Studies on Zipf's law , Brockmeyer, Bochum (1983) pp. 116–153
[a15] Ju.K. Orlov, "Ein Model der Haufigskeitstruktur des Vokabulars" H. Guiter (ed.) M.V. Arapov (ed.) , Studies on Zipf's law , Brockmeyer, Bochum (1983) pp. 154–233
[a16] Ju.K. Orlov, R.Y. Chitashvili, "On the statistical interpretation of Zipf's law" Bull. Acad. Sci. Georgia , 109 (1983) pp. 505–508
[a17] V. Pareto, "Course d'economie politique" , Lausanne and Paris (1897)
[a18] A. Rouault, "Loi de Zipf et sources markoviennes" Ann. Inst. H. Poincaré , 14 (1978) pp. 169–188
[a19] H.A. Simon, "On a class of skew distribution functions" Biometrika , 42 (1955) pp. 435–440
[a20] H.A. Simon, "Some further notes on a class of skew distribution functions" Inform. & Control , 3 (1960) pp. 80–88
[a21] H.A. Simon, "Reply to `final note' by Benoit Mandelbrot" Inform. & Control , 4 (1961) pp. 217–223
[a22] J.G. Willis, "Age and area" , Cambridge Univ. Press (1922)
[a23] M. Woodroofe, B.M. Hill, "On Zipf's law" J. Appl. Probab. , 12 : 3 (1975) pp. 425–434
[a24] G.U. Yule, "A mathematical theory of evolution, based on the conclusions of Dr.J.C. Willis, F.R.S." Philos. Trans. Royal Soc. London Ser. B , 213 (1924) pp. 21–87
[a25] G.K. Zipf, "Human behaviour and the principles of least effort" , Addison-Wesley (1949)
[a26] L.I. Davis, Aud. Field Notes , 9 (1955) pp. 425
[a27] L.M. Hanley, "Word index to James Joyce's Ulysses" , Madison, Wisc. (1937) (statistical tabulations by M. Joos)
How to Cite This Entry:
Zipf law. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Zipf_law&oldid=12733
This article was adapted from an original article by E. Khmaladze (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article