Zipf law
In 1949 G.K. Zipf published [a25]. A large, if not the main, part of it was devoted to the principles of human use of a language and the following was the main thesis of the author [a25], pp. 20–21: "From the viewpoint of the speaker, there would doubtless exist an important latent economy in a vocabulary that consisted exclusively of one single word — a single word that would mean whatever the speaker wanted it to mean. Thus, if there were $m$ different meanings to be verbalized, this word would have $m$ different meanings. [] But from the viewpoint of the auditor, a single word vocabulary would represent the acme of verbal labour, since he would be faced by the impossible task of determining the particular meaning to which the single word in given situation might refer. Indeed, from the viewpoint of the auditor [] the important internal economy of speech would be found rather in vocabulary of such size that [] if there were $m$ different meanings, there would be $m$ different words, with one meaning per word.
Thus, if there are an $m$ number of different distinctive meanings to be verbalized, there will be 1) a speaker's economy in possessing a vocabulary of one word which will refer to all the distinctive meanings; and there will also be 2) an opposing auditor's economy in possessing a vocabulary of $m$ different words with one distinctive meaning for each word. Obviously the two opposing economies are in extreme conflict.
In the language of these two [economies or forces] we may say that the vocabulary of a given stream of speech is constantly subject of the opposing Forces of Unification and Diversification which will determine both the number $n$ of actual words in the vocabulary, and also the meaning of those words.
"
As far as this was the main thesis, the main data and the main empirical fact discussed thoroughly in the book is: "James Joyce's novel Ulysses, with its $260,430$ running words, represents a sizable sample of running speech that may fairly be said to have served successfully in the communication of ideas. An index to the number of different words therein, together with the actual frequencies of their respective occurrences, has already been made with exemplary methods by Dr. Miles L. Hanley and associates ([a27]) [].
To the above published index has been added an appendix from the careful hands of Dr. M. Joos, in which is set forth all the qualitative information that is necessary for our present purposes. For Dr. Joos not only tells us that there are $29,899$ different words in the $260,430$ running words; [] and tells us the actual frequency, $f$, with which the different ranks, $r$, occur.
It is evident that the relationship between the various ranks, $r$, of these words and their respective frequencies, $f$, is potentially quite instructive about the entire matter of vocabulary balance, not only because it involves the frequencies with which the different words occur but also because the terminal rank of the list tells us the number of different words in the sample. And we remember that both the frequencies of occurrence and the number of different words will be important factors in the counterbalancing of the Forces of Unification and Diversification in the hypothetical vocabulary balance of any sample of speech.
"
Now, let $\{ f _ { i n } \} _ { i = 1 } ^ { N }$ be the frequencies of $N$, $N \leq \infty$, different disjoint events in a sample of size $n$, like occurrences of different words in a text of $n$ running words, and let $\mu _ { n } = \sum _ { i = 1 } ^ { N } 1 _ { \{ f _ { i n } \geq 1 \} }$ be an "empirical vocabulary" , that is, the number of all different events in the sample. Furthermore, let $f_{ ( 1 , n )} \geq \ldots \geq f _{( \mu _ { n } , n )}$ be the order statistic based on the frequencies and $G _ { n } ( x ) = \sum _ { i = 1 } ^ { N } 1_{ \{ f _ { i n } \geq x \} }$. Clearly, $\Delta G _ { n } ( x ) \equiv \mu _ { n } ( x ) = \sum {\bf 1} _ { \{ f _ { i n } = x \} }$ is the number of events that occurred in the sample exactly $x$ times and $G _ { n } ( 1 ) = \mu _ { n }$. Consider the statement that
\begin{equation} \tag{a1} k f _{( k , n )} \approx \mu _ { n } ,\; k = 1,2 , \ldots, \end{equation}
in some sense [a25], II.A, p.24. There is a bit of a problem with such statement for high values of the rank $k$. Indeed, though in general $G _ { n } ( f ( k , n ) ) = \operatorname { max } \left\{ k ^ { \prime } : f _ { ( k ^ { \prime } , n ) } = f ( k , n ) \right\}$ for $k = 1,2 , \dots$, typically $G _ { n } ( f_{( k , n )} ) = k$ and one can replace (a1) by
\begin{equation} \tag{a2} G _ { n } ( x ) x \approx \mu _ { n } ,\; x = f _{( 1 , n )} , f _{( 2 , n )}, \dots . \end{equation}
For $k$ close to $\mu _ { n }$ there will be plenty of frequencies of the same value which must have different ranks, and therefore the meaning of (a1) becomes obscure. However, one can use (a2) for all values of $x = 1 , \dots , f_{( 1 , n )}$. In the form
\begin{equation} \tag{a3} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \approx \Delta \frac { 1 } { x } = \frac { 1 } { x ( x + 1 ) } , x = 1,2 , \dots, \end{equation}
it becomes especially vivid, for it says that even for very large $n$ (of many thousands) half of the empirical vocabulary should come from events which occurred in the sample only once, $x = 1$. The events that occurred only twice, $x = 2$, should constitute $1 / 6$ of $\mu _ { n }$, and the events that occurred not more than $10$ times constitute $10 / 11$-th of empirical vocabulary. So, somewhat counterintuitive even in very big samples, rare events should be presented in plenty and should constitute almost total of all different events that occurred. Any of statements (a1), (a2) or (a3) are called Zipf's law.
Assume for simplicity that $\{ f _ { i n } \} _ { i = 1 } ^ { N }$ are drawn from a multinomial distribution with probabilities $\{ p _ { i n } \}^ { N } _ { 1 }$ which form an array with respect to $n$. The asymptotic behaviour of the statistics $\{ f _{( k , n )} \} _ { k = 1 } ^ { \mu _ { n } }$, $\{ \mu _ { n } ( k ) \} _ { k = 1 } ^ { \mu _ { n } }$, $G _ { n } ( . )$, which all are symmetric, is governed by the distribution function of the probabilities $\{ p _ { i n } \}^ { N } _ { 1 }$:
\begin{equation*} G _ { p , n } ( x ) = \sum _ { i = 1 } ^ { N } 1 _ { \{ n p _ { i n } \geq x \} }. \end{equation*}
Namely, if
\begin{equation*} \frac { 1 } { n } G _ { p , n } \stackrel { \omega } { \rightarrow } G, \end{equation*}
then
\begin{equation} \tag{a4} \frac { \mu _ { n } ( x ) } { n } \stackrel { \mathsf{P} } { \rightarrow } - \int _ { 0 } ^ { \infty } \frac { \lambda ^ { x } } { x ! } e ^ { - \lambda } G ( d \lambda ), \end{equation}
while if
\begin{equation*} R _ { n } \stackrel { \omega } { \rightarrow } R \text { and } \operatorname { lim } _ { \varepsilon \rightarrow 0 } \operatorname { sup } _ { n } \int _ { 0 } ^ { \varepsilon } z R _ { n } ( d z ) = 0, \end{equation*}
where
\begin{equation*} R _ { n } ( x ) = \frac { G _ { p , n } ( x ) } { \int _ { 0 } ^ { \infty } ( 1 - e ^ { - z } ) G _ { p , n } ( d z ) }, \end{equation*}
then
\begin{equation} \tag{a5} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \stackrel { \mathsf{P} } { \rightarrow } \alpha ( x ) = - \int _ { 0 } ^ { \infty } \frac { \lambda ^ { x } e ^ { - \lambda } } { x ! } R ( d \lambda ) \end{equation}
(see, e.g., [a7]). Therefore the limiting expression $1 / x ( x + 1 )$ is by no means unique and there are as many possible limiting expressions for the spectral statistics $\mu _ { n } ( x ) / \mu _ { n }$ as there are measures with $\int _ { 0 } ^ { \infty } ( 1 - e ^ { - \lambda } ) R ( d \lambda ) = 1$. The right-hand side of (a5) gives Zipf's law (a3) if and only if
\begin{equation*} R ( x ) = \int _ { 0 } ^ { \infty } \frac { 1 } { 1 + z } e ^ { - z x } d z. \end{equation*}
The concepts of C. Shannon in information theory were relatively new when B. Mandelbrot [a11] used them to make Zipf's reasoning on a compromise between speaker's and auditor's economies rigorous and derive (a3) formally. Therefore the statement
\begin{equation} \tag{a6} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \approx \frac { 1 } { ( a + b x ) ^ { 2 } } \end{equation}
is frequently called the Zipf–Mandelbrot law. Among earlier work on distributions of the sort,
\begin{equation} \tag{a7} f _{( k , n )} \sim A k ^ { - ( 1 + q ) } \end{equation}
is, of course, the famous Pareto distribution of population over income [a17].
In lexical data, both Mandelbrot [a12] and Zipf [a25] note that the so-called Zipf's law was first observed by J.B. Estoup [a2]. (Note that Zipf's role was not so much the observation itself, which, by the way, he never claimed to be the first to make, as the serious attempt to explain the mechanism behind it.)
G.U. Yule [a24], in his analysis of data of J.G. Willis [a22] on problems of biological taxonomy, introduced the following expression for $\alpha ( x )$:
\begin{equation} \tag{a8} \alpha ( x ) = \frac { \Gamma ( \beta + 1 ) \Gamma ( x ) } { \Gamma ( x + \beta + 1 ) }, \end{equation}
depending on some parameter $\beta$.
H.A. Simon [a19] proposed and studied a certain Markov model for $\{ \mu _ { n } ( x ) : x = 1,2 , \ldots \}$ as a process in $n$ and obtained (a8) and a somewhat more general form of it as a limiting expression for $\mu _ { n } ( x ) / \mu _ { n }$. His core assumption was most natural: as there are $x \mu _ { n } ( x )$ words used $x$ times in the first $n$ draws, the probability that a word from this group will be used on the $( n + 1 )$st draw is proportional to $x \mu _ { n } ( x )$. In other words,
\begin{equation} \tag{a9} \mathsf{E} [ \mu _ { n + 1 } ( x ) | \mu _ { n } ( \cdot ) ] - \mu _ { n } ( x ) = \end{equation}
\begin{equation*} = k ( n ) [ ( x - 1 ) \mu _ { n } ( x - 1 ) - x \mu _ { n } ( x ) ]. \end{equation*}
His other assumption that a new (unused) word would occur on each trial with the same constant probability $\alpha$ was, perhaps, a little bit controversial. Indeed, it implies that the empirical vocabulary $\mu _ { n }$ increases with the same rate as $n$, while for Zipf's law in its strict sense of (a3) (or for $q = 1$ in (a7)) one obviously has
\begin{equation*} \frac { n } { \mu _ { n } } = \frac { \sum _ { x = 1 } ^ { n } x \mu _ { n } ( x ) } { \mu _ { n } } \sim \sum _ { x = 1 } ^ { n } \frac { 1 } { x + 1 } \rightarrow \infty . \end{equation*}
This and some other concerns of analytic nature (cf. the criticism in [a12], [a13] and the replies [a20], [a21]) led Simon to the third assumption: "[a word] may be dropped with the same average rate" . This is described in very clear form in [a20]: "We consider a sequence of $k$ words. We add words to the end of the sequence, and drop words from the sequence at the same average rate, so that the length of the sequence remains $k$. For the birth process we assume: the probability that the next word added is one that now occurs $i$ times is proportional to $( i + c ) \mu ( i )$. The probability that the next word is a new word is $\alpha$, a constant. For the death process we assume: the probability that the next word dropped is one that now occurs $i$ times is proportional to $( i + d ) \mu ( i )$. The terms $c$ and $d$ are constant parameters. The steady state relation is:
\begin{equation*} \mu ( i , m + 1 ) - \mu ( i , m ) = \end{equation*}
\begin{equation*} = \frac { ( 1 - \alpha ) } { k + c m _ { k } } \text{..} [ ( i - 1 + c ) \mu ( i - 1 , m ) - ( i + c ) \mu ( i , m ) ] + \end{equation*}
\begin{equation*} - \frac { 1 } { k + d n _ { k } } {..} [ ( i + d ) \mu ( i , m ) - ( i + d + 1 ) \mu ( i + 1 , m ) ] = 0. \end{equation*}
A solution to this equation, independent of $m$, is:
\begin{equation} \tag{a10} \mu ( i , m ) = A \lambda ^ { i } B ( i + c , d - c + 1 ), \end{equation}
where
\begin{equation*} \lambda = \frac { ( 1 - \alpha ) ( k + d n _ { k } ) } { ( k + cn _ { k } ) } \end{equation*}
and $B$ is the beta function.
"
See [a12] for an analytically beneficial replacement of the difference equations (a9) (for expectations $\mathsf E \mu _ { n } ( x )$) by certain partial differential equations. See also [a3] for the exact solution of these difference equations. Simon suggested to call (a8) the Yule distribution, but (a8) and (a10) are more appropriately and more commonly called the Yule–Simon distribution.
Approximately at the same time, R.H. MacArthur [a10] published a note on a model for the relative abundancy of species. His model of "non-overlapping niches" was very simple: "The environment is compared with a stick of unit length on which $m - 1$ points are thrown at random. The stick is broken at these points and lengths of the $m$ resulting segments are proportional to the abundance of the $m$ species. [] The expected length of $r$-th shortest interval is given by
\begin{equation} \tag{a11} \frac { 1 } { m } \sum _ { i = 1 } ^ { r } \frac { 1 } { m - i + 1 } = p ( z ) \end{equation}
so that the expected abundance of the $r$-th rarest species among $m$ species and $n$ individuals is $m p ( z )$.
" It appears (see [a26]) that "bird censuses" from tropical forests and many temperate regions fit this almost perfectly.
The rate of $p ( z )$ in $z$ is, certainly, very different from $z ^ { - ( 1 + q ) }$. Very interestingly, MacArthur suggests, then, to view divergence of a distribution from (a11) as an indication and a measure of heterogeneity of a population. His idea was to "use several sticks of very different size" to produce other curves.
In a certain sense, this was actualized in a very elegant model of B. Hill and M. Woodroofe (see [a4], [a5], [a23]).
According to Hill, if $N$ individuals are to be distributed to $M$ non-empty genera in accordance with Bose–Einstein statistics, i.e. all allocations are equi-probable, each with probability
\begin{equation*} \frac { 1 } { \left( \begin{array} { c } { N - 1 } \\ { M - 1 } \end{array} \right) }, \end{equation*}
and if $\mathsf{P} \{ M / N \leq x \} \stackrel { \omega } { \rightarrow } F ( x )$, $0 \leq x \leq 1$, $F ( 0 ) = 0$, then for the number of genera with exactly $x$ species one has
\begin{equation} \tag{a12} \frac { \mu _ { N } ( x ) } { M } \stackrel { d } { \rightarrow } U ( 1 - U ) ^ { x - 1 }, \end{equation}
where the random variable $U$ has distribution $F$. Note that if $U$ has a uniform distribution, then, as a consequence of (a12),
\begin{equation*} \textsf{E} \frac { \mu _ { N } ( x ) } { M } \rightarrow \frac { 1 } { x ( x + 1 ) }, \end{equation*}
which is exactly (a3). By considering a triangular array of independent (or exchangeable, as the authors have it) families of genera $M _ { i k }$ with allocations of $N _ { i k }$ species in each family $i = 1 , \ldots , k$ (the different "sticks" of MacArthur [a10]!), Hill and Woodroofe have shown that for combined statistics $\mu _ { N _ { k } } ( x ) = \sum _ { i = 1 } ^ { k } \mu _ { i N _ { i } } ( x )$ and $M _ { k } = \sum _ { i = 1 } ^ { k } M _ { i k }$ one has
\begin{equation} \tag{a13} \frac { \mu _ { N } ( x ) } { M } \stackrel { \mathsf{P} } { \rightarrow } \int _ { 0 } ^ { 1 } u ( 1 - u ) ^ { x - 1 } F ( d x ). \end{equation}
A. Rouault [a18] considered a certain Markov process as a model for formation of a text word by word and obtained the following expression for the limit of spectral statistics:
\begin{equation} \tag{a14} \alpha ( x ) = \frac { \beta \Gamma ( x - \beta ) } { \Gamma ( 1 - \beta ) \Gamma ( x + 1 ) }, \end{equation}
depending on a parameter $\beta$.
This Karlin–Rouault law appeared in a very different context in [a8]: Let the unit interval be divided into two in the ratio $a : 1 - a$. Let each of these two be again subdivided in the same ratio, etc. At step $q$ one obtains probabilities $p _ { i }$, $i = 1 , \dots , 2 ^ { q }$, of the form $a ^ { k } ( 1 - a ) ^ { q - k }$ for some $k = 0 , \dots , q$. One can think of filling a questionnaire with "yes-no" questions at random with probability $a$ for one of the answers in each question. Then the $p _ { i }$ are the probabilities of each particular answer to the questions or "opinion" . It was proved that if $a \neq 1 / 2$, then
\begin{equation} \tag{a15} \mu _ { n } \rightarrow \infty \quad \text { but } \frac { \mu _ { n } } { n } \rightarrow 0 \end{equation}
and that
\begin{equation} \tag{a16} \frac { \mu _ { n } ( x ) } { \mu _ { n } } \rightarrow \frac { \int _ { - \infty } ^ { \infty } \alpha ^ { s ( x + \beta ) } e ^ { - \alpha ^ { s } } d N ( s ) } { \Gamma ( x + 1 ) \int _ { - \infty } ^ { \infty } \alpha ^ { s \beta } e ^ { - \alpha ^ { s } } d N ( s ) }, \end{equation}
with $\alpha = a / ( 1 - a )$ and $\beta = \beta ( \alpha , c ) < 1$ and with $N ( s )$ being a counting measure (concentrated on integers). With $d N ( s )$ replaced by the Lebesgue measure $d s$ and using the change of variables $u = \alpha ^ { s }$, the right-hand side of (a16) will coincide with (a14).
A possible heuristic interpretation of (a15) is that if $a \neq 1 / 2$, then saying "as many men as many minds" is incorrect — though the number of "minds" is infinitely large it is still asymptotically smaller than the number of "men" .
One of the present time (2000) characteristics of the study of Zipf's law is the increase of interest in the more general concept of "distributions with a large number of rare events" (LNRE distributions), which was introduced and first systematically studied in [a6] (partly reproduced in [a7]).
For a systematic application of LNRE distributions to the analysis of texts, see [a1]. On the other hand, studies on specific laws continue. In a series of papers, Yu. Orlov (see, e.g., [a14], [a15], [a16]) promoted the hypothesis that the presence of Zipf's law is the characteristic property of a complete, finished and well-balanced text. Though the hypothesis looks beautiful and attractive, for some readers the evidence may look thin. The problem of non-parametric estimation of the functions $G$ and $R$ through the expressions (a3) and (a5) provides another most natural example of an "inverse problem" , studied recently by C. Klaassen and R. Mnatsakanov [a9].
In conclusion, the following quote from [a12], whose views I share, is given: "The form of Zipf's law [see (a7)] is so striking and also so very different from any classical distribution of statistics that it is quite widely felt that it "should" have a basically simple reason, possibly as "weak" and general as the reasons which account for the role of Gaussian distribution. But, in fact, these laws turn out to be quite resistant to such an analysis. Thus, irrespective of any claim as to their practical importance, the "explanation" of their role has long been one of the best defined and most conspicuous challenges to those interested in statistical laws of nature.
"
References
[a1] | R.H. Baayen, "Word frequency distribution" , Kluwer Acad. Publ. (2000) |
[a2] | J.B. Estoup, "Gammes stenographiques" , Paris (1916) (Edition: Fourth) |
[a3] | R. Gunther, B. Schapiro, P. Wagner, "Physical complexity and Zipf's law" Internat. J. Theoret. Phys. , 31 : 3 (1999) pp. 525–543 |
[a4] | B.M. Hill, "Zipf's law and prior distributions for the composition of a population" J. Amer. Statist. Assoc. , 65 (1970) pp. 1220–1232 |
[a5] | B.M. Hill, M. Woodroofe, "Stronger forms of Zipf's law" J. Amer. Statist. Assoc. , 70 (1975) pp. 212–219 |
[a6] | E.V. Khmaladze, "The statistical analysis of large number of rare events" Techn. Rept. Dept. Math. Statist. CWI, Amsterdam , MS-R8804 (1987) |
[a7] | E.V. Khmaladze, R.J. Chitashvili, "Statistical analysis of large number of rate events and related problems" Trans. Tbilisi Math. Inst. , 91 (1989) pp. 196–245 |
[a8] | E.V. Khmaladze, Z.P. Tsigroshvili, "On polynomial distributions with large number of rare events" Math. Methods Statist. , 2 : 3 (1993) pp. 240–247 |
[a9] | C.A.J. Klaassen, R.M. Mnatsakanov, "Consistent estimation of the structural distribution function" Scand. J. Statist. , 27 (2000) pp. 1–14 |
[a10] | R.H. MacArthur, "On relative abundancy of bird species" Proc. Nat. Acad. Sci. USA , 43 (1957) pp. 293–295 |
[a11] | B. Mandelbrot, "An information theory of the statistical structure of language" W.E. Jackson (ed.) , Communication Theory , Acad. Press (1953) pp. 503–512 |
[a12] | B. Mandelbrot, "A note on a class of skew distribution functions: Analysis and critque of a paper by H.A. Simon" Inform. & Control , 2 (1959) pp. 90–99 |
[a13] | B. Mandelbrot, "Final note on a class of skew distribution functions: Analysis and critique of a model due to H.A. Simon" Inform. & Control , 4 (1961) pp. 198–216; 300–304 |
[a14] | Ju.K. Orlov, "Dynamik der Haufigkeitsstrukturen" H. Guiter (ed.) M.V. Arapov (ed.) , Studies on Zipf's law , Brockmeyer, Bochum (1983) pp. 116–153 |
[a15] | Ju.K. Orlov, "Ein Model der Haufigskeitstruktur des Vokabulars" H. Guiter (ed.) M.V. Arapov (ed.) , Studies on Zipf's law , Brockmeyer, Bochum (1983) pp. 154–233 |
[a16] | Ju.K. Orlov, R.Y. Chitashvili, "On the statistical interpretation of Zipf's law" Bull. Acad. Sci. Georgia , 109 (1983) pp. 505–508 |
[a17] | V. Pareto, "Course d'economie politique" , Lausanne and Paris (1897) |
[a18] | A. Rouault, "Loi de Zipf et sources markoviennes" Ann. Inst. H. Poincaré , 14 (1978) pp. 169–188 |
[a19] | H.A. Simon, "On a class of skew distribution functions" Biometrika , 42 (1955) pp. 435–440 |
[a20] | H.A. Simon, "Some further notes on a class of skew distribution functions" Inform. & Control , 3 (1960) pp. 80–88 |
[a21] | H.A. Simon, "Reply to `final note' by Benoit Mandelbrot" Inform. & Control , 4 (1961) pp. 217–223 |
[a22] | J.G. Willis, "Age and area" , Cambridge Univ. Press (1922) |
[a23] | M. Woodroofe, B.M. Hill, "On Zipf's law" J. Appl. Probab. , 12 : 3 (1975) pp. 425–434 |
[a24] | G.U. Yule, "A mathematical theory of evolution, based on the conclusions of Dr.J.C. Willis, F.R.S." Philos. Trans. Royal Soc. London Ser. B , 213 (1924) pp. 21–87 |
[a25] | G.K. Zipf, "Human behaviour and the principles of least effort" , Addison-Wesley (1949) |
[a26] | L.I. Davis, Aud. Field Notes , 9 (1955) pp. 425 |
[a27] | L.M. Hanley, "Word index to James Joyce's Ulysses" , Madison, Wisc. (1937) (statistical tabulations by M. Joos) |
Zipf law. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Zipf_law&oldid=50180