##### Actions

Adaptive sampling [a1] is a probabilistic algorithm invented by M. Wegman (unpublished) around 1980. It provides an unbiased estimator of the number of distinct elements (the "cardinality" ) of a file (a sequence of data items) of potentially large size that contains unpredictable replications. The algorithm is useful in data-base query optimization and in information retrieval. By standard hashing techniques [a3], [a6] the problem reduces to the following.

A sequence $( h _ {1} \dots h _ {n} )$ of real numbers is given. The sequence has been formed by drawing independently and randomly an unknown number $N$ of real numbers from $[ 0,1 ]$, after which the elements are replicated and permuted in some unknown fashion. The problem is to estimate the cardinality $N$ in a computationally efficient manner.

Three algorithms can perform this task.

1) Straight scan computes incrementally the sets $U _ {j} = \{ h _ {1} \dots h _ {j} \}$, where replications are eliminated on the fly. (This can be achieved by keeping the successive $U _ {j}$ in sorted order.) The cardinality is then determined exactly by $N = { \mathop{\rm card} } ( U _ {n} )$ but the auxiliary memory needed is $N$, which may be as large as $n$, resulting in a complexity that is prohibitive in many applications.

2) Static sampling is based on a fixed sampling ratio $p$, where $0 < p \leq 1$( e.g., $p = {1 / {100 } }$). One computes sequentially the samples $U _ {j} ^ {*} = \{ h _ {1} \dots h _ {j} \} \cap [ 0,p ]$. The cardinality estimate returned is $N ^ {*} = { {{ \mathop{\rm card} } ( U _ {n} ^ {*} ) } / p }$. The estimator $N ^ {*}$ is unbiased and the memory used is $Np$ on average.

3) Adaptive sampling is based on a design parameter $b \geq 2$( e.g., $b = 100$) and it maintains a dynamically changing sampling rate $p$ and a sequence of samples $U _ {j} ^ {** }$. Initially, $p = 1$ and $U _ {0} ^ {** } = \emptyset$. The rule is like that of static sampling, but with $p$ divided by $2$ each time the cardinality of $U _ {j} ^ {** }$ would exceed $b$ and with $U _ {j} ^ {** }$ modified accordingly in order to contain only $U _ {j} \cap [ 0,p ]$. The estimator $N ^ {** } = { {{ \mathop{\rm card} } ( U _ {n} ^ {** } ) } / p }$( where the final value of $p$ is used) is proved to be unbiased and the memory used is at most $b$.

The accuracy of any such unbiased estimator ${\widetilde{N} }$ of $N$ is measured by the standard deviation of ${\widetilde{N} }$ divided by $N$. For adaptive sampling, the accuracy is almost constant as a function of $N$ and asymptotically close to

$${ \frac{1.20 }{\sqrt b } } ,$$

a result established in [a1] by generating functions and Mellin transform techniques. An alternative algorithm, called probabilistic counting [a2], provides an estimator $N ^ {*** }$ of cardinalities that is unbiased only asymptotically but has a better accuracy, of about ${ {0.78 } / {\sqrt b } }$.

Typically, the adaptive sampling algorithm can be applied to gather statistics on word usage in a large text. Well-designed hashing transformations are then known to fulfill practically the uniformity assumption [a4]. A general perspective on probabilistic algorithms may be found in [a5].

How to Cite This Entry: