# Sample method

A statistical method for the study of the general properties of a certain population of objects by studying the properties of only a sample (a part) of these objects. The mathematical theory of sample methods is based on two important sections of mathematical statistics — the theory of sampling from a finite population and the theory of sampling from an infinite population. The fundamental difference between the sampling theory for finite and infinite populations consists in the fact that in the former case the theory is usually applied to objects of a non-random, determined nature (for example, the number of defective articles in a given industrial batch of products is not a random variable: it is an unknown constant which must be estimated from the sampling data). In the latter case the theory is usually employed to study the properties of random objects (for example, to study the properties of continuously-distributed random experimental errors, each one of which may be interpreted, in principle, as the realization of one out of an infinite set of possible results).

Samples from finite populations and their theory form the base of methods in statistical quality control and are often employed in sociological studies. According to probability theory, the sample will correctly reproduce the properties of the population as a whole if the sampling is conducted at random, i.e. so that each one of the possible samples of a given size $n$ out of a population of size $N$( the number of such samples is $N ! / n ! ( N - n) !$) has an equal chance of being selected in actual practice.

The method which is most often used in practice is sampling without replacement, the item already chosen not being returned to the population under study before taking the next items in the sample (for example, in drawing the winning lottery tickets, in statistical quality control and in long lasting demographic investigations). Sampling with replacement is usually employed in theoretical studies only (an example of this technique is the recording of the number of particles colliding with the container walls during a given period of time in the study of Brownian motion). If $n \ll N$, practically equivalent results are obtained from the two techniques.

The properties of the populations which are studied by the sample method may be qualitative and quantitative. In the former case the task of investigating the sample consists of finding the number $M$ of items in the population having certain characteristics (e.g. during statistical control the parameter of interest is often the number $M$ of defective items in a batch of $N$ items). $M$ is estimated by the ratio $mN/n$, where $m$ is the number of items displaying the characteristic under study in a sample of size $n$. In the case of a quantitative characteristic the task consists of determining the mean value $\overline{x}\; = ( x _ {1} + \dots + x _ {N} ) / N$ of the population. The value $\overline{x}\;$ is estimated by means of the sample average

$$\overline{X}\; = \frac{X _ {1} + \dots + X _ {N} }{n} ,$$

where $X _ {1} \dots X _ {N}$ are the numbers of quantities in the populations $x _ {1} \dots x _ {N}$ under study which belong to the sample. From the mathematical point of view, the former situation is a special case of the latter, which occurs if $M$ of the variables $x _ {i}$ are equal to one, while the remaining $( N - M )$ are zero; in this situation $\overline{x}\; = M/N$ and $\overline{X}\; = m/n$.

In the mathematical theory of sample methods, estimating the mean value is the key operation, since this value forms the base of a quantitative description of the variability of the characteristic within the population; in fact, the variability of the characteristic is usually defined as the variance

$$\sigma ^ {2} = \ \frac{( x _ {1} - \overline{x}\; ) ^ {2} + \dots + ( x _ {N} - \overline{x}\; ) ^ {2} }{N } ,$$

which is the average of the squares of the deviations of $x _ {i}$ from the average value $\overline{x}\;$. If a qualitative characteristic is studied, then

$$\sigma ^ {2} = \frac{M ( N - M) }{N ^ {2} } .$$

The accuracies of the estimates $m/n$ and $\overline{X}\;$ are found from their variances

$$\sigma _ {m/n } ^ {2} = {\mathsf E} \left ( { \frac{m}{n} } - { \frac{M}{N} } \right ) ^ {2} \ \ \textrm{ and } \ \ \sigma _ {\overline{X}\; } ^ {2} = {\mathsf E} ( \overline{X}\; - \overline{x}\; ) ^ {2} ,$$

which are expressed, in terms of the variances $\sigma ^ {2}$ of the finite population, as the ratios $\sigma ^ {2} / n$( in the case of sampling with replacement) and $\sigma ^ {2} ( N - n)/n( N - 1)$( in the case of sampling without replacement). Since in many problems of practical interest the random variables $m/n$ and $\overline{X}\;$ roughly follow a normal distribution if $n \geq 30$, it follows that the deviations of $m/n$ from $M/N$ and $\overline{X}\;$ from $\overline{x}\;$, with absolute values larger than $2 \sigma _ {m/n}$ and $2 \sigma _ {\overline{X}\; }$, respectively, may occur, on the average, in about one case in twenty.

More complete information about the distribution of a quantitative characteristic in a given population may be obtained from the empirical distribution of this characteristic in the sample.

## Contents

### Sampling from an infinite population.

It is usual in mathematical statistics to describe as a sample the results of given homogeneous observations (mostly independent ones) even through this differs from the concept of a sample from a finite population with or without replacement. Thus, the measurements of angles, which involve continuously-distributed random errors, would be denoted as a sample from an infinite population. It is assumed that it is possible, in principle, to carry out any desired number of such observations. The results obtained form a so-called sample from an infinite set of possible results, which is called the general aggregate. The concept of a general aggregate is neither logically unobjectionable nor indispensable. In solving practical problems, there is no need of the infinite general aggregate itself, but only of certain characteristics corresponding to it. From the point of view of probability theory, these characteristics are numerical or functional properties of a certain probability distribution, while the sample items are random variables subject to this distribution. Such an interpretation makes it possible to apply the general theory of statistical estimation to sample estimates. This is why, for example, in probability theory, when processing observations, the concept of an infinite general aggregate is replaced by the concept of a probability distribution involving unknown parameters. The results of the observations are treated as experimentally found values of the random variables subject to this distribution. The objective of the processing is to use the results of the observations to compute optimal (in some sense) statistical estimators for the unknown distribution parameters.

So far, the concern has been with sampling from one population of certain objects. In practice, however, sampling is often performed with several identical populations (e.g. in estimating the fraction of defective articles in several batches of finished industrial products). In such a situation the object of study is no longer a single number $M$, but several unknown numbers $M _ {1} , M _ {2} , . . .$. For instance, let each batch of the finished product contain $N$ articles, let $M _ {1} , M _ {2} \dots$ be the numbers of defective articles in these batches, and let $m _ {1} , m _ {2} \dots$ be the corresponding numbers of defective articles found in samples of size $n$. If the so-called principle of defect-free acceptance is accepted, the $r$- th batch is delivered to the customer if $m _ {i} = 0$ and is rejected otherwise. If it is assumed that the control of the articles involves their destruction, then the customer obtains a batch of size $R _ {i} = 0$( if $m _ {i} > 0$) or a batch of size $R _ {i} = N - n$ containing $D _ {i} = M _ {i}$( if $m _ {i} = 0$) defective articles, the values of $R _ {1} , R _ {2} , . . .$( thus, their sum as well) being known, while the value of $D _ {1} + D _ {2} + \dots$ is not known. The ratio $( D _ {1} + D _ {2} + \dots ) / ( R _ {1} + R _ {2} + \dots )$ is known as the fraction of passed defectives, and its mathematical expectation $q$ is known as the average fraction of passed defectives. The task of mathematical statistics is to estimate $q$ from the values of $R _ {1} , R _ {2} \dots$ which are determined using the sample method. If the values $M _ {1} , M _ {2} \dots$ may be treated as the realization of independent identically-distributed random variables with a known distribution law ${\mathsf P} \{ M _ {i} = r \} = p _ {r}$, then, according to the Bayes formula, a statistical estimator of the average number of passed defective articles in the accepted batches can be expressed by the formula

$$\widetilde{D} = \ {\mathsf E} \{ M \mid m = 0 \} = \ \frac{\left ( \sum _ {r = 1 } ^ { {N } - n } r \frac{C _ {N - r } ^ {n} }{C _ {N} ^ {n} } p _ {r} \right ) }{ {\mathsf P} \{ m = 0 \} } ,$$

and

$$\widetilde{D} \leq \frac{( N - n ) {\mathsf P} \{ m = 1 \} }{n {\mathsf P} \{ m= 0 \} } ,$$

where

$${\mathsf P} \{ m = k \} = \ \sum _ {r = 0 } ^ { {N } - n } \frac{C _ {r} ^ {k} C _ {N - r } ^ {n} }{C _ {N} ^ {n} } p _ {r} ,\ \ k = 0 \dots n.$$

For this reason the estimator

$$\widetilde{q} = \frac{\widetilde{D} }{( N - n) }$$

of the average fraction of passed defectives in the accepted batches satisfies the inequality

$$\widetilde{q} \leq \frac{ {\mathsf P} \{ m = 1 \} }{n {\mathsf P} \{ m= 0 \} } \approx \frac{s _ {1} }{ns _ {0} } ,$$

where $s _ {0}$ is the number of accepted batches while $s _ {1}$ is the number of defective batches the samples of which yielded exactly one defective article.

#### References

 [1] N.V. Smirnov, I.V. Dunin-Barkovskii, "Mathematische Statistik in der Technik" , Deutsch. Verlag Wissenschaft. (1969) (Translated from Russian) [2] Yu.K. Belyaev, "Probabilistic methods of sample control" , Moscow (1975) (In Russian) [3] M.G. Kendall, A. Stuart, "The advanced theory of statistics. Distribution theory" , 3. Design and analysis , Griffin (1969)