How to make a discrete number of distribution. Discrete statistical series

When processing large arrays of information, which is especially relevant in conducting modern scientific developments, the explorer is a major task of properly grouping source data. If the data is discrete, then the problems as we have seen does not occur - it is necessary to simply calculate the frequency of the sign. If the studied feature has continuous Character (which has more distribution in practice), the choice of the optimal number of intervals of the trait grouping is by no means a trivial task.

For grouping continuous random variables variation scope Suitted for some intervals to.

Grouped interval (continuous) variational near Called the ranked signs of the interval (), which are suggested together with the corresponding frequencies () of the number of observations that came to G "-y interval, or relative frequencies ():

Signal intervals

MI Frequency

bar graph and cumulat (rogue), Already discussed by us, are an excellent means of visualizing data that allows you to obtain a primary view of the data structure. Such graphs (Fig. 1.15) are built for continuous data as well as for discrete, only with regard to the fact that continuous data is completely fill in the area of \u200b\u200btheir possible values, taking any values.

Fig. 1.15.

therefore columns on the histogram and cumulant should come into contact, not to have sections where the character values \u200b\u200bdo not fall within all possible (i.e., the histogram and cumulat should not have "holes" along the abscissa axis, which do not fall into the values \u200b\u200bof the variable under study, as in Fig. 1.16). The pillar height corresponds to the frequency of observations in this interval, or relative observation frequency. Intervals should not intersect And they are usually the same width.

Fig. 1.16.

Histogram and polygon are approximations of the probability density curve (differential function) f (X) Theoretical distribution considered in the course of probability theory. Therefore, their construction has such important in the primary statistical processing of quantitative continuous data - according to their form, you can judge the hypothetical distribution law.

Cumulat - curve of the accumulated frequency (frequencies) of the interval variation series. The graph of the integrated distribution function is compared with the cumulative F (X)Also considered in the course of probability theory.

Basically, the concepts of histogram and cumulates are associated with continuous data and their interval variational series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of the interval variation series starts from determining the number of intervals k. And this task is perhaps the most difficult, important and ambiguous issue in the question.

The number of intervals should not be too small, since the histogram is obtained too smoothed ( oversmoothed), Loses all the features of the variability of the initial data - in Fig. 1.17 can be seen as the same data on which graphs are built. 1.15, used to build a histogram with a smaller number of intervals (left schedule).

At the same time, the number of intervals should not be too large - otherwise we will not be able to evaluate the distribution density of the data studied on the numerical axis: the histogram will be unshamed (undersmoothed), With unfilled intervals, uneven (see Fig. 1.17, right schedule).

Fig. 1.17.

How to determine the most preferred number of intervals?

Back in 1926, Herbert Sturges (Herbert Sturges) suggested that the formula for calculating the number of intervals to which the initial set of the signs studied are necessary. This formula was truly a superpopular - most of the statistical textbooks offer it precisely, it is also used by many statistical packages by default. How justified and in all cases - is a very serious question.

So, what is the basis of the Formula of Strocess?

Consider the binomial distribution)