How to compose a discrete distribution series. Discrete statistical series

When processing large amounts of information, which is especially important when carrying out modern scientific developments, the researcher is faced with the serious task of correctly grouping the initial data. If the data are discrete, then problems, as we have seen, do not arise - you just need to calculate the frequency of each feature. If the investigated feature has continuous character (which is more widespread in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

For grouping continuous random variables, the entire variation range the attribute is split into a certain number of intervals to.

Grouped by interval (continuous) variation series called intervals () ranked by the value of the feature, where the numbers of observations that fall into the r "-th interval indicated together with the corresponding frequencies (), or relative frequencies ():

Characteristic value intervals

Mi frequency

bar graph and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary idea of ​​the data structure. Such graphs (Fig. 1.15) are constructed for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fill the area of ​​their possible values, taking any values.

Fig. 1.15.

therefore the columns on the histogram and cumulative should be in contact, not have areas where the characteristic values ​​do not fall within the limits of all possible(ie, the histogram and cumulative should not have "holes" along the abscissa, which do not include the values ​​of the studied variable, as in Fig. 1.16). The height of the bar corresponds to the frequency - the number of observations that fell into given interval, or relative frequency - the proportion of observations. Intervals should not intersect and are generally of the same width.

Fig. 1.16.

The histogram and polygon are approximations of the probability density curve (differential function) f (x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is so important in the primary statistical processing of quantitative continuous data - by their appearance, one can judge the hypothetical distribution law.

Cumulative - the curve of the accumulated frequencies (frequencies) of the interval variation series. The cumulative is compared to the graph of the cumulative distribution function F (x), also considered in the course of probability theory.

Basically, the concepts of histograms and cumulates are associated with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task is perhaps the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, since in this case the histogram turns out to be too smoothed ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17, you can see how the same data used to plot the graphs in Fig. 1.15, are used to build a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the studied data along the number axis: the histogram will turn out to be undersmooth (undersmoothed), with unfilled intervals, uneven (see Fig. 1.17, right graph).

Fig. 1.17.

How do you determine the most preferred number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to split the original set of values ​​of the trait under study. This formula has truly become overwhelmingly popular - most statistical textbooks offer it, and many statistical packages use it by default. To what extent this is justified and in all cases is a very serious question.

So what is the Sturges formula based on?

Consider the binomial distribution)