Algorithm for constructing an interval variation series with equal intervals. Distribution series

If the studied random variable is continuous, then the ranking and grouping of the observed values often do not allow specific traits varying its values. This is because the individual values random variable can differ as little as you like from each other, and therefore in the aggregate of the observed data, the same values of the quantity can rarely occur, and the frequencies of the variants differ little from each other.

It is also impractical to construct a discrete series for a discrete random variable, the number of possible values of which is large. In such cases, you should build interval variation series distribution.

To construct such a series, the entire interval of variation of the observed values of a random variable is divided into a series partial intervals and counting the frequency at which the magnitude values fall into each sub-interval.

Interval variation series an ordered set of intervals of variation of values of a random variable with corresponding frequencies or relative frequencies of occurrence of values of a quantity in each of them is called.

To build an interval series, you must:

define magnitude partial intervals;
define width intervals;
set it for each interval upper and lower bounds ;
group observation results.

1 ... The question of choosing the number and width of the grouping intervals has to be solved in each specific case based on goals research, volume sampling and degree of variation feature in the sample.

Approximate number of intervals k can be estimated based only on the sample size n in one of the following ways:

according to the formula Sterzhesa : k = 1 + 3.32 lg n ;
using table 1.

Table 1

2 ... Intervals of equal width are generally preferred. To define the width of the intervals h calculate:

range of variation R - sample values: R = x max - x min ,

where x max and x min - maximum and minimum sample options;

the width of each of the intervals h determined by the following formula: h = R / k .

3 . Bottom line first interval x h1 is chosen so that the minimum sample option x min fell approximately in the middle of this interval: x h1 = x min - 0.5 h .

Intermediate intervals are obtained by adding to the end of the previous interval the length of the partial interval h :

x hi = x hi-1 + h.

The construction of the scale of intervals based on the calculation of the boundaries of intervals continues until the value x hi satisfies the ratio:

x hi< x max + 0,5·h .

4 ... In accordance with the scale of intervals, the values of the attribute are grouped - for each partial interval, the sum of frequencies is calculated n i option caught in i th interval. In this case, the interval includes values of a random variable that are greater than or equal to the lower boundary and less than the upper boundary of the interval.

Polygon and histogram

For clarity, various graphs of the statistical distribution are plotted.

Based on the data of the discrete variation series, build polygon frequencies or relative frequencies.

Polygon of frequencies x 1 ; n 1 ), (x 2 ; n 2 ), ..., (x k ; n k ). To build a frequency polygon on the abscissa axis, options are plotted x i , and on the ordinate - the corresponding frequencies n i ... Points ( x i ; n i ) are connected by straight line segments and a frequency polygon is obtained (Fig. 1).

Polygon of relative frequencies is called a broken line, the segments of which connect the points ( x 1 ; W 1 ), (x 2 ; W 2 ), ..., (x k ; W k ). To construct a polygon of relative frequencies on the abscissa axis, options are plotted x i , and on the ordinate - the corresponding relative frequencies W i ... Points ( x i ; W i ) are connected by straight line segments and a polygon of relative frequencies is obtained.

When continuous feature it is advisable to build histogram .

Frequency histogram is called a stepped figure consisting of rectangles, the bases of which are partial intervals of length h , and the heights are equal to the ratio n i / h (frequency density).

To construct a histogram of frequencies on the abscissa axis, partial intervals are plotted, and above them, segments are drawn parallel to the abscissa axis at a distance n i / h .

When processing large amounts of information, which is especially important when carrying out modern scientific developments, the researcher is faced with the serious task of correctly grouping the initial data. If the data are discrete, then problems, as we have seen, do not arise - you just need to calculate the frequency of each feature. If the investigated feature has continuous character (which is more widespread in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

To group continuous random variables, the entire variation range the attribute is split into a certain number of intervals To.

Grouped by interval (continuous) variation series the intervals (), ranked by the value of the feature, are called, where the numbers of observations that fall into the r "-th interval, indicated together with the corresponding frequencies (), or relative frequencies ():

Characteristic value intervals
Frequency mi

bar graph and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary idea of the data structure. Such graphs (Fig. 1.15) are constructed for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fill the area of their possible values, taking any values.

Rice. 1.15.

That's why the columns on the histogram and cumulative must be in contact, not have areas where the values of the characteristic do not fall within the limits of all possible(ie, the histogram and cumulative should not have "holes" along the abscissa, which do not include the values of the studied variable, as in Fig. 1.16). The height of the bar corresponds to the frequency - the number of observations that fell into given interval, or relative frequency - the proportion of observations. Intervals should not intersect and are generally of the same width.

Rice. 1.16.

The histogram and polygon are approximations of the probability density curve (differential function) f (x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is so important in the primary statistical processing of quantitative continuous data - by their appearance, one can judge the hypothetical distribution law.

Cumulative - the curve of the accumulated frequencies (frequencies) of the interval variation series. The cumulative is compared to the graph of the cumulative distribution function F (x), also considered in the course of probability theory.

Basically, the concepts of histograms and cumulates are associated with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task, perhaps, is the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, since in this case the histogram turns out to be too smoothed ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17 you can see how the same data on which the graphs in Fig. 1.15, are used to build a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the studied data along the number axis: the histogram will turn out to be undersmooth (undersmoothed), with unfilled intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How do you determine the most preferred number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which to split the original set of values of the trait under study. This formula has truly become super popular - most statistical textbooks offer it, and many statistical packages use it by default. To what extent this is justified and in all cases is a very serious question.

So what is the Sturges formula based on?

Consider the binomial distribution)