Construct a discrete variation series using the following data. Interval distribution series

When constructing an interval distribution series, three questions are resolved:

  • 1. How many intervals should you take?
  • 2. How long are the intervals?
  • 3. What is the order of inclusion of population units in the boundaries of intervals?
  • 1. Number of intervals can be determined by the Sturjess formula:

2. Interval length, or interval step, usually determined by the formula

where R - the range of variation.

3. The order of inclusion of units of the population in the boundaries of the interval

may be different, but when constructing an interval series, the distribution is necessarily strictly defined.

For example, this: [), in which the aggregate units are included in the lower bounds, and are not included in the upper bounds, but are carried over to the next interval. The exception to this rule is the last interval, the upper bound of which includes the last number of the ranked series.

The boundaries of the intervals are:

  • closed - with two extreme values ​​of the attribute;
  • open - with one extreme value of the feature (before such and such a number or over of such and such a number).

In order to assimilate the theoretical material, we introduce background information for solutions cross-cutting task.

There are conditional data on the average number of sales managers, the number of single-quality goods sold by them, the individual market price for this product, as well as the sales volume of 30 firms in one of the regions of the Russian Federation in the first quarter of the reporting year (Table 2.1).

Table 2.1

Background information for the cross-cutting challenge

Number of

managers,

Price, thousand rubles

Sales volume, RUB mln

Number of

managers,

Number of goods sold, pcs.

Price, thousand rubles

Sales volume, RUB mln

Based on the initial information, as well as additional information, we will make the setting of individual tasks. Then we will present the methodology for solving them and the solutions themselves.

Cross-cutting task. Assignment 2.1

Using the initial data table. 2.1, required build a discrete series of distribution of firms by the amount of goods sold (Table 2.2).

Solution:

Table 2.2

Discrete series of distribution of firms by the amount of goods sold in one of the regions of the Russian Federation in the first quarter of the reporting year

Cross-cutting task. Assignment 2.2

required build a ranked row of 30 firms based on the average headcount of managers.

Solution:

15; 17; 18; 20; 20; 20; 22; 22; 24; 25; 25; 25; 27; 27; 27; 28; 29; 30; 32; 32; 33; 33; 33; 34; 35; 35; 38; 39; 39; 45.

Cross-cutting task. Assignment 2.3

Using the initial data table. 2.1, required:

  • 1. Construct an interval series of distribution of firms by the number of managers.
  • 2. Calculate the frequency of the distribution series of firms.
  • 3. Draw conclusions.

Solution:

We calculate by the Sturgess formula (2.5) number of intervals:

Thus, we take 6 intervals (groups).

Interval length, or interval step, we calculate by the formula

Note. The order of inclusion of aggregate units in the boundaries of the interval is as follows: I), in which the aggregate units are included in the lower boundaries, and not included in the upper ones, but are carried over to the next interval. The exception to this rule is the last interval I], the upper bound of which includes the last number of the ranked series.

We build an interval series (Table 2.3).

Interval series distribution of firms but the average number of managers in one of the regions of the Russian Federation in the first quarter of the reporting year

Conclusion. The most numerous group of firms is a group with an average number of managers of 25-30 people, which includes 8 firms (27%); the smallest group with an average number of managers of 40-45 people includes only one firm (3%).

Using the initial data table. 2.1, as well as the interval series of the distribution of firms by the number of managers (Table 2.3), required to construct an analytical grouping of the relationship between the number of managers and the volume of sales of firms and, on the basis of it, draw a conclusion about the presence (or absence) of a connection between the indicated characteristics.

Solution:

The analytical grouping is based on factorial criteria. In our task, the factorial attribute (x) is the number of managers, and the resultant attribute (y) is the volume of sales (Table 2.4).

Let's build now analytical group(Table 2.5).

Conclusion. Based on the data of the constructed analytical group, it can be said that with the increase in the number of sales managers, the average sales volume of the company in the group also increases, which indicates the presence of a direct connection between the indicated characteristics.

Table 2.4

Auxiliary table for constructing an analytical grouping

Number of managers, people,

Company number

Sales volume, RUB mln

"= 59 f = 9.97

I- ™ 4 - Yu.22

74 '25 1PY1

U4 = 7 = 10,61

at = ’ =10,31 30

Table 2.5

Dependence of sales volumes on the number of managers of firms in one of the regions of the Russian Federation in the first quarter of the reporting year

CONTROL QUESTIONS
  • 1. What is the essence of statistical observation?
  • 2. What are the stages of statistical observation.
  • 3. What are organizational forms statistical observation?
  • 4. Name the types of statistical observation.
  • 5. What is a statistical summary?
  • 6. What are the types of statistical summaries.
  • 7. What is statistical grouping?
  • 8. Name the types of statistical groupings.
  • 9. What is a distribution series?
  • 10. Name structural elements distribution series.
  • 11. What is the order of constructing a distribution series?

In many cases, the statistical population includes a large or even more infinite number of options, which is most often encountered with continuous variation, it is almost impossible and impractical to form a group of units for each option. In such cases, the aggregation of statistical units into groups is possible only on the basis of an interval, i.e. such a group that has certain limits for the values ​​of the variable attribute. These limits are indicated by two numbers indicating the upper and lower limits of each group. The use of intervals leads to the formation of an interval distribution series.

Interval glad- it variation range, the variants of which are presented as intervals.

The interval series can be formed with equal and unequal intervals, while the choice of the principle of constructing this series depends mainly on the degree of representativeness and convenience of the statistical population. If the population is large enough (representative) in terms of the number of units and is completely homogeneous in its composition, then it is advisable to put the equality of intervals as the basis for the formation of the interval series. Usually, according to this principle, an interval series is formed for those populations where the range of variation is relatively small, i.e. the maximum and minimum options usually differ several times. In this case, the value of equal intervals is calculated by the ratio of the variation range of the feature to the specified number of intervals formed. To determine the equal and In the interval, the Sturgess formula can be used (usually with a small variation of interval features and a large number of units in the statistical population):

where x i - magnitude equal interval; X max, X min- maximum and minimum options in the statistical population; n . - the number of units in the aggregate.

Example. It is advisable to calculate the size of an equal interval in terms of the density of radioactive contamination with cesium - 137 in 100 settlements of the Krasnopolsky district of the Mogilev region, if it is known that the initial (minimum) option is equal to I km / km 2, the final ( maximum) - 65 ki / km 2. Using formula 5.1. we get:

Consequently, in order to form an interval series with equal intervals in terms of the density of cesium pollution - 137 settlements of the Krasnopolsk region, the size of an equal interval can be 8 cu / km 2.

In conditions of uneven distribution, i.e. when the maximum and minimum variants are hundreds of times, when forming an interval series, the principle can be applied unequal intervals. Unequal intervals usually increase as you move to larger characteristic values.

The intervals can be closed or open in shape. Closed it is customary to call intervals for which both the lower and upper boundaries are indicated. Open intervals have only one border: in the first interval - the upper, in the last - the lower border.

It is advisable to evaluate interval series, especially with unequal intervals, taking into account distribution density, the simplest way to calculate which is the ratio of the local frequency (or frequency) to the size of the interval.

For the practical formation of the interval series, you can use the layout of the table. 5.3.

Table 5.3. The procedure for the formation of the interval series of settlements in the Krasnopolsk region according to the density of radioactive contamination with cesium -137

The main advantage of the interval series is its extreme compactness. at the same time, in the interval distribution series individual options features are hidden in the appropriate intervals

When graphically depicting an interval series in a rectangular coordinate system, the upper boundaries of the intervals are plotted on the abscissa axis, and the local frequencies of the series are plotted on the ordinate axis. The graphical construction of an interval series differs from the construction of a distribution polygon in that each interval has a lower and upper boundaries, and two abscissas correspond to any one value of the ordinate. Therefore, on the graph of the interval series, not a point is marked, as in a polygon, but a line connecting two points. These horizontal lines are connected to each other by vertical lines and the shape of a stepped polygon is obtained, which is usually called histogram distribution (Figure 5.3).

When graphically plotting an interval series for a sufficiently large statistical population, the histogram approaches symmetrical distribution form. In those cases where the statistical population is small, as a rule, asymmetrical bar graph.

In some cases, it is advisable to form a number of accumulated frequencies, i.e. cumulative row. A cumulative series can be formed on the basis of a discrete or interval distribution series. When graphically depicting a cumulative series in a rectangular coordinate system, the options are plotted on the abscissa axis, and the accumulated frequencies (frequencies) are plotted on the ordinate axis. The resulting curved line is usually called cumulative distribution (Figure 5.4).

Formation and graphic image different types variation series contributes to the simplified calculation of the main statistical characteristics, which are discussed in detail in topic 6, helps to better understand the essence of the distribution laws of the statistical population. The analysis of the variation series is of particular importance in those cases when it is necessary to identify and trace the relationship between the options and frequencies (frequencies). This dependence is manifested in the fact that the number of cases falling on each option is in a certain way related to the magnitude of this option, i.e. with an increase in the values ​​of the variable attribute, the frequency (frequency) of these values ​​undergo certain, systematic changes. This means that the numbers in the column of frequencies (frequencies) are not subject to chaotic oscillations, but change in a certain direction, in a certain order and sequence.

If the frequencies in their changes reveal a certain systematicity, then this means that we are on the way to identifying patterns. The system, order, sequence in changing frequencies is a reflection of common causes, general conditions characteristic of the entire population.

It should not be assumed that the distribution pattern is always given in finished form... There are quite a few variation series in which the frequencies bizarrely jump, sometimes increasing, sometimes decreasing. In such cases, it is advisable to find out what distribution the researcher is dealing with: either this distribution is not at all inherent in regularities, then its nature has not yet been revealed: The first case is rare, the second, the second case is a fairly frequent and very widespread phenomenon.

So, when forming an interval series, the total number of statistical units can be small, and each interval contains a small number of variants (for example, 1-3 units). In such cases, it is not necessary to count on the manifestation of any regularity. In order for a logical result to be obtained on the basis of random observations, the law of large numbers must come into force, i.e. so that for each interval there would be not a few, but tens and hundreds of statistical units. To this end, one should try to increase the number of observations as much as possible. This is the most the right way detecting patterns in mass processes. If there is no real opportunity to increase the number of observations, then the identification of a pattern can be achieved by reducing the number of intervals in the distribution series. Decreasing the number of intervals in the variation series, thereby increasing the number of frequencies in each interval. This means that random fluctuations of each statistical unit are superimposed on each other, "smoothed", turning into a regularity.

The formation and construction of variational series allows you to get only a general, approximate picture of the distribution of the statistical population. For example, the histogram only in a rough form expresses the relationship between the values ​​of a feature and its frequencies (frequencies) Therefore, the series of variations are essentially only the basis for further, in-depth study of the internal laws of static distribution.

CONTROL QUESTIONS FOR TOPIC 5

1. What is a variation? What causes the variation of a feature in a statistical population?

2. What kinds of varying features can take place in statistics?

3. What is a variation series? What types of variation series can there be?

4. What is the ranked series? What are its advantages and disadvantages?

5. What is a discrete series and what are its advantages and disadvantages?

6. What is the order of formation of the interval series, what are its advantages and disadvantages?

7. What is a graphical representation of a ranged, discrete, interval distribution series?

8. What is the cumulative distribution and what does it characterize?

When processing large amounts of information, which is especially important when carrying out modern scientific developments, the researcher is faced with the serious task of correctly grouping the initial data. If the data are discrete, then problems, as we have seen, do not arise - you just need to calculate the frequency of each feature. If the investigated feature has continuous character (which is more widespread in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

To group continuous random variables, the entire variation range the attribute is split into a certain number of intervals To.

Grouped by interval (continuous) variation series the intervals (), ranked by the value of the feature, are called, where the numbers of observations that fall into the r "-th interval, indicated together with the corresponding frequencies (), or relative frequencies ():

Characteristic value intervals

Frequency mi

bar graph and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary idea of ​​the data structure. Such graphs (Fig. 1.15) are constructed for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fill the area of ​​their possible values, taking any values.

Rice. 1.15.

So the columns on the histogram and cumulative must touch each other, do not have areas where the values ​​of the characteristic do not fall within the limits of all possible(ie, the histogram and cumulative should not have "holes" along the abscissa, which do not include the values ​​of the studied variable, as in Fig. 1.16). The height of the bar corresponds to the frequency - the number of observations that fell into given interval, or relative frequency - the proportion of observations. Intervals should not intersect and are generally of the same width.

Rice. 1.16.

The histogram and polygon are approximations of the probability density curve (differential function) f (x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is so important in the primary statistical processing of quantitative continuous data - by their appearance, one can judge the hypothetical distribution law.

Cumulative - the curve of the accumulated frequencies (frequencies) of the interval variation series. The cumulative is compared to the graph of the cumulative distribution function F (x), also considered in the course of probability theory.

Basically, the concepts of histograms and cumulates are associated with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task, perhaps, is the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, since in this case the histogram turns out to be too smoothed ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17 you can see how the same data on which the graphs in Fig. 1.15, are used to build a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the studied data along the number axis: the histogram will turn out to be undersmooth (undersmoothed), with unfilled intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How do you determine the most preferred number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to split the original set of values ​​of the trait under study. This formula has truly become super popular - most statistical textbooks offer it, and many statistical packages use it by default. To what extent this is justified and in all cases is a very serious question.

So what is the Sturges formula based on?

Consider the binomial distribution)