What is a discrete series. Let's construct a statistical distribution series

They are presented in the form of distribution series and are made out in the form.

A distribution series is a type of grouping.

Distribution series- represents an ordered distribution of units of the studied population into groups according to a certain varying characteristic.

Depending on the feature underlying the formation of a series of distributions, they are distinguished attributive and variational distribution ranks:

Attributive- call the distribution series, built according to qualitative characteristics.
Distribution series built in ascending or descending order of the values of a quantitative characteristic are called variational.

The distribution variation series consists of two columns:

The first column contains the quantitative values of the varying attribute, which are called options and are indicated. Discrete option - expressed as an integer. The interval option ranges from and to. Depending on the type of variants, you can build a discrete or interval variation series.
The second column contains number of specific option expressed in terms of frequencies or frequencies:

Frequencies- these are absolute numbers showing how many times a given value of a feature occurs in aggregate, which denote. The sum of all frequencies should be equal to the number of units in the entire population.

Frequencies() Are frequencies expressed as a percentage of the total. The sum of all frequencies expressed as a percentage should be equal to 100% in a fraction of one.

Graphical representation of distribution rows

Distribution series are visualized using graphical representations.

Distribution series are depicted as:

Polygon
Histograms
Cumulates
Ogives

Polygon

When constructing a polygon on the horizontal axis (abscissa axis), the values of the variable feature are plotted, and on the vertical axis (ordinate axis) - frequencies or frequencies.

Polygon in Fig. 6.1 built on the basis of the microcensus of the population of Russia in 1994.

6.1. Distribution of households by size

Condition: Data are given on the distribution of 25 employees of one of the enterprises by tariff categories:
4; 2; 4; 6; 5; 6; 4; 1; 3; 1; 2; 5; 2; 6; 3; 1; 2; 3; 4; 5; 4; 6; 2; 3; 4
Task: Build a discrete variation series and display it graphically as a distribution polygon.
Solution:
V this example options are the employee's wage grade. To determine the frequencies, it is necessary to calculate the number of employees with the corresponding wage category.

The polygon is used for discrete variation series.

To build a distribution polygon (Fig. 1), along the abscissa axis (X), we postpone the quantitative values of the varying feature - options, and along the ordinate - frequencies or frequencies.

If the values of a feature are expressed as intervals, then such a series is called interval.
Interval Rows distributions are plotted graphically as histograms, cumulates or ogives.

Statistical table

Condition: Data on the size of deposits are given 20 individuals in one bank (thousand rubles) 60; 25; 12; ten; 68; 35; 2; 17; 51; nine; 3; 130; 24; 85; 100; 152; 6; eighteen; 7; 42.
Task: Plot an interval variation series with at equal intervals.
Solution:

The original population consists of 20 units (N = 20).
Using the Sturgess formula, we define required amount used groups: n = 1 + 3.322 * lg20 = 5
We calculate the value of an equal interval: i = (152 - 2) / 5 = 30 thousand rubles
Let's divide the initial population into 5 groups with an interval of 30 thousand rubles.
The grouping results are presented in the table:

With such a recording of a continuous feature, when the same value occurs twice (as the upper limit of one interval and the lower limit of another interval), then this value refers to the group where this value acts as the upper limit.

bar graph

To construct a histogram along the abscissa, the values of the boundaries of the intervals are indicated and, on their basis, rectangles are constructed, the height of which is proportional to the frequencies (or parts).

In fig. 6.2. shows a histogram of the distribution of the population of Russia in 1997 by age groups.

Rice. 6.2. Distribution of the population of Russia by age groups

Condition: The distribution of 30 employees of the company by the size of the monthly salary is given

Task: Display the interval variation series graphically in the form of a histogram and cumulates.
Solution:

The unknown border of the open (first) interval is determined by the value of the second interval: 7000 - 5000 = 2000 rubles. With the same value, we find the lower limit of the first interval: 5000 - 2000 = 3000 rubles.
To construct a histogram in a rectangular coordinate system along the abscissa axis, we set aside the segments, the values of which correspond to the intervals of the varietal series.
These segments serve as the lower base, and the corresponding frequency (frequency) - the height of the formed rectangles.
Let's build a histogram:

To construct cumulates, it is necessary to calculate the accumulated frequencies (frequencies). They are determined by sequential summation of the frequencies (frequencies) of the previous intervals and are denoted by S. The accumulated frequencies show how many units of the population have a value of the attribute no more than the considered one.

Cumulata

The distribution of a feature in the variation series according to the accumulated frequencies (parts) is depicted using cumulates.

Cumulata or the cumulative curve, in contrast to the polygon, is built from the accumulated frequencies or parts. In this case, the values of the attribute are placed on the abscissa axis, and the accumulated frequencies or frequencies are placed on the ordinate axis (Fig. 6.3).

Rice. 6.3. Cumulative distribution of households by size

4. Let's calculate the accumulated frequencies:
The knee frequency of the first interval is calculated as follows: 0 + 4 = 4, for the second: 4 + 12 = 16; for the third: 4 + 12 + 8 = 24, etc.

When constructing cumulates, the accumulated frequency (frequency) of the corresponding interval is assigned to its upper limit:

Ogiva

Ogiva is constructed similarly to the cumulative, with the only difference that the accumulated frequencies are placed on the abscissa axis, and the attribute values are placed on the ordinate axis.

A variety of cumulates is the concentration curve or the Lorentz graph. To plot the concentration curve, a scale scale in percent from 0 to 100 is applied to both axes of a rectangular coordinate system. The accumulated frequencies are indicated on the abscissa, and the accumulated values of the fraction (in percent) by the volume of the feature are indicated on the ordinate.

The uniform distribution of the feature corresponds to the diagonal of the square on the graph (Fig. 6.4). With an uneven distribution, the graph is a concave curve depending on the level of concentration of the trait.

6.4. Concentration curve

What is the grouping of statistical data, and how it is related to distribution series, was discussed in this lecture, where you can also learn about what a discrete and variation distribution series is.

Distribution series are one of the varieties of statistical series (apart from them, dynamics series are used in statistics), they are used to analyze data on the phenomena of social life. The construction of the series of variations is quite a feasible task for everyone. However, there are rules to remember.

How to plot a discrete variational distribution series

Example 1. There are data on the number of children in 20 surveyed families. Construct a discrete variation series distribution of families by the number of children.

0 1 2 3 1
2 1 2 1 0
4 3 2 1 1
1 0 1 0 2

Solution:

We'll start with a layout for the table, into which we will then fill in the data. Since the distribution rows have two elements, the table will consist of two columns. The first column is always an option - what we are studying - we take its name from the task (the end of the sentence with the task in the conditions) - by the number of children- so our option is the number of children.

The second column is the frequency - how often our variant occurs in the studied phenomenon - we also take the name of the column from the task - distribution of families - so our frequency is the number of families with the corresponding number of children.

Now, from the initial data, select those values that occur at least once. In our case it is

And we will arrange this data in the first column of our table in a logical order, in this case increasing from 0 to 4. We get

And in conclusion, let's count how many times each value of the options occurs.

0 1 2 3 1

2 1 2 1 0

4 3 2 1 1

1 0 1 0 2

As a result, we get a complete table or the required series of distribution of families by the number of children.

Exercise . There are data on the wage categories of 30 workers of the enterprise. Construct a discrete variation series for the distribution of workers by tariff rate. 2 3 2 4 4 5 5 4 6 3

1 4 4 5 5 6 4 3 2 3

4 5 4 5 5 6 6 3 3 4

How to plot an interval variation series of a distribution

Let's construct an interval distribution series, and see how its construction differs from a discrete series.

Example 2. There are data on the amount of profit received by 16 enterprises, mln. Rubles. - 23 48 57 12 118 9 16 22 27 48 56 87 45 98 88 63. Construct an interval variation series of distribution of enterprises in terms of profit, highlighting 3 groups at equal intervals.

The general principle of constructing the series, of course, will be preserved, the same two columns, the same options and frequency, but here the option will be located in the interval and the frequencies will be counted differently.

Solution:

Let's start in a similar way to the previous task by building a layout for a table, into which we will then enter the data. Since the distribution rows have two elements, the table will consist of two columns. The first column is always the option - what we are studying - we take its name from the assignment (the end of the sentence with the assignment in the conditions) - by the amount of profit - which means that our option is the amount of profit made.

The second column is the frequency - as our variant often occurs in the studied phenomenon - we also take the name of the column from the assignment - distribution of enterprises - so our frequency is the number of enterprises with the corresponding profit, in this case falling into the interval.

As a result, the layout of our table will look like this:

where i is the value or length of the interval,

Xmax and Xmin - the maximum and minimum value of the feature,

n is the required number of groups according to the problem statement.

Let's calculate the size of the interval for our example. To do this, among the initial data, we find the largest and smallest

23 48 57 12 118 9 16 22 27 48 56 87 45 98 88 63 - the maximum value is 118 million rubles, and the minimum is 9 million rubles. Let's calculate using the formula.

In the calculation, we received the number 36, (3) three in the period, in such situations the value of the interval must be rounded up to a larger one, so that after the calculations the maximum data is not lost, which is why in the calculation the value of the interval is 36.4 million rubles.

Now let's build the intervals - our options in this problem. The first interval begins to build from the minimum value, the value of the interval is added to it and the upper limit of the first interval is obtained. Then the upper limit of the first interval becomes the lower limit of the second interval, the value of the interval is added to it and the second interval is obtained. And so on as many times as required to plot intervals by condition.

Let's pay attention if we had not rounded the value of the interval to 36.4, but would have left it at 36.3, then the last value would have turned out to be 117.9. Precisely in order to avoid data loss, it is necessary to round the value of the interval to a larger value.

Let's calculate the number of enterprises that fell into each specific interval. When processing data, it must be remembered that the upper value of the interval in this interval is not taken into account (not included in this interval), but is taken into account in the next interval (the lower border of the interval is included in given interval, and the upper one is not included), except for the last interval.

When carrying out data processing, it is best to designate the selected data conventional icons or color for easier processing.

23 48 57 12 118 9 16 22

27 48 56 87 45 98 88 63

The first interval is denoted by yellow- and determine how much data falls in the interval from 9 to 45.4, while this 45.4 will be taken into account in the second interval (provided that it is in the data) - as a result, we get 7 enterprises in the first interval. And so on at all intervals.

(additional action) Let's calculate the total profit received by enterprises for each interval and in general. To do this, add the data marked different colors and get the total value of the profit.

By the first interval - 23 + 12 + 9 + 16 + 22 + 27 + 45 = 154 million rubles.

For the second interval - 48 + 57 + 48 + 56 + 63 = 272 million rubles.

For the third interval - 118 + 87 + 98 + 88 = 391 million rubles.

Exercise . There are data on the size of the deposit in the bank of 30 depositors, thousand rubles. 150, 120, 300, 650, 1500, 900, 450, 500, 380, 440,

600, 80, 150, 180, 250, 350, 90, 470, 1100, 800,

500, 520, 480, 630, 650, 670, 220, 140, 680, 320

Build interval variation series distribution of depositors, according to the size of the contribution, highlighting 4 groups at equal intervals. For each group, count overall size deposits.

When processing large amounts of information, which is especially important when carrying out modern scientific developments, the researcher is faced with the serious task of correctly grouping the initial data. If the data are discrete, then problems, as we have seen, do not arise - you just need to calculate the frequency of each feature. If the investigated feature has continuous character (which is more widespread in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

To group continuous random variables, the entire variation range the attribute is split into a certain number of intervals To.

Grouped by interval (continuous) variation series the intervals (), ranked by the value of the feature, are called, where the numbers of observations that fall into the r "-th interval, indicated together with the corresponding frequencies (), or relative frequencies ():

Characteristic value intervals
Frequency mi

bar graph and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary idea of the data structure. Such graphs (Fig. 1.15) are constructed for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fill the area of their possible values, taking any values.

Rice. 1.15.

That's why the columns on the histogram and cumulative must touch each other, do not have areas where the values of the characteristic do not fall within the limits of all possible(ie, the histogram and cumulative should not have "holes" along the abscissa, which do not include the values of the studied variable, as in Fig. 1.16). The height of the bar corresponds to the frequency - the number of observations within the given interval, or the relative frequency - the proportion of observations. Intervals should not intersect and are generally of the same width.

Rice. 1.16.

The histogram and polygon are approximations of the probability density curve (differential function) f (x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is so important in the primary statistical processing of quantitative continuous data - by their appearance, one can judge the hypothetical distribution law.

Cumulative - the curve of the accumulated frequencies (frequencies) of the interval variation series. The cumulative is compared to the graph of the cumulative distribution function F (x), also considered in the course of probability theory.

Basically, the concepts of histograms and cumulates are associated with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task, perhaps, is the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, since in this case the histogram turns out to be too smoothed ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17 you can see how the same data on which the graphs in Fig. 1.15, are used to build a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the studied data along the number axis: the histogram will turn out to be undersmooth (undersmoothed), with unfilled intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How do you determine the most preferred number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to split the original set of values of the trait under study. This formula has truly become super popular - most statistical textbooks offer it, and many statistical packages use it by default. To what extent this is justified and in all cases is a very serious question.

So what is the Sturges formula based on?

Consider the binomial distribution.

Example. The researcher is interested in the knowledge of applicants in mathematics. 10 applicants are selected and their grades in this subject are recorded. The following sample was obtained: 5; 4; 4; 3; 2; 5; 4; 3; 4; 5.

a) Present the sample in the form of a variation series;

b) build a statistical series of frequencies and relative frequencies;

c) draw a polygon of relative frequencies for the resulting series.

a) Let's rank the sample, i.e. we arrange the members of the sample in non-decreasing order. We get the variation series: 2; 3; 3; 4; 4; 4; 4; 5; 5; 5.

b) Let's construct a statistical series of frequencies (the correspondence between the sample variants and their frequencies) and a statistical series of relative frequencies (the correspondence between the sample variants and their relative frequencies)


	0,1	0,2	0,4	0,3

Statistical series of frequencies Statistical series rel. frequencies

1 + 2 + 4 + 3 = 10 = n 0.1 + 0.2 + 0.4 + 0.3 = 1.

Polygon of relative frequencies.