What is a discrete series. Let's build a statistical series of distribution

They are presented in the form of distribution series and are formatted as .

A distribution series is one type of grouping.

Distribution range- represents an ordered distribution of units of the studied population into groups according to a certain varying attribute.

Depending on the trait underlying the formation of a distribution series, there are attributive and variational distribution ranks:

attributive- call the distribution series built on qualitative grounds.
Distribution series constructed in ascending or descending order of values of a quantitative attribute are called variational.

The variation series of the distribution consists of two columns:

The first column contains the quantitative values of the variable characteristic, which are called options and are marked. Discrete variant - expressed as an integer. The interval option is in the range from and to. Depending on the type of variants, it is possible to construct a discrete or interval variational series.
The second column contains number of specific option, expressed in terms of frequencies or frequencies:

Frequencies- these are absolute numbers showing how many times in the aggregate the given value of the feature occurs, which denote . The sum of all frequencies should be equal to the number of units of the entire population.

Frequencies() are the frequencies expressed as a percentage of the total. The sum of all frequencies expressed as a percentage must be equal to 100% in fractions of one.

Graphical representation of distribution series

The distribution series are visualized using graphic images.

The distribution series are displayed as:

Polygon
Histograms
Cumulates
ogives

Polygon

When constructing a polygon, on the horizontal axis (abscissa) the values of the variable attribute are plotted, and on the vertical axis (ordinate) - frequencies or frequencies.

The polygon in fig. 6.1 was built according to the micro-census of the population of Russia in 1994.

6.1. Distribution of households by size

Condition: Data are given on the distribution of 25 employees of one of the enterprises by tariff categories:
4; 2; 4; 6; 5; 6; 4; 1; 3; 1; 2; 5; 2; 6; 3; 1; 2; 3; 4; 5; 4; 6; 2; 3; 4
A task: Build a discrete variational series and depict it graphically as a distribution polygon.
Solution:
AT this example options is the wage category of the employee. To determine the frequencies, it is necessary to calculate the number of employees with the appropriate wage category.

The polygon is used for discrete variation series.

To build a distribution polygon (Fig. 1), along the abscissa (X) axis, we plot the quantitative values of the varying attribute - variants, and along the ordinate axis - frequencies or frequencies.

If the characteristic values are expressed as intervals, then such a series is called an interval series.
interval series distributions are shown graphically as a histogram, cumulate or ogive.

Statistical table

Condition: Data on the size of deposits 20 are given individuals in one bank (thousand rubles) 60; 25; 12; ten; 68; 35; 2; 17; 51; 9; 3; 130; 24; 85; 100; 152; 6; eighteen; 7; 42.
A task: Build an interval variation series with at equal intervals.
Solution:

The initial population consists of 20 units (N = 20).
Using the Sturgess formula, we define required amount used groups: n=1+3.322*lg20=5
Let's calculate the value of the equal interval: i=(152 - 2) /5 = 30 thousand rubles
We divide the initial population into 5 groups with an interval of 30 thousand rubles.
The grouping results are presented in the table:

With such a recording of a continuous feature, when the same value occurs twice (as the upper limit of one interval and the lower limit of another interval), then this value belongs to the group where this value acts as the upper limit.

bar chart

To build a histogram along the abscissa, indicate the values of the boundaries of the intervals and, based on them, construct rectangles, the height of which is proportional to the frequencies (or frequencies).

On fig. 6.2. the histogram of distribution of the population of Russia in 1997 by age groups is shown.

Rice. 6.2. Distribution of the population of Russia by age groups

Condition: The distribution of 30 employees of the company according to the size of the monthly salary is given

A task: Display the interval variation series graphically as a histogram and cumulate.
Solution:

The unknown border of the open (first) interval is determined by the value of the second interval: 7000 - 5000 = 2000 rubles. With the same value, we find the lower limit of the first interval: 5000 - 2000 = 3000 rubles.
To construct a histogram in a rectangular coordinate system, along the abscissa axis, we set aside segments whose values correspond to the intervals of the variant series.
These segments serve as the lower base, and the corresponding frequency (frequency) serves as the height of the rectangles formed.
Let's build a histogram:

To construct the cumulate, it is necessary to calculate the accumulated frequencies (frequencies). They are determined by successive summation of the frequencies (frequencies) of the previous intervals and are denoted by S. The accumulated frequencies show how many units of the population have a feature value no greater than the one under consideration.

Cumulate

The distribution of a trait in a variational series according to the accumulated frequencies (frequencies) is depicted using the cumulate.

Cumulate or the cumulative curve, in contrast to the polygon, is built on the accumulated frequencies or frequencies. At the same time, the values of the feature are placed on the abscissa axis, and the accumulated frequencies or frequencies are placed on the ordinate axis (Fig. 6.3).

Rice. 6.3. Cumulative distribution of households by size

4. Calculate the accumulated frequencies:
The knee frequency of the first interval is calculated as follows: 0 + 4 = 4, for the second: 4 + 12 = 16; for the third: 4 + 12 + 8 = 24, etc.

When constructing the cumulate, the accumulated frequency (frequency) of the corresponding interval is assigned to its upper bound:

Ogiva

Ogiva is constructed similarly to the cumulate with the only difference that the accumulated frequencies are placed on the abscissa axis, and the feature values are placed on the ordinate axis.

A variation of the cumulate is the concentration curve or Lorenz plot. To plot the concentration curve, both axes of a rectangular coordinate system are scaled as a percentage from 0 to 100. In this case, the abscissa axes indicate the accumulated frequencies, and the ordinate axes show the accumulated values of the share (in percent) by the volume of the feature.

The uniform distribution of the sign corresponds to the diagonal of the square on the graph (Fig. 6.4). With uneven distribution, the graph is a concave curve depending on the concentration level of the trait.

6.4. concentration curve

What is the grouping of statistical data, and how it is related to the distribution series, was considered in this lecture, where you can also learn about what a discrete and variational distribution series is.

Distribution series are one of the varieties of statistical series (besides them, dynamics series are used in statistics), they are used to analyze data on the phenomena of social life. The construction of variational series is quite a feasible task for everyone. However, there are rules to remember.

How to build a discrete variational distribution series

Example 1 Data are available on the number of children in 20 surveyed families. Construct a discrete variational series distribution of families by number of children.

0 1 2 3 1
2 1 2 1 0
4 3 2 1 1
1 0 1 0 2

Solution:

Let's start with the layout of the table, in which we will then enter the data. Since the distribution rows have two elements, the table will consist of two columns. The first column is always a variant - what we are studying - we take its name from the task (the end of the sentence with the task in the conditions) - by number of children- so our version is the number of children.

The second column is the frequency - how often our variant occurs in the phenomenon under study - we also take the name of the column from the task - distribution of families - so our frequency is the number of families with the corresponding number of children.

Now, from the initial data, we select those values that occur at least once. In our case, this

And let's arrange this data in the first column of our table in a logical order, in this case increasing from 0 to 4. We get

And in conclusion, let's calculate how many times each value of the options occurs.

0 1 2 3 1

2 1 2 1 0

4 3 2 1 1

1 0 1 0 2

As a result, we obtain a complete table or the required series of distribution of families by the number of children.

Exercise . There is data on the tariff categories of 30 workers of the enterprise. Construct a discrete variational series of distribution of workers according to tariff category. 2 3 2 4 4 5 5 4 6 3

1 4 4 5 5 6 4 3 2 3

4 5 4 5 5 6 6 3 3 4

How to build an interval variation series of distribution

Let's build an interval distribution series, and see how its construction differs from a discrete series.

Example 2 There is data on the amount of profit received by 16 enterprises, million rubles. — 23 48 57 12 118 9 16 22 27 48 56 87 45 98 88 63. Construct an interval variational series for the distribution of enterprises by profit volume, highlighting 3 groups at equal intervals.

The general principle of constructing a series, of course, will be preserved, the same two columns, the same variants and frequency, but in this case the variants will be located in the interval and the frequencies will be counted differently.

Solution:

Let's start similarly to the previous task by building a table layout, into which we will then enter data. Since the distribution rows have two elements, the table will consist of two columns. The first column is always a variant - what we are studying - we take its name from the task (the end of the sentence with the task in the conditions) - by the amount of profit - which means that our variant is the amount of profit received.

The second column is the frequency - how often our variant occurs in the phenomenon under study - we also take the name of the column from the assignment - the distribution of enterprises - this means our frequency is the number of enterprises with the corresponding profit, in this case falling into the interval.

As a result, the layout of our table will look like this:

where i is the value or length of the interval,

Xmax and Xmin - the maximum and minimum value of the feature,

n is the required number of groups according to the condition of the problem.

Let's calculate the interval value for our example. To do this, among the initial data, we find the largest and smallest

23 48 57 12 118 9 16 22 27 48 56 87 45 98 88 63 - the maximum value is 118 million rubles, and the minimum is 9 million rubles. Let's calculate the formula.

In the calculation, we got the number 36, (3) three in the period, in such situations, the value of the interval must be rounded up so that after the calculations the maximum data is not lost, which is why the value of the interval in the calculation is 36.4 million rubles.

Now let's build the intervals - our options in this problem. The first interval is started from the minimum value, the value of the interval is added to it and the upper limit of the first interval is obtained. Then the upper limit of the first interval becomes the lower limit of the second interval, the value of the interval is added to it and the second interval is obtained. And so on as many times as required to build intervals according to the condition.

Pay attention, if we did not round the value of the interval to 36.4, but would leave it at 36.3, then the last value would be 117.9. It is in order to avoid data loss that it is necessary to round the value of the interval to a larger value.

Let's count the number of enterprises that fall into each specific interval. When processing data, it must be remembered that the upper value of the interval in this interval is not taken into account (is not included in this interval), but is taken into account in the next interval (the lower limit of the interval is included in given interval, and the top one is not included), except for the last interval.

When conducting data processing, it is best to designate the selected data conventional icons or color, for ease of processing.

23 48 57 12 118 9 16 22

27 48 56 87 45 98 88 63

We denote the first interval yellow- and determine how much data falls into the interval from 9 to 45.4, while this 45.4 will be taken into account in the second interval (provided that it is in the data) - as a result, we get 7 enterprises in the first interval. And so on for all intervals.

(additional action) Let's calculate the total amount of profit received by enterprises for each interval and in general. To do this, we add the data marked different colors and get the total value of profit.

For the first interval 23 + 12 + 9 + 16 + 22 + 27 + 45 = 154 million rubles

For the second interval - 48 + 57 + 48 + 56 + 63 = 272 million rubles.

For the third interval - 118 + 87 + 98 + 88 = 391 million rubles.

Exercise . There is data on the size of the deposit in the bank of 30 depositors, thousand rubles. 150, 120, 300, 650, 1500, 900, 450, 500, 380, 440,

600, 80, 150, 180, 250, 350, 90, 470, 1100, 800,

500, 520, 480, 630, 650, 670, 220, 140, 680, 320

Build interval variation series distribution of depositors, by the size of the contribution, highlighting 4 groups at equal intervals. Calculate for each group overall size deposits.

When processing large amounts of information, which is especially important when conducting modern scientific developments, the researcher faces the serious task of correctly grouping the initial data. If the data is discrete, then, as we have seen, there are no problems - you just need to calculate the frequency of each feature. If the trait under study has continuous character (which is more common in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

For grouping continuous random variables, the entire variation range feature is divided into a number of intervals to.

Grouped interval (continuous) variational series called intervals ranked by the value of the feature (), where indicated together with the corresponding frequencies () the number of observations that fell into the r "th interval, or relative frequencies ():

Characteristic value intervals
mi frequency

bar chart and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary understanding of the data structure. Such graphs (Fig. 1.15) are built for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fills the area of its possible values, taking any values.

Rice. 1.15.

That's why the columns on the histogram and the cumulate must be in contact, have no areas where the attribute values do not fall within all possible(i.e., the histogram and cumulate should not have "holes" along the abscissa axis, in which the values of the variable under study do not fall, as in Fig. 1.16). The height of the bar corresponds to the frequency - the number of observations that fall into the given interval, or the relative frequency - the proportion of observations. Intervals must not cross and are usually the same width.

Rice. 1.16.

The histogram and polygon are approximations of the probability density curve (differential function) f(x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is of such importance in the primary statistical processing of quantitative continuous data - by their form one can judge the hypothetical distribution law.

Cumulate - the curve of the accumulated frequencies (frequencies) of the interval variation series. The graph of the integral distribution function is compared with the cumulate F(x), also considered in the course of probability theory.

Basically, the concepts of histogram and cumulates are associated precisely with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task is perhaps the most difficult, important and controversial in the issue under study.

The number of intervals should not be too small, as the histogram will be too smooth ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17 you can see how the same data on which the graphs of Fig. 1.15 are used to construct a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the data under study along the numerical axis: the histogram will turn out to be undersmoothed (undersmoothed) with unfilled intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How to determine the most preferred number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to divide the initial set of values of the studied attribute. This formula has really become super popular - most statistical textbooks offer it, and many statistical packages use it by default. Whether this is justified and in all cases is a very serious question.

So what is the Sturges formula based on?

Consider the binomial distribution.

Example. The researcher is interested in the knowledge of applicants in mathematics. 10 applicants are selected and their school grades in this subject are recorded. The following sample was received: 5;4;4;3;2;5;4;3;4;5.

a) Present the sample as a variation series;

b) build a statistical series of frequencies and relative frequencies;

c) draw a polygon of relative frequencies for the resulting series.

a) Let's rank the sample, i.e. Arrange the members of the sample in non-decreasing order. We get a variational series: 2; 3; 3; four; four; four; four; 5; 5;5.

b) We construct a statistical series of frequencies (correspondence between sample options and their frequencies) and a statistical series of relative frequencies (correspondence between sample options and their relative frequencies)


	0,1	0,2	0,4	0,3

Statistical series of frequencies statistical series rel. frequencies

1+2+4+3=10=n 0.1+0.2+0.4+0.3=1.

Polygon of relative frequencies.