Build an interval variational distribution series online. Summary and grouping of statistics

They are presented in the form of distribution series and are formatted as .

A distribution series is one type of grouping.

Distribution range- represents an ordered distribution of units of the studied population into groups according to a certain varying attribute.

Depending on the trait underlying the formation of a distribution series, there are attributive and variational distribution ranks:

  • attributive- call the distribution series built on qualitative grounds.
  • Distribution series built in ascending or descending order of values ​​of a quantitative attribute are called variational.
The variation series of the distribution consists of two columns:

The first column contains the quantitative values ​​of the variable characteristic, which are called options and are marked. Discrete variant - expressed as an integer. The interval option is in the range from and to. Depending on the type of variants, it is possible to construct a discrete or interval variational series.
The second column contains number of specific option, expressed in terms of frequencies or frequencies:

Frequencies- these are absolute numbers showing how many times in the aggregate a given feature value occurs, which denote . The sum of all frequencies should be equal to the number of units of the entire population.

Frequencies() are the frequencies expressed as a percentage of the total. The sum of all frequencies expressed as a percentage must be equal to 100% in fractions of one.

Graphical representation of distribution series

The distribution series are visualized using graphic images.

The distribution series are displayed as:
  • Polygon
  • Histograms
  • Cumulates
  • ogives

Polygon

When constructing a polygon, on the horizontal axis (abscissa) the values ​​of the variable attribute are plotted, and on the vertical axis (ordinate) - frequencies or frequencies.

The polygon in fig. 6.1 was built according to the micro-census of the population of Russia in 1994.

6.1. Distribution of households by size

Condition: Data are given on the distribution of 25 employees of one of the enterprises by tariff categories:
4; 2; 4; 6; 5; 6; 4; 1; 3; 1; 2; 5; 2; 6; 3; 1; 2; 3; 4; 5; 4; 6; 2; 3; 4
A task: Build a discrete variational series and depict it graphically as a distribution polygon.
Solution:
AT this example options is tariff category worker. To determine the frequencies, it is necessary to calculate the number of employees with the appropriate wage category.

The polygon is used for discrete variation series.

To build a distribution polygon (Fig. 1), along the abscissa (X), we plot the quantitative values ​​of the varying trait - variants, and along the ordinate - frequencies or frequencies.

If the characteristic values ​​are expressed as intervals, then such a series is called an interval series.
interval series distributions are shown graphically as a histogram, cumulate or ogive.

Statistical table

Condition: Data on the size of deposits 20 are given individuals in one bank (thousand rubles) 60; 25; 12; ten; 68; 35; 2; 17; 51; 9; 3; 130; 24; 85; 100; 152; 6; eighteen; 7; 42.
A task: Build an interval variation series with at equal intervals.
Solution:

  1. The initial population consists of 20 units (N = 20).
  2. Using the Sturgess formula, we define required amount used groups: n=1+3.322*lg20=5
  3. Let's calculate the value of the equal interval: i=(152 - 2) /5 = 30 thousand rubles
  4. We divide the initial population into 5 groups with an interval of 30 thousand rubles.
  5. The grouping results are presented in the table:

With such a recording of a continuous feature, when the same value occurs twice (as the upper limit of one interval and the lower limit of another interval), then this value belongs to the group where this value acts as the upper limit.

bar chart

To build a histogram along the abscissa, indicate the values ​​of the boundaries of the intervals and, based on them, construct rectangles whose height is proportional to the frequencies (or frequencies).

On fig. 6.2. the histogram of distribution of the population of Russia in 1997 by age groups is shown.

Rice. 6.2. Distribution of the population of Russia by age groups

Condition: The distribution of 30 employees of the company according to the size of the monthly salary is given

A task: Display the interval variation series graphically as a histogram and cumulate.
Solution:

  1. The unknown border of the open (first) interval is determined by the value of the second interval: 7000 - 5000 = 2000 rubles. With the same value, we find the lower limit of the first interval: 5000 - 2000 = 3000 rubles.
  2. To construct a histogram in a rectangular coordinate system, along the abscissa axis, we set aside segments whose values ​​correspond to the intervals of the variant row.
    These segments serve as the lower base, and the corresponding frequency (frequency) serves as the height of the rectangles formed.
  3. Let's build a histogram:

To construct the cumulate, it is necessary to calculate the accumulated frequencies (frequencies). They are determined by successive summation of the frequencies (frequencies) of the previous intervals and are denoted by S. The accumulated frequencies show how many units of the population have a feature value no greater than the one under consideration.

Cumulate

The distribution of a trait in a variational series according to the accumulated frequencies (frequencies) is depicted using the cumulate.

Cumulate or the cumulative curve, in contrast to the polygon, is built on the accumulated frequencies or frequencies. At the same time, the values ​​of the feature are placed on the abscissa axis, and the accumulated frequencies or frequencies are placed on the ordinate axis (Fig. 6.3).

Rice. 6.3. Cumulative distribution of households by size

4. Calculate the accumulated frequencies:
The knee frequency of the first interval is calculated as follows: 0 + 4 = 4, for the second: 4 + 12 = 16; for the third: 4 + 12 + 8 = 24, etc.

When constructing the cumulate, the accumulated frequency (frequency) of the corresponding interval is assigned to its upper bound:

Ogiva

Ogiva is constructed similarly to the cumulate with the only difference that the accumulated frequencies are placed on the abscissa axis, and the feature values ​​are placed on the ordinate axis.

A variation of the cumulate is the concentration curve or Lorenz plot. To plot the concentration curve, both axes of the rectangular coordinate system are scaled as a percentage from 0 to 100. In this case, the abscissa axes indicate the accumulated frequencies, and the ordinate axes show the accumulated values ​​of the share (in percent) by the volume of the feature.

The uniform distribution of the sign corresponds to the diagonal of the square on the graph (Fig. 6.4). With uneven distribution, the graph is a concave curve depending on the concentration level of the trait.

6.4. concentration curve

An example of solving a test in mathematical statistics

Task 1

Initial data : students of a certain group consisting of 30 people passed the exam in the course "Informatics". The grades received by the students form the following series of numbers:

I. Compose a variational series

m x

w x

m x nak

w x nak

Total:

II. Graphical representation of statistical information.

III. Numerical characteristics of the sample.

1. Arithmetic mean

2. Geometric mean

3. Fashion

4. Median

222222333333333 | 3 34444444445555

5. Sample variance

7. Coefficient of variation

8. Asymmetry

9. Asymmetry coefficient

10. Kurtosis

11. Kurtosis coefficient

Task 2

Initial data : students of a certain group wrote a final test. The group consists of 30 people. The scores scored by students form the following series of numbers

Solution

I. Since the sign takes on many different values, we will construct an interval variation series for it. To do this, we first set the interval value h. Let's use the Sturger formula

Let's make a scale of intervals. In this case, for the upper boundary of the first interval we will take the value determined by the formula:

The upper bounds of subsequent intervals are determined by the following recursive formula:

, then

We finish building the scale of intervals, since the upper limit of the next interval has become greater than or equal to the maximum value of the sample
.

II. Graphical display of the interval variation series

III. Numerical characteristics of the sample

To determine the numerical characteristics of the sample, we will compile an auxiliary table

Sum:

1. Arithmetic mean

2. Geometric mean

3. Fashion

4. Median

10 11 12 12 13 13 13 13 14 14 14 14 15 15 15 |15 15 15 16 16 16 16 16 17 17 18 19 19 20 20

5. Sample variance

6. Sample standard deviation

7. Coefficient of variation

8. Asymmetry

9. Asymmetry coefficient

10. Kurtosis

11. Kurtosis coefficient

Task 3

Condition : the value of the division of the ammeter scale is 0.1 A. The readings are rounded to the nearest whole division. Find the probability that an error greater than 0.02 A will be made during the reading.

Solution.

The rounding error can be considered as a random variable X, which is distributed evenly in the interval between two adjacent integer divisions. Density of uniform distribution

where
- the length of the interval that contains the possible values X; outside this interval
In this problem, the length of the interval containing the possible values X, is equal to 0.1, so

The reading error will exceed 0.02 if it is enclosed in the interval (0.02; 0.08). Then

Answer: R=0,6

Task 4

Initial data: mathematical expectation and standard deviation of a normally distributed feature X are 10 and 2, respectively. Find the probability that as a result of the test X will take the value contained in the interval (12, 14).

Solution.

Let's use the formula

And theoretical frequencies

Solution

For x her expected value M(X) and variance D(X). Solution. Find the distribution function F(x) random variable... sampling error). Let's compose variational row Interval Width will be: For each value row Let's calculate how many...

  • Solution: separable equation

    Solution

    In the form To find a private solutions inhomogeneous equation compose system Let's solve the resulting system... ; +47; +61; +10; -eight. Build interval variational row. To give statistical estimates mean value...

  • Solution: Let's calculate chain and basic absolute growth rates, growth rates, growth rates. The obtained values ​​are summarized in table 1

    Solution

    The volume of production. Solution: Arithmetic mean of interval variational row calculated as follows: per... Marginal sampling error with a probability of 0.954 (t=2) will be: Δ w = t*μ = 2*0.0146 = 0.02927 Let's define the boundaries...

  • Solution. sign

    Solution

    About whose work experience and amounted to sample. The average length of service for the sample ... of the working day of these employees and amounted to sample. Average duration for the sample... 1.16, significance level α = 0.05. Solution. variational row of this sample has the form: 0.71 ...

  • Working curriculum in biology for grades 10-11 Compiled by Polikarpova S. V

    Working curriculum

    The simplest crossbreeding schemes» 5 L.r. " Solution elementary genetic problems” 6 L.r. " Solution elementary genetic problems” 7 L.r. "..., 110, 115, 112, 110. Compose variational row, draw variational curve, find the average value of the feature ...

  • grouping- this is the division of the population into groups that are homogeneous in some way.

    Service assignment. With the online calculator you can:

    • build a variation series, build a histogram and a polygon;
    • find indicators of variation (mean, mode (including graphically), median, range of variation, quartiles, deciles, quartile coefficient of differentiation, coefficient of variation and other indicators);

    Instruction. To group a series, you must select the type of the resulting variation series (discrete or interval) and specify the amount of data (number of rows). The resulting solution is saved in a Word file (see the example of grouping statistical data).

    Number of input data
    ",0);">

    If the grouping has already been done and the discrete variation series or interval series, then you need to use the online calculator Variation indicators. Testing the hypothesis about the type of distribution produced using the service Study of the form of distribution.

    Types of statistical groupings

    Variation series. In the case of observations of a discrete random variable, the same value can be encountered several times. Such values ​​\u200b\u200bof a random variable x i are recorded indicating n i the number of times it appears in n observations, this is the frequency of this value.
    In the case of a continuous random variable, grouping is used in practice.
    1. Typological grouping- this is the division of the studied qualitatively heterogeneous population into classes, socio-economic types, homogeneous groups of units. To build this grouping, use the Discrete variational series parameter.
    2. Structural grouping is called, in which a homogeneous population is divided into groups that characterize its structure according to some varying feature. To build this grouping, use the Interval series parameter.
    3. A grouping that reveals the relationship between the studied phenomena and their features is called analytical group(see analytical grouping of series).

    Principles of building statistical groupings

    A series of observations ordered in ascending order is called variational series . grouping sign is the sign by which the population is divided into separate groups. It is called the base of the group. Grouping can be based on both quantitative and qualitative characteristics.
    After determining the basis of the grouping, the question of the number of groups into which the study population should be divided should be decided.

    When using personal computers for processing statistical data, the grouping of units of an object is carried out using standard procedures.
    One such procedure is based on using the Sturgess formula to determine the optimal number of groups:

    k = 1+3.322*lg(N)

    Where k is the number of groups, N is the number of population units.

    The length of the partial intervals is calculated as h=(x max -x min)/k

    Then count the number of hits of observations in these intervals, which are taken as frequencies n i . Few frequencies, the values ​​of which are less than 5 (n i< 5), следует объединить. в этом случае надо объединить и соответствующие интервалы.
    The midpoints of the intervals x i =(c i-1 +c i)/2 are taken as new values.

    A discrete variational series is constructed for discrete features.

    In order to build a discrete variation series, you need to do the following: 1) order the units of observation in ascending order of the studied attribute value,

    2) determine all possible values ​​of the attribute x i , sort them in ascending order,

    sign value, i .

    feature value frequency and denote f i . The sum of all frequencies of the series is equal to the number of elements in the studied population.

    Example 1 .

    List of grades obtained by students in exams: 3; four; 3; 5; four; 2; 2; four; four; 3; 5; 2; four; 5; four; 3; four; 3; 3; four; four; 2; 2; 5; 5; four; 5; 2; 3; four; four; 3; four; 5; 2; 5; 5; four; 3; 3; four; 2; four; four; 5; four; 3; 5; 3; 5; four; four; 5; four; four; 5; four; 5; 5; 5.

    Here the number X - gradeis a discrete random variable, and the resulting list of estimates isstatistical (observed) data .

      order the units of observation in ascending order of the studied value of the feature:

    2; 2; 2; 2; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5.

    2) determine all possible values ​​of the attribute x i , sort them in ascending order:

    In this example, all scores can be divided into four groups with the following values: 2; 3; four; 5.

    The value of a random variable corresponding to a separate group of observed data is called sign value, variant (option) and designate x i .

    The number that shows how many times the corresponding feature value occurs in a series of observations is called feature value frequency and denote f i .

    For our example

    score 2 occurs - 8 times,

    score 3 occurs - 12 times,

    score 4 occurs - 23 times,

    score 5 occurs - 17 times.

    There are 60 ratings in total.

    4) write the received data into a table of two rows (columns) - x i and f i .

    Based on these data, it is possible to construct a discrete variational series

    Discrete variation series - this is a table in which the occurring values ​​of the studied trait are indicated as separate values ​​in ascending order and their frequencies

    1. Construction of an interval variation series

    In addition to a discrete variational series, there is often such a way of grouping data as an interval variational series.

    An interval series is built if:

      the sign has a continuous nature of change;

      there are a lot of discrete values ​​(more than 10)

      frequencies of discrete values ​​are very small (do not exceed 1-3 with a relatively large number of units of observation);

      many discrete values ​​of a feature with the same frequencies.

    An interval variation series is a way of grouping data in the form of a table that has two columns (feature values ​​in the form of an interval of values ​​and the frequency of each interval).

    Unlike a discrete series of feature values interval series are represented not by individual values, but by an interval of values ​​("from - to").

    The number that shows how many observation units fell into each selected interval is called feature value frequency and denote f i . The sum of all frequencies of the series is equal to the number of elements (observation units) in the studied population.

    If a unit has a feature value equal to the value of the upper limit of the interval, then it should be referred to the next interval.

    For example, a child with a height of 100 cm will fall into the 2nd interval, and not into the first; and a child with a height of 130 cm will fall into the last interval, and not into the third.

    Based on these data, it is possible to construct an interval variation series.

    Each interval has a lower limit (x n), an upper limit (x in) and an interval width ( i).

    An interval boundary is a feature value that lies on the border of two intervals.

    children's height (cm)

    children's height (cm)

    amount of children

    over 130

    If an interval has an upper and lower bound, then it is called closed interval. If the interval has only a lower or only an upper bound, then this is - open interval. Only the very first or the very last interval can be open. In the above example, the last interval is open.

    Interval width (i) is the difference between the upper and lower bounds.

    i = x n - x in

    The width of an open interval is assumed to be the same as the width of an adjacent closed interval.

    children's height (cm)

    amount of children

    Interval width (i)

    for calculations 130+20=150

    20 (because the width of the adjacent closed interval is 20)

    All interval series are divided into interval series with equal intervals and interval series with unequal intervals. . In interval rows with equal intervals, the width of all intervals is the same. In interval series with unequal intervals, the width of the intervals is different.

    In this example, an interval series with unequal intervals.

    When processing large amounts of information, which is especially important when conducting modern scientific developments, the researcher faces the serious task of correctly grouping the initial data. If the data is discrete, then, as we have seen, there are no problems - you just need to calculate the frequency of each feature. If the trait under study has continuous character (which is more common in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

    For grouping continuous random variables, the entire variation range feature is divided into a number of intervals to.

    Grouped interval (continuous) variational series called intervals ranked by the value of the feature (), where indicated together with the corresponding frequencies () the number of observations that fell into the r "th interval, or relative frequencies ():

    Characteristic value intervals

    mi frequency

    bar chart and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary understanding of the data structure. Such graphs (Fig. 1.15) are built for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fills the area of ​​​​its possible values, taking any values.

    Rice. 1.15.

    That's why the columns on the histogram and the cumulate must be in contact, have no areas where the attribute values ​​​​do not fall within all possible(i.e., the histogram and cumulate should not have "holes" along the abscissa axis, in which the values ​​of the variable under study do not fall, as in Fig. 1.16). The height of the bar corresponds to the frequency—the number of observations that fall into the given interval, or relative frequency - the proportion of observations. Intervals must not cross and are usually the same width.

    Rice. 1.16.

    The histogram and the polygon are approximations of the probability density curve (differential function) f(x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is of such importance in the primary statistical processing of quantitative continuous data - by their form one can judge the hypothetical distribution law.

    Cumulate - the curve of the accumulated frequencies (frequencies) of the interval variation series. The graph of the integral distribution function is compared with the cumulate F(x), also considered in the course of probability theory.

    Basically, the concepts of histogram and cumulates are associated precisely with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

    The construction of an interval variation series begins with determining the number of intervals k. And this task is perhaps the most difficult, important and controversial in the issue under study.

    The number of intervals should not be too small, as the histogram will be too smooth ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17 you can see how the same data on which the graphs of Fig. 1.15 are used to construct a histogram with a smaller number of intervals (left graph).

    At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the data under study along the numerical axis: the histogram will turn out to be undersmoothed (undersmoothed) with unfilled intervals, uneven (see Fig. 1.17, right graph).

    Rice. 1.17.

    How to determine the most preferred number of intervals?

    Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to divide the initial set of values ​​of the studied attribute. This formula has really become super popular - most statistical textbooks offer it, and many statistical packages use it by default. Whether this is justified and in all cases is a very serious question.

    So what is the Sturges formula based on?

    Consider the binomial distribution )