The formula for the variance of a feature. Dispersion of a discrete random variable

Dispersion random variable is the measure of the spread of the given random variable, that is, her deviations from the mathematical expectation. In statistics, the notation (sigma squared) is often used to denote variance. The square root of the variance equal to is called standard deviation or standard spread. The standard deviation is measured in the same units as the random variable itself, and the variance is measured in the squares of this unit.

Although it is very convenient to use only one value (such as mean or mode and median) to estimate the entire sample, this approach can easily lead to incorrect conclusions. The reason for this situation lies not in the quantity itself, but in the fact that one quantity does not in any way reflect the spread of data values.

For example, in the sample:

the average is 5.

However, the sample itself does not have a single item with a value of 5. You may need to know the degree to which each item in the sample is close to its mean. Or, in other words, you need to know the variance of the values. Knowing the extent to which the data has changed, you can better interpret average value, median and fashion... The rate of change in sample values ​​is determined by calculating their variance and standard deviation.



Dispersion and Square root of the variance, called the standard deviation, characterize the mean deviation from the sample mean. Among these two quantities, the most important is standard deviation ... This value can be thought of as the average distance that items are from the middle item in the sample.

The variance is difficult to interpret meaningfully. However, the square root of this value is the standard deviation and is well interpreted.

The standard deviation is calculated by first determining the variance and then calculating the square root of the variance.

For example, for the data array shown in the figure, the following values ​​will be obtained:

Picture 1

Here, the mean of the squares of the differences is 717.43. To get the standard deviation, all that's left is to take the square root of that number.

The result is approximately 26.78.

It should be remembered that the standard deviation is interpreted as the mean distance of the items from the sample mean.

The standard deviation shows how well the mean describes the entire sample.

Let's say you are the head of the production department for the assembly of a PC. The quarterly report said it had 2,500 PCs in the last quarter. Is this good or bad? You asked (or the report already contains this column) in the report to display the standard deviation for this data. The standard deviation figure, for example, is 2000. It becomes clear to you, as the head of the department, that the production line requires better management(too large deviations in the number of collected PCs).

Recall that when the standard deviation is large, the data is widely scattered about the mean, and when the standard deviation is small, they are grouped close to the mean.

The four statistical functions VAR (), VAR (), STDEV () and STDEV () are designed to calculate the variance and standard deviation of numbers in an interval of cells. Before calculating the variance and standard deviation of a dataset, you need to determine whether the data represents a population or a sample from the general population... In the case of a sample from the general population, the VAR () and STDEV () functions should be used, and in the case of the general population, the VAR () and STDEVP () functions should be used:

General population Function

VARP ()

STANDOTLONP ()
Sample

DISP ()

STDEV ()

The variance (as well as the standard deviation), as we noted, indicates the extent to which the values ​​included in the dataset are scattered around the arithmetic mean.

A small value of variance or standard deviation indicates that all data is centered around the arithmetic mean, while a large value of these values ​​indicates that the data is scattered over a wide range of values.

The variance is rather difficult to interpret meaningfully (what does a small value, a large value mean?). Performance Assignments 3 allows you to visually, on a graph, show the meaning of variance for a dataset.

Tasks

· Exercise 1.

· 2.1. Give concepts: variance and standard deviation; their symbolic designation in statistical data processing.

· 2.2. Draw up a worksheet in accordance with Figure 1 and make the necessary calculations.

· 2.3. Provide the basic formulas used in the calculations

· 2.4. Explain all notation (,,)

· 2.5. Explain the practical meaning of variance and standard deviation.

Task 2.

1.1. Give concepts: general population and sample; expected value and the arithmetic mean of their symbolic designation in statistical data processing.

1.2. In accordance with Figure 2, draw up a worksheet and make calculations.

1.3. Provide the basic formulas used in the calculations (for the general population and the sample).

Picture 2

1.4. Explain why it is possible to obtain such arithmetic mean values ​​in samples as 46.43 and 48.78 (see the file Appendix). To conclude.

Task 3.

There are two samples with different datasets, but the mean for them will be the same:

Figure 3

3.1. Draw up a worksheet in accordance with Figure 3 and make the necessary calculations.

3.2. Give the basic calculation formulas.

3.3. Build graphs in accordance with Figures 4, 5.

3.4. Explain the resulting dependencies.

3.5. Perform similar calculations for these two samples.

Original sample 11119999

Select the values ​​of the second sample so that the arithmetic mean for the second sample is the same, for example:

Choose the values ​​for the second sample yourself. Design calculations and graphs like Figures 3, 4, 5. Show the basic formulas that were used in the calculations.

Draw the appropriate conclusions.

All tasks should be drawn up in the form of a report with all the necessary pictures, graphs, formulas and brief explanations.

Note: the construction of graphs must be explained with pictures and brief explanations.

Dispersion in statistics is found as individual values ​​of the attribute squared from. Depending on the initial data, it is determined by the formulas of simple and weighted variances:

1. (for ungrouped data) is calculated by the formula:

2. Weighted variance (for the variation series):

where n is the frequency (repeatability of factor X)

An example of finding the variance

This page describes a standard example of finding the variance, you can also look at other tasks for finding it.

Example 1. The following data are available for a group of 20 correspondence students. It is necessary to build an interval series of the feature distribution, calculate the average value of the feature and study its variance

Let's build an interval grouping. Let's define the range of the interval by the formula:

where X max– maximum value grouping attribute;
X min is the minimum value of the grouping attribute;
n is the number of intervals:

We accept n = 5. The step is: h = (192 - 159) / 5 = 6.6

Let's compose an interval grouping

For further calculations, we will build an auxiliary table:

X'i is the middle of the interval. (for example, the middle of the interval 159 - 165.6 = 162.3)

The average height of students is determined by the formula of the arithmetic weighted average:

Let's define the variance by the formula:

The variance formula can be transformed like this:

It follows from this formula that variance is the difference between the mean of the squares of the options and the square and the mean.

Dispersion in variation series with at equal intervals by the method of moments can be calculated in the following way using the second property of the variance (dividing all the options by the value of the interval). Determining variance, calculated by the method of moments, using the following formula is less laborious:

where i is the size of the interval;
A - conditional zero, which is convenient to use the middle of the interval with the highest frequency;
m1 is the square of the first order moment;
m2 - second order moment

(if in a statistical population the attribute changes so that there are only two mutually exclusive options, then such variability is called alternative) can be calculated by the formula:

Substituting in this formula variance q = 1 - p, we get:

Dispersion types

Total variance measures the variation of a trait across the population as a whole under the influence of all factors that cause this variation. It is equal to the mean square of the deviations of individual values ​​of the attribute x from the total mean value of x and can be defined as simple variance or weighted variance.

characterizes random variation, i.e. part of the variation that is due to the influence of unaccounted factors and does not depend on the attribute-factor underlying the grouping. This variance is equal to the mean square of the deviations of the individual values ​​of the trait within the group X from the arithmetic mean of the group and can be calculated as a simple variance or as a weighted variance.

Thus, intragroup variance measures variation of a feature within a group and is determined by the formula:

where xi is the group average;
ni is the number of units in the group.

For example, intragroup variances that need to be determined in the task of studying the influence of workers' qualifications on the level of labor productivity in a shop show the variations in output in each group caused by all possible factors ( technical condition equipment, provision of tools and materials, age of workers, labor intensity, etc.), except for differences in qualifications (within the group, all workers have the same qualifications).

The average of the within-group variances reflects the random, i.e., that part of the variation that occurred under the influence of all other factors, with the exception of the grouping factor. It is calculated by the formula:

It characterizes the systematic variation of the effective trait, which is due to the influence of the trait-factor underlying the grouping. It is equal to the mean square of the deviations of the group means from the total mean. Intergroup variance is calculated using the formula:

Variance addition rule in statistics

According to variance addition rule total variance is equal to the sum of the average of intragroup and intergroup variances:

The meaning of this rule lies in the fact that the total variance that occurs under the influence of all factors is equal to the sum of the variances that arise under the influence of all other factors, and the variance that occurs due to the grouping factor.

Using the formula for adding variances, it is possible to determine the third unknown from two known variances, and also to judge the strength of the influence of the grouping attribute.

Dispersion properties

1. If all the values ​​of the attribute are reduced (increased) by the same constant value, then the variance will not change from this.
2. If all the values ​​of the feature are reduced (increased) by the same number of times n, then the variance will correspondingly decrease (increase) by n ^ 2 times.

Dispersion types:

Total variance characterizes the variation of the trait of the entire population under the influence of all those factors that caused this variation. This value is determined by the formula

where is the total arithmetic mean of the entire study population.

Average within-group variance indicates a random variation that may arise under the influence of any unaccounted for factors and which does not depend on the attribute-factor underlying the grouping. This variance is calculated as follows: first, the variances for individual groups () are calculated, then the average intragroup variance is calculated:

where n i is the number of units in the group

Intergroup variance(variance of group means) characterizes systematic variation, i.e. differences in the size of the trait under study, arising under the influence of the trait-factor, which is the basis of the grouping.

where is the average value for a separate group.

All three types of variance are related: the total variance is equal to the sum of the average intragroup variance and intergroup variance:

Properties:

25 Relative rates of variation

Oscillation coefficient

Relative linear deviation

The coefficient of variation

Coef. Osc. O reflects the relative fluctuations of the extreme values ​​of the attribute around the average. Rel. lin. off... characterizes the share of the average value of the sign of absolute deviations from average size... Coef. Variation is the most common measure of variability used to assess the typicality of averages.

In statistics, populations with a coefficient of variation greater than 30–35% are considered to be heterogeneous.

    The regularity of the distribution series. Distribution moments. Distribution form indicators

In the series of variations, there is a connection between the frequencies and the values ​​of the varying feature: with an increase in the feature, the frequency value first increases to a certain limit, and then decreases. Such changes are called distribution patterns.

The shape of the distribution is studied using indicators of asymmetry and kurtosis. When calculating these indicators, distribution moments are used.

The moment of the k-th order is called the average of the k-th degrees of deviations of the variants of the values ​​of the attribute from some constant value. The order of the moment is determined by the value of k. When analyzing the variational series, they are limited to calculating the moments of the first four orders. When calculating moments, frequencies or frequencies can be used as weights. Depending on the choice of a constant, there are initial, conditional and central moments.

Distribution form indicators:

Asymmetry(As) indicator characterizing the degree of asymmetry of the distribution .

Therefore, with (left-sided) negative asymmetry ... With (right-sided) positive asymmetry .

Center moments can be used to calculate asymmetry. Then:

,

where μ 3 Is the central moment of the third order.

- excess (E To ) characterizes the slope of the function graph in comparison with normal distribution with the same strength of variation:

,

where μ 4 is the 4th order central moment.

    Normal distribution law

For a normal distribution (Gaussian distribution), the distribution function has the following form:

Expected value - standard deviation

The normal distribution is symmetric and is characterized by the following relationship: Xav = Me = Mo

The kurtosis of the normal distribution is 3 and the skewness coefficient is 0.

Normal distribution curve is a polygon (symmetric bell-shaped line)

    Types of dispersions. Variance addition rule. The essence of the empirical coefficient of determination.

If the initial population is divided into groups according to some essential feature, then the following types of variances are calculated:

    Total variance of the original population:

where is the total average value of the original population; f are the frequencies of the original population. The total variance characterizes the deviation of individual values ​​of a trait from the total average value of the original population.

    Intra-group variances:

where j is the number of the group; is the average value in each j-th group; - the frequencies of the j-th group. Intragroup variances characterize the deviation of the individual value of the trait in each group from the group average. Of all intragroup variances, the average is calculated by the formula:, where is the number of units in each j-th group.

    Intergroup variance:

Intergroup variance characterizes the deviation of group means from the total mean of the original population.

Variance addition rule lies in the fact that the total variance of the original population should be equal to the sum of the intergroup and the average of the intragroup variances:

Empirical coefficient of determination shows the proportion of variation of the studied trait due to the variation of the grouping trait, and is calculated by the formula:

    Method of counting from a conditional zero (method of moments) for calculating the mean and variance

The calculation of variance by the method of moments is based on the use of formulas and 3 and 4 dispersion properties.

(3.If all the values ​​of the attribute (options) increase (decrease) by some constant number A, then the variance of the new population will not change.

4. If all the values ​​of the attribute (options) are increased (multiplied) by K times, where K is a constant number, then the variance of the new population will increase (decrease) by K 2 times.)

We obtain the formula for calculating the variance in variational series with equal intervals by the method of moments:

A - conditional zero, equal to the option with the maximum frequency (middle of the interval with the maximum frequency)

The calculation of the mean by the method of moments is also based on the use of the properties of the mean.

    The concept of selective observation. Stages of the study of economic phenomena by the sampling method

A selective observation is called an observation in which not all units of the original population are examined and studied, but only a part of the units, while the result of a survey of a part of the population applies to the entire initial population. The set from which the units are selected for further examination and study is called general and all indicators characterizing this set are called general.

The possible limits of deviations of the sample mean from the general mean are called sampling error.

The set of selected units is called selective and all indicators characterizing this set are called selective.

The sample study includes the following stages:

Characteristics of the research object (mass economic phenomena). If the general population is small, then sampling is not recommended, a continuous survey is necessary;

Sample size calculation. It is important to determine the optimal volume that will allow obtaining sampling error within the acceptable range at the lowest cost;

Selection of observation units, taking into account the requirements of randomness, proportionality.

Proof of representativeness based on an estimate of sampling error. For a random sample, the error is calculated using formulas. For the target sample, representativeness is assessed using qualitative methods (comparison, experiment);

Sample analysis. If the formed sample meets the requirements of representativeness, then it is analyzed using analytical indicators (average, relative, etc.)

According to the sample survey, the depositors were grouped according to the size of their deposit in the city's Sberbank:

Define:

1) the range of variation;

2) the average size of the deposit;

3) average linear deviation;

4) variance;

5) average standard deviation;

6) coefficient of variation of contributions.

Solution:

This distribution series contains open intervals. In such series, the value of the interval of the first group is conventionally assumed to be equal to the value of the interval of the next one, and the value of the interval of the last group is equal to the value of the interval of the previous one.

The value of the interval of the second group is equal to 200, therefore, the value of the first group is also equal to 200. The value of the interval of the penultimate group is equal to 200, which means that the last interval will have a value of 200.

1) Let's define the range of variation as the difference between the largest and the smallest value of the feature:

The range of variation in the size of the deposit is equal to 1000 rubles.

2) The average size of the contribution is determined by the formula of the arithmetic weighted average.

Let us preliminarily define discrete quantity signs in each interval. To do this, using the formula for the arithmetic simple mean, we find the midpoints of the intervals.

The average value of the first interval will be equal to:

the second - 500, etc.

Let's enter the results of the calculations into the table:

Deposit amount, rub.Number of depositors, fMiddle of the interval, xxf
200-400 32 300 9600
400-600 56 500 28000
600-800 120 700 84000
800-1000 104 900 93600
1000-1200 88 1100 96800
Total 400 - 312000

The average size of a deposit in the city's Sberbank will be equal to 780 rubles:

3) The average linear deviation is the arithmetic mean of the absolute deviations of individual values ​​of the attribute from the total average:

The procedure for calculating the mean linear deviation in the interval series of the distribution is as follows:

1. Calculate the weighted arithmetic mean, as shown in item 2).

2. The absolute deviations of the variant from the mean are determined:

3. The resulting deviations are multiplied by the frequencies:

4. Find the sum of the weighted deviations without taking into account the sign:

5. The sum of the weighted deviations is divided by the sum of the frequencies:

It is convenient to use the calculated data table:

Deposit amount, rub.Number of depositors, fMiddle of the interval, x
200-400 32 300 -480 480 15360
400-600 56 500 -280 280 15680
600-800 120 700 -80 80 9600
800-1000 104 900 120 120 12480
1000-1200 88 1100 320 320 28160
Total 400 - - - 81280

The average linear deviation of the size of the deposit of Sberbank clients is 203.2 rubles.

4) Dispersion is the arithmetic mean of the squares of the deviations of each feature value from the arithmetic mean.

The calculation of the variance in the interval distribution series is carried out according to the formula:

The procedure for calculating the variance in this case is as follows:

1. Determine the weighted arithmetic mean, as shown in clause 2).

2. Find the deviation of the variant from the mean:

3. Square the deviation of each option from the mean:

4. Multiply the squares of the deviations by the weights (frequencies):

5. Summarize the received works:

6. The resulting sum is divided by the sum of the weights (frequencies):

Let's fill out the calculations in the table:

Deposit amount, rub.Number of depositors, fMiddle of the interval, x
200-400 32 300 -480 230400 7372800
400-600 56 500 -280 78400 4390400
600-800 120 700 -80 6400 768000
800-1000 104 900 120 14400 1497600
1000-1200 88 1100 320 102400 9011200
Total 400 - - - 23040000

Dispersion in statistics is defined as the standard deviation of individual values ​​of a feature squared from the arithmetic mean. A common method for calculating the squares of deviations of options from the mean with their subsequent averaging.

In economic and statistical analysis, the variation of a feature is usually assessed using the standard deviation, it is the square root of the variance.

(3)

It characterizes the absolute variability of the values ​​of the varying attribute and is expressed in the same units of measurement as the options. In statistics, it is often necessary to compare the variation of various features. For such comparisons, a relative measure of variation, the coefficient of variation, is used.

Dispersion properties:

1) if you subtract any number from all the options, then the variance will not change from this;

2) if all values ​​of the variant are divided by some number b, then the variance will decrease by b ^ 2 times, i.e.

3) if you calculate the mean square of deviations from any number from an unequal arithmetic mean, then it will be greater than the variance. In this case, by a well-defined value per square of the difference between the average value of c.

Variance can be defined as the difference between the mean square and the mean square.

17. Group and intergroup variations. Variance addition rule

If the statistical population is divided into groups or parts according to the studied attribute, then the following types of variance can be calculated for such a population: group (private), average group (private), and intergroup.

Total variance- reflects the variation of a feature due to all conditions and reasons operating in a given statistical population.

Group variance- is equal to the mean square of deviations of individual values ​​of a feature within a group from the arithmetic mean of this group, called the group mean. Moreover, the group average does not coincide with the overall average for the entire population.

Group variance reflects the variation of a trait only due to conditions and reasons operating within the group.

Average group variance- is defined as the weighted arithmetic mean of the group variances, and the weights are the volumes of the groups.

Intergroup variance- is equal to the mean square of the deviations of the group means from the total mean.

Intergroup variance characterizes the variation of the effective trait due to the grouping trait.

There is a certain ratio between the considered types of variances: the total variance is equal to the sum of the average group and intergroup variance.

This ratio is called the variance addition rule.

18. Dynamic series and its constituent elements. Dynamic series types.

Series in statistics- this is digital data showing the change in a phenomenon in time or in space and making it possible to make a statistical comparison of phenomena both in the process of their development in time and in different forms and types of processes. Thanks to this, it is possible to discover the mutual dependence of phenomena.

The process of development of the movement of social phenomena in time in statistics is usually called dynamics. To display the dynamics, series of dynamics (chronological, temporal) are built, which are series of time-varying values ​​of a statistical indicator (for example, the number of convicts in 10 years), located in chronological order... Their constituent elements are the digital values ​​of this indicator and the periods or points in time to which they relate.

The most important characteristic of the series of dynamics- their size (volume, magnitude) of this or that phenomenon, achieved in a certain period or by a certain moment. Accordingly, the size of the members of a series of dynamics is its level. Distinguish the initial, middle and final levels of the time series. First level shows the value of the first, final - the value of the last member of the series. Average level is the chronological average of the variation range and is calculated depending on whether the time series is interval or momentary.

Another important characteristic of the dynamic range- the time elapsed from the initial to the final observation, or the number of such observations.

There are various types of series of dynamics, they can be classified according to the following criteria.

1) Depending on the way the levels are expressed, the series of dynamics are subdivided into series of absolute and derived indicators (relative and average values).

2) Depending on how the levels of the series express the state of the phenomenon at certain points in time (at the beginning of a month, quarter, year, etc.) or its value for certain time intervals (for example, per day, month, year, etc.) are distinguished respectively moment and interval series of dynamics. Momentary series in the analytical work of law enforcement agencies are used relatively rarely.

In the theory of statistics, the dynamics are distinguished according to a number of other classification features: depending on the distance between the levels - with equal levels and unequal levels in time; depending on the presence of the main trend of the studied process - stationary and non-stationary. When analyzing time series, the following levels of the series are presented in the form of components:

Y t = TP + E (t)

where TP is a deterministic component that determines the general trend of change over time or a trend.

E (t) is a random component that causes fluctuations in levels.