Variation coefficient of a random variable formula. Calculation of the coefficient of variation

The square root of the variance is called the standard deviation from the mean, which is calculated as follows:

An elementary algebraic transformation of the standard deviation formula brings it to the following form:

This formula is often more convenient in the practice of calculations.

The mean square deviation, like the mean linear deviation, shows how much, on average, the specific values \u200b\u200bof the feature deviate from their mean. The standard deviation is always greater than the linear standard deviation. There is such a ratio between them:

Knowing this ratio, it is possible to determine the unknown by the known indicators, for example, but (I calculate a and vice versa. The standard deviation measures the absolute size of the variability of the attribute and is expressed in the same units of measurement as the values \u200b\u200bof the attribute (rubles, tons, years, etc.). It is the absolute measure of variation.

For alternative signs, for example, the presence or absence of higher education, insurance, variance and standard deviation formulas are:

Let us show the calculation of the standard deviation according to the data of a discrete series characterizing the distribution of students of one of the faculties of the university by age (Table 6.2).

Table 6.2.

The results of auxiliary calculations are given in columns 2-5 of table. 6.2.

The average age of a student, years, is determined by the formula of the arithmetic weighted average (column 2):

The squares of the deviation of the student's individual age from the average are contained in columns 3-4, and the products of the squares of the deviations by the corresponding frequencies are in column 5.

The variance of the age of students, years, is found by the formula (6.2):

Then o \u003d l / 3.43 1.85 * ode, i.e. each specific value of the student's age deviates from the average by 1.85 years.

The coefficient of variation

In its absolute value, the standard deviation depends not only on the degree of variation of the trait, but also on the absolute levels of variants and the mean. Therefore, it is impossible to directly compare the standard deviations of the variational series with different mean levels. To be able for such a comparison, you need to find the specific weight of the average deviation (linear or quadratic) in the arithmetic mean, expressed as a percentage, i.e. calculate relative rates of variation.

Linear coefficient of variation calculated by the formula

The coefficient of variation determined by the following formula:

In the coefficients of variation, not only the incompatibility associated with different units of measurement of the studied attribute is eliminated, but also the incomparability arising from differences in the value of the arithmetic means. In addition, the indicators of variation characterize the homogeneity of the population. The population is considered homogeneous if the coefficient of variation does not exceed 33%.

According to the table. 6.2 and the above calculation results, we determine the coefficient of variation,%, according to the formula (6.3):

If the coefficient of variation exceeds 33%, then this indicates the heterogeneity of the studied population. The value obtained in our case indicates that the aggregate of students is homogeneous in terms of age. Thus, an important function of generalized indicators of variation is the assessment of the reliability of the mean. The less c1, a2 and V, the more homogeneous the set of phenomena obtained and the more reliable the obtained average. According to the "three sigma rule" considered by mathematical statistics, in normally distributed or close to them series, deviations from the arithmetic mean, not exceeding ± 3st, occur in 997 cases out of 1000. Thus, knowing x and a, you can get a general initial idea of \u200b\u200bthe variation series. If, for example, the average wage of an employee in the company was 25,000 rubles, and a is equal to 100 rubles, then with a probability close to reliability, it can be argued that the wages of employees of the firm fluctuate within (25,000 ± 3 x 100 ) i.e. from 24 700 to 25 300 rubles.

Many people are faced with the variability of the trait under study in individual units of the population, its fluctuation relative to a certain value, that is, with its variation. This is what must be taken into account in order to obtain the most reliable information about the progress of a particular scientific research.

Most researchers, when determining the interval of change in the value of one or another parameter, most often resort to absolute and Among the latter, the coefficient of variation is most widespread, which, if the studied value is characterized by a normal distribution, is a criterion for the homogeneity of the population. This indicator allows you to determine what degree of dispersion the values \u200b\u200bof the studied parameter will have, regardless of the scale and unit of measurement.

The coefficient of variation can be calculated by dividing by the arithmetic mean of the variable, expressed as a percentage. The result of this calculation can fall within the range from zero to infinity, increasing as the variation of the feature increases. If the obtained value is less than 33.3%, the variation of the feature is weak. If more - strong. In the latter case, the studied set of data is heterogeneous, it is recognized as atypical, and therefore cannot be a generalizing indicator. Therefore, for this population, it is worth applying other indicators.

It should be noted that the coefficient of variation not only characterizes the homogeneity of a certain population, but is also used as a comparative assessment of it. For example, it is used if there is a need for fluctuations of a particular feature in populations for which the calculated mean value is different. In this case, the scatter of the data obtained does not allow an objective assessment of the acquired value. The coefficient of variation characterizes the relative variability of a variable, and therefore can be a relative measure of fluctuations in the value of the studied parameter.

However, there are some limitations here. In particular, it is possible to assess the degree of fluctuation of the parameter values \u200b\u200bonly for a specific feature and if the population has a certain composition. At the same time, the equality of these indicators may indicate both strong and weak variation. This is the case if the signs are different or the studies are carried out on different populations. Such a result is formed under the influence of very objective reasons, and this should be taken into account when processing the obtained experimental data.

The coefficient of variation is widely used in various fields of science and technology. In particular, it is actively involved in assessing fluctuations in parameters in economics and sociology. In this case, the application of the coefficient is made impossible if it is necessary to assess the variability of variables that are capable of changing their sign to the opposite. After all, then, as a result of calculations, incorrect values \u200b\u200bof this indicator will be obtained: either it will be very small, or it will have a negative sign. In the latter case, it is worth checking the correctness of the calculations performed.

Thus, we can say that the coefficient of variation is a parameter that will allow you to estimate the degree of variation and the relative variability of the average. The use of this indicator allows you to identify the most significant factors, focusing on which will allow you to achieve the goals and solve the necessary tasks.

INTRODUCTION

Methodological instructions for the implementation of practical and laboratory work in statistics contain the requirements for their implementation, the procedure for manual calculations and using MS Excel, PPP Statistica.

Part II of the guidelines characterizes the calculation of the indicators of variation: the range of variation, quartiles and quartile deviation, the mean linear deviation, variance and standard deviation, the coefficients of oscillation, variation, asymmetry, kurtosis and others.

The calculation of the indicators of variation, along with the construction of interval and discrete variation series and the calculation of average values \u200b\u200bpresented in Part I of the guidelines, is of great importance for the analysis of distribution series.

CALCULATION OF VARIATION INDICATORS

Purpose of the work: obtaining practical skills in calculating various indicators (measures) of variation depending on the tasks set by the research.

Work order:

Determine the type and form (simple or weighted) of the indicators of variation.

Formulate conclusions.

An example of calculating the indicators of variation

Determination of the type and form of indicators of variation.

Variation indicators are divided into two groups: absolute and relative. The absolute ones include: the range of variation, quartile deviation, linear mean deviation, variance and standard deviation. Relative indicators are oscillation coefficients, variations, relative linear deviation, etc.

The range of variation (R) is the simplest measure of the variation of a trait and is determined by the following formula:

where is the highest value of the variable feature;

The smallest value of a variable feature.

Quartile deviation (Q) - used to characterize the variation of a trait in the aggregate. Can be used in place of a swing to avoid the disadvantages of using extreme values.

Quartiles are the values \u200b\u200bof a feature in a ranked distribution series, chosen in such a way that 25% of the population units will be smaller in size; 25% of units will be enclosed between and; 25% of the units will be enclosed between and, and the remaining 25% are superior.

where is the lower boundary of the interval in which the first quartile is located;

The sum of the accumulated frequencies of the intervals preceding the interval in which the first quartile is located;

The frequency of the interval in which the first quartile is located.

where Me is the median of the series;

the legend is the same as for the quantity.

In symmetric or moderately asymmetric distributions, Q2 / 3. Since the quartile deviation is not affected by the deviations of all values \u200b\u200bof the characteristic, its use should be limited to cases when the determination of the standard deviation is difficult or impossible.

The average linear deviation () is the average of the absolute deviations of the feature variants from their average. It can be calculated using the arithmetic mean formula, both unweighted and weighted, depending on the absence or presence of frequencies in the distribution series.

(6) - unweighted mean linear deviation,

(7) - weighted mean linear deviation.

Dispersion () - mean square of deviations of individual values \u200b\u200bof a feature from their mean. The variance is calculated using simple unweighted and weighted formulas.

(8) - unweighted,

(9) - weighted.

The standard deviation () - the most common measure of variation, is the square root of the variance value.

The range of variation, quartile deviation, mean linear and standard deviation are named values, have the dimension of the average feature.

For the purpose of comparing the variability of different characteristics in the same population, or when comparing the variability of the same attribute in several populations, relative indicators of variation are calculated. The basis for comparison is the arithmetic mean. Most often, relative indicators are expressed as a percentage and characterize not only the comparative assessment of variation, but also characterize the homogeneity of the population.

The oscillation coefficient is calculated by the formula:

Relative linear deviation (linear coefficient of variation):

(13) or (14)

The coefficient of variation:

The most commonly used indicator of relative variability in statistics is the coefficient of variation. It is used not only for the comparative assessment of variation, but also as a characteristic of the homogeneity of the population. The set is considered homogeneous if the coefficient of variation does not exceed 33% (Efimova M.R., Ryabtsev V.M. General theory of statistics: Textbook M .: Finance and statistics, 1991, p. 105).

To obtain an approximate idea of \u200b\u200bthe shape of the distribution, distribution graphs are built (polygon and histogram).

In the practice of statistical research, one has to meet with a variety of distributions. When studying homogeneous populations, we deal, as a rule, with unimodal distributions. Multiple vertices indicate the heterogeneity of the studied population, the appearance of two or more vertices indicates the need to regroup the data in order to identify more homogeneous groups. Elucidation of the general nature of the distribution involves assessing the degree of its homogeneity, as well as calculating the indicators of asymmetry and kurtosis. Symmetrical is a distribution in which the frequencies of any two variants, equally spaced on both sides of the distribution center, are equal to each other. For symmetric distributions, the arithmetic mean, mode and median are equal to each other. In this regard, the simplest indicator asymmetries based on the ratio of indicators of the center of distribution: the greater the difference between the means, the greater the asymmetry of the series.

For a comparative analysis of the degree of asymmetry of several distributions, the relative indicator As is calculated:

The value of the As indicator can be positive and negative. A positive value of the indicator indicates the presence of right-sided asymmetry (the right branch is more elongated relative to the maximum ordinate than the left). With right-sided asymmetry between the indicators of the center of distribution, there is a ratio:. A negative sign of the asymmetry indicator indicates the presence of left-sided asymmetry (Figure 1). Between the indicators of the center of distribution in this case, there is the following ratio:.

Figure 1. Distribution: 1 - with right-sided asymmetry; 2 - with left-sided asymmetry.

Another indicator, proposed by the Swedish mathematician Lindberg, is calculated using the formula:

where P is the percentage of those values \u200b\u200bof the feature that exceed the arithmetic mean in size.

The most accurate and widespread indicator is based on the determination of the central moment of the third order (in a symmetric distribution, its value is equal to zero):

where is the central moment of the third order:

(19) - for ungrouped data;

(20) - for grouped data.

y is the standard deviation.

The use of this indicator makes it possible not only to determine the amount of asymmetry, but also to answer the question of the presence or absence of asymmetry in the distribution of a trait in the general population. The assessment of the significance of this indicator is given using the root-mean-square error, which depends on the volume of observations n and is calculated by the formula:

If the ratio is significant, the asymmetry is significant and the distribution of the trait in the population is not symmetrical. If the attitude, asymmetry is insignificant, its presence can be explained by the influence of various random circumstances.

For symmetric distributions, the indicator is calculated excess (peakedness). Lindbergh proposed the following indicator for assessing kurtosis:

where P is the proportion (%) of the number of options lying in an interval equal to half of the standard deviation in one direction or another from the arithmetic mean.

The most accurate indicator is using the fourth order central moment:

where is the central moment of the fourth moment;

(24) - for ungrouped data;

(25) - for grouped data.

Figure 2 shows two distributions: one is peaked (the amount of kurtosis is positive), the second is flat-topped (the amount of kurtosis is negative). Kurtosis is the fall of the top of the empirical distribution up or down from the top of the normal distribution curve. In a normal distribution, the ratio.

Figure 2. Distribution: 1.4 - normal; 2 - peaked; 3 - flat top

The root mean square error of kurtosis is calculated by the formula:

where n is the number of observations.

If, then the excess is significant, if, then it is insignificant.

Evaluation of the significance of the indicators of asymmetry and kurtosis allows us to conclude whether this empirical study can be attributed to the type of normal distribution curves.

Consider the methodology for calculating the indicators of variation.

Table 1. Data on the volume of currency sales of several branches of the Central Bank.

Determine the average sales of foreign currency for a set of branches, calculate the absolute and relative indicators of variation.

Let's calculate the range of variation:

R \u003d \u003d 24.3 - 10.2 \u003d 14.1 million rubles.

variation variance oscillation variation asymmetry kurtosis

To determine the deviations of the attribute values \u200b\u200bfrom the mean and their squares, we build an auxiliary table:

Table 2. Calculation table

We find the average value using the simple arithmetic mean formula:

Average linear deviation:

Dispersion:

Oscillation coefficient:

The coefficient of variation:

To calculate the indicators of the distribution form, we build an auxiliary table:

Table 3. Calculation table

Table 4. Data on the turnover of enterprises in one of the industries.

Determine the average volume of trade, structural averages, absolute and relative indicators of variation and how much the actual distribution is consistent with the normal (in terms of the distribution form).

To calculate the indicators, let's build an auxiliary table.

Table 5. Calculation table

Range of variation:

We find the average value by the formula of the arithmetic weighted average:

In the interval distribution series, the mode is determined by the formula:

In our case, the mod will be equal to:

In the interval variation series, the median is determined by the formula:

In our case, the median will be:

Quartile deviation:

where and are the first and third quartiles of the distribution, respectively.

Quartiles are determined by the formulas:

Average linear deviation:

Dispersion:

Standard deviation:

Let's calculate the relative indicators of variation.

Oscillation coefficient:

Relative linear deviation:

Relative measure of quartile variation:

The coefficient of variation:

Let's define the indicators of the distribution form:

Formulation of conclusions.

Let us formulate conclusions on the calculated indicators of variation of example 2, which presents the interval series of the distribution of enterprises by the volume of turnover, million rubles.

The range of variation indicates that the difference between the maximum and minimum values \u200b\u200bis 40 million rubles. The average turnover is 30 million rubles. The most frequently encountered value of the volume of turnover in the considered set of enterprises is 31.4 million rubles, and 50% (40 enterprises) have a turnover of less than 30.5 million rubles, and 50% more.

A quartile deviation of 5 indicates a moderate asymmetry of the distribution, as in symmetric or moderately asymmetric distributions (in the example under consideration).

The average linear and standard deviation show how much, on average, the value of the trait fluctuates in units of the studied population. Thus, the average value of the fluctuations in the volume of turnover of enterprises in industries is: by the average linear deviation - 6.5 million rubles. (absolute deviation); by standard deviation - 8.1 million rubles. The square of the deviations of the individual values \u200b\u200bof the trait from their average value is 65.

The difference between the extreme values \u200b\u200bof the trait is 33.3% higher than the average (\u003d 133.3%).

The relative linear deviation (\u003d 21.7%) and the relative indicator of quartile variation (\u003d 16.4%) characterize the homogeneity of the studied population, which is confirmed by the calculated coefficient of variation equal to 27% (V \u003d 27% less than 33%).

According to the calculated indices of asymmetry and kurtosis, it can be concluded that the distribution is flat-topped (Ex< 0) и наблюдается левосторонняя асимметрия (As < 0). Асимметрия и эксцесс являются несущественными.

Feature variation is determined by various factors, some of these factors can be distinguished if the statistical population is divided into groups according to a certain criterion. Then, along with the study of the variation of a trait in the population as a whole, it is possible to study the variation for each of its constituent groups and between these groups. In the simple case, when the population is divided into groups according to one factor, the study of variation is achieved by calculating and analyzing three types of variance: total, intergroup and intragroup.

Empirical coefficient of determination

Empirical coefficient of determination It is widely used in statistical analysis and is an indicator representing the share of intergroup variance in a productive trait and characterizes the strength of the influence of a grouping trait on the formation of general variation. It can be calculated using the formula:

Shows the proportion of variation of the effective attribute y under the influence of the factor attribute x, it is associated with the correlation coefficient by a quadratic dependence. In the absence of communication, the empirical coefficient of determination is equal to zero, and in the case of functional communication - to one.

For example, when the dependence of the labor productivity of workers on their qualifications is studied, the coefficient of determination is 0.7, then 70% of the variation in the productivity of workers is due to differences in their qualifications and 30% - the influence of other factors.

The empirical correlation is the square root of the coefficient of determination. The ratio shows the closeness of the relationship between the grouping and the productive characteristics. The empirical correlation ratio takes values \u200b\u200bfrom -1 to 1. If there is no connection, then the correlation ratio is equal to zero, i.e. all group means are equal and there is no intergroup variation. This means that the grouping trait does not affect the formation of the general variation.

If the connection is functional, then the correlation ratio is equal to one. In this case, the variance of the group means is equal to the total variance, i.e. there is no intra-group variation. This means that the grouping characteristic completely determines the variation of the effective characteristic.

The closer the value of the correlation ratio to unity, the stronger and closer to the functional dependence the relationship between the signs. For a qualitative assessment of the strength of the connection based on the indicator of the empirical correlation coefficient, you can use the Chaddock ratio.

Chaddock ratio

The connection is very close - the correlation coefficient is in the range 0.9 - 0.99
Close connection - Rxy \u003d 0.7 - 0.9
The connection is noticeable - Rxy \u003d 0.5 - 0.7
Moderate relationship - Rxy \u003d 0.3 - 0.5
The connection is weak - Rxy \u003d 0.1 - 0.3

Variation - this is a discrepancy between the values \u200b\u200bof the same statistical quantity for different objects due to the peculiarities of their own development, as well as differences in the conditions in which they are found. The variation has an objective character and helps to understand the essence of the studied phenomenon. If the average smoothes out individual differences, then the variation, on the contrary, emphasizes them, establishing the typicality or non-typicality of the average value found for a particular statistical population. Thus, we can conclude about the quality of the selected statistical data.

Variation is measured using relative quantities called coefficients of variation and defined as the ratio of the average deviation to the average. Since the average deviation can be determined by linear and quadratic methods, the coefficients of variation can also be corresponding. Therefore, the coefficients of variation must be determined by the formulas

– linear; (1.28)

– quadratic. (1.29) The values \u200b\u200bof the coefficient of variation vary from 0 to 1, and the closer it is to zero, the more typical is the found average value for the studied statistical population, and therefore the better the statistical data are selected. In this case, the criterion value of the coefficient of variation is 1/3.

That is, the average value is considered typical for a given population at λ 0.333 or at ν 0.333. Otherwise, the average is not typical and it is necessary to revise the statistical population in order to include more objective statistical values \u200b\u200bin it.

Typically, the quadratic coefficient of variation is somewhat (about 25%) greater than the linear one, calculated from the same data. This means that the case is possible when λ 0.333 and ν 0.333, then it is necessary to take the average of these coefficients and, by its value, draw a final conclusion about the non-typicality of the found average.

With the help of the linear coefficient of variation, the fundamental conclusion about the typical or non-typical of the average value can be obtained easier and faster than using the quadratic one. However, the quadratic coefficient is used more often because there are several ways to calculate variance.

There is also a significant drawback to this method of evaluating variation. Indeed, let, for example, the initial population of workers with an average experience of 15 years, with a standard deviation σ \u003d 10 years, “aged” by another 15 years. Now \u003d 30 years, and the standard deviation is still 10. 3%).

Therefore, additional analysis of the statistical population is possible using oscillation coefficientdetermined by the formula

where R - the range of variation in the form of the difference between the largest and smallest values \u200b\u200bin the aggregate of statistical values. I.e

R \u003d Xmax -Xmin,(1.31)

where Xmax and Xmin are the maximum and minimum values \u200b\u200bcombined.

By ordering the statistical quantities in the aggregate, grouping intervals are formed. Then under the notation ∆X the range of the interval is understood, and the average interval value is denoted CHI... In the case of focusing only on the quadratic coefficient of variation, different methods for determining the variance can be used.