Analysis of variation series. Variational and statistical distribution series

The grouping method also allows you to measure variation(variability, fluctuation) of signs. With a relatively small number of population units, the variation is measured on the basis of a ranked series of units that make up the population. The row is called ranked if the units are arranged in ascending (descending) feature.

However, ranked series are rather indicative when it is necessary Comparative characteristics variations. In addition, in many cases one has to deal with statistical aggregates consisting of a large number of units, which are practically difficult to represent in the form of a specific series. In this regard, for the initial general acquaintance with statistical data and especially to facilitate the study of the variation of signs, the studied phenomena and processes are usually combined into groups, and the results of the grouping are drawn up in the form of group tables.

If there are only two columns in the group table - groups according to the selected feature (options) and the number of groups (frequencies or frequencies), it is called near distribution.

Distribution range - the simplest type of structural grouping according to one attribute, displayed in a group table with two columns containing variants and frequencies of the attribute. In many cases, with such a structural grouping, i.e. with the compilation of distribution series, the study of the initial statistical material begins.

Structural grouping in the form of a distribution series can be turned into a true structural grouping if the selected groups are characterized not only by frequencies, but also by other statistical indicators. The main purpose of distribution series is to study the variation of features. The theory of distribution series is developed in detail by mathematical statistics.

The distribution series are divided into attributive(grouping by attributive characteristics, for example, the division of the population by sex, nationality, marital status etc.) and variational(grouping by quantitative characteristics).

Variation series is a group table that contains two columns: a grouping of units according to one quantitative attribute and the number of units in each group. The intervals in the variation series are usually formed equal and closed. The variation series is the following grouping of the population of Russia in terms of per capita cash income(Table 3.10).

Table 3.10

Distribution of Russia's population by average per capita income in 2004-2009

Population groups by average per capita cash income, rub./month

Population in the group, in % of the total

8 000,1-10 000,0

10 000,1-15 000,0

15 000,1-25 000,0

Over 25,000.0

All population

Variational series, in turn, are divided into discrete and interval. Discrete variation series combine variants of discrete features that vary within narrow limits. An example of a discrete variation series can serve as the distribution of Russian families according to the number of children they have.

Interval variational series combine variants of either continuous features or discrete features that change over a wide range. The interval series is the variational series of the distribution of the Russian population in terms of average per capita cash income.

Discrete variational series are not used very often in practice. Meanwhile, compiling them is not difficult, since the composition of the groups is determined by the specific variants that the studied grouping characteristics actually possess.

Interval variational series are more widespread. When they are compiled, complex issue on the number of groups, as well as on the size of the intervals to be set.

The principles for resolving this issue are set out in the chapter on the methodology for constructing statistical groupings(see paragraph 3.3).

Variation series are a means of collapsing or compressing diverse information into a compact form; they can be used to make a fairly clear judgment about the nature of the variation, to study the differences in the signs of the phenomena included in the set under study. But the most important significance of the variational series is that on their basis the special generalizing characteristics of the variation are calculated (see Chapter 7).

(definition of a variational series; components of a variational series; three forms of a variational series; expediency of constructing an interval series; conclusions that can be drawn from the constructed series)

A variational series is a sequence of all elements of a sample arranged in non-decreasing order. The same elements are repeated

Variational - these are series built on a quantitative basis.

Variational distribution series consist of two elements: variants and frequencies:

Variants are the numerical values ​​of a quantitative trait in the variation series of the distribution. They can be positive or negative, absolute or relative. So, when grouping enterprises according to the results of economic activity, the options are positive - this is profit, and negative numbers is a loss.

Frequencies are the numbers of individual variants or each group of the variation series, i.e. these are numbers showing how often certain options occur in a distribution series. The sum of all frequencies is called the volume of the population and is determined by the number of elements of the entire population.

Frequencies are frequencies expressed as relative values ​​(fractions of units or percentages). The sum of the frequencies is equal to one or 100%. The replacement of frequencies by frequencies makes it possible to compare variational series with different numbers of observations.

There are three forms of variation series: ranked series, discrete series and interval series.

A ranked series is the distribution of individual units of the population in ascending or descending order of the trait under study. Ranking makes it easy to divide quantitative data into groups, immediately detect the smallest and largest values ​​of a feature, highlight the values ​​that are most often repeated.

Other forms of the variation series are group tables compiled according to the nature of the variation in the values ​​of the trait under study. By the nature of the variation, discrete (discontinuous) and continuous signs are distinguished.

A discrete series is such a variational series, the construction of which is based on signs with a discontinuous change (discrete signs). The latter can be attributed tariff category, the number of children in the family, the number of employees in the enterprise, etc. These signs can take only a finite number of certain values.

A discrete variational series is a table that consists of two columns. The first column indicates the specific value of the attribute, and the second - the number of population units with a specific value of the attribute.

If a sign has a continuous change (the amount of income, work experience, the cost of fixed assets of an enterprise, etc., which can take any value within certain limits), then an interval variation series must be built for this sign.



The group table here also has two columns. The first indicates the value of the feature in the interval "from - to" (options), the second - the number of units included in the interval (frequency).

Frequency (repetition frequency) - the number of repetitions of a particular variant of the attribute values, denoted fi , and the sum of frequencies equal to the volume of the studied population, denoted

Where k is the number of attribute value options

Very often, the table is supplemented with a column in which the accumulated frequencies S are calculated, which show how many units of the population have a feature value no greater than this value.

A discrete variational distribution series is a series in which groups are composed according to a feature that varies discretely and takes only integer values.

The interval variation series of distribution is a series in which the grouping attribute, which forms the basis of the grouping, can take any values ​​in a certain interval, including fractional ones.

interval variational series is called an ordered set of intervals of variation of values random variable with the corresponding frequencies or frequencies of magnitude values ​​hitting each of them.

It is expedient to build an interval distribution series, first of all, with a continuous variation of a trait, and also if a discrete variation manifests itself over a wide range, i.e. the number of options for a discrete feature is quite large.

Several conclusions can already be drawn from this series. For example, the average element of a variation series (median) can be an estimate of the most probable result of a measurement. The first and last element of the variational series (i.e., the minimum and maximum element of the sample) show the spread of the elements of the sample. Sometimes, if the first or last element is very different from the rest of the sample, then they are excluded from the measurement results, considering that these values ​​were obtained as a result of some kind of gross failure, for example, technology.

Variation determines differences in the values ​​of any attribute in different units of a given population in the same period (point in time). The reason for the variation is different conditions the existence of different units of the population. For example, even twins in the process of life acquire differences in height, weight, as well as in such signs as the level of education, income, number of children, etc.

Variation arises as a result of the fact that the values ​​of the attribute themselves are formed under the total influence of various conditions that are combined in different ways in each individual case. Thus, the value of any option is objective.

Variation is characteristic to all phenomena of nature and society, without exception, except for the legislatively fixed normative values ​​of individual social characteristics. Studies of variation in statistics have great value help to understand the essence of the phenomenon under study. Finding variation, elucidating its causes, identifying the influence of individual factors provide important information for the implementation of evidence-based management decisions.

The average value gives a generalized characteristic of the feature of the population, but it does not reveal its structure. The average value does not show how the variants of the average feature are located around it, whether they are distributed near the average or deviate from it. The average in two sets may be the same, but in one variant all individual values ​​differ slightly from it, and in the other, these differences are large, i.e. in the first case, the variation of the trait is small, and in the second case, it is large; this is very important for characterizing the significance of the average value.

In order for the head of the organization, the manager, the researcher to be able to study the variation and manage it, statistics have been developed special methods variation studies (scorecard). With their help, the variation is found, its properties are characterized. The indicators of variation are : range of variation, mean linear deviation, coefficient of variation.

Variation series and its forms

Variation series- this is an ordered distribution of units of the population more often by increasing (less often decreasing) values ​​of the attribute and counting the number of units with one or another value of the attribute. When the number of population units is large, the ranked series becomes cumbersome, its construction takes a long time. In such a situation, a variational series is constructed by grouping population units according to the values ​​of the trait under study.

There are the following variation series forms :

  1. ranked row is a list of individual units of the population in ascending (descending) order of the trait under study.
  2. Discrete variation series - this is a table consisting of two rows or a graph: specific values ​​​​of the variable feature x and the number of units in the population with the given value f - the feature of frequencies. It is built when the attribute takes largest number values.
  3. interval series.

The range of variation is determined as the absolute value of the difference between the maximum and minimum values ​​(options) of the attribute:

The range of variation shows only extreme deviations of the trait and does not reflect individual deviations of all variants in the series. It characterizes the limits of change of a variable attribute and is dependent on the fluctuations of the two extreme options and is absolutely not related to the frequencies in the variation series, that is, to the nature of the distribution, which gives this value a random character. To analyze variation, you need an indicator that reflects all the fluctuations of the variation trait and gives general characteristics. The simplest indicator of this kind is the average linear deviation.

Variation series - this is a statistical series showing the distribution of the phenomenon under study according to the value of any quantitative trait. For example, patients by age, duration of treatment, newborns by weight, etc.

Option - individual values ​​of the characteristic by which the grouping is carried out (denoted V ) .

Frequency- a number indicating how often one or another variant occurs (denoted P ) . The sum of all frequencies shows total number observations and is denoted n . The difference between the largest and smallest variant of the variation series is called scope or amplitude .

There are variation series:

1. Discontinuous (discrete) and continuous.

The series is considered continuous if the grouping attribute can be expressed in fractional values ​​(weight, height, etc.), discontinuous if the grouping attribute is expressed only as an integer (days of disability, number of heartbeats, etc.).

2. Simple and weighted.

A simple variational series is a series in which the quantitative value of a variable attribute occurs once. In a weighted variational series, the quantitative values ​​of a varying trait are repeated with a certain frequency.

3. Grouped (interval) and ungrouped.

A grouped series has options combined into groups that unite them in size within a certain interval. In an ungrouped series, each individual variant corresponds to a certain frequency.

4. Even and odd.

In even variational series, the sum of frequencies or the total number of observations is expressed even number, in odd ones, odd.

5. Symmetrical and asymmetrical.

In a symmetrical variation series, all types of averages coincide or are very close (mode, median, arithmetic mean).

Depending on the nature of the phenomena being studied, on the specific tasks and objectives of the statistical study, as well as on the content of the source material, in sanitary statistics the following types of averages are used:

structural averages (mode, median);

arithmetic mean;

average harmonic;

geometric mean;

medium progressive.

Fashion (M about ) - the value of the variable trait, which is more common in the studied population, i.e. option corresponding to the highest frequency. It is found directly by the structure of the variation series, without resorting to any calculations. It is usually a value very close to the arithmetic mean and is very convenient in practice.

Median (M e ) - dividing the variation series (ranked, i.e. the values ​​of the option are arranged in ascending or descending order) into two equal halves. The median is calculated using the so-called odd series, which is obtained by successively summing the frequencies. If the sum of the frequencies corresponds to an even number, then the median is conventionally taken as the arithmetic mean of the two average values.

The mode and median are applied in the case of an open population, i.e. when the largest or smallest options do not have an exact quantitative characteristic (for example, under 15 years old, 50 and older, etc.). In this case, the arithmetic mean (parametric characteristics) cannot be calculated.

Average i arithmetic - the most common value. The arithmetic mean is usually denoted by M.

Distinguish between simple arithmetic mean and weighted mean.

simple arithmetic mean calculated:

— in those cases when the totality is represented by a simple list of knowledge of an attribute for each unit;

— if the number of repetitions of each variant cannot be determined;

— if the numbers of repetitions of each variant are close to each other.

The simple arithmetic mean is calculated by the formula:

where V - individual values ​​of the attribute; n is the number of individual values;
- sign of summation.

Thus, the simple average is the ratio of the sum of the variant to the number of observations.

Example: determine the average length of stay in bed for 10 patients with pneumonia:

16 days - 1 patient; 17–1; 18–1; 19–1; 20–1; 21–1; 22–1; 23–1; 26–1; 31–1.

bed-day.

Arithmetic weighted average is calculated in cases where the individual values ​​of the characteristic are repeated. It can be calculated in two ways:

1. Directly (arithmetic mean or direct method) according to the formula:

,

where P is the frequency (number of cases) of observations of each option.

Thus, the weighted arithmetic mean is the ratio of the sum of the products of the variant by the frequency to the number of observations.

2. By calculating deviations from the conditional average (according to the method of moments).

The basis for calculating the weighted arithmetic mean is:

— grouped material according to variants of a quantitative trait;

— all options should be arranged in ascending or descending order of the attribute value (ranked series).

To calculate by the method of moments, the prerequisite is the same size of all intervals.

According to the method of moments, the arithmetic mean is calculated by the formula:

,

where M o is the conditional average, which is often taken as the value of the feature corresponding to the highest frequency, i.e. which is more often repeated (Mode).

i - interval value.

a - conditional deviation from the conditions of the average, which is a sequential series of numbers (1, 2, etc.) with a + sign for a large conditional average option and with a - (-1, -2, etc.) sign for a option, which are below the average. The conditional deviation from the variant taken as the conditional average is 0.

P - frequencies.

- total number of observations or n.

Example: determine the average height of 8-year-old boys directly (table 1).

Table 1

Height in cm

Boys P

Central

option V

The central variant, the middle of the interval, is defined as the semi-sum of the initial values ​​of two adjacent groups:

;
etc.

The VP product is obtained by multiplying the central variants by the frequencies
;
etc. Then the resulting products are added and get
, which is divided by the number of observations (100) and the weighted arithmetic mean is obtained.

cm.

We will solve the same problem using the method of moments, for which the following table 2 is compiled:

Table 2

Height in cm (V)

Boys P

n=100

We take 122 as M o, because out of 100 observations, 33 people had a height of 122 cm. We find the conditional deviations (a) from the conditional average in accordance with the above. Then we obtain the product of conditional deviations by frequencies (aP) and summarize the obtained values ​​(
). The result will be 17. Finally, we substitute the data into the formula:

When studying a variable trait, one should not be limited only to the calculation of average values. It is also necessary to calculate indicators characterizing the degree of diversity of the studied features. The value of one or another quantitative attribute is not the same for all units of the statistical population.

The characteristic of the variation series is the average standard deviation (), which shows the scatter (scattering) of the studied features relative to the arithmetic mean, i.e. characterizes the fluctuation of the variation series. It can be determined directly by the formula:

The standard deviation is equal to the square root of the sum of the products of the squared deviations of each option from the arithmetic mean (V–M) 2 by its frequencies divided by the sum of the frequencies (
).

Calculation example: determine the average number of sick leaves issued in the clinic per day (table 3).

Table 3

Number of sick days

sheets issued

doctor per day (V)

Number of doctors (P)

;

In the denominator, when the number of observations is less than 30, it is necessary from
take away a unit.

If the series is grouped at equal intervals, then the standard deviation can be determined by the method of moments:

,

where i is the value of the interval;

- conditional deviation from the conditional average;

P - frequency variant of the corresponding intervals;

is the total number of observations.

Calculation example : Determine the average duration of stay of patients in a therapeutic bed (according to the method of moments) (table 4):

Table 4

Number of days

bed stay (V)

sick (P)

;

The Belgian statistician A. Quetelet discovered that the variations of mass phenomena obey the error distribution law, discovered almost simultaneously by K. Gauss and P. Laplace. The curve representing this distribution has the shape of a bell. According to the normal distribution law, the variability of the individual values ​​of the trait is within
, which covers 99.73% of all units in the population.

It is calculated that if you add and subtract 2 to the arithmetic mean , then 95.45% of all members of the variation series are within the obtained values, and, finally, if we add and subtract 1 to the arithmetic mean , then 68.27% of all members of this variational series will be within the obtained values. In medicine with magnitude
1associated with the concept of norm. The deviation from the arithmetic mean is greater than 1 , but less than 2 is subnormal and the deviation is greater than 2 abnormal (above or below normal).

In sanitary statistics, the three-sigma rule is used in the study of physical development, assessment of the activities of health care institutions, and assessment of public health. The same rule is widely applied in national economy in setting standards.

Thus, the standard deviation serves to:

— measurements of the dispersion of a variational series;

— characteristics of the degree of diversity of attributes, which are determined by the coefficient of variation:

If the coefficient of variation is more than 20% - strong diversity, from 20 to 10% - medium, less than 10% - weak diversity of characters. The coefficient of variation is, to a certain extent, a criterion for the reliability of the arithmetic mean.

The rows built by quantity, are called variational.

The distribution series consist of options(characteristic values) and frequencies(number of groups). Frequencies expressed as relative values ​​(shares, percentages) are called frequencies. The sum of all frequencies is called the volume of the distribution series.

By type, the distribution series are divided into discrete(built on discontinuous values ​​of the feature) and interval(built on continuous feature values).

Variation series represents two columns (or rows); one of which provides individual values ​​of the variable attribute, called variants and denoted by X; and in the other - absolute numbers showing how many times (how often) each option occurs. The indicators of the second column are called frequencies and are conventionally denoted by f. Once again, we note that in the second column, relative indicators characterizing the share of the frequency of individual variants in the total amount of frequencies can also be used. These relative indicators are called frequencies and conventionally denoted by ω The sum of all frequencies in this case is equal to one. However, frequencies can also be expressed as a percentage, and then the sum of all frequencies gives 100%.

If the variants of the variational series are expressed as discrete quantities, then such a variational series is called discrete.

For continuous features, variation series are constructed as interval, that is, the values ​​of the attribute in them are expressed “from ... to ...”. In this case, the minimum values ​​of the attribute in such an interval are called the lower limit of the interval, and the maximum - the upper limit.

Interval variational series are also built for discrete features that vary over a wide range. interval series may be with equal and unequal intervals.

Consider how the value is determined equal intervals. Let us introduce the following notation:

i– interval value;

- maximum value a sign in units of the population;

- the minimum value of the attribute for units of the population;

n- the number of allocated groups.

if n is known.

If the number of allocated groups is difficult to determine in advance, then to calculate optimal value interval with a sufficient population, the formula proposed by Sturgess in 1926 can be recommended:

n = 1+ 3.322 log N, where N is the number of ones in the population.

The value of unequal intervals is determined in each individual case, taking into account the characteristics of the object of study.

The statistical distribution of the sample call the list of options and their corresponding frequencies (or relative frequencies).

Statistical distribution samples can be specified in the form of a table, in the first column of which there are options, and in the second - the frequencies corresponding to these options ni, or relative frequencies Pi .

Statistical distribution of the sample

Interval series are called variation series in which the values ​​of the features underlying their formation are expressed within certain limits (intervals). Frequencies in this case do not refer to individual values ​​of the attribute, but to the entire interval.

Interval distribution series are constructed according to continuous quantitative characteristics, as well as according to discrete characteristics, varying within a significant range.

The interval series can be represented by the statistical distribution of the sample, indicating the intervals and their corresponding frequencies. In this case, the sum of the frequencies of the variant that fell into this interval is taken as the frequency of the interval.

When grouping by quantitative continuous features, it is important to determine the size of the interval.

In addition to the sample mean and sample variance, other characteristics of the variation series are also used.

Fashion name the variant that has the highest frequency.