CHI-squared distribution. Distributions of mathematical statistics in MS EXCEL

The chi-square distribution is one of the most widely used in statistics to test statistical hypotheses. On the basis of the chi-square distribution, one of the most powerful goodness-of-fit tests is constructed - the Pearson's chi-square test.

The criterion of goodness is called the criterion for testing the hypothesis about the assumed law of the unknown distribution.

The χ2 (chi-square) test is used to test the hypothesis of different distributions. This is its dignity.

Calculation formula criterion is

where m and m 'are empirical and theoretical frequencies, respectively

the distribution under consideration;

n is the number of degrees of freedom.

For verification, we need to compare empirical (observed) and theoretical (calculated under the assumption of a normal distribution) frequencies.

If the empirical frequencies fully coincide with the frequencies calculated or expected, S (E - T) = 0 and the χ2 criterion will also be is zero... If S (E - T) is not equal to zero, this will indicate a discrepancy between the calculated frequencies and the empirical frequencies of the series. In such cases, it is necessary to assess the significance of the χ2 criterion, which theoretically can vary from zero to infinity. This is done by comparing the actually obtained value of χ2ph with its critical value (χ2st). The null hypothesis, i.e. the assumption that the discrepancy between the empirical and theoretical or expected frequencies is random, is refuted if χ2ph is greater than or equal to χ2st for the accepted significance level (a) and the number of degrees of freedom (n).

The distribution of probable values ​​of the random variable χ2 is continuous and asymmetric. It depends on the number of degrees of freedom (n) and approaches normal distribution as the number of observations increases. Therefore, the application of the χ2 criterion to the estimation of discrete distributions is associated with some errors that affect its value, especially on small samples. To obtain more accurate estimates, the sample is distributed in variation range, must have at least 50 options. Correct application the χ2 criterion also requires that the frequencies of variants in the extreme classes should not be less than 5; if there are less than 5 of them, then they are combined with the frequencies of neighboring classes, so that the sum is greater than or equal to 5. Accordingly, the combination of frequencies decreases the number of classes (N). The number of degrees of freedom is set according to the secondary number of classes, taking into account the number of restrictions on the freedom of variation.



Since the accuracy of determining the χ2 criterion largely depends on the accuracy of calculating the theoretical frequencies (T), unrounded theoretical frequencies should be used to obtain the difference between the empirical and calculated frequencies.

As an example, take a study published on a website that focuses on the application of statistical methods in the humanities.

The Chi-square test allows comparison of frequency distributions, regardless of whether they are normally distributed or not.

Frequency refers to the number of occurrences of an event. Usually, the frequency of occurrence of an event is dealt with when the variables are measured in a scale of names and their other characteristics, except for the frequency, are impossible or problematic to select. In other words, when a variable has quality characteristics... Also, many researchers tend to translate test scores into levels (high, medium, low) and build tables of score distributions to find out the number of people at these levels. To prove that in one of the levels (in one of the categories) the number of people is really more (less), the Chi-square coefficient is also used.

Let's look at the simplest example.

A self-esteem test was conducted among younger adolescents. The test scores were translated into three levels: high, medium, low. The frequencies were distributed as follows:

High (V) 27 pers.

Medium (C) 12 pers.

Low (N) 11 pers.

Obviously, the majority of children with high self-esteem, however, this needs to be proven statistically. For this we use the Chi-square test.

Our task is to check whether the obtained empirical data differ from theoretically equiprobable ones. To do this, you need to find the theoretical frequencies. In our case, theoretical frequencies are equally probable frequencies, which are found by adding all frequencies and dividing by the number of categories.

In our case:

(B + C + H) / 3 = (27 + 12 + 11) / 3 = 16.6

Chi-square test formula:

χ2 = ∑ (E - T) I / T

We build the table:

We find the sum of the last column:

Now you need to find critical value criterion according to the table of critical values ​​(Table 1 in the appendix). For this we need the number of degrees of freedom (n).

n = (R - 1) * (C - 1)

where R is the number of rows in the table, C is the number of columns.

In our case, there is only one column (meaning the original empirical frequencies) and three rows (categories), so the formula changes - we exclude the columns.

n = (R - 1) = 3 - 1 = 2

For the probability of error p≤0.05 and n = 2, the critical value is χ2 = 5.99.

The obtained empirical value is greater than the critical one - the frequency differences are reliable (χ2 = 9.64; p≤0.05).

As you can see, the calculation of the criterion is very simple and does not take much time. The practical value of the chi-square test is enormous. This method turns out to be most valuable when analyzing the answers to the questionnaire questions.


Let's look at a more complex example.

For example, a psychologist wants to know if it is true that teachers are more biased towards boys than girls. Those. are more likely to praise girls. For this, the psychologist analyzed the characteristics of students, written by teachers, for the frequency of occurrence of three words: "active", "diligent", "disciplined", synonyms of words were also counted. Data on the frequency of occurrence of words were entered in the table:

To process the obtained data, we use the chi-square test.

To do this, we construct a table of distribution of empirical frequencies, i.e. those frequencies that we observe:

In theory, we would expect the frequencies to be equally distributed, i.e. the frequency will be distributed proportionally between boys and girls. Let's build a table of theoretical frequencies. To do this, multiply the row sum by the column sum and divide the resulting number by the total (s).

The final table for calculations will look like this:

χ2 = ∑ (E - T) I / T

n = (R - 1), where R is the number of rows in the table.

In our case, chi-square = 4.21; n = 2.

According to the table of critical values ​​of the criterion, we find: for n = 2 and an error level of 0.05, the critical value is χ2 = 5.99.

The resulting value is less than the critical one, which means that the null hypothesis is accepted.

Conclusion: teachers do not attach importance to the sex of the child when writing characteristics to him.


Conclusion.

K. Pearson made a significant contribution to the development of mathematical statistics (a large number of fundamental concepts). The main philosophical position of Pearson is formulated as follows: the concepts of science are artificial constructions, means of describing and ordering sensory experience; the rules for their connection into scientific sentences are isolated by the grammar of science, which is the philosophy of science. The universal discipline, applied statistics, allows us to connect dissimilar concepts and phenomena, although, according to Pearson, it is also subjective.

Many constructions of K. Pearson are directly related or developed using anthropological materials. He developed numerous methods of numerical classification and statistical criteria applied in all fields of science.


Literature.

1. Bogolyubov A. N. Mathematics. Mechanics. Biographical reference book. - Kiev: Naukova Dumka, 1983.

2. Kolmogorov A. N., Yushkevich A. P. (ed.). 19th century mathematics. - M .: Science. - T. I.

3. 3. Borovkov A.A. Math statistics. Moscow: Nauka, 1994.

4. 8. Feller V. Introduction to the theory of probability and its applications. - M .: Mir, Vol. 2, 1984.

5. 9. Harman G., Modern factor analysis. - M .: Statistics, 1972.

Chi-square test - universal method checking the agreement between the experimental results and the used statistical model.

Pearson distance X 2

A.M. Pyatnitsky

Russian State Medical University

In 1900, Karl Pearson proposed a simple, versatile and effective method checking the agreement between model predictions and experimental data. His “chi-square test” is the most important and most frequently used statistical test. Most of the problems associated with estimating unknown parameters of a model and checking the agreement between the model and experimental data can be solved with its help.

Let there be an a priori (“pre-experimental”) model of the studied object or process (in statistics they speak of the “null hypothesis” H 0), and the results of the experiment with this object. It should be decided if the model is adequate (does it correspond to reality)? Do the results of the experiment contradict our ideas about how reality works, or, in other words, should H 0 be rejected? Often this problem can be reduced to comparing the observed (O i = Observed) and expected according to the model (E i = Expected) average frequencies of occurrence of certain events. It is believed that the observed frequencies were obtained in a series of N independent (!) Observations made under constant (!) Conditions. As a result of each observation, one of M events is recorded. These events cannot occur simultaneously (they are incompatible in pairs) and one of them necessarily occurs (their combination forms a reliable event). The totality of all observations is reduced to a table (vector) of frequencies (O i) = (O 1, ... O M), which fully describes the results of the experiment. O 2 = 4 means that event number 2 has happened 4 times. The sum of frequencies O 1 + ... O M = N. It is important to distinguish between two cases: N - fixed, no coincidence, N - random value... For a fixed total number of experiments N, the frequencies have a polynomial distribution. Let us explain this general scheme simple example.

Application of the chi-square test to test simple hypotheses.

Let the model (null hypothesis H 0) consist in the fact that the die is correct - all faces fall out equally often with the probability p i = 1/6, i =, M = 6. An experiment was carried out, which consisted in the fact that the bone was thrown 60 times (N = 60 independent tests were carried out). According to the model, we expect that all observed frequencies O i of appearance 1,2, ... 6 points should be close to their average values ​​E i = Np i = 60 ∙ (1/6) = 10. According to H 0, the vector of average frequencies (E i) = (Np i) = (10, 10, 10, 10, 10, 10). (Hypotheses in which the average frequencies are fully known before the start of the experiment are called simple.) If the observed vector (O i) were equal to (34,0,0,0,0,26), then it is immediately clear that the model is incorrect - the bone cannot be correct, since only 1 and 6 fell out 60 times. The probability of such an event for a correct die is negligible: P = (2/6) 60 = 2.4 * 10 -29. However, the emergence of such obvious discrepancies between model and experience is an exception. Let the vector of observed frequencies (O i) be (5, 15, 6, 14, 4, 16). Is this consistent with H 0? So, we need to compare two frequency vectors (E i) and (O i). In this case, the vector of expected frequencies (E i) is not random, but the vector of observables (O i) is random - in the next experiment (in a new series of 60 shots) it will turn out to be different. It is useful to introduce a geometric interpretation of the problem and assume that in the frequency space (in this case, 6-dimensional) two points are given with coordinates (5, 15, 6, 14, 4, 16) and (10, 10, 10, 10, 10, 10 ). Are they far enough apart from each other to be considered inconsistent with H 0? In other words, we need:

  1. learn to measure the distance between frequencies (points in the frequency space),
  2. have a criterion of how much distance should be considered too (“implausible”) large, that is, incompatible with H 0.

The square of the usual Euclidean distance would be:

X 2 Euclid = S(O i -E i) 2 = (5-10) 2 + (15-10) 2 + (6-10) 2 + (14-10) 2 + (4-10) 2 + (16-10) 2

Moreover, the surfaces X 2 Euclid = const are always spheres if we fix the values ​​of E i and change O i. Carl Pearson noted that one should not use Euclidean distance in frequency space. So, it is wrong to assume that the points (O = 1030 and E = 1000) and (O = 40 and E = 10) are at an equal distance from each other, although in both cases the difference is O -E = 30. After all, the greater the expected frequency, the greater the deviations from it should be considered possible. Therefore, the points (O = 1030 and E = 1000) should be considered “close”, and the points (O = 40 and E = 10) should be considered “far” from each other. It can be shown that if the hypothesis H 0 is true, then the fluctuations of the frequency O i relative to E i are of the order of square root(!) from E i. Therefore, Pearson suggested, when calculating the distance, to square not the differences (O i -E i), but the normalized differences (O i -E i) / E i 1/2. So, here is the formula by which the Pearson distance is calculated (in fact, it is the square of the distance):

X 2 Pearson = S((O i -E i) / E i 1/2) 2 = S(O i -E i) 2 / E i

In our example:

X 2 Pearson = (5-10) 2/10 + (15-10) 2/10 + (6-10) 2/10 + (14-10) 2/10 + (4-10) 2/10 + ( 16-10) 2 /10=15.4

For a correct die, all the expected frequencies E i are the same, but usually they are different, so the surfaces on which the Pearson distance is constant (X 2 Pearson = const) are already ellipsoids, not spheres.

Now that the formula for calculating the distances has been chosen, it is necessary to figure out which distances should be considered “not too large” (consistent with H 0). So, for example, what can be said about the distance we calculated 15.4? In what percentage of cases (or how likely) would we get a distance greater than 15.4 if we experimented with the correct die? If this percentage is small (<0.05), то H 0 надо отвергнуть. Иными словами требуется найти распределение длярасстояния Пирсона. Если все ожидаемые частоты E i не слишком малы (≥5), и верна H 0 , то нормированные разности (O i - E i )/E i 1/2 приближенно эквивалентны стандартным гауссовским случайным величинам: (O i - E i )/E i 1/2 ≈N (0,1). Это, например, означает, что в 95% случаев| (O i - E i )/E i 1/2 | < 1.96 ≈ 2 (правило “двух сигм”).

Explanation... The number of measurements O i falling into the cell of the table with the number i has a binomial distribution with the parameters: m = Np i = E i, σ = (Np i (1-pi)) 1/2, where N is the number of measurements (N " 1), pi is the probability for one measurement to get into this cell (recall that measurements are independent and are performed in constant conditions). If pi is small, then: σ≈ (Np i) 1/2 = E i and the binomial distribution is close to Poisson, in which the average number of observations E i = λ, and the standard deviation σ = λ 1/2 = E i 1 / 2. For λ≥5, the Poissonian distribution is close to the normal N (m = E i = λ, σ = E i 1/2 = λ 1/2), and the normalized quantity (O i - E i) / E i 1/2 ≈ N (0 ,one).

Pearson defined the random variable χ 2 n - “chi-square with n degrees of freedom”, as the sum of the squares of n independent standard normal rvs:

χ 2 n = T 1 2 + T 2 2 +… + T n 2, where all T i = N (0,1) - n. O. R. With. v.

Let's try to clearly understand the meaning of this most important random variable in statistics. To do this, on a plane (for n = 2) or in space (for n = 3), we represent a cloud of points whose coordinates are independent and have a standard normal distribution f T (x) ~ exp (-x 2/2). On a plane, according to the “two sigma” rule, which is independently applied to both coordinates, 90% (0.95 * 0.95≈0.90) points are enclosed within a square (-2

f χ 2 2 (a) = Сexp (-a / 2) = 0.5exp (-a / 2).

With a sufficiently large number of degrees of freedom n (n> 30), the chi-square distribution approaches the normal one: N (m = n; σ = (2n) ½). This is a consequence of the “central limit theorem”: the sum of identically distributed quantities with finite variance approaches the normal law with an increase in the number of terms.

In practice, it is necessary to remember that the average square of the distance is equal to m (χ 2 n) = n, and its variance is σ 2 (χ 2 n) = 2n. From this it is easy to conclude which chi-square values ​​should be considered too small and too large: most of the distribution is enclosed in the range from n -2 ∙ (2n) ½ to n + 2 ∙ (2n) ½.

So, Pearson distances significantly exceeding n + 2 ∙ (2n) ½ should be considered incredibly large (inconsistent with H 0). If the result is close to n + 2 ∙ (2n) ½, then you should use the tables, in which you can find out exactly in what proportion of cases such and large chi-square values ​​can appear.

It is important to know how to choose the correct number degrees of freedom (n .d .f.). It seemed natural to assume that n is simply equal to the number of digits: n = M. In his article, Pearson suggested so. In the dice example, this would mean n = 6. However, several years later it was shown that Pearson was wrong. The number of degrees of freedom is always less than the number of bits if there are connections between the random variables O i. For the dice example, the sum of O i is 60, and only 5 frequencies can be changed independently, so the correct value is n = 6-1 = 5. For this value of n we get n + 2 ∙ (2n) ½ = 5 + 2 ∙ (10) ½ = 11.3. Since 15.4> 11.3, then the hypothesis H 0 - the die is correct, should be rejected.

After clarifying the error, the existing tables χ 2 had to be supplemented, since initially there was no case n = 1 in them, since the smallest number of digits = 2. Now it turned out that there may be cases when the Pearson distance has a distribution χ 2 n = 1.

Example... With 100 coin tosses, the number of coats of arms is O 1 = 65, and tails is O 2 = 35. The number of digits is M = 2. If the coin is symmetrical, then the expected frequencies are E 1 = 50, E 2 = 50.

X 2 Pearson = S(O i -E i) 2 / E i = (65-50) 2/50 + (35-50) 2/50 = 2 * 225/50 = 9.

The resulting value should be compared with those that can be assumed by the random variable χ 2 n = 1, defined as the square of the standard normal value χ 2 n = 1 = T 1 2 ≥ 9 ó T 1 ≥3 or T 1 ≤-3. The probability of such an event is very small P (χ 2 n = 1 ≥9) = 0.006. Therefore, the coin cannot be considered symmetrical: H 0 should be rejected. The fact that the number of degrees of freedom cannot be equal to the number of digits is evident from the fact that the sum of the observed frequencies is always equal to the sum of the expected ones, for example O 1 + O 2 = 65 + 35 = E 1 + E 2 = 50 + 50 = 100. Therefore, random points with coordinates O 1 and O 2 are located on a straight line: O 1 + O 2 = E 1 + E 2 = 100 and the distance to the center turns out to be less than if this restriction did not exist, and they were located on the entire plane. Indeed, for two independent random variables with mathematical expectations E 1 = 50, E 2 = 50, the sum of their realizations should not always be equal to 100 - for example, the values ​​O 1 = 60, O 2 = 55 would be admissible.

Explanation... Let us compare the result of the Pearson criterion for M = 2 with what the formula of Moivre Laplace gives when evaluating random fluctuations in the frequency of occurrence of an event ν = K / N having a probability p in a series of N independent Bernoulli tests (K is the number of successes):

χ 2 n = 1 = S(O i -E i) 2 / E i = (O 1 -E 1) 2 / E 1 + (O 2 -E 2) 2 / E 2 = (Nν -Np) 2 / (Np) + (N ( 1-ν) -N (1-p)) 2 / (N (1-p)) =

= (Nν-Np) 2 (1 / p + 1 / (1-p)) / N = (Nν-Np) 2 / (Np (1-p)) = ((K-Np) / (Npq) ½ ) 2 = T 2

The quantity T = (K -Np) / (Npq) ½ = (K -m (K)) / σ (K) ≈N (0,1) for σ (K) = (Npq) ½ ≥3. We see that in this case Pearson's result is exactly the same as that given by using the normal approximation for the binomial distribution.

So far, we have considered simple hypotheses for which the expected average frequencies E i are fully known in advance. See below how to choose the right number of degrees of freedom for complex hypotheses.

Chi-square test for complex hypotheses

In the examples with the correct die and coin, the expected frequencies could be determined before (!) The experiment. Such hypotheses are called “simple”. In practice, “complex hypotheses” are more common. At the same time, in order to find the expected frequencies E i, it is necessary to preliminarily estimate one or several quantities (model parameters), and this can be done only using the experimental data. As a result, for “complex hypotheses”, the expected frequencies E i turn out to be dependent on the observed frequencies O i and therefore themselves become random variables that vary depending on the results of the experiment. In the process of selecting the parameters, the Pearson distance decreases - the parameters are selected so as to improve the agreement between the model and experiment. Therefore, the number of degrees of freedom should decrease.

How to evaluate the parameters of the model? There are many different methods of estimation - “maximum likelihood method”, “method of moments”, “substitution method”. However, it is possible not to attract any additional funds and find parameter estimates by minimizing the Pearson distance. In the pre-computer era, this approach was rarely used: it is inconvenient to tame calculations and, as a rule, does not lend itself to an analytical solution. In computer calculations, numerical minimization is usually easy to accomplish, and the advantage of this method is its versatility. So, according to the “chi-square minimization method,” we select the values ​​of the unknown parameters so that the Pearson distance becomes the smallest. (By the way, by studying the changes in this distance at small displacements relative to the found minimum, one can estimate the measure of the estimation accuracy: build confidence intervals.) After the parameters and this minimum distance itself are found, it is again required to answer the question of whether it is small enough.

The general sequence of actions is as follows:

  1. Model selection (hypothesis H 0).
  2. Choice of digits and determination of the vector of observed frequencies O i.
  3. Estimation of unknown parameters of the model and construction of confidence intervals for them (for example, by searching for the minimum of the Pearson distance).
  4. Calculation of the expected frequencies E i.
  5. Comparison of the found value of the Pearson distance X 2 with the critical value of the chi-square χ 2 crit - the largest, which is still considered as plausible, compatible with H 0. We find the value, χ 2 crit from the tables, solving the equation

P (χ 2 n> χ 2 crit) = 1-α,

where α is “significance level” or “criterion size” or “magnitude of type I error” (typical value α = 0.05).

Usually the number of degrees of freedom n is calculated by the formula

n = (number of digits) - 1 - (number of estimated parameters)

If X 2> χ 2 crit, then the hypothesis H 0 is rejected, otherwise it is accepted. In α ∙ 100% of cases (that is, quite rarely) such a method of testing H 0 will lead to a “type I error”: the hypothesis H 0 will be rejected erroneously.

Example. In the study of 10 series of 100 seeds, the number of those infected with the green-eyed fly was counted. The received data: O i = (16, 18, 11, 18, 21, 10, 20, 18, 17, 21);

Here the vector of expected frequencies is unknown in advance. If the data are homogeneous and obtained for a binomial distribution, then one parameter is unknown, the proportion p of infected seeds. Note that the original table actually contains not 10 but 20 frequencies that satisfy 10 connections: 16 + 84 = 100, ... 21 + 79 = 100.

X 2 = (16-100p) 2 / 100p + (84-100 (1-p)) 2 / (100 (1-p)) +… +

(21-100p) 2 / 100p + (79-100 (1-p)) 2 / (100 (1-p))

Combining the terms in pairs (as in the example with a coin), we get the form of writing Pearson's criterion, which is usually written right away:

X 2 = (16-100p) 2 / (100p (1-p)) +… + (21-100p) 2 / (100p (1-p)).

Now, if we use the minimum of the Pearson distance as a method for estimating p, then it is necessary to find such p for which X 2 = min. (The model tries to “fit” into the experimental data as much as possible.)

Pearson's criterion is the most versatile of all those used in statistics. It can be applied to univariate and multivariate data, quantitative and qualitative features. However, it is precisely because of the versatility that one should be careful not to make mistakes.

Important points

1.Selection of the digits.

  • If the distribution is discrete, then there is usually no arbitrariness in the choice of digits.
  • If distribution is continuous, then arbitrariness is inevitable. Statistically equivalent blocks can be used (all O's are the same, eg = 10). In this case, the lengths of the intervals are different. In manual calculations, we tried to make the intervals the same. Should the intervals when studying the distribution of a one-dimensional feature be equal? No.
  • It is necessary to combine the discharges so that the expected (and not the observed!) Frequencies are not too small (≥5). Recall that it is they (E i) that are in the denominators when calculating X 2! When analyzing one-dimensional features, it is allowed to violate this rule in two extreme digits E 1 = E max = 1. If the number of discharges is large and the expected frequencies are close, then X 2 approaches χ 2 well even for E i = 2.

Parameter estimation... The use of “home-made”, ineffective estimation methods can lead to overestimated values ​​of the Pearson distance.

Choosing the correct number of degrees of freedom... If the parameters are estimated not by frequencies, but directly from the data (for example, the arithmetic mean is taken as an estimate of the mean), then the exact number of degrees of freedom n is unknown. It is only known that it satisfies the inequality:

(number of digits - 1 - number of estimated parameters)< n < (число разрядов – 1)

Therefore, it is necessary to compare X 2 with the critical values ​​χ 2 crit calculated in this entire range of n.

How to interpret improbably small chi-square values? Should a coin be considered symmetrical if, after 10,000 tosses, it fell out with a coat of arms 5,000 times? Previously, many statisticians believed that H 0 should also be rejected in this case. Now a different approach is proposed: take H 0, but subject the data and the method of their analysis to additional verification. There are two possibilities: either a too small Pearson distance means that the increase in the number of model parameters was not accompanied by a proper decrease in the number of degrees of freedom, or the data itself was falsified (possibly inadvertently adjusted to the expected result).

Example. Two researchers A and B calculated the proportion of recessive homozygotes aa in the second generation when monohybrid AA * aa was crossed. According to Mendel's laws, this fraction is 0.25. Each researcher conducted 5 experiments, and in each experiment 100 organisms were studied.

Results A: 25, 24, 26, 25, 24. The conclusion of the researcher: Mendel's law is true (?).

Results B: 29, 21, 23, 30, 19. Researcher's conclusion: Mendel's law is not true (?).

However, Mendel's law is statistical in nature, and quantitative analysis of the results reverses the conclusions! Combining five experiments into one, we arrive at a chi-square distribution with 5 degrees of freedom (a simple hypothesis is tested):

X 2 A = ((25-25) 2 + (24-25) 2 + (26-25) 2 + (25-25) 2 + (24-25) 2) / (100 ∙ 0.25 ∙ 0.75) = 0.16

X 2 B = ((29-25) 2 + (21-25) 2 + (23-25) 2 + (30-25) 2 + (19-25) 2) / (100 ∙ 0.25 ∙ 0.75) = 5.17

Average value m [χ 2 n = 5] = 5, standard deviation σ [χ 2 n = 5] = (2 ∙ 5) 1/2 = 3.2.

Therefore, without referring to the tables, it is clear that the value of X 2 B is typical, and the value of X 2 A is implausibly small. According to the tables P (χ 2 n = 5<0.16)<0.0001.

This example is an adapted version of a real case that occurred in the 1930s (see Kolmogorov's work "On Another Proof of Mendel's Laws"). Curiously, researcher A was a supporter of genetics, and researcher B was its opponent.

Confusion in notation. It is necessary to distinguish the Pearson distance, which in its calculation requires additional conventions, from the mathematical concept of a chi-square random variable. The Pearson distance under certain conditions has a distribution close to the chi-square with n degrees of freedom. Therefore, it is advisable NOT to denote the Pearson distance by the symbol χ 2 n, but to use a similar, but different designation X 2..

Pearson's criterion is not omnipotent. There are an infinite number of alternatives for H 0 that he cannot take into account. Suppose you test the hypothesis that the feature had a uniform distribution, you have 10 bits and the vector of observed frequencies is (130,125,121,118,116,115,114,113,111,110). Pearson's criterion cannot “notice” that the frequencies decrease monotonically and H 0 will not be rejected. If we supplement it with the criterion of the series, then yes!

The chi-square test of independence is used to determine the relationship between two categorical variables. Examples of pairs of categorical variables are: Marital status vs. Employment level of the respondent; Dog breed vs. Master's profession, Salary level vs. Specialization of an engineer, etc. When calculating the criterion of independence, the hypothesis is tested that there is no relationship between the variables. The calculations will be performed using the MS EXCEL 2010 CHI2.TEST () function and the usual formulas.

Suppose we have sample data representing the result of a survey of 500 people. People were asked 2 questions: about their marital status (married, civil marriage, not in a relationship) and their level of employment (full-time, part-time, temporarily not working, at the household, retired, study). All answers were placed in a table:

This table is called contingency table(or factorial table, English Contingency table). Elements at the intersection of rows and columns of the table are usually denoted O ij (from the English Observed, i.e. observed, actual frequencies).

We are interested in the question "Does Marital Status Affect Employment?" is there a relationship between the two classification methods sampling?

At hypothesis testing of this kind is usually assumed that null hypothesis asserts the absence of dependence of classification methods.

Consider limiting cases. An example of the complete dependence of two categorical variables is the following survey result:

In this case, marital status unambiguously determines employment (see. example file sheet Explanation). Conversely, another survey result is an example of complete independence:

Please note that the percentage of employment in this case does not depend on marital status (the same for married and unmarried). This is exactly the same as the wording null hypothesis... If null hypothesis is fair, then the results of the survey should have been distributed in the table so that the percentage of the employed would be the same regardless of marital status. Using this, we calculate the survey results that correspond to null hypothesis(cm. example file sheet example).

First, we calculate the estimate of the probability that the element sampling will have some employment (see column u i):

where With- the number of columns equal to the number of levels of the variable "Marital status".

Then we calculate the estimate of the probability that the element sampling will have a definite marital status (see line v j).

where r- the number of rows (rows) equal to the number of levels of the variable "Employment".

The theoretical frequency for each cell E ij (from the English Expected, i.e. the expected frequency) in the case of independence of the variables is calculated by the formula:
E ij = n * u i * v j

It is known that the statistics X 2 0 for large n has approximately (r-1) (c-1) degrees of freedom (df - degrees of freedom):

If calculated based on sampling the value of this statistic is "too large" (more than the threshold), then null hypothesis rejected. The threshold is calculated based on, for example, the formula = CHIS2.REV.RT (0.05; df).

Note: Significance level usually taken equal to 0.1; 0.05; 0.01.

At hypothesis testing it is also convenient to calculate, which we compare with level of significance. p-meaning calculated using c (r-1) * (c-1) = df degrees of freedom.

If the probability that a random variable has s (r-1) (s-1) degrees of freedom will take on a value greater than the calculated statistic X 2 0, i.e. P (X 2 (r-1) * (c-1)> X 2 0), less significance level, then null hypothesis deviates.

In MS EXCEL p-value can be calculated using the formula = CHI2.DIST.RT (X 2 0; df), of course, by calculating immediately before this the value of the statistic X 2 0 (this is done in the example file). However, it is most convenient to use the CHISK.TEST () function. References to ranges containing actual (Observed) and calculated theoretical frequencies (Expected) are specified as arguments to this function.

If significance level > p-values, then this means the actual and theoretical frequencies, calculated from the assumption of the validity null hypothesis are seriously different. So, null hypothesis must be rejected.

Using the CHISK.TEST () function allows you to speed up the procedure. hypothesis testing since no need to calculate the value statistics... Now it is enough to compare the result of the CH2.TEST () function with the given level of significance.

Note: CHISQ.TEST () function, English name CHISQ.TEST, appeared in MS EXCEL 2010. Its earlier version CHI2TEST (), available in MS EXCEL 2007, has the same functionality. But, as for CHIS.TEST (), the theoretical frequencies need to be calculated independently.

). The specific formulation of the hypothesis being tested will vary from case to case.

In this post I will describe how the \ (\ chi ^ 2 \) criterion works using a (hypothetical) example from immunology. Imagine that we have performed an experiment to establish the effectiveness of suppressing the development of a microbial disease when the appropriate antibodies are introduced into the body. In total, 111 mice were involved in the experiment, which we divided into two groups, including 57 and 54 animals, respectively. The first group of mice was injected with pathogenic bacteria, followed by the introduction of blood serum containing antibodies against these bacteria. Animals from the second group served as control - they received only bacterial injections. After some time of incubation, it turned out that 38 mice died and 73 survived. Of the dead, 13 belonged to the first group, and 25 to the second (control). The null hypothesis tested in this experiment can be formulated as follows: the administration of serum with antibodies has no effect on the survival of mice. In other words, we argue that the observed differences in the survival rate of mice (77.2% in the first group versus 53.7% in the second group) are completely random and are not associated with the action of antibodies.

The data obtained in the experiment can be presented in the form of a table:

Total

Bacteria + serum

Only bacteria

Total

Tables like this one are called contingency tables. In this example, the table has a dimension of 2x2: there are two classes of objects ("Bacteria + serum" and "Bacteria only"), which are investigated according to two criteria ("Dead" and "Survived"). This is the simplest case of a contingency table: of course, both the number of studied classes and the number of features can be larger.

To test the null hypothesis formulated above, we need to know what the situation would be if the antibodies did not really have any effect on the survival of the mice. In other words, you need to calculate expected frequencies for the corresponding cells of the contingency table. How to do it? In the experiment, a total of 38 mice died, which is 34.2% of the total number of animals involved. If the administration of antibodies does not affect the survival of the mice, the same percentage of mortality should be observed in both experimental groups, namely 34.2%. Calculating how much is 34.2% of 57 and 54, we get 19.5 and 18.5. These are the expected mortality rates in our experimental groups. The expected survival rates are calculated in a similar way: since 73 mice survived in total, or 65.8% of their total number, the expected survival rates will be 37.5 and 35.5. Let's compose a new contingency table, now with the expected frequencies:

The dead

Survivors

Total

Bacteria + serum

Only bacteria

Total

As you can see, the expected frequencies are quite different from the observed ones, i.e. the introduction of antibodies does seem to have an impact on the survival of mice infected with the pathogenic microorganism. We can quantify this impression using Pearson's goodness-of-fit test \ (\ chi ^ 2 \):

\ [\ chi ^ 2 = \ sum _ () \ frac ((f_o - f_e) ^ 2) (f_e), \]


where \ (f_o \) and \ (f_e \) are the observed and expected frequencies, respectively. Summation is performed over all table cells. So, for the example under consideration, we have

\ [\ chi ^ 2 = (13 - 19.5) ^ 2 / 19.5 + (44 - 37.5) ^ 2 / 37.5 + (25 - 18.5) ^ 2 / 18.5 + (29 - 35.5) ^ 2 / 35.5 = \]

Is the obtained value \ (\ chi ^ 2 \) large enough to reject the null hypothesis? To answer this question, it is necessary to find the corresponding critical value of the criterion. The number of degrees of freedom for \ (\ chi ^ 2 \) is calculated as \ (df = (R - 1) (C - 1) \), where \ (R \) and \ (C \) - the number of rows and columns in the table conjugation. In our case \ (df = (2 -1) (2 - 1) = 1 \). Knowing the number of degrees of freedom, we can now easily find out the critical value \ (\ chi ^ 2 \) using the standard R-function qchisq ():


Thus, with one degree of freedom in only 5% of cases, the value of the criterion \ (\ chi ^ 2 \) exceeds 3.841. The value we obtained of 6.79 significantly exceeds this critical value, which gives us the right to reject the null hypothesis that there is no relationship between the administration of antibodies and the survival of the infected mice. Rejecting this hypothesis, we run the risk of being wrong with a probability of less than 5%.

It should be noted that the above formula for the criterion \ (\ chi ^ 2 \) gives slightly overestimated values ​​when working with contingency tables of 2x2 size. The reason is that the distribution of the criterion itself \ (\ chi ^ 2 \) is continuous, while the frequencies of binary features ("died" / "survived") are discrete by definition. In this regard, when calculating the criterion, it is customary to introduce the so-called. continuity correction, or Yates' amendment :

\ [\ chi ^ 2_Y = \ sum _ () \ frac ((| f_o - f_e | - 0.5) ^ 2) (f_e). \]

"s Chi-squared test with Yates" continuity correction data: mice X-squared = 5.7923, df = 1, p-value = 0.0161


As you can see, R automatically applies Yates' correction for continuity ( Pearson "s Chi-squared test with Yates" continuity correction). The value \ (\ chi ^ 2 \) calculated by the program was 5.79213. We can reject the null hypothesis of no antibody effect, risking a mistake with a probability of just over 1% (p-value = 0.0161).

The Pearson χ2 test is a nonparametric method that allows you to assess the significance of the differences between the actual (identified by the study) number of outcomes or qualitative characteristics of the sample falling into each category, and the theoretical number that can be expected in the studied groups if the null hypothesis is valid. Simply put, the method allows you to assess the statistical significance of the differences between two or more relative indicators (frequencies, shares).

1. Development history of the χ 2 criterion

The chi-square test for the analysis of contingency tables was developed and proposed in 1900 by an English mathematician, statistician, biologist and philosopher, the founder of mathematical statistics and one of the founders of biometrics Karl Pearson(1857-1936).

2. What is the Pearson χ 2 test used for?

Chi-square test can be applied in the analysis contingency tables containing information on the frequency of outcomes depending on the presence of a risk factor. For instance, four-field contingency table as follows:

There is an exodus (1) No outcome (0) Total
There is a risk factor (1) A B A + B
No risk factor (0) C D C + D
Total A + C B + D A + B + C + D

How to fill in such a contingency table? Let's look at a small example.

A study of the effect of smoking on the risk of developing arterial hypertension is being carried out. For this, two groups of subjects were selected - the first included 70 people who daily smoke at least 1 pack of cigarettes, the second - 80 nonsmokers of the same age. In the first group, 40 people had high blood pressure. In the second, arterial hypertension was observed in 32 people. Accordingly, normal blood pressure in the group of smokers was in 30 people (70 - 40 = 30) and in the group of nonsmokers - in 48 (80 - 32 = 48).

We fill in the initial data a four-field contingency table:

In the resulting contingency table, each line corresponds to a specific group of subjects. Columns - show the number of persons with arterial hypertension or with normal arterial pressure.

The task for the researcher is: are there statistically significant differences between the frequency of people with blood pressure among smokers and non-smokers? You can answer this question by calculating Pearson's chi-square test and comparing the resulting value with the critical one.

3. Conditions and limitations of the Pearson chi-square test

  1. Comparable indicators should be measured in nominal scale(for example, the gender of the patient - male or female) or in ordinal(for example, the degree of arterial hypertension, which ranges from 0 to 3).
  2. This method allows analyzing not only four-field tables, when both the factor and the outcome are binary variables, that is, they have only two possible values ​​(for example, male or female gender, the presence or absence of a certain disease in the history ...). The Pearson chi-square criterion can also be applied in the case of analysis of multi-field tables, when the factor and (or) the outcome take three or more values.
  3. The groups compared should be independent, that is, the chi-square test should not be used when comparing the observations "before and after." McNemar test(when comparing two related populations) or calculated Cochran's Q test(in the case of comparing three or more groups).
  4. When analyzing four-field tables expected values each of the cells must have at least 10 as amended by Yeats... If in at least one cell the expected phenomenon is less than 5, then the analysis should use Fisher's exact test.
  5. In the case of analysis of multi-field tables, the expected number of cases should not be less than 5 in more than 20% of the cells.

4. How to calculate the Pearson chi-square test?

To calculate the chi-square criterion, you must:

This algorithm is applicable to both four-field and multi-field tables.

5. How to interpret the value of the Pearson chi-square test?

In the event that the obtained value of the χ 2 criterion is greater than the critical one, we conclude that there is a statistical relationship between the risk factor under study and the outcome at an appropriate level of significance.

6. An example of calculating the Pearson chi-square test

Let us determine the statistical significance of the influence of the smoking factor on the incidence of arterial hypertension according to the table above:

  1. We calculate the expected values ​​for each cell:
  2. Find the value of the Pearson chi-square test:

    χ 2 = (40-33.6) 2 /33.6 + (30-36.4) 2 /36.4 + (32-38.4) 2 /38.4 + (48-41.6) 2 /41.6 = 4.396.

  3. The number of degrees of freedom f = (2-1) * (2-1) = 1. From the table, we find the critical value of the Pearson chi-square test, which, at a significance level of p = 0.05 and a number of degrees of freedom of 1, is 3.841.
  4. We compare the obtained value of the chi-square test with the critical one: 4.396> 3.841, therefore, the dependence of the incidence of arterial hypertension on the presence of smoking is statistically significant. The significance level of this relationship corresponds to p<0.05.