Testing the hypothesis about the normal distribution of the general population according to the Pearson test. Pearson's goodness-of-fit test (chi-square test)

Consider the application inMSEXCELthe Pearson chi-square test for testing simple hypotheses.

After obtaining experimental data (i.e. when there is some sample), the distribution law is usually chosen that best describes the random variable represented by a given sampling... Checking how well the experimental data are described by the chosen theoretical distribution law is carried out using criteria of consent. Null hypothesis, there is usually a hypothesis about the equality of the distribution random variable some theoretical law.

Let's first consider the application Pearson's goodness-of-fit test X 2 (chi-square) in relation to simple hypotheses (the parameters of the theoretical distribution are assumed to be known). Then -, when only the shape of the distribution is specified, and the parameters of this distribution and the value statistics X 2 are estimated / calculated based on the same sampling.

Note: In the English-language literature, the application procedure Pearson's goodness-of-fit test X 2 has a title The chi-square goodness of fit test.

Recall the hypothesis testing procedure:

  • based sampling the value is calculated statistics, which corresponds to the type of hypothesis being tested. For example, for is used t-statistics(if not known);
  • subject to truth null hypothesis, the distribution of this statistics is known and can be used to calculate probabilities (for example, for t-statistics this is );
  • calculated based on sampling meaning statistics compared with the critical value for the given value ();
  • null hypothesis reject if the value statistics more than critical (or if the probability of getting this value statistics() smaller significance level, which is an equivalent approach).

We will carry out hypothesis testing for different distributions.

Discrete case

Suppose two people are playing dice. Each player has a different set of dice. The players take turns rolling 3 dice at once. Each round is won by the one who rolls more sixes at a time. The results are recorded. After 100 rounds, one of the players suspected that his opponent's dice were asymmetrical, because he often wins (often throws sixes). He decided to analyze how likely such a number of opponent's outcomes are.

Note: Because dice 3, then you can roll 0 at a time; 1; 2 or 3 sixes, i.e. a random variable can take 4 values.

From the theory of probability, we know that if the cubes are symmetric, then the probability of getting sixes obeys. Therefore, after 100 rounds, the frequencies of sixes can be calculated using the formula
= BINOM.DIST (A7; 3; 1/6; FALSE) * 100

The formula assumes that in the cell A7 contains the corresponding number of dropped sixes in one round.

Note: Calculations are given in example file on the Discrete sheet.

For comparison observed(Observed) and theoretical frequencies(Expected) is convenient to use.

With a significant deviation of the observed frequencies from the theoretical distribution, null hypothesis on the distribution of a random variable according to a theoretical law, should be rejected. That is, if the opponent's dice are asymmetrical, then the observed frequencies will be "significantly different" from binomial distribution.

In our case, at first glance, the frequencies are quite close and it is difficult to draw an unambiguous conclusion without calculations. Applicable Pearson's goodness-of-fit test X 2, so that instead of the subjective statement "significantly differ", which can be made based on the comparison histograms, use a mathematically correct statement.

We use the fact that due to the law large numbers observed frequency (Observed) with increasing volume sampling n tends to the probability corresponding to the theoretical law (in our case, binomial law). In our case, the sample size n is 100.

Introduce test statistics, which we denote X 2:

where O l is the observed frequency of events that the random variable has taken certain admissible values, E l is the corresponding theoretical frequency (Expected). L is the number of values ​​that a random variable can take (in our case, it is 4).

As can be seen from the formula, this statistics is a measure of the closeness of the observed frequencies to the theoretical ones, i.e. it can be used to estimate the "distances" between these frequencies. If the sum of these "distances" is "too large", then these frequencies are "significantly different." It is clear that if our cube is symmetric (i.e., we apply binomial law), then the probability that the sum of the "distances" will be "too great" will be small. To calculate this probability we need to know the distribution statistics X 2 ( statistics X 2 is calculated on the basis of a random sampling, therefore it is a random variable and, therefore, has its own probability distribution).

From a multidimensional analogue integral theorem of Moivre-Laplace it is known that for n-> ∞ our random variable X 2 asymptotically with L - 1 degrees of freedom.

So if the calculated value statistics X 2 (the sum of the "distances" between frequencies) will be greater than a certain limit value, then we will have reason to reject null hypothesis... As with checking parametric hypotheses, the limit value is set via significance level... If the probability that the statistic X 2 will take a value less than or equal to the calculated ( p-meaning) will be less significance level, then null hypothesis can be rejected.

In our case, the statistic is 22.757. The probability that the X 2 statistic will take a value greater than or equal to 22.757 is very small (0.000045) and can be calculated using the formulas
= CHI2.DIST.RF (22.757; 4-1) or
= CHI2.TEST (Observed; Expected)

Note: The CHI2.TEST () function is specially designed to test the connection between two categorical variables (see).

The probability of 0.000045 is significantly less than usual significance level 0.05. So, the player has every reason to suspect his opponent of dishonesty ( null hypothesis his honesty is rejected).

When applying criterion X 2 it is necessary to ensure that the volume sampling n was large enough, otherwise the approximation of the distribution statistics X 2... It is usually assumed that for this it is sufficient that the observed frequencies (Observed) are greater than 5. If this is not the case, then small frequencies are combined into one or join other frequencies, and the combined value is assigned the total probability and, accordingly, the number of degrees of freedom decreases X 2 -distributions.

In order to improve the quality of application criterion X 2(), it is necessary to decrease the partition intervals (increase L and, accordingly, increase the number degrees of freedom), however, this is hindered by the limitation on the number of observations that fall into each interval (b.b.> 5).

Continuous case

Pearson's goodness-of-fit test X 2 can be applied the same way in case.

Consider a certain sampling consisting of 200 values. Null hypothesis States that sample made from .

Note: Random values ​​in example file on worksheet Continuous generated with the formula = NORM.ST.OBR (RAND ())... Therefore, the new meanings sampling generated each time the sheet is recalculated.

Whether the available dataset is adequate can be visually assessed.

As can be seen from the diagram, the sample values ​​fit fairly well along the straight line. However, as in for hypothesis testing applicable Pearson's goodness-of-fit criterion X 2.

To do this, we divide the range of variation of the random variable into intervals with a step of 0.5. Let's calculate the observed and theoretical frequencies. The observed frequencies will be calculated using the FREQUENCY () function, and the theoretical ones - using the NORM.ST.DIST () function.

Note: As for discrete case, it is necessary to ensure that sample was large enough, and> 5 values ​​fell into the interval.

We calculate the statistics X 2 and compare it with the critical value for a given significance level(0.05). Because we have divided the range of variation of the random variable into 10 intervals, then the number of degrees of freedom is 9. The critical value can be calculated by the formula
= CHI2.OBR.PH (0.05; 9) or
= CHI2.OBR (1-0.05; 9)

In the diagram above, you can see that the statistic is 8.19, which is significantly higher criticalnull hypothesis is not rejected.

Below is shown where sample assumed an unlikely value and based on criterion Pearson Consent X 2 the null hypothesis was rejected (although the random values ​​were generated using the formula = NORM.ST.OBR (RAND ()) providing sampling from standard normal distribution).

Null hypothesis rejected, although visually the data is located quite close to a straight line.

As an example, also take sampling from U (-3; 3). In this case, even from the graph it is obvious that null hypothesis must be rejected.

Criterion Pearson Consent X 2 also confirms that null hypothesis must be rejected.

Goodness-of-fit criterion for testing the hypothesis about the distribution law of the investigated random variable. In many practical problems, the exact distribution law is unknown. Therefore, a hypothesis is put forward about the correspondence of the existing empirical law, built from observations, to some theoretical one. This hypothesis requires statistical testing, according to the results of which it will either be confirmed. either refuted.

Let X be the random variable under study. It is required to test the hypothesis H 0 that this random variable obeys the distribution law F (x). To do this, it is necessary to make a sample of n independent observations and use it to construct an empirical distribution law F "(x). To compare the empirical and hypothetical laws, a rule called the goodness-of-fit criterion is used. One of the most popular is K. Pearson's chi-square goodness test.

It calculates the chi-square statistic:

,

where N is the number of intervals by which the empirical distribution law was constructed (the number of columns of the corresponding histogram), i is the interval number, pti is the probability of the value of the random variable falling into the i-th interval for the theoretical distribution law, pei is the probability of the value of the random variable falling into i -th interval for the empirical distribution law. It must obey the chi-square distribution.

If the calculated value of the statistic exceeds the quantile of the chi-square distribution with kp-1 degrees of freedom for a given significance level, then the hypothesis H 0 is rejected; otherwise, it is accepted at a given significance level. Here k is the number of observations, p is the number of estimated parameters of the distribution law ...

Pearson allows you to check the empirical and theoretical (or other empirical) distributions of one feature. This criterion is mainly applied in two cases:

To compare the empirical distribution of a trait with a theoretical distribution (normal, exponential, uniform, or some other law);

To compare two empirical distributions of the same feature.

The idea of ​​the method is to determine the degree of divergence of the corresponding frequencies n i and; the greater the discrepancy, the more value

The sample sizes must be at least 50 and the equality of the sums of frequencies is required

Null hypothesis H 0 = (the two distributions practically do not differ from each other); alternative hypothesis - H 1 = (the discrepancy between the distributions is significant).

Let us give a scheme for applying the criterion for comparing two empirical distributions:

Criterion - a statistical criterion for testing the hypothesis that the observed random variable obeys a certain theoretical distribution law.


Depending on the value of the criterion, the hypothesis can be accepted or rejected:

§ , the hypothesis is fulfilled.

§ (falls into the left "tail" of the distribution). Consequently, the theoretical and practical values ​​are very close. If, for example, a random number generator is checked, which generated n numbers from a segment and a hypothesis: the sample is evenly distributed on, then the generator cannot be called random (the randomness hypothesis is not fulfilled), since the sample is too evenly distributed, but the hypothesis holds.

§ (falls into the right "tail" of the distribution) the hypothesis is rejected.

Definition: let a random variable X be given.

Hypothesis: with. v. X obeys the distribution law.

To test the hypothesis, consider a sample consisting of n independent observations over r.v. X:. Based on the sample, we construct the empirical distribution of r.v. X. Comparison of the empirical and theoretical distribution (assumed in the hypothesis) is performed using a specially selected goodness-of-fit criterion function. Consider Pearson's goodness-of-fit test (criterion):

Hypothesis: X n is generated by the function.

Divide into k disjoint intervals ;

Let be the number of observations in j-th interval: ;

The probability of the observation falling into the j-th interval when the hypothesis is fulfilled;

- the expected number of hits in the j-th interval;

Statistics: - Chi-square distribution with k-1 degrees of freedom.

The criterion is erroneous on samples with low frequency (rare) events. This problem can be solved by discarding low frequency events, or by combining them with other events. This method is called Yates' correction.

Pearson's goodness-of-fit test (χ 2) is used to test the hypothesis about the correspondence of the empirical distribution to the assumed theoretical distribution F (x) for a large sample size (n ≥ 100). The criterion is applicable for any kind of function F (x), even for unknown values ​​of their parameters, which usually takes place when analyzing the results mechanical tests... This is its versatility.

The use of the χ 2 criterion provides for dividing the range of variation of the sample into intervals and determining the number of observations (frequency) n j for each of e intervals. For the convenience of estimating the distribution parameters, the intervals are chosen of the same length.

The number of intervals depends on the sample size. Usually take: for n = 100 e= 10 ÷ 15, for n = 200 e= 15 ÷ 20, for n = 400 e= 25 ÷ 30, with n = 1000 e= 35 ÷ 40.

Intervals containing less than five observations are combined with adjacent ones. However, if the number of such intervals is less than 20% of their total number, intervals with a frequency of n j ≥ 2 are allowed.

The statistics of Pearson's criterion is the value
, (3.91)
where p j is the probability of the studied random variable falling into j-th interval calculated in accordance with the hypothetical distribution law F (x). When calculating the probability p j, it should be borne in mind that the left boundary of the first interval and the right of the last one must coincide with the boundaries of the range of possible values ​​of the random variable. For example, for normal distribution the first interval extends to -∞ and the last to + ∞.

The null hypothesis about the correspondence of the sample distribution to the theoretical law F (x) is checked by comparing the value calculated by the formula (3.91) with the critical value χ 2 α found from table. VI applications for the significance level α and the number of degrees of freedom k = e 1 - m - 1. Here e 1 - number of intervals after merging; m is the number of parameters estimated for the considered sample. If the inequality
χ 2 ≤ χ 2 α (3.92)
then the null hypothesis is not rejected. If this inequality is not met, an alternative hypothesis is accepted that the sample belongs to an unknown distribution.

The disadvantage of Pearson's goodness-of-fit criterion is the loss of some of the initial information associated with the need to group the observation results into intervals and combine individual intervals with a small number of observations. In this connection, it is recommended to supplement the verification of the correspondence of distributions by the χ 2 criterion with other criteria. This is especially necessary with a relatively small volume samples (n ≈ 100).

The table shows the critical values ​​of the chi-squared distribution with a given number of degrees of freedom. The desired value is located at the intersection of the column with the corresponding probability value and the row with the number of degrees of freedom. For example, critical value The chi-squared distribution with 4 degrees of freedom for a probability of 0.25 is 5.38527. This means that the area under the chi-square distribution with 4 degrees of freedom to the right of 5.38527 is 0.25.

Pearson's correlation criterion is a method of parametric statistics that allows you to determine the presence or absence of a linear relationship between two quantitative indicators, as well as to assess its closeness and statistical significance. In other words, the Pearson correlation test allows you to determine whether there is a linear relationship between changes in the values ​​of two variables. In statistical calculations and conclusions, the correlation coefficient is usually denoted as r xy or R xy.

1. History of the development of the correlation criterion

Pearson's correlation criterion was developed by a team of British scientists led by Karl Pearson(1857-1936) in the 90s of the 19th century, to simplify the analysis of the covariance of two random variables. In addition to Karl Pearson, they also worked on the Pearson correlation criterion. Francis Edgeworth and Raphael Weldon.

2. What is the Pearson correlation test used for?

The Pearson correlation criterion allows you to determine what is the tightness (or strength) of the correlation between two indicators measured in a quantitative scale. With the help of additional calculations, you can also determine how statistically significant the identified relationship is.

For example, using the Pearson correlation criterion, one can answer the question of whether there is a relationship between body temperature and the content of leukocytes in the blood in acute respiratory infections, between the height and weight of the patient, between the content in drinking water fluoride and the incidence of caries in the population.

3. Conditions and limitations of the Pearson chi-square test

  1. Comparable indicators should be measured in quantitative scale(for example, heart rate, body temperature, white blood cell count in 1 ml of blood, systolic blood pressure).
  2. By means of the Pearson correlation criterion it is possible to determine only the presence and strength of a linear relationship between quantities. Other characteristics of the relationship, including the direction (forward or backward), the nature of changes (straight or curvilinear), as well as the dependence of one variable on another, are determined using regression analysis.
  3. The number of compared values ​​should be equal to two. In the case of analyzing the relationship of three or more parameters, you should use the method factor analysis.
  4. Pearson's correlation criterion is parametric, in connection with which the condition for its application is normal distribution matching variables. If necessary correlation analysis indicators, the distribution of which differs from the normal, including those measured on an ordinal scale, the Spearman's rank correlation coefficient should be used.
  5. It is necessary to clearly distinguish between the concepts of dependence and correlation. The dependence of the quantities determines the presence of a correlation between them, but not vice versa.

For example, the growth of a child depends on his age, that is, what older child, the higher it is. If we take two children different ages, then with a high degree of probability the growth of the older child will be greater than that of the younger. This phenomenon and called addiction, implying a causal relationship between indicators. Of course, between them there is correlation link, meaning that changes in one indicator are accompanied by changes in another indicator.

In another situation, consider the relationship between a child's growth and heart rate (HR). As you know, both of these values ​​directly depend on age, therefore, in most cases, children greater growth(and therefore older) will have lower heart rate values. That is, correlation link will be observed and may have a fairly high density. However, if we take children the same age, but different heights, then, most likely, their heart rate will differ insignificantly, in connection with which it can be concluded that independence HR from growth.

This example shows how important it is to distinguish between fundamental concepts in statistics connections and dependencies indicators for building correct conclusions.

4. How to calculate the Pearson correlation coefficient?

The Pearson correlation coefficient is calculated using the following formula:

5. How to interpret the value of the Pearson correlation coefficient?

The values ​​of the Pearson correlation coefficient are interpreted based on its absolute values. Possible values ​​of the correlation coefficient range from 0 to ± 1. The greater the absolute value of r xy, the higher the tightness of the relationship between the two values. r xy = 0 indicates a complete lack of communication. r xy = 1 - indicates the presence of an absolute (functional) connection. If the value of the Pearson correlation criterion turned out to be more than 1 or less than -1, an error was made in the calculations.

To assess the tightness, or strength, of a correlation, generally accepted criteria are usually used, according to which absolute values r xy< 0.3 свидетельствуют о weak connection, r xy values ​​from 0.3 to 0.7 - about connection middle tightness, r xy values> 0.7 - o strong communication.

A more accurate estimate of the strength of the correlation can be obtained by using Chaddock table:

Grade statistical significance the correlation coefficient r xy is carried out using the t-criterion, calculated by the following formula:

The obtained value of t r is compared with the critical value at a certain level of significance and the number of degrees of freedom n-2. If t r exceeds t crit, then a conclusion is made about the statistical significance of the revealed correlation.

6. An example of calculating the Pearson correlation coefficient

The aim of the study was to identify, determine the density and statistical significance of the correlation between two quantitative indicators: the level of testosterone in the blood (X) and the percentage muscle mass in the body (Y). The initial data for a sample of 5 subjects (n = 5) are summarized in the table.

Pearson's criterion for testing the hypothesis about the form of the distribution law of a random variable. Testing hypotheses about normal, exponential and uniform distributions according to Pearson's criterion. Kolmogorov criterion. An approximate method for checking the normality of a distribution, associated with estimates of the coefficients of skewness and kurtosis.

In the previous lecture, hypotheses were considered in which the distribution law of the general population was assumed to be known. Now let's start testing hypotheses about the assumed law of unknown distribution, that is, we will test the null hypothesis that the general population is distributed according to some known law. Usually statistical criteria to test such hypotheses are called goodness-of-fit tests.

The advantage of the Pearson criterion is its versatility: it can be used to test hypotheses about various distribution laws.

1. Testing the hypothesis of normal distribution.

Let a sample of a sufficiently large size be obtained NS with big amount different meanings option. For the convenience of its processing, we divide the interval from the smallest to the largest of the values ​​of the variant into s equal parts and we will assume that the values ​​of var

ant, hitting each interval, are approximately equal to the number specifying the middle of the interval. Having counted the number of variants that fall into each interval, we will compose the so-called grouped sample:

options NS 1 NS 2 x s

frequency NS 1 NS 2 n s ,

where x i are the values ​​of the midpoints of the intervals, and n i- the number of variants included in i-th interval (empirical frequencies).

Based on the data obtained, you can calculate the sample mean and sample mean standard deviation σ B... Let us check the assumption that the general population is distributed according to the normal law with the parameters M(X) = , D(X) =. Then you can find the number of numbers from the sample size NS, which should appear in each interval under this assumption (that is, theoretical frequencies). To do this, from the table of values ​​of the Laplace function, we find the probability of falling into i th interval:

,

where and i and b i- borders i th interval. Multiplying the obtained probabilities by the sample size n, we find the theoretical frequencies: n i = n? p i... Our goal is to compare the empirical and theoretical frequencies, which, of course, differ from each other, and find out whether these differences are insignificant, do not refute the hypothesis of the normal distribution of the random variable under study, or they are so large that they contradict this hypothesis. For this, a criterion is used in the form of a random variable

. (20.1)

Its meaning is obvious: the parts are summed up, which are the squares of the deviations of empirical frequencies from the theoretical ones from the corresponding theoretical frequencies. It can be proved that regardless of the real distribution law of the general population, the distribution law of a random variable (20.1) at tends to the distribution law (see lecture 12) with the number of degrees of freedom k = s - 1 - r, where r- the number of parameters of the assumed distribution, estimated from the sample data. The normal distribution is characterized by two parameters, therefore k = s - 3. For the selected criterion, a right-sided critical region is constructed, determined by the condition


(20.2)

where α - significance level. Therefore, the critical region is given by the inequality and the area of ​​acceptance of the hypothesis is.

So, to test the null hypothesis N 0: the general population is normally distributed - you need to calculate the observed value of the criterion from the sample:

, (20.1`)

and from the table of critical points of the distribution χ 2 find the critical point using the known values ​​of α and k = s - 3. If - the null hypothesis is accepted, when it is rejected.

2. Testing the hypothesis of uniform distribution.

When using the Pearson test to test the hypothesis of a uniform distribution of the general population with an assumed probability density

it is necessary, having calculated the value from the available sample, to estimate the parameters a and b by the formulas:

where a* and b *- estimates a and b... Indeed, for uniform distribution M(NS) = , , from where you can get a system for determining a* and b*:, the solution of which are expressions (20.3).

Then, assuming that , the theoretical frequencies can be found by the formulas

Here s- the number of intervals into which the sample is split.

The observed value of the Pearson criterion is calculated by the formula (20.1`), and the critical value - according to the table, taking into account the fact that the number of degrees of freedom k = s - 3. After that, the boundaries of the critical region are determined in the same way as for testing the hypothesis of a normal distribution.

3. Testing the hypothesis about exponential distribution.

In this case, after dividing the available sample into intervals equal in length, we consider a sequence of variants equidistant from each other (we assume that all variants that fall into i- th interval, take a value that coincides with its middle), and the corresponding frequencies n i(the number of sample variants included in i- th interval). We calculate from these data and take as an estimate of the parameter λ magnitude. Then the theoretical frequencies are calculated by the formula

Then the observed and critical values ​​of the Pearson criterion are compared, taking into account that the number of degrees of freedom k = s - 2.

The width of the interval will be:

Xmax - the maximum value of the grouping attribute in the aggregate.
Xmin is the minimum value of the grouping attribute.
Let's define the boundaries of the group.

Group numberBottom lineUpper bound
1 43 45.83
2 45.83 48.66
3 48.66 51.49
4 51.49 54.32
5 54.32 57.15
6 57.15 60

The same value of the feature serves as the upper and lower boundaries of two adjacent (previous and next) groups.
For each value of the series, we calculate how many times it falls into one or another interval. To do this, we sort the row in ascending order.
43 43 - 45.83 1
48.5 45.83 - 48.66 1
49 48.66 - 51.49 1
49 48.66 - 51.49 2
49.5 48.66 - 51.49 3
50 48.66 - 51.49 4
50 48.66 - 51.49 5
50.5 48.66 - 51.49 6
51.5 51.49 - 54.32 1
51.5 51.49 - 54.32 2
52 51.49 - 54.32 3
52 51.49 - 54.32 4
52 51.49 - 54.32 5
52 51.49 - 54.32 6
52 51.49 - 54.32 7
52 51.49 - 54.32 8
52 51.49 - 54.32 9
52.5 51.49 - 54.32 10
52.5 51.49 - 54.32 11
53 51.49 - 54.32 12
53 51.49 - 54.32 13
53 51.49 - 54.32 14
53.5 51.49 - 54.32 15
54 51.49 - 54.32 16
54 51.49 - 54.32 17
54 51.49 - 54.32 18
54.5 54.32 - 57.15 1
54.5 54.32 - 57.15 2
55.5 54.32 - 57.15 3
57 54.32 - 57.15 4
57.5 57.15 - 59.98 1
57.5 57.15 - 59.98 2
58 57.15 - 59.98 3
58 57.15 - 59.98 4
58.5 57.15 - 59.98 5
60 57.15 - 59.98 6

The grouping results are presented in the form of a table:
GroupsPopulation numberFrequency f i
43 - 45.83 1 1
45.83 - 48.66 2 1
48.66 - 51.49 3,4,5,6,7,8 6
51.49 - 54.32 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26 18
54.32 - 57.15 27,28,29,30 4
57.15 - 59.98 31,32,33,34,35,36 6

Table for calculating indicators.
Groupsx iQuantity, f ix i * f iAccumulated frequency, S| x - x cf | * f(x - x cf) 2 * fFrequency, f i / n
43 - 45.83 44.42 1 44.42 1 8.88 78.91 0.0278
45.83 - 48.66 47.25 1 47.25 2 6.05 36.64 0.0278
48.66 - 51.49 50.08 6 300.45 8 19.34 62.33 0.17
51.49 - 54.32 52.91 18 952.29 26 7.07 2.78 0.5
54.32 - 57.15 55.74 4 222.94 30 9.75 23.75 0.11
57.15 - 59.98 58.57 6 351.39 36 31.6 166.44 0.17
36 1918.73 82.7 370.86 1

To estimate the distribution series, we find the following indicators:
Distribution center indicators.
Weighted average


Fashion
Fashion is the most common meaning of a feature in units of a given population.

where x 0 - the beginning of the modal interval; h is the size of the interval; f 2 is the frequency corresponding to the modal interval; f 1 - pre-modal frequency; f 3 - post-modal frequency.
We choose 51.49 as the beginning of the interval, since this interval contains the largest number.

The most common row value is 52.8
Median
The median divides the sample into two parts: half of the option is less than the median, half is more.
V interval row distribution, you can immediately specify only the interval in which the mode or median will be located. The median corresponds to the variant in the middle of the ranked row. The median is the interval 51.49 - 54.32, because in this interval, the accumulated frequency S is greater than the median number (the median is the first interval, the accumulated frequency S of which exceeds half of the total frequency sum).


Thus, 50% of the population units will be smaller in size 53.06
Variation indicators.
Absolute indicators of variation.
The range of variation is the difference between the maximum and minimum values ​​of the primary series feature.
R = X max - X min
R = 60 - 43 = 17
Average linear deviation- calculated in order to take into account the differences of all units of the studied population.


Each value of the series differs from another by no more than 2.3
Dispersion- characterizes the measure of dispersion around its mean (the measure of dispersion, i.e. deviation from the mean).


Unbiased variance estimate- consistent estimate of variance.


Standard deviation.

Each value of the series differs from the mean value 53.3 by no more than 3.21
Estimation of standard deviation.

Relative rates of variation.
The relative indicators of variation include: the oscillation coefficient, linear coefficient variation, relative linear deviation.
The coefficient of variation- a measure of the relative spread of the population values: shows how much of the mean of this value is its average spread.

Since v ≤ 30%, the population is homogeneous and the variation is weak. You can trust the results you get.
Linear coefficient of variation or Relative linear deviation- characterizes the share of the average value of the sign of absolute deviations from the average.

Testing hypotheses about the type of distribution.
1. Let us check the hypothesis that X is distributed over normal law using Pearson's goodness-of-fit test.

where p i is the probability of falling into the i-th interval of a random variable distributed according to a hypothetical law
To calculate the probabilities p i, we use the formula and the table of the Laplace function

where
s = 3.21, x av = 53.3
The theoretical (expected) frequency is n i = np i, where n = 36
Grouping intervalsObserved frequency n ix 1 = (x i - x avg) / sx 2 = (x i + 1 - x avg) / sF (x 1)F (x 2)The probability of hitting the i-th interval, p i = Ф (x 2) - Ф (x 1)Expected frequency, 36p iTerms of Pearson statistics, K i
43 - 45.83 1 -3.16 -2.29 -0.5 -0.49 0.01 0.36 1.14
45.83 - 48.66 1 -2.29 -1.42 -0.49 -0.42 0.0657 2.37 0.79
48.66 - 51.49 6 -1.42 -0.56 -0.42 -0.21 0.21 7.61 0.34
51.49 - 54.32 18 -0.56 0.31 -0.21 0.13 0.34 12.16 2.8
54.32 - 57.15 4 0.31 1.18 0.13 0.38 0.26 9.27 3
57.15 - 59.98 6 1.18 2.06 0.38 0.48 0.0973 3.5 1.78
36 9.84

Let us define the boundary of the critical region. Since Pearson's statistics measure the difference between empirical and theoretical distributions, the larger its observed value K obs, the stronger the argument against the main hypothesis.
Therefore, the critical area for this statistic is always right-handed :)