What is a confidence interval in statistics. Confidence intervals for frequencies and proportions

Confidence intervals ( English Confidence Intervals) one of the types of interval estimates used in statistics, which are calculated for a given level of significance. They allow the assertion that the true value of an unknown statistical parameter population is in the obtained range of values ​​with a probability that is given by the selected level statistical significance.

Normal distribution

When the variance (σ 2 ) of the population of data is known, a z-score can be used to calculate confidence limits (boundary points of the confidence interval). Compared to using a t-distribution, using a z-score will not only provide a narrower confidence interval, but also provide more reliable estimates of the mean and standard deviation (σ), since the Z-score is based on a normal distribution.

Formula

To determine the boundary points of the confidence interval, provided that the standard deviation of the population of data is known, the following formula is used

L = X - Z α/2 σ
√n

Example

Assume that the sample size is 25 observations, the sample mean is 15, and the population standard deviation is 8. For a significance level of α=5%, the Z-score is Z α/2 =1.96. In this case, the lower and upper limits of the confidence interval will be

L = 15 - 1.96 8 = 11,864
√25
L = 15 + 1.96 8 = 18,136
√25

Thus, we can state that with a probability of 95% the mathematical expectation of the general population will fall in the range from 11.864 to 18.136.

Methods for narrowing the confidence interval

Let's say the range is too wide for the purposes of our study. There are two ways to decrease the confidence interval range.

  1. Reduce the level of statistical significance α.
  2. Increase the sample size.

Reducing the level of statistical significance to α=10%, we get a Z-score equal to Z α/2 =1.64. In this case, the lower and upper limits of the interval will be

L = 15 - 1.64 8 = 12,376
√25
L = 15 + 1.64 8 = 17,624
√25

And the confidence interval itself can be written as

In this case, we can make the assumption that with a probability of 90%, the mathematical expectation of the general population will fall into the range.

If we want to keep the level of statistical significance α, then the only alternative is to increase the sample size. Increasing it to 144 observations, we obtain the following values ​​of the confidence limits

L = 15 - 1.96 8 = 13,693
√144
L = 15 + 1.96 8 = 16,307
√144

The confidence interval itself will look like this:

Thus, narrowing the confidence interval without reducing the level of statistical significance is only possible by increasing the sample size. If it is not possible to increase the sample size, then the narrowing of the confidence interval can be achieved solely by reducing the level of statistical significance.

Building a confidence interval for a non-normal distribution

If the standard deviation of the population is not known or the distribution is non-normal, the t-distribution is used to construct a confidence interval. This technique is more conservative, which is expressed in wider confidence intervals, compared to the technique based on the Z-score.

Formula

The following formulas are used to calculate the lower and upper limits of the confidence interval based on the t-distribution

L = X - tα σ
√n

Student's distribution or t-distribution depends on only one parameter - the number of degrees of freedom, which is equal to the number of individual feature values ​​(the number of observations in the sample). The value of Student's t-test for a given number of degrees of freedom (n) and the level of statistical significance α can be found in the lookup tables.

Example

Assume that the sample size is 25 individual values, the mean of the sample is 50, and the standard deviation of the sample is 28. You need to construct a confidence interval for the level of statistical significance α=5%.

In our case, the number of degrees of freedom is 24 (25-1), therefore, the corresponding tabular value of Student's t-test for the level of statistical significance α=5% is 2.064. Therefore, the lower and upper bounds of the confidence interval will be

L = 50 - 2.064 28 = 38,442
√25
L = 50 + 2.064 28 = 61,558
√25

And the interval itself can be written as

Thus, we can state that with a probability of 95% the mathematical expectation of the general population will be in the range.

Using a t-distribution allows you to narrow the confidence interval, either by reducing statistical significance or by increasing the sample size.

Reducing the statistical significance from 95% to 90% in the conditions of our example, we get the corresponding tabular value of Student's t-test 1.711.

L = 50 - 1.711 28 = 40,418
√25
L = 50 + 1.711 28 = 59,582
√25

In this case, we can say that with a probability of 90% the mathematical expectation of the general population will be in the range.

If we do not want to reduce the statistical significance, then the only alternative is to increase the sample size. Let's say that it is 64 individual observations, and not 25 as in the initial condition of the example. Table value Student's t-test for 63 degrees of freedom (64-1) and the level of statistical significance α=5% is 1.998.

L = 50 - 1.998 28 = 43,007
√64
L = 50 + 1.998 28 = 56,993
√64

This gives us the opportunity to assert that with a probability of 95% the mathematical expectation of the general population will be in the range.

Large Samples

Large samples are samples from a population of data with more than 100 individual observations. Statistical studies have shown that larger samples tend to be normally distributed, even if the distribution of the population is not normal. In addition, for such samples, the use of z-score and t-distribution give approximately the same results when constructing confidence intervals. Thus, for large samples, it is acceptable to use a z-score for a normal distribution instead of a t-distribution.

Summing up

In the previous subsections, we considered the question of estimating the unknown parameter a one number. Such an assessment is called "point". In a number of tasks, it is required not only to find for the parameter a suitable numerical value, but also evaluate its accuracy and reliability. It is required to know what errors the parameter substitution can lead to a its point estimate a and with what degree of confidence can we expect that these errors will not go beyond known limits?

Problems of this kind are especially relevant for a small number of observations, when the point estimate and in is largely random and an approximate replacement of a by a can lead to serious errors.

To give an idea of ​​the accuracy and reliability of the estimate a,

in mathematical statistics use the so-called confidence intervals and confidence probabilities.

Let for the parameter a derived from experience unbiased estimate a. We want to estimate the possible error in this case. Let us assign some sufficiently large probability p (for example, p = 0.9, 0.95, or 0.99) such that an event with probability p can be considered practically certain, and find a value of s for which

Then the range of practically possible values ​​of the error that occurs when replacing a on the a, will be ± s; large absolute errors will appear only with a small probability a = 1 - p. Let's rewrite (14.3.1) as:

Equality (14.3.2) means that with probability p the unknown value of the parameter a falls within the interval

In this case, one circumstance should be noted. Previously, we repeatedly considered the probability of a random variable falling into a given non-random interval. Here the situation is different: a not random, but random interval / r. Randomly its position on the x-axis, determined by its center a; in general, the length of the interval 2s is also random, since the value of s is calculated, as a rule, from experimental data. Therefore, in this case, it would be better to interpret the value of p not as the probability of "hitting" the point a into the interval / p, but as the probability that a random interval / p will cover the point a(Fig. 14.3.1).

Rice. 14.3.1

The probability p is called confidence level, and the interval / p - confidence interval. Interval boundaries if. a x \u003d a- s and a 2 = a + and are called trust boundaries.

Let's give one more interpretation to the concept of a confidence interval: it can be considered as an interval of parameter values a, compatible with experimental data and not contradicting them. Indeed, if we agree to consider an event with a probability a = 1-p practically impossible, then those values ​​of the parameter a for which a - a> s must be recognized as contradicting the experimental data, and those for which |a - a a t na 2 .

Let for the parameter a there is an unbiased estimate a. If we knew the law of distribution of the quantity a, the problem of finding the confidence interval would be quite simple: it would be enough to find a value of s for which

The difficulty lies in the fact that the distribution law of the estimate a depends on the law of distribution of quantity X and, consequently, on its unknown parameters (in particular, on the parameter itself a).

To get around this difficulty, one can apply the following roughly approximate trick: replace the unknown parameters in the expression for s with their point estimates. With comparatively large numbers experiments P(about 20 ... 30) this technique usually gives satisfactory results in terms of accuracy.

As an example, consider the problem of the confidence interval for the mathematical expectation.

Let produced P x, whose characteristics are the mathematical expectation t and variance D- unknown. For these parameters, the following estimates were obtained:

It is required to build a confidence interval / р, corresponding to the confidence probability р, for the mathematical expectation t quantities x.

In solving this problem, we use the fact that the quantity t is the sum P independent identically distributed random variables X h and according to the central limit theorem for sufficiently large P its distribution law is close to normal. In practice, even with a relatively small number of terms (of the order of 10 ... 20), the distribution law of the sum can be approximately considered normal. We will assume that the value t distributed according to the normal law. The characteristics of this law - the mathematical expectation and variance - are equal, respectively t and

(see chapter 13 subsection 13.3). Let's assume that the value D is known to us and we will find such a value Ep for which

Applying formula (6.3.5) of Chapter 6, we express the probability on the left side of (14.3.5) in terms of the normal distribution function

where is the standard deviation of the estimate t.

From the equation

find the Sp value:

where arg Ф* (x) is the inverse function of Ф* (X), those. such a value of the argument for which the normal distribution function is equal to X.

Dispersion D, through which the value is expressed a 1P, we do not know exactly; as its approximate value, you can use the estimate D(14.3.4) and put approximately:

Thus, the problem of constructing a confidence interval is approximately solved, which is equal to:

where gp is defined by formula (14.3.7).

In order to avoid reverse interpolation in the tables of the function Ф * (l) when calculating s p, it is convenient to compile a special table (Table 14.3.1), which lists the values ​​of the quantity

depending on r. The value (p determines for the normal law the number of standard deviations that must be set aside to the right and left of the dispersion center so that the probability of falling into the resulting area is equal to p.

Through the value of 7 p, the confidence interval is expressed as:

Table 14.3.1

Example 1. 20 experiments were carried out on the value x; the results are shown in table. 14.3.2.

Table 14.3.2

It is required to find an estimate of for the mathematical expectation of the quantity X and construct a confidence interval corresponding to a confidence level p = 0.8.

Decision. We have:

Choosing for the origin n: = 10, according to the third formula (14.2.14) we find the unbiased estimate D :

According to the table 14.3.1 we find

Confidence limits:

Confidence interval:

Parameter values t, lying in this interval are compatible with the experimental data given in table. 14.3.2.

In a similar way, a confidence interval can be constructed for the variance.

Let produced P independent experiments on random variable X with unknown parameters from and A, and for the variance D the unbiased estimate is obtained:

It is required to approximately build a confidence interval for the variance.

From formula (14.3.11) it can be seen that the value D represents

amount P random variables of the form . These values ​​are not

independent, since any of them includes the quantity t, dependent on everyone else. However, it can be shown that as P the distribution law of their sum is also close to normal. Almost at P= 20...30 it can already be considered normal.

Let's assume that this is so, and find the characteristics of this law: the mathematical expectation and variance. Since the score D- unbiased, then M[D] = D.

Variance Calculation D D is associated with relatively complex calculations, so we give its expression without derivation:

where c 4 - the fourth central moment of the quantity x.

To use this expression, you need to substitute in it the values ​​\u200b\u200bof 4 and D(at least approximate). Instead of D you can use the evaluation D. In principle, the fourth central moment can also be replaced by its estimate, for example, by a value of the form:

but such a replacement will give an extremely low accuracy, since in general, with a limited number of experiments, high-order moments are determined with large errors. However, in practice it often happens that the form of the distribution law of the quantity X known in advance: only its parameters are unknown. Then we can try to express u4 in terms of D.

Let us take the most common case, when the value X distributed according to the normal law. Then its fourth central moment is expressed in terms of the variance (see Chapter 6 Subsection 6.2);

and formula (14.3.12) gives or

Replacing in (14.3.14) the unknown D his assessment D, we get: whence

The moment u 4 can be expressed in terms of D also in some other cases, when the distribution of the quantity X is not normal, but its appearance is known. For example, for the law of uniform density (see Chapter 5) we have:

where (a, P) is the interval on which the law is given.

Hence,

According to the formula (14.3.12) we get: from where we find approximately

In cases where the form of the law of distribution of the value 26 is unknown, when estimating the value of a /) it is still recommended to use the formula (14.3.16), if there are no special grounds for believing that this law is very different from the normal one (has a noticeable positive or negative kurtosis) .

If the approximate value of a /) is obtained in one way or another, then it is possible to construct a confidence interval for the variance in the same way as we built it for the mathematical expectation:

where the value depending on the given probability p is found in Table. 14.3.1.

Example 2. Find an Approximately 80% Confidence Interval for the Variance of a Random Variable X under the conditions of example 1, if it is known that the value X distributed according to a law close to normal.

Decision. The value remains the same as in Table. 14.3.1:

According to the formula (14.3.16)

According to the formula (14.3.18) we find the confidence interval:

Corresponding range of mean values standard deviation: (0,21; 0,29).

14.4. Exact methods for constructing confidence intervals for the parameters of a random variable distributed according to the normal law

In the previous subsection, we considered roughly approximate methods for constructing confidence intervals for the mean and variance. Here we give an idea of ​​the exact methods for solving the same problem. We emphasize that in order to accurately find the confidence intervals, it is absolutely necessary to know in advance the form of the law of distribution of the quantity x, whereas this is not necessary for the application of approximate methods.

Idea precise methods construction of confidence intervals is reduced to the following. Any confidence interval is found from the condition expressing the probability of fulfillment of some inequalities, which include the estimate of interest to us a. Grade distribution law a in the general case depends on the unknown parameters of the quantity x. However, sometimes it is possible to pass in inequalities from a random variable a to some other function of observed values X p X 2, ..., X p. the distribution law of which does not depend on unknown parameters, but depends only on the number of experiments and on the form of the distribution law of the quantity x. Random variables of this kind play a large role in mathematical statistics; they have been studied in most detail for the case of a normal distribution of the quantity x.

For example, it has been proved that under a normal distribution of the quantity X random value

subject to the so-called Student's distribution law with P- 1 degrees of freedom; the density of this law has the form

where G(x) is the known gamma function:

It is also proved that the random variable

has "distribution % 2 " with P- 1 degrees of freedom (see chapter 7), the density of which is expressed by the formula

Without dwelling on the derivations of distributions (14.4.2) and (14.4.4), we will show how they can be applied when constructing confidence intervals for the parameters Ty D .

Let produced P independent experiments on a random variable x, distributed according to the normal law with unknown parameters TIO. For these parameters, estimates

It is required to construct confidence intervals for both parameters corresponding to the confidence probability p.

Let us first construct a confidence interval for the mathematical expectation. It is natural to take this interval symmetrical with respect to t; denote by s p half the length of the interval. The value of sp must be chosen so that the condition

Let's try to pass on the left side of equality (14.4.5) from a random variable t to a random variable T, distributed according to Student's law. To do this, we multiply both parts of the inequality |m-w?|

to a positive value: or, using the notation (14.4.1),

Let us find a number / p such that the value / p can be found from the condition

From formula (14.4.2) it can be seen that (1) - even function, so (14.4.8) gives

Equality (14.4.9) determines the value / p depending on p. If you have at your disposal a table of integral values

then the value / p can be found by reverse interpolation in the table. However, it is more convenient to compile a table of values ​​/ p in advance. Such a table is given in the Appendix (Table 5). This table shows the values ​​depending on the confidence probability p and the number of degrees of freedom P- 1. Having determined / p according to the table. 5 and assuming

we find half the width of the confidence interval / p and the interval itself

Example 1. 5 independent experiments were performed on a random variable x, normally distributed with unknown parameters t and about. The results of the experiments are given in table. 14.4.1.

Table 14.4.1

Find an estimate t for the mathematical expectation and construct a 90% confidence interval / p for it (i.e., the interval corresponding to the confidence probability p \u003d 0.9).

Decision. We have:

According to table 5 of the application for P - 1 = 4 and p = 0.9 we find where

The confidence interval will be

Example 2. For the conditions of example 1 of subsection 14.3, assuming the value X normally distributed, find the exact confidence interval.

Decision. According to table 5 of the application, we find at P - 1 = 19ir =

0.8 / p = 1.328; from here

Comparing with the solution of example 1 of subsection 14.3 (e p \u003d 0.072), we see that the discrepancy is very small. If we keep the accuracy to the second decimal place, then the confidence intervals found by the exact and approximate methods are the same:

Let's move on to constructing a confidence interval for the variance. Consider the unbiased variance estimate

and express the random variable D through the value V(14.4.3) having distribution x 2 (14.4.4):

Knowing the distribution law of the quantity V, you can find the interval / (1 , in which it falls from given probability R.

distribution law k n _ x (v) the value of I 7 has the form shown in fig. 14.4.1.

Rice. 14.4.1

The question arises: how to choose the interval / p? If the distribution law of the quantity V was symmetric (like a normal law or Student's distribution), it would be natural to take the interval /p symmetric with respect to the mathematical expectation. In this case, the law k n _ x (v) asymmetrical. Let us agree to choose the interval /p so that the probabilities of output of the quantity V outside the interval to the right and left (shaded areas in Fig. 14.4.1) were the same and equal

To construct an interval / p with this property, we use Table. 4 applications: it contains numbers y) such that

for the quantity V, having x 2 -distribution with r degrees of freedom. In our case r = n- 1. Fix r = n- 1 and find in the corresponding line of the table. 4 two values x 2 - one corresponding to a probability the other - probabilities Let us designate these

values at 2 and xl? The interval has y 2 , with his left, and y ~ right end.

Now we find the required confidence interval /| for the variance with boundaries D, and D2, which covers the point D with probability p:

Let us construct such an interval / (, = (?> b A), which covers the point D if and only if the value V falls into the interval / r. Let us show that the interval

satisfies this condition. Indeed, the inequalities are equivalent to the inequalities

and these inequalities hold with probability p. Thus, the confidence interval for the dispersion is found and is expressed by the formula (14.4.13).

Example 3. Find the confidence interval for the variance under the conditions of example 2 of subsection 14.3, if it is known that the value X distributed normally.

Decision. We have . According to table 4 of the application

we find at r = n - 1 = 19

According to the formula (14.4.13) we find the confidence interval for the dispersion

Corresponding interval for standard deviation: (0.21; 0.32). This interval only slightly exceeds the interval (0.21; 0.29) obtained in Example 2 of Subsection 14.3 by the approximate method.

  • Figure 14.3.1 considers a confidence interval that is symmetric about a. In general, as we will see later, this is not necessary.

Confidence intervals.

The calculation of the confidence interval is based on the average error of the corresponding parameter. Confidence interval shows within what limits with probability (1-a) is the true value of the estimated parameter. Here a is the significance level, (1-a) is also called the confidence level.

In the first chapter, we showed that, for example, for the arithmetic mean, the true population mean lies within 2 mean errors of the mean about 95% of the time. Thus, the boundaries of the 95% confidence interval for the mean will be twice as far from the sample mean. average error average, i.e. we multiply the mean error of the mean by a factor that depends on the confidence level. For the average and the difference of the averages, the Student's coefficient (the critical value of the Student's criterion) is taken, for the share and difference of the shares, the critical value of the z criterion. The product of the coefficient and the average error can be called the marginal error given parameter, i.e. the maximum that we can get when evaluating it.

Confidence interval for arithmetic mean : .

Here is the sample mean;

Average error of the arithmetic mean;

s- sample standard deviation;

n

f = n-1 (Student's coefficient).

Confidence interval for difference of arithmetic means :

Here, is the difference between the sample means;

- the average error of the difference of arithmetic means;

s 1 ,s 2 - sample standard deviations;

n1,n2

Critical value of the Student's criterion for a given level of significance a and the number of degrees of freedom f=n1 +n2-2 (Student's coefficient).

Confidence interval for shares :

.

Here d is the sample share;

– average share error;

n– sample size (group size);

Confidence interval for share differences :

Here, is the difference between the sample shares;

is the mean error of the difference between the arithmetic means;

n1,n2– sample sizes (number of groups);

The critical value of the criterion z at a given significance level a ( , , ).

By calculating the confidence intervals for the difference in indicators, we, firstly, directly see the possible values ​​of the effect, and not just its point estimate. Secondly, we can draw a conclusion about the acceptance or refutation of the null hypothesis and, thirdly, we can draw a conclusion about the power of the criterion.

When testing hypotheses using confidence intervals, the following rule should be followed:

If the 100(1-a)-percent confidence interval of the mean difference does not contain zero, then the differences are statistically significant at the a significance level; on the contrary, if this interval contains zero, then the differences are not statistically significant.

Indeed, if this interval contains zero, then, it means that the compared indicator can be either more or less in one of the groups compared to the other, i.e. the observed differences are random.

By the place where zero is located within the confidence interval, one can judge the power of the criterion. If zero is close to the lower or upper limit of the interval, then perhaps with a larger number of compared groups, the differences would reach statistical significance. If zero is close to the middle of the interval, then it means that both the increase and decrease of the indicator in the experimental group are equally probable, and, probably, there really are no differences.

Examples:

To compare operational lethality when using two different types of anesthesia: 61 people were operated on using the first type of anesthesia, 8 died, using the second - 67 people, 10 died.

d 1 \u003d 8/61 \u003d 0.131; d 2 \u003d 10/67 \u003d 0.149; d1-d2 = - 0.018.

The difference in lethality of the compared methods will be in the range (-0.018 - 0.122; -0.018 + 0.122) or (-0.14; 0.104) with a probability of 100(1-a) = 95%. The interval contains zero, i.e. hypothesis about the same mortality in two different types anesthesia cannot be denied.

Thus, mortality can and will decrease to 14% and increase to 10.4% with a probability of 95%, i.e. zero is approximately in the middle of the interval, so it can be argued that, most likely, these two methods do not really differ in lethality.

In the example discussed earlier, the average click time was compared during the tapping test in four groups students who differ in examination grade. Let's calculate the confidence intervals of the average pressing time for students who passed the exam for 2 and 5 and the confidence interval for the difference between these averages.

Student's coefficients are found from the tables of Student's distribution (see Appendix): for the first group: = t(0.05;48) = 2.011; for the second group: = t(0.05;61) = 2.000. Thus, confidence intervals for the first group: = (162.19-2.011 * 2.18; 162.19 + 2.011 * 2.18) = (157.8; 166.6) , for the second group (156.55- 2.000*1.88 ; 156.55+2.000*1.88) = (152.8 ; 160.3). So, for those who passed the exam for 2, the average pressing time ranges from 157.8 ms to 166.6 ms with a probability of 95%, for those who passed the exam for 5 - from 152.8 ms to 160.3 ms with a probability of 95%.

You can also test the null hypothesis using confidence intervals for the means, and not just for the difference in the means. For example, as in our case, if the confidence intervals for the means overlap, then the null hypothesis cannot be rejected. In order to reject a hypothesis at a chosen significance level, the corresponding confidence intervals must not overlap.

Let's find the confidence interval for the difference in the average pressing time in the groups who passed the exam for 2 and 5. The difference in the averages: 162.19 - 156.55 = 5.64. Student's coefficient: \u003d t (0.05; 49 + 62-2) \u003d t (0.05; 109) \u003d 1.982. Group standard deviations will be equal to: ; . We calculate the average error of the difference between the means: . Confidence interval: \u003d (5.64-1.982 * 2.87; 5.64 + 1.982 * 2.87) \u003d (-0.044; 11.33).

So, the difference in the average pressing time in the groups that passed the exam at 2 and at 5 will be in the range from -0.044 ms to 11.33 ms. This interval includes zero, i.e. the average pressing time for those who passed the exam with excellent results can both increase and decrease compared to those who passed the exam unsatisfactorily, i.e. the null hypothesis cannot be rejected. But zero is very close to the lower limit, the time of pressing is much more likely to decrease for excellent passers. Thus, we can conclude that there are still differences in the average click time between those who passed by 2 and by 5, we just could not detect them for a given change in the average time, the spread of the average time and sample sizes.



The power of the test is the probability of rejecting an incorrect null hypothesis, i.e. find differences where they really are.

The power of the test is determined based on the level of significance, the size of the differences between the groups, the spread of values ​​in the groups, and the sample size.

For Student's t-test and analysis of variance, you can use sensitivity charts.

The power of the criterion can be used in the preliminary determination of the required number of groups.

The confidence interval shows within what limits the true value of the estimated parameter lies with a given probability.

With the help of confidence intervals, you can test statistical hypotheses and draw conclusions about the sensitivity of the criteria.

LITERATURE.

Glantz S. - Chapter 6.7.

Rebrova O.Yu. - p.112-114, p.171-173, p.234-238.

Sidorenko E. V. - pp. 32-33.

Questions for self-examination of students.

1. What is the power of the criterion?

2. In what cases is it necessary to evaluate the power of criteria?

3. Methods for calculating power.

6. How to test a statistical hypothesis using a confidence interval?

7. What can be said about the power of the criterion when calculating the confidence interval?

Tasks.

Suppose we have a large number of items, with normal distribution some characteristics (for example, a full warehouse of the same type of vegetables, the size and weight of which varies). You want to know the average characteristics of the entire batch of goods, but you have neither the time nor the inclination to measure and weigh each vegetable. You understand that this is not necessary. But how many pieces would you need to take for random inspection?

Before giving some formulas useful for this situation, we recall some notation.

First, if we did measure the entire warehouse of vegetables (this set of elements is called the general population), then we would know with all the accuracy available to us the average value of the weight of the entire batch. Let's call this average X cf .g en . - general average. We already know what is completely determined if its mean value and deviation s are known . True, so far we are neither X avg. nor s we do not know the general population. We can only take some sample, measure the values ​​we need and calculate for this sample both the mean value X sr. in sample and the standard deviation S sb.

It is known that if our custom check contains a large number of elements (usually n is greater than 30), and they are taken really random, then s the general population will almost not differ from S ..

In addition, for the case of a normal distribution, we can use the following formulas:

With a probability of 95%


With a probability of 99%



AT general view with probability Р (t)


The relationship between the value of t and the value of the probability P (t), with which we want to know the confidence interval, can be taken from the following table:


Thus, we have determined in what range the average value for the general population is (with a given probability).

Unless we have a large enough sample, we cannot claim that the population has s = S sel. In addition, in this case, the closeness of the sample to the normal distribution is problematic. In this case, also use S sb instead s in the formula:




but the value of t for a fixed probability P(t) will depend on the number of elements in the sample n. The larger n, the closer the resulting confidence interval will be to the value given by formula (1). The values ​​of t in this case are taken from another table ( Student's t-test), which we present below:

Student's t-test values ​​for probability 0.95 and 0.99


Example 3 30 people were randomly selected from the employees of the company. According to the sample, it turned out that the average salary (per month) is 30 thousand rubles, with an average standard deviation 5 thousand rubles. With a probability of 0.99 determine the average salary in the firm.

Decision: By condition, we have n = 30, X cf. =30000, S=5000, P=0.99. To find the confidence interval, we use the formula corresponding to the Student's criterion. According to the table for n \u003d 30 and P \u003d 0.99 we find t \u003d 2.756, therefore,


those. desired trust interval 27484< Х ср.ген < 32516.

So, with a probability of 0.99, it can be argued that the interval (27484; 32516) contains the average salary in the company.

We hope that you will use this method without necessarily having a spreadsheet with you every time. Calculations can be carried out automatically in Excel. While in an Excel file, click the fx button on the top menu. Then, select among the functions the type "statistical", and from the proposed list in the box - STEUDRASP. Then, at the prompt, placing the cursor in the "probability" field, type the value of the reciprocal probability (that is, in our case, instead of the probability of 0.95, you need to type the probability of 0.05). Apparently, the spreadsheet is designed so that the result answers the question of how likely we can be wrong. Similarly, in the "degree of freedom" field, enter the value (n-1) for your sample.

Let's build in MS EXCEL trust interval for estimating the mean value of the distribution in the case of a known value of the variance.

Of course the choice level of trust completely depends on the task at hand. Thus, the degree of confidence of the air passenger in the reliability of the aircraft, of course, should be higher than the degree of confidence of the buyer in the reliability of the light bulb.

Task Formulation

Let's assume that from population having taken sample size n. It is assumed that standard deviation this distribution is known. Necessary on the basis of this samples evaluate the unknown distribution mean(μ, ) and construct the corresponding bilateral confidence interval.

Point Estimation

As is known from statistics(let's call it X cf) is an unbiased estimate of the mean this population and has the distribution N(μ;σ 2 /n).

Note: What if you need to build confidence interval in the case of distribution, which is not normal? In this case, comes to the rescue, which says that with enough big size samples n from distribution non- normal, sampling distribution of statistics Х av will approximately correspond normal distribution with parameters N(μ;σ 2 /n).

So, point estimate middle distribution values we have is sample mean, i.e. X cf. Now let's get busy confidence interval.

Building a confidence interval

Usually, knowing the distribution and its parameters, we can calculate the probability that a random variable will take a value from the interval we specified. Now let's do the opposite: find the interval in which the random variable falls with a given probability. For example, from properties normal distribution it is known that with a probability of 95%, a random variable distributed over normal law, will fall within the interval approximately +/- 2 from mean value(see article about). This interval will serve as our prototype for confidence interval.

Now let's see if we know the distribution , to calculate this interval? To answer the question, we must specify the form of distribution and its parameters.

We know the form of distribution is normal distribution(remember that we are talking about sampling distribution statistics X cf).

The parameter μ is unknown to us (it just needs to be estimated using confidence interval), but we have its estimate X cf, calculated based on sample, which can be used.

The second parameter is sample mean standard deviation will be known, it is equal to σ/√n.

Because we do not know μ, then we will build the interval +/- 2 standard deviations not from mean value, but from its known estimate X cf. Those. when calculating confidence interval we will NOT assume that X cf will fall within the interval +/- 2 standard deviations from μ with a probability of 95%, and we will assume that the interval is +/- 2 standard deviations from X cf with a probability of 95% will cover μ - the average of the general population, from which sample. These two statements are equivalent, but the second statement allows us to construct confidence interval.

In addition, we refine the interval: a random variable distributed over normal law, with a 95% probability falls within the interval +/- 1.960 standard deviations, not +/- 2 standard deviations. This can be calculated using the formula \u003d NORM.ST.OBR ((1 + 0.95) / 2), cm. sample file Sheet Spacing.

Now we can formulate a probabilistic statement that will serve us to form confidence interval:
"The probability that population mean located from sample average within 1.960" standard deviations of the sample mean", is equal to 95%.

The probability value mentioned in the statement has special name , which is associated with significance level α (alpha) by a simple expression trust level =1 . In our case significance level α =1-0,95=0,05 .

Now, based on this probabilistic statement, we write an expression for calculating confidence interval:

where Zα/2 standard normal distribution(such a value of a random variable z, what P(z>=Zα/2 )=α/2).

Note: Upper α/2-quantile defines the width confidence interval in standard deviations sample mean. Upper α/2-quantile standard normal distribution is always greater than 0, which is very convenient.

In our case, at α=0.05, upper α/2-quantile equals 1.960. For other significance levels α (10%; 1%) upper α/2-quantile Zα/2 can be calculated using the formula \u003d NORM.ST.OBR (1-α / 2) or, if known trust level, =NORM.ST.OBR((1+confidence level)/2).

Usually when building confidence intervals for estimating the mean use only upper α/2-quantile and do not use lower α/2-quantile. This is possible because standard normal distribution symmetrical about the x-axis ( density of its distribution symmetrical about average, i.e. 0). Therefore, there is no need to calculate lower α/2-quantile(it is simply called α /2-quantile), because it is equal upper α/2-quantile with a minus sign.

Recall that, regardless of the shape of the distribution of x, the corresponding random variable X cf distributed approximately fine N(μ;σ 2 /n) (see article about). Therefore, in general, the above expression for confidence interval is only approximate. If x is distributed over normal law N(μ;σ 2 /n), then the expression for confidence interval is accurate.

Calculation of confidence interval in MS EXCEL

Let's solve the problem.
The response time of an electronic component to an input signal is an important characteristic of a device. An engineer wants to plot a confidence interval for the average response time at a confidence level of 95%. From previous experience, the engineer knows that the standard deviation of the response time is 8 ms. It is known that the engineer made 25 measurements to estimate the response time, the average value was 78 ms.

Decision: An engineer wants to know the response time of an electronic device, but he understands that the response time is not fixed, but a random variable that has its own distribution. So the best he can hope for is to determine the parameters and shape of this distribution.

Unfortunately, from the condition of the problem, we do not know the form of the distribution of the response time (it does not have to be normal). , this distribution is also unknown. Only he is known standard deviationσ=8. Therefore, while we cannot calculate the probabilities and construct confidence interval.

However, although we do not know the distribution time separate response, we know that according to CPT, sampling distribution average response time is approximately normal(we will assume that the conditions CPT are performed, because the size samples large enough (n=25)) .

Furthermore, the average this distribution is equal to mean value unit response distributions, i.e. μ. BUT standard deviation of this distribution (σ/√n) can be calculated using the formula =8/ROOT(25) .

It is also known that the engineer received point estimate parameter μ equal to 78 ms (X cf). Therefore, now we can calculate the probabilities, because we know the distribution form ( normal) and its parameters (Х ср and σ/√n).

Engineer wants to know expected valueμ of the response time distribution. As stated above, this μ is equal to mathematical expectation sampling distribution of average response time. If we use normal distribution N(X cf; σ/√n), then the desired μ will be in the range +/-2*σ/√n with a probability of approximately 95%.

Significance level equals 1-0.95=0.05.

Finally, find the left and right border confidence interval.
Left border: \u003d 78-NORM.ST.INR (1-0.05 / 2) * 8 / ROOT (25) = 74,864
Right border: \u003d 78 + NORM. ST. OBR (1-0.05 / 2) * 8 / ROOT (25) \u003d 81.136

Left border: =NORM.INV(0.05/2, 78, 8/SQRT(25))
Right border: =NORM.INV(1-0.05/2, 78, 8/SQRT(25))

Answer: confidence interval at 95% confidence level and σ=8msec equals 78+/-3.136ms

AT example file on sheet Sigma known created a form for calculation and construction bilateral confidence interval for arbitrary samples with a given σ and significance level.

CONFIDENCE.NORM() function

If the values samples are in the range B20:B79 , a significance level equal to 0.05; then MS EXCEL formula:
=AVERAGE(B20:B79)-CONFIDENCE(0.05,σ, COUNT(B20:B79))
will return the left border confidence interval.

The same boundary can be calculated using the formula:
=AVERAGE(B20:B79)-NORM.ST.INV(1-0.05/2)*σ/SQRT(COUNT(B20:B79))

Note: The TRUST.NORM() function appeared in MS EXCEL 2010. Earlier versions of MS EXCEL used the TRUST() function.