Find 95 confidence interval. Confidence interval

Often the appraiser has to analyze the real estate market of the segment in which the appraisal object is located. If the market is developed, it can be difficult to analyze the entire set of presented objects, therefore, a sample of objects is used for analysis. This sample is not always homogeneous, sometimes it is required to clear it of extremes - too high or too low market offers. For this purpose, it is applied confidence interval . The purpose of this study is to conduct a comparative analysis of two methods for calculating the confidence interval and choose best option calculation when working with different samples in the estimatica.pro system.

Confidence interval - calculated on the basis of the sample, the interval of values ​​of the characteristic, which with a known probability contains the estimated parameter of the general population.

The meaning of calculating the confidence interval is to build such an interval from the sample data so that it can be asserted with given probability that the value of the estimated parameter is in this interval. In other words, the confidence interval with a certain probability contains the unknown value of the estimated quantity. The wider the interval, the higher the inaccuracy.

There are different methods for determining the confidence interval. In this article, we will consider 2 ways:

  • through the median and standard deviation;
  • across critical value t-statistics (Student's coefficient).

Stages comparative analysis different ways CI calculation:

1. form a data sample;

2. we process it with statistical methods: we calculate the mean value, median, variance, etc.;

3. we calculate the confidence interval in two ways;

4. Analyze the cleaned samples and the obtained confidence intervals.

Stage 1. Data sampling

The sample was formed using the estimatica.pro system. The sample included 91 sales offers 1 room apartments in the 3rd price zone with the type of layout "Khrushchev".

Table 1. Initial sample

The price of 1 sq.m., c.u.

Fig.1. Initial sample



Stage 2. Processing of the initial sample

Sample processing by statistical methods requires the calculation of the following values:

1. Arithmetic mean

2. Median - a number that characterizes the sample: exactly half of the sample elements are greater than the median, the other half is less than the median

(for a sample with an odd number of values)

3. Range - the difference between the maximum and minimum values ​​in the sample

4. Variance - used to more accurately estimate the variation in data

5. The standard deviation for the sample (hereinafter referred to as RMS) is the most common indicator of the dispersion of adjustment values ​​around the arithmetic mean.

6. Coefficient of variation - reflects the degree of dispersion of adjustment values

7. oscillation coefficient - reflects the relative fluctuation of the extreme values ​​of prices in the sample around the average

Table 2. Statistical indicators of the original sample

The coefficient of variation, which characterizes the homogeneity of the data, is 12.29%, but the coefficient of oscillation is too large. Thus, we can state that the original sample is not homogeneous, so let's move on to calculating the confidence interval.

Stage 3. Calculation of the confidence interval

Method 1. Calculation through the median and standard deviation.

The confidence interval is determined as follows: the minimum value - the standard deviation is subtracted from the median; maximum value- SSE is added to the median.

Thus, the confidence interval (47179 CU; 60689 CU)

Rice. 2. Values ​​within confidence interval 1.



Method 2. Building a confidence interval through the critical value of t-statistics (Student's coefficient)

S.V. Gribovsky in the book " Mathematical Methods property value assessment” describes how to calculate the confidence interval through the Student's coefficient. When calculating by this method, the estimator himself must set the significance level ∝, which determines the probability with which the confidence interval will be built. Significance levels of 0.1 are commonly used; 0.05 and 0.01. They correspond to confidence probabilities of 0.9; 0.95 and 0.99. With this method, the true values ​​\u200b\u200bare calculated mathematical expectation and variances are practically unknown (which is almost always true when solving practical estimation problems).

Confidence interval formula:

n - sample size;

The critical value of t-statistics (Student's distributions) with a significance level ∝, the number of degrees of freedom n-1, which is determined by special statistical tables or using MS Excel (→"Statistical"→ STUDRASPOBR);

∝ - significance level, we take ∝=0.01.

Rice. 2. Values ​​within the confidence interval 2.

Step 4. Analysis of different ways to calculate the confidence interval

Two ways to calculate the confidence interval - through the median and Student's coefficient - led to different values intervals. Accordingly, two different purified samples were obtained.

Table 3. Statistical indicators for three samples.

Indicator

Initial sample

1 option

Option 2

Mean

Dispersion

Coef. variations

Coef. oscillations

Number of retired objects, pcs.

Based on the calculations performed, we can say that the values ​​of the confidence intervals obtained by different methods intersect, so you can use any of the calculation methods at the discretion of the appraiser.

However, we believe that when working in the estimatica.pro system, it is advisable to choose a method for calculating the confidence interval, depending on the degree of market development:

  • if the market is not developed, apply the method of calculation through the median and standard deviation, since the number of retired objects in this case is small;
  • if the market is developed, apply the calculation through the critical value of t-statistics (Student's coefficient), since it is possible to form a large initial sample.

In preparing the article were used:

1. Gribovsky S.V., Sivets S.A., Levykina I.A. Mathematical methods for assessing the value of property. Moscow, 2014

2. Data from the estimatica.pro system

Suppose we have a large number of items with a normal distribution of some characteristics (for example, a full warehouse of vegetables of the same type, the size and weight of which varies). You want to know the average characteristics of the entire batch of goods, but you have neither the time nor the inclination to measure and weigh each vegetable. You understand that this is not necessary. But how many pieces would you need to take for random inspection?

Before giving some formulas useful for this situation, we recall some notation.

First, if we did measure the entire warehouse of vegetables (this set of elements is called the general population), then we would know with all the accuracy available to us the average value of the weight of the entire batch. Let's call this average X cf .g en . - general average. We already know what is completely determined if its mean value and deviation s are known . True, so far we are neither X avg. nor s we do not know the general population. We can only take some sample, measure the values ​​we need and calculate for this sample both the mean value X sr. in sample and the standard deviation S sb.

It is known that if our custom check contains a large number of elements (usually n is greater than 30), and they are taken really random, then s the general population will almost not differ from S ..

In addition, for the case of a normal distribution, we can use the following formulas:

With a probability of 95%


With a probability of 99%



IN general view with probability Р (t)


The relationship between the value of t and the value of the probability P (t), with which we want to know the confidence interval, can be taken from the following table:


Thus, we have determined in what range the average value for the general population is (with a given probability).

If we do not have a large enough sample, we cannot say that population has s = S sel. In addition, in this case, the closeness of the sample to the normal distribution is problematic. In this case, also use S sb instead s in the formula:




but the value of t for a fixed probability P(t) will depend on the number of elements in the sample n. The larger n, the closer the resulting confidence interval will be to the value given by formula (1). The values ​​of t in this case are taken from another table ( Student's t-test), which we present below:

Student's t-test values ​​for probability 0.95 and 0.99


Example 3 30 people were randomly selected from the employees of the company. According to the sample, it turned out that the average salary (per month) is 30 thousand rubles, with an average standard deviation 5 thousand rubles. With a probability of 0.99 determine the average salary in the firm.

Solution: By condition, we have n = 30, X cf. =30000, S=5000, P=0.99. To find the confidence interval, we use the formula corresponding to the Student's criterion. According to the table for n \u003d 30 and P \u003d 0.99 we find t \u003d 2.756, therefore,


those. desired trust interval 27484< Х ср.ген < 32516.

So, with a probability of 0.99, it can be argued that the interval (27484; 32516) contains the average salary in the company.

We hope that you will use this method without necessarily having a spreadsheet with you every time. Calculations can be carried out automatically in Excel. While in an Excel file, click the fx button on the top menu. Then, select among the functions the type "statistical", and from the proposed list in the box - STEUDRASP. Then, at the prompt, placing the cursor in the "probability" field, type the value of the reciprocal probability (that is, in our case, instead of the probability of 0.95, you need to type the probability of 0.05). Apparently, the spreadsheet is designed so that the result answers the question of how likely we can be wrong. Similarly, in the "degree of freedom" field, enter the value (n-1) for your sample.

The confidence interval came to us from the field of statistics. This is a defined range that serves to estimate an unknown parameter with a high degree reliability. The easiest way to explain this is with an example.

Suppose you need to investigate some random variable, for example, the speed of the server's response to a client request. Each time a user types in the address of a particular site, the server responds at a different rate. Thus, the investigated response time has a random character. So, the confidence interval allows you to determine the boundaries of this parameter, and then it will be possible to assert that with a probability of 95% the server will be in the range we calculated.

Or you need to find out how many people know about trademark firms. When the confidence interval is calculated, it will be possible, for example, to say that with a 95% probability the share of consumers who know about this is in the range from 27% to 34%.

Closely related to this term is such a value as the confidence level. It represents the probability that the desired parameter is included in the confidence interval. This value determines how large our desired range will be. How greater value it accepts, the narrower the confidence interval becomes, and vice versa. Usually it is set to 90%, 95% or 99%. The value of 95% is the most popular.

On the this indicator the variance of observations also has an effect and its definition is based on the assumption that the feature under study obeys. This statement is also known as Gauss' Law. According to him, such a distribution of all probabilities of a continuous random variable, which can be described by the probability density. If the assumption about normal distribution turned out to be erroneous, then the estimate may be incorrect.

First, let's figure out how to calculate the confidence interval for Here, two cases are possible. Dispersion (the degree of spread of a random variable) may or may not be known. If it is known, then our confidence interval is calculated using the following formula:

xsr - t*σ / (sqrt(n))<= α <= хср + t*σ / (sqrt(n)), где

α - sign,

t is a parameter from the Laplace distribution table,

σ is the square root of the dispersion.

If the variance is unknown, then it can be calculated if we know all the values ​​of the desired feature. For this, the following formula is used:

σ2 = х2ср - (хр)2, where

х2ср - the average value of the squares of the trait under study,

(xsr)2 is the square of this feature.

The formula by which the confidence interval is calculated in this case changes slightly:

xsr - t*s / (sqrt(n))<= α <= хср + t*s / (sqrt(n)), где

xsr - sample mean,

α - sign,

t is a parameter that is found using the Student's distribution table t \u003d t (ɣ; n-1),

sqrt(n) is the square root of the total sample size,

s is the square root of the variance.

Consider this example. Assume that, based on the results of 7 measurements, the trait under study was determined to be 30 and the sample variance equal to 36. It is necessary to find, with a probability of 99%, a confidence interval that contains the true value of the measured parameter.

First, let's determine what t is equal to: t \u003d t (0.99; 7-1) \u003d 3.71. Using the above formula, we get:

xsr - t*s / (sqrt(n))<= α <= хср + t*s / (sqrt(n))

30 - 3.71*36 / (sqrt(7))<= α <= 30 + 3.71*36 / (sqrt(7))

21.587 <= α <= 38.413

The confidence interval for the variance is calculated both in the case of a known mean and when there is no data on the mathematical expectation, and only the value of the unbiased point estimate of the variance is known. We will not give here the formulas for its calculation, since they are quite complex and, if desired, they can always be found on the net.

We only note that it is convenient to determine the confidence interval using the Excel program or a network service, which is called so.

And others. All of them are estimates of their theoretical counterparts, which could be obtained if there were not a sample, but the general population. But alas, the general population is very expensive and often unavailable.

The concept of interval estimation

Any sample estimate has some scatter, because is a random variable depending on the values ​​in a particular sample. Therefore, for more reliable statistical inferences, one should know not only the point estimate, but also the interval, which with a high probability γ (gamma) covers the estimated indicator θ (theta).

Formally, these are two such values ​​(statistics) T1(X) And T2(X), what T1< T 2 , for which at a given level of probability γ condition is met:

In short, it is likely γ or more the true value is between the points T1(X) And T2(X), which are called the lower and upper bounds confidence interval.

One of the conditions for constructing confidence intervals is its maximum narrowness, i.e. it should be as short as possible. Desire is quite natural, because. the researcher tries to more accurately localize the finding of the desired parameter.

It follows that the confidence interval should cover the maximum probabilities of the distribution. and the score itself be at the center.

That is, the probability of deviation (of the true indicator from the estimate) upwards is equal to the probability of deviation downwards. It should also be noted that for skewed distributions, the interval on the right is not equal to the interval on the left.

The figure above clearly shows that the greater the confidence level, the wider the interval - a direct relationship.

This was a small introduction to the theory of interval estimation of unknown parameters. Let's move on to finding confidence limits for the mathematical expectation.

Confidence interval for mathematical expectation

If the original data are distributed over , then the average will be a normal value. This follows from the rule that a linear combination of normal values ​​also has a normal distribution. Therefore, to calculate the probabilities, we could use the mathematical apparatus of the normal distribution law.

However, this will require the knowledge of two parameters - the expected value and the variance, which are usually not known. You can, of course, use estimates instead of parameters (arithmetic mean and ), but then the distribution of the mean will not be quite normal, it will be slightly flattened down. Citizen William Gosset of Ireland adroitly noted this fact when he published his discovery in the March 1908 issue of Biometrica. For secrecy purposes, Gosset signed with Student. This is how the Student's t-distribution appeared.

However, the normal distribution of data, used by K. Gauss in the analysis of errors in astronomical observations, is extremely rare in terrestrial life and it is quite difficult to establish this (for high accuracy, about 2 thousand observations are needed). Therefore, it is best to drop the normality assumption and use methods that do not depend on the distribution of the original data.

The question arises: what is the distribution of the arithmetic mean if it is calculated from the data of an unknown distribution? The answer is given by the well-known in probability theory Central limit theorem(CPT). In mathematics, there are several versions of it (the formulations have been refined over the years), but all of them, roughly speaking, come down to the statement that the sum of a large number of independent random variables obeys the normal distribution law.

When calculating the arithmetic mean, the sum of random variables is used. From this it turns out that the arithmetic mean has a normal distribution, in which the expected value is the expected value of the initial data, and the variance is .

Smart people know how to prove the CLT, but we will verify this with the help of an experiment conducted in Excel. Let's simulate a sample of 50 uniformly distributed random variables (using the Excel function RANDOMBETWEEN). Then we will make 1000 such samples and calculate the arithmetic mean for each. Let's look at their distribution.

It can be seen that the distribution of the average is close to the normal law. If the volume of samples and their number are made even larger, then the similarity will be even better.

Now that we have seen for ourselves the validity of the CLT, we can, using , calculate the confidence intervals for the arithmetic mean, which cover the true mean or mathematical expectation with a given probability.

To establish the upper and lower bounds, it is required to know the parameters of the normal distribution. As a rule, they are not, therefore, estimates are used: arithmetic mean And sample variance. Again, this method gives a good approximation only for large samples. When the samples are small, it is often recommended to use Student's distribution. Don't believe! Student's distribution for the mean occurs only when the original data has a normal distribution, that is, almost never. Therefore, it is better to immediately set the minimum bar for the amount of required data and use asymptotically correct methods. They say 30 observations are enough. Take 50 - you can't go wrong.

T 1.2 are the lower and upper bounds of the confidence interval

– sample arithmetic mean

s0– sample standard deviation (unbiased)

n – sample size

γ – confidence level (usually equal to 0.9, 0.95 or 0.99)

c γ =Φ -1 ((1+γ)/2) is the reciprocal of the standard normal distribution function. In simple terms, this is the number of standard errors from the arithmetic mean to the lower or upper bound (the indicated three probabilities correspond to the values ​​\u200b\u200bof 1.64, 1.96 and 2.58).

The essence of the formula is that the arithmetic mean is taken and then a certain amount is set aside from it ( with γ) standard errors ( s 0 /√n). Everything is known, take it and count.

Before the mass use of PCs, to obtain the values ​​​​of the normal distribution function and its inverse, they used . They are still used, but it is more efficient to turn to ready-made Excel formulas. All elements from the formula above ( , and ) can be easily calculated in Excel. But there is also a ready-made formula for calculating the confidence interval - CONFIDENCE NORM. Its syntax is the following.

CONFIDENCE NORM(alpha, standard_dev, size)

alpha– significance level or confidence level, which in the above notation is equal to 1-γ, i.e. the probability that the mathematicalthe expectation will be outside the confidence interval. With a confidence level of 0.95, alpha is 0.05, and so on.

standard_off is the standard deviation of the sample data. You don't need to calculate the standard error, Excel will divide by the root of n.

size– sample size (n).

The result of the CONFIDENCE.NORM function is the second term from the formula for calculating the confidence interval, i.e. half-interval. Accordingly, the lower and upper points are the average ± the obtained value.

Thus, it is possible to build a universal algorithm for calculating confidence intervals for the arithmetic mean, which does not depend on the distribution of the initial data. The price for universality is its asymptotic nature, i.e. the need to use relatively large samples. However, in the age of modern technology, collecting the right amount of data is usually not difficult.

Testing Statistical Hypotheses Using a Confidence Interval

(module 111)

One of the main problems solved in statistics is. In a nutshell, its essence is this. An assumption is made, for example, that the expectation of the general population is equal to some value. Then the distribution of sample means is constructed, which can be observed with a given expectation. Next, we look at where in this conditional distribution the real average is located. If it goes beyond the allowable limits, then the appearance of such an average is very unlikely, and with a single repetition of the experiment it is almost impossible, which contradicts the hypothesis put forward, which is successfully rejected. If the average does not go beyond the critical level, then the hypothesis is not rejected (but it is not proved either!).

So, with the help of confidence intervals, in our case for the expectation, you can also test some hypotheses. It's very easy to do. Suppose the arithmetic mean for some sample is 100. The hypothesis is being tested that the expected value is, say, 90. That is, if we put the question primitively, it sounds like this: can it be that with the true value of the average equal to 90, the observed the average was 100?

To answer this question, additional information on standard deviation and sample size will be required. Let's say the standard deviation is 30, and the number of observations is 64 (to easily extract the root). Then the standard error of the mean is 30/8 or 3.75. To calculate the 95% confidence interval, you will need to set aside two standard errors on both sides of the mean (more precisely, 1.96). The confidence interval will be approximately 100 ± 7.5, or from 92.5 to 107.5.

Further reasoning is as follows. If the tested value falls within the confidence interval, then it does not contradict the hypothesis, since fits within the limits of random fluctuations (with a probability of 95%). If the tested point is outside the confidence interval, then the probability of such an event is very small, in any case below the acceptable level. Hence, the hypothesis is rejected as contradicting the observed data. In our case, the expectation hypothesis is outside the confidence interval (the tested value of 90 is not included in the interval of 100±7.5), so it should be rejected. Answering the primitive question above, one should say: no, it cannot, in any case, this happens extremely rarely. Often, this indicates a specific probability of erroneous rejection of the hypothesis (p-level), and not a given level, according to which the confidence interval was built, but more on that another time.

As you can see, it is not difficult to build a confidence interval for the mean (or mathematical expectation). The main thing is to catch the essence, and then things will go. In practice, most use the 95% confidence interval, which is about two standard errors wide on either side of the mean.

That's all for now. All the best!

In statistics, there are two types of estimates: point and interval. Point Estimation is a single sample statistic that is used to estimate a population parameter. For example, the sample mean is a point estimate of the population mean, and the sample variance S2- point estimate of the population variance σ2. it was shown that the sample mean is an unbiased estimate of the population expectation. The sample mean is called unbiased because the mean of all sample means (with the same sample size n) is equal to the mathematical expectation of the general population.

In order for the sample variance S2 became an unbiased estimator of the population variance σ2, the denominator of the sample variance should be set equal to n – 1 , but not n. In other words, the population variance is the average of all possible sample variances.

When estimating population parameters, it should be kept in mind that sample statistics such as , depend on specific samples. To take this fact into account, to obtain interval estimation the mathematical expectation of the general population analyze the distribution of sample means (for more details, see). The constructed interval is characterized by a certain confidence level, which is the probability that the true parameter of the general population is estimated correctly. Similar confidence intervals can be used to estimate the proportion of a feature R and the main distributed mass of the general population.

Download note in or format, examples in format

Construction of a confidence interval for the mathematical expectation of the general population with a known standard deviation

Building a confidence interval for the proportion of a trait in the general population

In this section, the concept of confidence interval is extended to categorical data. This allows you to estimate the share of the trait in the general population R with a sample share RS= X/n. As mentioned, if the values nR And n(1 - p) exceed the number 5, the binomial distribution can be approximated by the normal one. Therefore, to estimate the share of a trait in the general population R it is possible to construct an interval whose confidence level is equal to (1 - α)x100%.


where pS- sample share of the feature, equal to X/n, i.e. the number of successes divided by the sample size, R- the share of the trait in the general population, Z is the critical value of the standardized normal distribution, n- sample size.

Example 3 Let's assume that a sample is extracted from the information system, consisting of 100 invoices completed during the last month. Let's say that 10 of these invoices are incorrect. In this way, R= 10/100 = 0.1. The 95% confidence level corresponds to the critical value Z = 1.96.

Thus, there is a 95% chance that between 4.12% and 15.88% of invoices contain errors.

For a given sample size, the confidence interval containing the proportion of the trait in the general population seems to be wider than for a continuous random variable. This is because measurements of a continuous random variable contain more information than measurements of categorical data. In other words, categorical data that takes only two values ​​contain insufficient information to estimate the parameters of their distribution.

INcalculation of estimates drawn from a finite population

Estimation of mathematical expectation. Correction factor for the final population ( fpc) was used to reduce the standard error by a factor of . When calculating confidence intervals for population parameter estimates, a correction factor is applied in situations where samples are drawn without replacement. Thus, the confidence interval for the mathematical expectation, having a confidence level equal to (1 - α)x100%, is calculated by the formula:

Example 4 To illustrate the application of a correction factor for a finite population, let us return to the problem of calculating the confidence interval for the average amount of invoices discussed in Example 3 above. Suppose that a company issues 5,000 invoices per month, and =110.27 USD, S= $28.95 N = 5000, n = 100, α = 0.05, t99 = 1.9842. According to formula (6) we get:

Estimation of the share of the feature. When choosing no return, the confidence interval for the proportion of the feature that has a confidence level equal to (1 - α)x100%, is calculated by the formula:

Confidence intervals and ethical issues

When sampling a population and formulating statistical inferences, ethical problems often arise. The main one is how the confidence intervals and point estimates of sample statistics agree. Publishing point estimates without specifying the appropriate confidence intervals (usually at 95% confidence levels) and the sample size from which they are derived can be misleading. This may give the user the impression that a point estimate is exactly what he needs to predict the properties of the entire population. Thus, it is necessary to understand that in any research, not point, but interval estimates should be put at the forefront. In addition, special attention should be paid to the correct choice of sample sizes.

Most often, the objects of statistical manipulations are the results of sociological surveys of the population on various political issues. At the same time, the results of the survey are placed on the front pages of newspapers, and the sampling error and the methodology of statistical analysis are printed somewhere in the middle. To prove the validity of the obtained point estimates, it is necessary to indicate the sample size on the basis of which they were obtained, the boundaries of the confidence interval and its significance level.

Next note

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 448–462

Central limit theorem states that, given a sufficiently large sample size, the sample distribution of means can be approximated by a normal distribution. This property does not depend on the type of population distribution.