Find the 95th confidence interval. Confidence interval

Often an appraiser has to analyze the real estate market of the segment in which the appraisal object is located. If the market is developed, it can be difficult to analyze the entire set of presented objects, therefore, a sample of objects is used for analysis. This sample does not always turn out to be homogeneous, sometimes it is necessary to clear it of extremes - too high or too low market offers. For this purpose applies confidence interval ... The purpose of this study is to conduct a comparative analysis of two methods for calculating the confidence interval and choose the best option calculation when working with different samples in the estimatica.pro system.

Confidence interval is an interval of characteristic values ​​calculated on the basis of a sample, which, with a known probability, contains the estimated parameter of the general population.

The meaning of calculating the confidence interval is to construct, based on the sample data, such an interval so that it can be asserted with given probability that the value of the estimated parameter is in this interval. In other words, the confidence interval contains, with a certain probability, the unknown value of the estimated value. The wider the interval, the higher the inaccuracy.

There are different methods for determining the confidence interval. In this article, we will consider 2 ways:

  • through the median and standard deviation;
  • across critical value t-statistics (Student's coefficient).

Stages comparative analysis different ways CI calculation:

1. we form a sample of data;

2. we process it by statistical methods: calculate the mean, median, variance, etc .;

3. we calculate the confidence interval in two ways;

4. we analyze the cleaned samples and the obtained confidence intervals.

Stage 1. Data sampling

The sample was formed using the estimatica.pro system. The sample included 91 offers for sale 1 room apartments in the 3rd price zone with the layout type "Khrushchev".

Table 1. Initial sample

Price for 1 sq.m., d.e.

Fig. 1. Initial sample



Stage 2. Processing of the original sample

The processing of a sample by statistical methods requires the calculation of the following values:

1. Arithmetic mean

2. Median - a number characterizing the sample: exactly half of the sample is greater than the median, the other half is less than the median

(for a sample with an odd number of values)

3. Range - the difference between the maximum and minimum values ​​in the sample

4. Variance - used for more accurate estimation of data variation

5. The sample standard deviation (hereinafter - RMS) is the most common indicator of the dispersion of adjustment values ​​around the arithmetic mean.

6. Coefficient of variation - reflects the degree of dispersion of the adjustment values

7.Oscillation coefficient - reflects the relative fluctuation of the extreme values ​​of prices in the sample around the average

Table 2. Statistical indicators of the original sample

The coefficient of variation, which characterizes the homogeneity of the data, is 12.29%, but the coefficient of oscillation is too large. Thus, we can argue that the original sample is not homogeneous, so let's move on to calculating the confidence interval.

Step 3. Calculation of the confidence interval

Method 1. Calculation through the median and standard deviation.

The confidence interval is determined as follows: the minimum value - the standard deviation is subtracted from the median; maximum value- the standard deviation is added to the median.

Thus, the confidence interval (CU 47179; CU 60689)

Rice. 2. Values ​​that fall within the confidence interval 1.



Method 2. Construction of the confidence interval through the critical value of the t-statistics (Student's coefficient)

S.V. Gribovsky in the book “ Mathematical Methods property appraisal ”describes a method for calculating the confidence interval through the Student's coefficient. When calculating by this method, the evaluator himself must set the level of significance ∝, which determines the probability with which the confidence interval will be constructed. Significance levels of 0.1 are commonly used; 0.05 and 0.01. Confidence probabilities of 0.9 correspond to them; 0.95 and 0.99. With this method, the true values ​​are assumed mathematical expectation and variance are practically unknown (which is almost always true when solving practical estimation problems).

Confidence Interval Formula:

n is the sample size;

The critical value of t-statistics (Student's distribution) with a significance level ∝, the number of degrees of freedom n-1, which is determined using special statistical tables or using MS Excel (→ "Statistical" → STYUDRASPOBR);

∝ - the level of significance, we take ∝ = 0.01.

Rice. 2. Values ​​that fall within the confidence interval 2.

Stage 4. Analysis of different methods of calculating the confidence interval

Two methods of calculating the confidence interval - through the median and the Student's coefficient - led to different meanings intervals. Accordingly, we got two different cleaned samples.

Table 3. Statistical indicators for three samples.

Index

Initial sample

Option 1

Option 2

Mean

Dispersion

Coef. variations

Coef. oscillations

Number of retired objects, pcs.

Based on the calculations performed, it can be said that the values ​​of the confidence intervals obtained by different methods intersect, therefore, any of the calculation methods can be used at the discretion of the appraiser.

However, we believe that when working in the estimatica.pro system, it is advisable to choose a method for calculating the confidence interval depending on the degree of market development:

  • if the market is undeveloped, apply the method of calculation through the median and standard deviation, since the number of retired objects in this case is small;
  • if the market is developed, apply the calculation through the critical value of the t-statistics (Student's coefficient), since it is possible to form a large initial sample.

In preparing the article, the following were used:

1. Gribovsky S.V., Sivets S.A., Levykina I.A. Mathematical methods for assessing the value of property. Moscow, 2014

2. Data of the estimatica.pro system

Suppose we have a large number of items with a normal distribution of some characteristics (for example, a full warehouse of the same type of vegetables, the size and weight of which varies). You want to know the average characteristics of the entire batch of goods, but you have neither the time nor the desire to measure and weigh each vegetable. You understand that this is not necessary. But how many would have to be sampled?

Before giving some useful formulas for this situation, let us recall some notation.

Firstly, if we nevertheless measured the entire warehouse of vegetables (this set of elements is called the general population), then we would know, with all the accuracy available to us, the average value of the weight of the entire batch. Let's call this average X Wed .g en ... - general average. We already know that it is determined completely if we know its mean value and deviation s . True, so far we are neither X average gen. Nor s we do not know the general population. We can only take a certain sample, measure the values ​​we need, and calculate for this sample both the mean value of X cf. and the standard deviation S select.

It is known that if our sample check contains a large number of elements (usually n is greater than 30), and they are taken really random, then s the general population will hardly differ from S choice.

In addition, for the case of a normal distribution, we can use the following formulas:

With a probability of 95%


With a probability of 99%



V general view with probability Р (t)


The relationship between the value of t and the value of the probability P (t), with which we want to know the confidence interval, can be taken from the following table:


Thus, we determined in what range the average value for the general population (with a given probability) is located.

If we do not have a large enough sample, we cannot argue that general population has s = S select. In addition, in this case, the proximity of the sample to the normal distribution is problematic. In this case, we also use S choice instead of s in the formula:




but the value of t for a fixed probability P (t) will depend on the number of elements in the sample n. The larger n, the closer the obtained confidence interval will be to the value given by formula (1). The values ​​of t in this case are taken from another table ( Student's t-test), which we provide below:

Values ​​of Student's t-test for probability 0.95 and 0.99


Example 3. 30 people were randomly selected from the firm's employees. According to the sample, it turned out that the average salary (per month) is 30 thousand rubles, with an average square deviation 5 thousand rubles. Determine the average salary in the firm with a probability of 0.99.

Solution: By hypothesis, we have n = 30, X cf. = 30,000, S = 5000, P = 0.99. To find the confidence interval, we will use the formula corresponding to the Student's criterion. According to the table for n = 30 and P = 0.99 we find t = 2.756, therefore,


those. sought fiduciary interval 27484< Х ср.ген < 32516.

So, with a probability of 0.99, it can be argued that the interval (27484; 32516) contains the average salary in the firm.

We hope that you will use this method, but you do not need to have a table with you every time. Calculations can be done in Excel automatically. While in the Excel file, click the fx button on the top menu. Then, select among the functions the type "statistical", and from the proposed list in the window - STYUDRESIST. Then, according to the hint, placing the cursor in the "probability" field, type the value of the inverse probability (ie, in our case, instead of the probability 0.95, you must type the probability 0.05). Apparently, the spreadsheet is designed in such a way that the result answers the question of how likely we can be wrong. Likewise, in the degree of freedom field, enter a value (n-1) for your selection.

The confidence interval came to us from the field of statistics. This is a specific range that serves to estimate an unknown parameter with high degree reliability. The easiest way to explain this is with an example.

Suppose you want to investigate some random variable, for example, the server's response rate to a client request. Each time a user types the address of a specific site, the server reacts to it at a different speed. Thus, the investigated response time is random. So, the confidence interval allows us to determine the boundaries of this parameter, and then it can be argued that with a probability of 95% the server will be in the range we calculated.

Or you need to find out how many people know about brand name firms. When the confidence interval is calculated, it will be possible, for example, to say that with 95% probability the share of consumers who know about this is in the range from 27% to 34%.

Closely related to this term is such a value as the confidence level. It represents the probability that the desired parameter is included in the confidence interval. How large our desired range will be depends on this value. How greater importance it accepts, the narrower the confidence interval becomes, and vice versa. Usually it is set at 90%, 95% or 99%. The 95% value is the most popular.

On this indicator the variance of observations also influences and its definition is based on the assumption that the investigated feature obeys This statement is also known as Gauss's law. According to him, such a distribution of all probabilities of a continuous random variable, which can be described by the probability density. If the assumption about normal distribution turned out to be erroneous, the estimate may turn out to be incorrect.

First, let's figure out how to calculate the confidence interval for Here, two cases are possible. The variance (the degree of dispersion of a random variable) can be known or not. If it is known, then our confidence interval is calculated using the following formula:

хср - t * σ / (sqrt (n))<= α <= хср + t*σ / (sqrt(n)), где

α is a sign,

t - parameter from the Laplace distribution table,

σ is the square root of the variance.

If the variance is unknown, then it can be calculated if we know all the values ​​of the desired feature. For this, the following formula is used:

σ2 = х2ср - (хср) 2, where

х2ср - the average value of the squares of the investigated feature,

(хср) 2 - the square of the given feature.

The formula by which the confidence interval is calculated in this case changes slightly:

xcr - t * s / (sqrt (n))<= α <= хср + t*s / (sqrt(n)), где

хср - sample mean,

α is a sign,

t is a parameter that is found using the Student's distribution table t = t (ɣ; n-1),

sqrt (n) - square root of the total sample size,

s is the square root of the variance.

Consider this example. Suppose that according to the results of 7 measurements, the investigated characteristic was determined, equal to 30 and the sample variance equal to 36. It is necessary to find with a probability of 99% the confidence interval, which contains the true value of the measured parameter.

First, let's determine what t is equal to: t = t (0.99; 7-1) = 3.71. Using the above formula, we get:

xcr - t * s / (sqrt (n))<= α <= хср + t*s / (sqrt(n))

30 - 3.71 * 36 / (sqrt (7))<= α <= 30 + 3.71*36 / (sqrt(7))

21.587 <= α <= 38.413

The confidence interval for the variance is calculated both in the case of a known mean and when there is no data on the mathematical expectation, but only the value of the point unbiased estimate of the variance is known. We will not give here the formulas for calculating it, since they are quite complex and, if desired, they can always be found on the net.

We only note that it is convenient to determine the confidence interval using Excel or a network service, which is called that.

And others. All of them are estimates of their theoretical analogs, which could have been obtained if not a sample, but the general population were available. But alas, the general population is very expensive and often inaccessible.

Understanding interval grading

Any sample estimate has some scatter, since is a random variable depending on the values ​​in a particular sample. Therefore, for more reliable statistical conclusions, one should know not only the point estimate, but also the interval, which with a high probability γ (gamma) covers the estimated indicator θ (theta).

Formally, these are two such values ​​(statistics) T 1 (X) and T 2 (X), what T 1< T 2 for which at a given level of probability γ the condition is met:

In short, with the probability γ or more, the true figure is between the points T 1 (X) and T 2 (X) which are called the lower and upper bounds confidence interval.

One of the conditions for constructing confidence intervals is its maximum narrowness, i.e. it should be as short as possible. Desire is quite natural, because the researcher tries to more accurately localize the finding of the desired parameter.

It follows that the confidence interval should cover the maximum distribution probabilities. and the assessment itself is in the center.

That is, the probability of deviation (of the true indicator from the assessment) upward is equal to the probability of deviation downward. It should also be noted that for asymmetric distributions, the interval on the right is not equal to the interval on the left.

The figure above clearly shows that the greater the confidence level, the wider the interval - a direct relationship.

This was a small introduction to the theory of interval estimation of unknown parameters. Let's move on to finding the confidence bounds for the mathematical expectation.

Confidence interval for expected value

If the original data are distributed over, then the average will be a normal value. This follows from the rule that a linear combination of normal values ​​also has a normal distribution. Therefore, to calculate the probabilities, we could use the mathematical apparatus of the normal distribution law.

However, this requires knowing two parameters - expectation and variance, which are usually not known. You can, of course, use estimates instead of parameters (arithmetic mean and), but then the distribution of the mean will not be entirely normal, it will be slightly flattened downward. This fact was cleverly noted by citizen William Gosset of Ireland, who published his discovery in the March 1908 issue of Biometrica. For conspiracy purposes, Gosset signed himself as Student. This is how the Student's t-distribution appeared.

However, the normal distribution of data used by K. Gauss in the analysis of errors in astronomical observations is extremely rare in terrestrial life and it is rather difficult to establish this (for high accuracy, about 2 thousand observations are needed). Therefore, it is best to discard the normality assumption and use methods that are independent of the distribution of the original data.

The question arises: what is the distribution of the arithmetic mean if it is calculated from the data of an unknown distribution? The answer is given by the well-known in probability theory Central limit theorem(TSPT). In mathematics, there are several versions of it (over the years, the formulations have been refined), but all of them, roughly speaking, boil down to the statement that the sum of a large number of independent random variables obeys the normal distribution law.

When calculating the arithmetic mean, the sum of random variables is used. Hence, it turns out that the arithmetic mean has a normal distribution, in which the mean is the mean of the original data, and the variance is.

Smart people know how to prove the CLT, but we will be convinced of this with the help of an experiment conducted in Excel. Let's simulate a sample of 50 uniformly distributed random variables (using the Excel function RANDBETWEEN). Then we will make 1000 such samples and calculate the arithmetic mean for each. Let's look at their distribution.

It is seen that the distribution of the mean is close to the normal law. If the volume of samples and their number are made even larger, then the similarity will be even better.

Now that we have personally convinced of the validity of the CLT, we can, using, calculate the confidence intervals for the arithmetic mean, which, with a given probability, cover the true mean or mathematical expectation.

To establish the upper and lower bounds, you need to know the parameters of the normal distribution. As a rule, they are not there, therefore, estimates are used: arithmetic mean and sample variance... Again, this method gives a good approximation only for large samples. When the samples are small, it is often recommended to use the Student's t distribution. Do not believe it! The Student's distribution for the mean occurs only when the original data have a normal distribution, that is, almost never. Therefore, it is better to immediately set the minimum bar for the amount of required data and use asymptotically correct methods. They say that 30 observations are enough. Take 50 - you can't go wrong.

T 1.2- lower and upper limits of the confidence interval

- sample arithmetic mean

s 0- sample standard deviation (unbiased)

n - sample size

γ - confidence level (usually 0.9, 0.95 or 0.99)

c γ = Φ -1 ((1 + γ) / 2) Is the inverse of the standard normal distribution function. In simple terms, this is the number of standard errors from the arithmetic mean to the lower or upper bound (the indicated three probabilities correspond to the values ​​1.64, 1.96 and 2.58).

The essence of the formula is that the arithmetic mean is taken and then a certain amount is deposited from it ( with γ) standard errors ( s 0 / √n). Everything is known, take it and count it.

Before the mass use of a personal computer to obtain the values ​​of the normal distribution function and its inverse, they used. They are still used now, but it is more efficient to turn to ready-made Excel formulas. All elements from the formula above (, and) can be easily calculated in Excel. But there is also a ready-made formula for calculating the confidence interval - TRUST.NORM... Its syntax is as follows.

TRUST.NORM (alpha; standard_dev; size)

alpha- the level of significance or confidence level, which in the above notation is equal to 1-γ, i.e. probability that the mathematicalthe expectation will be outside the confidence interval. At a confidence level of 0.95, alpha is 0.05, etc.

standard_dev Is the standard deviation of the sample data. You don't need to calculate the standard error, Excel will divide it by the root of n.

the size- sample size (n).

The result of the CONFIDENCE.NORM function is the second term from the formula for calculating the confidence interval, i.e. half-interval. Accordingly, the lower and upper points are the mean ± the obtained value.

Thus, it is possible to build a universal algorithm for calculating the confidence intervals for the arithmetic mean, which does not depend on the distribution of the initial data. The price for universality is its asymptoticity, i.e. the need to use relatively large samples. However, in the age of modern technology, collecting the right amount of data is usually not difficult.

Testing Statistical Hypotheses Using Confidence Intervals

(module 111)

One of the main tasks solved in statistics is. Its essence is briefly as follows. It is suggested, for example, that the expected value of the general population is equal to some value. Then the distribution of sample averages is plotted, which can be observed with a given expectation. Next, they look where the real average is located in this conditional distribution. If it goes beyond the permissible limits, then the appearance of such an average is very unlikely, and with a single repetition of the experiment, it is almost impossible, which contradicts the hypothesis put forward, which is successfully rejected. If the mean does not go beyond the critical level, then the hypothesis is not rejected (but also not proven!).

So, using the confidence intervals, in our case for the expectation, you can also test some hypotheses. It's very easy to do. Suppose the arithmetic mean over a certain sample is equal to 100. The hypothesis is tested that the expectation is equal, say, 90. the average was equal to 100?

To answer this question, you will additionally need information about the standard deviation and the sample size. Let's say the standard deviation is 30, and the number of observations is 64 (to easily extract the root). Then the standard error of the mean is 30/8 or 3.75. To calculate the 95% confidence interval, it will be necessary to postpone two standard errors (more precisely, 1.96 each) on both sides of the mean. The confidence interval will be approximately 100 ± 7.5, or 92.5 to 107.5.

Further, the reasoning is as follows. If the tested value falls within the confidence interval, then it does not contradict the hypothesis, since fits within the limits of random fluctuations (with a probability of 95%). If the point being checked is outside the confidence interval, then the probability of such an event is very small, at least below the acceptable level. Hence, the hypothesis is rejected as contradicting the observed data. In our case, the hypothesis about expectation is outside the confidence interval (the tested value of 90 is not included in the interval 100 ± 7.5), so it should be rejected. Answering the primitive question above, one should say: no, it cannot, in any case, this happens extremely rarely. At the same time, they often indicate the specific probability of erroneous rejection of the hypothesis (p-level), and not the specified level, according to which the confidence interval was built, but more on that another time.

As you can see, it is not difficult to construct a confidence interval for the mean (or mathematical expectation). The main thing is to grasp the essence, and then things will go. In practice, in most cases, a 95% confidence interval is used, which is approximately two standard errors wide on either side of the mean.

That's all for now. All the best!

There are two types of estimates in statistics: point and interval. Point estimate is a single sample statistic that is used to estimate a parameter of a population. For example, the sample mean is a point estimate of the mathematical expectation of the general population, and the sample variance S 2- point estimate of the variance of the general population σ 2... it was shown that the sample mean is an unbiased estimate of the mathematical expectation of the general population. The sample mean is called unbiased because the mean of all sample means (for the same sample size n) is equal to the mathematical expectation of the general population.

In order for the sample variance S 2 became an unbiased estimate of the population variance σ 2, the denominator of the sample variance should be set equal to n – 1 , but not n... In other words, the variance of the general population is the average of all possible sample variances.

When assessing the parameters of the general population, it should be borne in mind that sample statistics such as , depend on specific samples. To take this fact into account, to obtain interval estimation the mathematical expectation of the general population, the distribution of sample means is analyzed (for more details, see). The constructed interval is characterized by a certain confidence level, which is the probability that the true parameter of the general population is estimated correctly. Similar confidence intervals can be used to estimate the proportion of a feature R and the main distributed mass of the general population.

Download the note in the format or, examples in the format

Plotting the confidence interval for the mathematical expectation of the general population with a known standard deviation

Construction of a confidence interval for the share of a feature in the general population

In this section, the concept of a confidence interval is extended to categorical data. This allows you to estimate the share of the trait in the general population. R using a sample rate RS= X /n... As indicated, if the quantities nR and n(1 - p) exceed the number 5, the binomial distribution can be approximated by a normal one. Therefore, to assess the share of a feature in the general population R an interval can be plotted whose confidence level is (1 - α) х100%.


where pS- a selective share of a feature equal to NS/n, i.e. the number of successes divided by the sample size, R- the share of the feature in the general population, Z- the critical value of the standardized normal distribution, n- sample size.

Example 3. Suppose that a sample is retrieved from the information system, consisting of 100 invoices completed during the last month. Let's say that 10 of these invoices are made with errors. Thus, R= 10/100 = 0.1. The 95% confidence level corresponds to the critical value Z = 1.96.

Thus, the probability that between 4.12% and 15.88% of invoices contains errors is 95%.

For a given sample size, the confidence interval containing the share of a feature in the general population seems to be wider than for a continuous random variable. This is because measurements of a continuous random variable contain more information than measurements of categorical data. In other words, categorical data that take only two values ​​does not contain enough information to estimate the parameters of their distribution.

Vcalculating estimates derived from a finite population

Estimation of the mathematical expectation. The correction factor for the final population ( fpc) was used to decrease the standard error by a factor. When calculating confidence intervals for estimates of population parameters, a correction factor is applied in situations where samples are retrieved without being returned. Thus, the confidence interval for the mathematical expectation having a confidence level equal to (1 - α) х100%, is calculated by the formula:

Example 4. To illustrate the application of the correction factor for the final population, let us return to the problem of calculating the confidence interval for the average amount of invoices discussed above in Example 3. Suppose that a company issues 5,000 invoices per month, and = 110.27 dollars., S= $ 28.95 N = 5000, n = 100, α = 0.05, t 99 = 1.9842. By formula (6) we get:

Assessment of the share of a feature. When choosing without returning, the confidence interval for the fraction of the feature having a confidence level equal to (1 - α) х100%, is calculated by the formula:

Confidence Intervals and Ethical Issues

Ethical problems often arise when sampling the population and formulating statistical conclusions. The main one is how confidence intervals and point estimates of sample statistics agree. Publication of point estimates without appropriate confidence intervals (usually 95% confidence levels) and sample sizes from which they are derived can be misleading. This can give the user the impression that the point estimate is exactly what he needs to predict the properties of the entire population. Thus, it is necessary to understand that in any research, interval estimates should be placed at the forefront. In addition, special attention should be paid to the correct selection of sample sizes.

Most often, the objects of statistical manipulation are the results of sociological polls of the population on various political issues. At the same time, the results of the survey are put on the front pages of newspapers, and the error of the sample research and the methodology of statistical analysis are printed somewhere in the middle. To prove the validity of the obtained point estimates, it is necessary to indicate the size of the sample on the basis of which they were obtained, the boundaries of the confidence interval and its level of significance.

Next note

Used materials of the book Levin and other Statistics for managers. - M .: Williams, 2004 .-- p. 448-462

Central limit theorem states that for a sufficiently large sample size, the sample distribution of means can be approximated by a normal distribution. This property does not depend on the type of distribution of the general population.