Estimation using Fisher's F-criterion of the statistical reliability of the results of regression modeling. Average approximation error

Average approximation error- average deviation of the calculated values ​​from the actual ones:

Where y x is the calculated value according to the equation.

The value of the average approximation error of up to 15% indicates a well-chosen model of the equation.

For seven territories of the Ural region for 199X, the values ​​of two signs are known.

Required:
1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power-law;
c) indicative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Rate each model through mean approximation error A cf and F-Fisher's test.

We carry out the solution with the help online calculator Linear regression equation.
a) linear regression equation;
Using the graphical method.
This method is used to visualize the form of the relationship between the studied economic indicators. To do this, a graph is plotted in a rectangular coordinate system, the individual values ​​of the resultant attribute Y are plotted along the ordinate axis, and the individual values ​​of the factor attribute X are plotted along the abscissa axis.
The set of points of the effective and factorial signs is called correlation field.


Based on the correlation field, it is possible to put forward a hypothesis (for the general population) that the relationship between all possible values ​​of X and Y is linear.
The linear regression equation is y = bx + a + ε
Here ε is a random error (deviation, perturbation).
Reasons for the existence of a random error:
1. Not including significant explanatory variables in the regression model;
2. Aggregation of variables. For example, the function of total consumption is an attempt at a general expression of the aggregate of decisions of individual individuals about spending. This is just an approximation of individual ratios that have different parameters.
3. Incorrect description of the structure of the model;
4. Wrong functional specification;
5. Measurement errors.
Since the deviations ε i for each specific observation i are random and their values ​​in the sample are unknown, then:
1) from observations x i and y i, only estimates of the parameters α and β
2) The estimates of the parameters α and β of the regression model are the values ​​a and b, respectively, which are random in nature, since correspond to a random sample;
Then the estimated regression equation (constructed from sample data) will have the form y = bx + a + ε, where e i are the observed values ​​(estimates) of errors ε i, and and b, respectively, estimates of the parameters α and β of the regression model that should be found.
To estimate the parameters α and β - the least squares method is used.




We get b = -0.35, a = 76.88
Regression equation:
y = -0.35 x + 76.88

x y x 2 y 2 x y y (x) (y i -y cp) 2 (y-y (x)) 2 | y - y x |: y
45,1 68,8 2034,01 4733,44 3102,88 61,28 119,12 56,61 0,1094
59 61,2 3481 3745,44 3610,8 56,47 10,98 22,4 0,0773
57,2 59,9 3271,84 3588,01 3426,28 57,09 4,06 7,9 0,0469
61,8 56,7 3819,24 3214,89 3504,06 55,5 1,41 1,44 0,0212
58,8 55 3457,44 3025 3234 56,54 8,33 2,36 0,0279
47,2 54,3 2227,84 2948,49 2562,96 60,55 12,86 39,05 0,1151
55,2 49,3 3047,04 2430,49 2721,36 57,78 73,71 71,94 0,172
384,3 405,2 21338,41 23685,76 22162,34 405,2 230,47 201,71 0,5699

Note: The y (x) values ​​are found from the resulting regression equation:
y (45.1) = -0.35 * 45.1 + 76.88 = 61.28
y (59) = -0.35 * 59 + 76.88 = 56.47
... ... ...

Approximation error
Let us estimate the quality of the regression equation using the absolute approximation error. Average approximation error- average deviation of the calculated values ​​from the actual ones:

Since the error is less than 15%, then given equation can be used as a regression.

F statistics. Fisher's criterion.










3. Table value is determined from the Fisher distribution tables for a given level of significance, taking into account that the number of degrees of freedom for the total sum of squares (greater variance) is equal to 1 and the number of degrees of freedom of the residual sum of squares (less variance) at linear regression is equal to n-2.
4. If the actual value of the F-criterion is less than the tabular one, then they say that there is no reason to reject the null hypothesis.
Otherwise, the null hypothesis is rejected and an alternative hypothesis about statistical significance equations in general.

< Fkp, то коэффициент детерминации статистически не значим (Найденная оценка уравнения регрессии статистически не надежна).

b) power regression;
The solution is carried out using the Nonlinear Regression service. When choosing, specify Power y = ax b
c) exponential regression;
d) model of an equilateral hyperbola.
System of normal equations.

For our data, the system of equations has the form
7a + 0.1291b = 405.2
0.1291a + 0.0024b = 7.51
From the first equation we express a and substitute it into the second equation
We get b = 1054.67, a = 38.44
Regression equation:
y = 1054.67 / x + 38.44
Approximation error.
Let us estimate the quality of the regression equation using the absolute approximation error.

Since the error is less than 15%, this equation can be used as a regression.

Fisher's criterion.
Checking the significance of the regression model is carried out using the Fisher's F-test, the calculated value of which is found as the ratio of the variance of the initial series of observations of the studied indicator and the unbiased estimate of the variance of the residual sequence for this model.
If the calculated value with k1 = (m) and k2 = (n-m-1) degrees of freedom is greater than the tabular value for a given significance level, then the model is considered significant.

where m is the number of factors in the model.
The statistical significance of paired linear regression is estimated using the following algorithm:
1. A null hypothesis is put forward that the equation as a whole is statistically insignificant: H 0: R 2 = 0 at the significance level α.
2. Next, the actual value of the F-criterion is determined:

where m = 1 for paired regression.
Table value of the criterion with degrees of freedom k1 = 1 and k2 = 5, Fkp = 6.61
Since the actual value of F< Fkp, то коэффициент детерминации статистически не значим (Найденная оценка уравнения регрессии статистически не надежна).

Among different methods forecasting cannot but highlight the approximation. With its help, you can make approximate calculations and calculate the planned indicators by replacing the original objects with simpler ones. In Excel, there is also the possibility of using this method for forecasting and analysis. Let's see how this method can be applied in the specified program with built-in tools.

The name of this method comes from the Latin word proxima - "closest" It is the approximation by simplifying and smoothing the known indicators, building them into a trend and is its basis. But this method can be used not only for forecasting, but also for researching existing results. After all, approximation is, in fact, a simplification of the initial data, and the simplified version is easier to study.

The main tool used for smoothing in Excel is plotting a trend line. The bottom line is that on the basis of already available indicators, the function schedule for future periods is completed. The main purpose of the trend line, as you might guess, is to make forecasts or identify a general trend.

But it can be constructed using one of five types of approximation:

  • Linear;
  • Exponential;
  • Logarithmic;
  • Polynomial;
  • Degree.

Let's consider each of the options in more detail separately.

Method 1: linear smoothing

First of all, let's look at the simplest possible approximation, namely a linear function. We will dwell on it in more detail, since we will outline the general points characteristic of other methods, namely the construction of a graph and some other nuances, which we will not dwell on when considering the subsequent options.

First of all, let's build a graph on the basis of which we will carry out the smoothing procedure. To build a graph, let's take a table in which the cost of a unit of production produced by the enterprise and the corresponding profit in a given period are indicated on a monthly basis. The graphical function that we will build will display the dependence of an increase in profits on a decrease in production costs.


The anti-aliasing used in this case is described by the following formula:

In our particular case, the formula takes the following form:

y = -0.1156x + 72.255

The value of the accuracy of the approximation is equal to 0,9418 , which is a fairly acceptable result, characterizing smoothing as reliable.

Method 2: exponential approximation

Now let's look at the exponential type of approximation in Excel.


The general view of the smoothing function is as follows:

where e Is the foundation natural logarithm.

In our particular case, the formula took the following form:

y = 6282.7 * e ^ (- 0.012 * x)

Method 3: logarithmic smoothing

Now it's time to consider the method of logarithmic approximation.


V general view the smoothing formula looks like this:

where ln Is the value of the natural logarithm. Hence the name of the method.

In our case, the formula takes the following form:

y = -62.81ln (x) +404.96

Method 4: polynomial smoothing

Now it's time to consider the polynomial smoothing method.


The formula that describes this type of anti-aliasing took the following form:

y = 8E-08x ^ 6-0.0003x ^ 5 + 0.3725x ^ 4-269.33x ^ 3 + 109525x ^ 2-2E + 07x + 2E + 09

Method 5: power-law anti-aliasing

Finally, consider the power approximation method in Excel.


This method is effectively used in cases of intensive changes in function data. It is important to note that this option is only applicable if the function and argument do not accept negative or null values.

The general formula describing this method is as follows:

In our particular case, it looks like this:

y = 6E + 18x ^ (- 6.512)

As you can see, when using the specific data that we used for the example, the highest level of confidence was shown by the method of polynomial approximation with a polynomial in the sixth degree ( 0,9844 ), the lowest level of confidence is at linear method (0,9418 ). But this does not mean at all that the same trend will be when using other examples. No, the level of effectiveness of the above methods can differ significantly, depending on the specific type of function for which the trend line will be drawn. Therefore, if the chosen method is the most effective for this function, this does not mean at all that it will also be optimal in another situation.

If you still cannot immediately determine, based on the above recommendations, which type of approximation is suitable specifically for your case, then it makes sense to try all the methods. After building a trend line and viewing its confidence level, you can choose the best option.

For the territories of the region, data are given for the year 200X.

Region number Average per capita subsistence minimum per day of one able-bodied person, rubles, x Average daily wages, rubles, y
1 78 133
2 82 148
3 87 134
4 79 154
5 89 162
6 106 195
7 67 139
8 88 158
9 73 152
10 87 162
11 76 159
12 115 173

Exercise:

1. Build a correlation field and formulate a hypothesis about the form of the relationship.

2. Calculate the parameters of the linear regression equation

4. Using the average (general) coefficient of elasticity, give a comparative assessment of the strength of the relationship between the factor and the result.

7. Calculate the predicted value of the result if the predicted value of the factor increases by 10% from its average level. Determine the predictive confidence interval for the significance level.

Solution:

Let's solve this problem using Excel.

1. Comparing the available data x and y, for example, ranking them in ascending order of the factor x, one can observe the presence of a direct relationship between the signs, when an increase in the average per capita subsistence minimum increases the average daily wage. Based on this, we can make the assumption that the connection between the features is direct and it can be described by the equation of a straight line. The same conclusion is confirmed on the basis of graphical analysis.

To build a correlation field, you can use the PPP Excel. Enter the initial data in sequence: first x, then y.

Select the area of ​​cells containing data.

Then choose: Insert / Scatter chart / Scatter with markers as shown in Figure 1.

Figure 1 Plotting the correlation field

Analysis of the correlation field shows the presence of a dependence close to a straight line, since the points are located almost in a straight line.

2. To calculate the parameters of the linear regression equation
let's use the built-in statistical function LINEST.

For this:

1) Open an existing file containing the analyzed data;
2) Select a 5 × 2 blank cell area (5 rows, 2 columns) to display the results regression statistics.
3) Activate Function wizard: in the main menu select Formulas / Insert Function.
4) In the window Category you are taking Statistical, in the window the function - LINEST... Click the button OK as shown in Figure 2;

Figure 2 Function Wizard Dialog Box

5) Fill in the function arguments:

Known values ​​for

Known values ​​of x

Constant- a boolean value that indicates the presence or absence of an intercept in the equation; if Constant = 1, then the free term is calculated in the usual way; if Constant = 0, then the free term is 0;

Statistics- a boolean value that indicates whether to display additional information on the regression analysis or not. If Statistics = 1, then additional information is displayed, if Statistics = 0, then only estimates of the equation parameters are displayed.

Click the button OK;

Figure 3 LINEST function arguments dialog box

6) The first element of the final table will appear in the upper left cell of the selected area. To expand the entire table, press the key and then the key combination ++ .

Additional regression statistics will be displayed in the order shown in the following diagram:

The value of the coefficient b The value of the coefficient a
Standard error b Standard error a
Standard error y
F-statistics
Regression sum of squares

Figure 4 The result of calculating the LINEST function

We got the regression equation:

We conclude: With an increase in the average per capita subsistence minimum by 1 ruble. average daily wages increase by 0.92 rubles on average.

It means that 52% of the variation in wages (y) is explained by the variation of factor x - the average per capita subsistence minimum, and 48% - by the action of other factors not included in the model.

The calculated coefficient of determination can be used to calculate the correlation coefficient: .

The connection is assessed as close.

4. Using the average (general) coefficient of elasticity, we determine the strength of the factor's influence on the result.

For the equation of a straight line, the average (general) coefficient of elasticity is determined by the formula:

Find the average values ​​by selecting the area of ​​cells with x values, and select Formulas / AutoSum / Average, and do the same with the values ​​of y.

Figure 5 Calculation of the mean values ​​of the function and the argument

Thus, if the average per capita subsistence minimum changes by 1% of its average value, the average daily wage will change on average by 0.51%.

Using a data analysis tool Regression you can get it:
- the results of regression statistics,
- the results of analysis of variance,
- the results of the confidence intervals,
- residuals and graphs for fitting the regression line,
- residuals and normal probability.

The procedure is as follows:

1) check access to Analysis package... In the main menu, select in sequence: File / Options / Add-ins.

2) In the dropdown Control select item Excel add-ins and press the button Go.

3) In the window Add-ons check the box Analysis package and then press the button OK.

If Analysis package is not in the field list Available add-ons, press the button Overview to search.

If a message appears stating that the analysis package is not installed on your computer, click Yes to install it.

4) In the main menu, sequentially select: Data / Data Analysis / Analysis Tools / Regression and then press the button OK.

5) Complete the data entry and output parameters dialog box:

Input Span Y- the range containing the data of the effective attribute;

Input interval X- a range containing the data of the factor attribute;

Tags- a flag that indicates whether the first line contains column names or not;

Constant - zero- a flag indicating the presence or absence of an intercept in the equation;

Output Interval- it is enough to indicate the upper left cell of the future range;

6) New worksheet - you can set an arbitrary name for the new sheet.

Then press the button OK.

Figure 6 Dialog box for entering parameters of the Regression tool

The results of the regression analysis for the task data are presented in Figure 7.

Figure 7 Result of applying the regression tool

5. Let us estimate the quality of the equations using the average approximation error. Let's use the results of the regression analysis presented in Figure 8.

Figure 8 Result of using the "Residual output" regression tool

Let's compose a new table as shown in Figure 9. In column C we calculate the relative error of approximation by the formula:

Figure 9 Calculation of the average approximation error

The average approximation error is calculated by the formula:

The quality of the constructed model is assessed as good, since it does not exceed 8 - 10%.

6. From the table with regression statistics (Figure 4), we write out the actual value of Fisher's F-test:

Insofar as at a 5% level of significance, it can be concluded that the regression equation is significant (the relationship is proven).

8. The estimation of the statistical significance of the regression parameters will be carried out using the Student's t-statistics and by calculating confidence interval each of the indicators.

We put forward the hypothesis H 0 about the statistically insignificant difference of indicators from zero:

.

for the number of degrees of freedom

Figure 7 shows the actual values ​​of the t-statistic:

The t-test for the correlation coefficient can be calculated in two ways:

Method I:

where - random error of the correlation coefficient.

We take the data for the calculation from the table in Figure 7.

Method II:

The actual t-statistic values ​​are superior to the tabular values:

Therefore, the hypothesis H 0 is rejected, that is, the regression parameters and the correlation coefficient are not randomly different from zero, but statistically significant.

The confidence interval for the parameter a is defined as

For parameter a, the 95% bounds as shown in Figure 7 were:

The confidence interval for the regression coefficient is defined as

For the regression coefficient b, the 95% bounds as shown in Figure 7 were:

Analysis of the upper and lower boundaries of the confidence intervals leads to the conclusion that with the probability parameters a and b, being within the specified boundaries, do not take zero values, i.e. are not statistically insignificant and are materially different from zero.

7. The obtained estimates of the regression equation make it possible to use it for forecasting. If the forecast value of the subsistence minimum is:

Then the predicted value of the subsistence minimum will be:

We calculate the forecast error using the formula:

where

We also calculate the variance using the PPP Excel. For this:

1) Activate Function wizard: in the main menu select Formulas / Insert Function.

3) Fill in the range containing the numerical data of the factor attribute. Click on OK.

Figure 10 Calculation of variance

Received the variance value

To calculate the residual variance per degree of freedom, we use the ANOVA results as shown in Figure 7.

Confidence intervals for predicting individual values ​​of y at with a probability of 0.95 are determined by the expression:

The interval is wide enough, primarily due to the small volume of observations. On the whole, the fulfilled forecast of the average monthly salary turned out to be reliable.

The condition of the problem is taken from: Workshop on econometrics: Textbook. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M .: Finance and statistics, 2003 .-- 192 p .: ill.

5. Using the F-criterion, it was found that the resulting pair regression equation as a whole is statistically insignificant, and inadequately describes the studied phenomenon of the relationship between the monthly pension y and the subsistence minimum x.

6. Formed an econometric model of multiple linear regression, linking the value of the net income of a conditional firm y with capital turnover x1 and capital used x2

7. By calculating the elasticity coefficients, it is shown that when the capital turnover changes by 1%, the value of the company's net income changes by 0.0008%, and when the used capital changes by 1%, the company's net income changes by 0.56%.

8. Using the t-test, the statistical significance of the regression coefficients was assessed.It was found that the explanatory variable x 1 is statistically insignificant and can be excluded from the regression equation, at the same time the explanatory variable x 2 is statistically significant.

9.Using the F-criterion, it was found that the resulting paired regression equation as a whole is statistically significant, and adequately describes the studied phenomenon of the relationship between the net income of a conditional firm y with capital turnover x 1 and used capital x 2.

10. Calculated the average error of approximation of statistical data by the linear equation multiple regression, which amounted to 29.8%. It is shown, due to which observation in the statistical base the magnitude of this error exceeds the permissible value.

14. Building a paired regression model without using EXCEL.

Using the statistical material given in table 3.5 it is necessary:

2. Assess the tightness of communication using the indicators of correlation and determination.

3. Using the coefficient of elasticity, determine the degree of connection between the factor attribute and the effective one.

4. Determine the average approximation error.

5. Evaluate the statistical reliability of modeling using Fisher's F-test.

Table 3.5. Initial data.

Share cash income aimed at increasing savings in deposits, loans, certificates and buying foreign currency, in the total amount of average per capita monetary income,%

Average monthly accrued wages, c.u.

Kaluga

Kostroma

Orlovskaya

Ryazan

Smolensk

To determine the unknown parameters b 0, b 1 of the pair linear regression equation, we use the standard system of normal equations, which has the form

(3.7)

To solve this system, it is first necessary to determine the values ​​of Sх 2 and Sху. These values ​​are determined from the table of initial data, supplementing it with the appropriate columns (table 3.6).

Table 3.6. To the calculation of regression coefficients.

Then system (3.7) takes the form

Expressing b 0 from the first equation and substituting the resulting expression into the second equation, we get:

Performing term-by-term multiplication and expanding the parentheses, we get:

Finally, the equation of paired linear regression, linking the value of the share of monetary incomes of the population aimed at increasing savings y with the value of the average monthly accrued wages x has the form:

So, as the equation of paired linear regression is constructed, then we determine the linear correlation coefficient according to the dependence:

where are the values ​​of the standard deviations of the corresponding parameters.

To calculate the linear correlation coefficient according to dependence (3.9), we will perform intermediate calculations.

Substituting the values ​​of the found parameters into expression (3.9), we obtain

.

The obtained value of the linear correlation coefficient indicates the presence of a weak inverse statistical relationship between the value of the share of monetary incomes of the population aimed at increasing savings y and the value of the average monthly accrued wages x.

The coefficient of determination is equal, which means that only 9.6% is explained by the regression of the explanatory variable by the value of y. Accordingly, a value of 1, equal to 90.4%, characterizes the proportion of the variance of the variable caused by the influence of all the other explanatory variables not taken into account in the econometric model.

The coefficient of elasticity is

Consequently, when the average monthly accrued wages change by 1%, the share of the population's monetary incomes aimed at increasing savings also decreases by 1%, and with an increase in wages, a decrease in the share of the population's monetary incomes aimed at increasing savings is observed. This conclusion contradicts common sense and can only be explained by the incorrectness of the formed mathematical model.

Let's calculate the average approximation error.

Table 3.7. To the calculation of the average approximation error.

The resulting value exceeds (12 ... 15)%, which indicates the significance of the average deviation of the calculated data from the actual ones, which are used to build the econometric model.

The reliability of statistical modeling is carried out on the basis of Fisher's F-criterion. The theoretical value of the Fisher criterion F calc is determined from the ratio of the values ​​of the factorial and residual variances, calculated for one degree of freedom by the formula

where n is the number of observations;

m is the number of explanatory variables (for the considered example, m m = 1).

The critical value F crit is determined by statistical tables and for the significance level a = 0.05 is equal to 10.13. Since F is calculated

15. Building a multiple regression model without using EXCEL.

Using the statistical material given in table 3.8, it is necessary:

1. Construct a linear multiple regression equation, explain the economic meaning of its parameters.

2. To give a comparative assessment of the closeness of the relationship of factors with an effective indicator using the average (general) coefficients of elasticity.

3. Evaluate the statistical significance of the regression coefficients using the t-test and the null hypothesis that the equation is not significant using the F-test.

4. Evaluate the quality of the equation by determining the average approximation error.

Table 3.8. Initial data.

Net income, mln USD

Capital turnover mln USD

Used capital, mln. US dollars

To determine the unknown parameters b 0, b 1, b 2 of the multiple linear regression equation, we use the standard system of normal equations, which has the form

(3.11)

To solve this system, it is first necessary to determine the values ​​of the quantities Sх 1 2, Sх 2 2, Sх 1 y, Sх 2 y, Sх 1 x 2. These values ​​are determined from the source data table, supplementing it with the appropriate columns (table 3.9).

Table 3.9. To the calculation of regression coefficients.

Then system (3.11) takes the form

To solve this system, we will use the Gauss method, which consists in the sequential elimination of unknowns: divide the first equation of the system by 10, then multiply the resulting equation by 370.6 and subtract it from the second equation of the system, then multiply the resulting equation by 158.20 and subtract it from the third equation of the system. Repeating this algorithm for the transformed second and third equations of the system, we get:

Þ Þ

Þ .

After the transformation, we have:

Then the final dependence of net income on capital turnover and used capital in the form linear equation multiple regression looks like:

From the resulting econometric equation, it can be seen that with an increase in the capital used, net income increases and vice versa with an increase in capital turnover, net income decreases. In addition, the larger the value of the regression coefficient, the more significant the influence of the explanatory variable on the dependent variable. In the example under consideration, the value of the regression coefficient is greater than the value of the coefficient, therefore, the capital used has a much greater effect on net income than capital turnover. For a quantitative assessment of this conclusion, let us determine the partial coefficients of elasticity.

The analysis of the results obtained also shows that the capital used has a greater effect on net income. So, in particular, with an increase in the used capital by 1%, the net income increases by 1.17%. At the same time, with an increase in capital turnover by 1%, net income decreases by 0.5%.

The theoretical value of the Fisher criterion F calculated

The critical value F crit is determined from statistical tables and for the significance level a = 0.05 equals 4.74. Since F calc> F crit, the null hypothesis is rejected, and the resulting regression equation is taken to be statistically significant.

Evaluation of the statistical significance of the regression coefficients and by the t-criterion is reduced to comparing the numerical values ​​of these coefficients with the magnitude of their random errors and by the dependence:

The working formula for calculating the theoretical value of the t-statistic is:

, (3.13)

where the paired correlation coefficients and the multiple correlation coefficient are calculated according to the dependencies:

Then the theoretical (calculated) values ​​of t-statistics are respectively equal:

Since the critical value of t-statistics, determined from statistical tables for a significance level of a = 0.05 equal to t crit = 2.36 is greater in absolute value than = - 1.798, the null hypothesis is not rejected and the explanatory variable x 1 is statistically insignificant and its can be excluded from the regression equation. Conversely, for the second regression coefficient> t crit (3.3> 2.36), and the explanatory variable x 2 is statistically significant.

Let's calculate the average approximation error.

Table 3.10. To the calculation of the average approximation error.

Then the average approximation error is

The resulting value does not exceed the permissible limit equal to (12 ... 15)%.

16. The history of the development of the theory of measurements

At first, TI developed as a theory of psychophysical measurements. In post-war publications, the American psychologist S.S. Stevens focused on measuring scales. In the second half of the XX century. the scope of TI is rapidly expanding. One of the volumes of the "Encyclopedia of Psychological Sciences" published in the USA in the 50s was called "Psychological Dimensions". The authors of this publication expanded the scope of TI from psychophysics to psychology in general. In the article of this collection "Foundations of the theory of measurements", the presentation was at an abstract-mathematical level, without reference to any specific area of ​​application. In it, the emphasis was placed on "homomorphisms of empirical systems with relations to numbers" (there is no need to go into these mathematical terms here), and the mathematical complexity of the presentation increased in comparison with the works of S.S. Stevens.

In one of the first domestic articles on TI (the end of the 60s) it was established that the points assigned by experts in assessing the objects of expertise are usually measured on an ordinal scale. The works that appeared in the early 70s led to a significant expansion of the field of TI use. It was applied to pedagogical qualimetry (measuring the quality of students' knowledge), in system studies, in various tasks of the theory of expert assessments, for aggregating product quality indicators, in sociological research, etc.

As two main problems of TI, along with establishing the type of scale for measuring specific data, the search for algorithms for data analysis was put forward, the result of which does not change with any admissible transformation of the scale (i.e., is invariant with respect to this transformation). Ordinal scales in geography are the Beaufort scale winds ("calm", "weak wind", "moderate wind", etc.), the scale of the strength of earthquakes. Obviously, one cannot say that an earthquake of 2 points (the lamp swung from the ceiling) is exactly 5 times weaker than an earthquake of 10 points (complete destruction of everything on the surface of the earth).

In medicine, ordinal scales are the scale of stages of hypertension (according to Myasnikov), the scale of degrees of heart failure (according to Strazhesko-Vasilenko-Lang), the scale of the severity of coronary insufficiency (according to Vogelson), etc. All these scales are built according to the following scheme: no disease was detected; the first stage of the disease; second stage; the third stage ... Sometimes stages 1a, 16, etc. are distinguished. Each stage has its own unique medical characteristics. When describing disability groups, numbers are used in the opposite order: the most severe is the first disability group, then the second, and the lightest is the third.

House numbers are also measured on an ordinal scale - they show the order in which the houses are located along the street. Volume numbers in a writer's collected works or case numbers in an enterprise archive are usually associated with the chronological order of their creation.

When assessing the quality of products and services, in the so-called qualimetry (literal translation - measurement of quality), ordinal scales are popular. Namely, the unit of production is judged to be good or bad. A more thorough analysis uses a scale with three gradations: there are significant defects - there are only minor defects - no defects. Sometimes four grades are used: there are critical defects (making it impossible to use) - there are significant defects - there are only minor defects - there are no defects. The grade of products has a similar meaning - top grade, first grade, second grade, ...

When assessing environmental impacts, the first, most generalized assessment is usually ordinal, for example: the natural environment is stable - the natural environment is oppressed (degrades). The environmental-medical scale is similar: there is no pronounced impact on human health - there is a negative impact on health.

The ordinal scale is used in other areas as well. In econometrics, these are primarily various methods of expert assessments.

All measurement scales are divided into two groups - scales of qualitative signs and scales of quantitative signs. The ordinal scale and the naming scale are the main scales of qualitative characteristics, therefore, in many specific areas, the results of qualitative analysis can be considered as measurements on these scales. Scales of quantitative signs are scales of intervals, ratios, differences, absolute. The scale of intervals is used to measure the value of the potential energy or the coordinate of a point on a straight line. In these cases, neither the natural origin nor the natural unit of measurement can be marked on the scale. The researcher must set the starting point and choose the unit of measurement himself. Allowable transformations in the scale of intervals are linear increasing transformations, i.e. linear functions. The temperature scales of Celsius and Fahrenheit are related precisely by this relationship: ° С = 5/9 (° F - 32), where ° С is the temperature (in degrees) on the Celsius scale, and ° F is the temperature on the Fahrenheit scale.

Of the quantitative scales, the most common in science and practice are the scales of relations. They have a natural reference point - zero, i.e. no quantity, but no natural unit of measurement. Most physical units are measured on the scale of relations: body mass, length, charge, as well as prices in the economy. Allowed transformations in the scale of relations are similar (changing only the scale). In other words, linear increasing transformations without an intercept, for example, converting prices from one currency to another at a fixed rate. Suppose we are comparing the cost-effectiveness of two investment projects using prices in rubles. Let the first project be better than the second. Now let's switch to the Chinese currency - yuan, using a fixed conversion rate. Obviously, the first project should again prove to be more profitable than the second. However, the calculation algorithms do not automatically ensure the fulfillment of this condition, and it is necessary to check that it is fulfilled. The results of a similar test for averages are described below.

The difference scale has a natural unit of measurement, but no natural reference point. Time is measured on the scale of differences, if the year (or day - from noon to noon) is taken as the natural unit of measurement, and on the scale of intervals in the general case. At the modern level of knowledge, a natural starting point cannot be indicated. Different authors calculate the date of the creation of the world in different ways, as well as the moment of the Nativity of Christ.

For absolute scale only, the measurement results are numbers in the usual sense of the word, for example, the number of people in a room. For an absolute scale, only the identical transformation is permissible.

In the process of development of the corresponding field of knowledge, the type of the scale may change. So, at first, the temperature was measured on an ordinal scale (colder - warmer). Then - by interval (Celsius, Fahrenheit, Reaumur scales). Finally, after opening absolute zero, the temperature can be considered measured on a ratio scale (Kelvin scale). It should be noted that there are sometimes disagreements among specialists about which scales should be considered as measured by certain real values. In other words, the measurement process also includes determining the type of scale (along with the rationale for choosing a particular type of scale). In addition to the listed six main types of scales, other scales are sometimes used.

17. Invariant algorithms and mean values.

Let us formulate the main requirement for algorithms for data analysis in TI: conclusions drawn on the basis of data measured in a scale of a certain type should not change with an admissible transformation of the measurement scale of these data. In other words, the conclusions must be invariant with respect to admissible scale transformations.

Thus, one of the main goals of measurement theory is to combat the subjectivism of the researcher when assigning numerical values ​​to real objects. So, distances can be measured in arshins, meters, microns, miles, parsecs and other units of measurement. Mass (weight) - in poods, kilograms, pounds, etc. Prices for goods and services can be indicated in yuan, rubles, tenge, hryvnia, lats, kroons, marks, US dollars and other currencies (subject to specified conversion rates). Let us emphasize a very important, albeit quite obvious, circumstance: the choice of units of measurement depends on the researcher, i.e. subjective. Statistical conclusions can be adequate to reality only when they do not depend on which unit of measurement the researcher prefers, when they are invariant with respect to the admissible scale transformation. Of the many algorithms for econometric data analysis, only a few satisfy this condition. Let us show this by comparing average values.

Let X 1, X 2, .., X n be a sample of size n. The arithmetic mean is often used. The use of the arithmetic mean is so common that the second word in the term is often omitted and they talk about the average salary, average income and other averages for specific economic data, meaning the "mean" is the arithmetic mean. This tradition can lead to erroneous conclusions. Let us show this using the example of calculating the average wage (average income) of workers in a conventional enterprise. Out of 100 employees, only 5 have a wage that exceeds it, and the wages of the rest 95 are significantly less than the arithmetic average. The reason is obvious - the salary of one person - the general director - exceeds the salary of 95 employees - low-skilled and highly skilled workers, engineers and office employees. The situation is reminiscent of the one described in the famous story about the hospital, in which 10 patients, of whom 9 have a temperature of 40 ° C, and one has already exhausted himself, lies in a morgue with a temperature of 0 ° C. Meanwhile average temperature the hospital is 36 ° C - it couldn't be better!

Thus, the arithmetic mean can be used only for sufficiently homogeneous populations (without large outliers in one direction or another). What averages should you use to describe wages? It is quite natural to use the median - the arithmetic mean of the 50th and 51st workers, if their salaries are arranged in non-decreasing order. First comes the wages of 40 low-skilled workers, and then - from 41 to 70 workers - the wages of highly skilled workers. Consequently, the median falls precisely on them and is equal to 200. For 50 workers, wages do not exceed 200, and for 50 workers - at least 200, therefore the median shows the "center" around which the bulk of the studied values ​​are grouped. Another average is mode, the most common value. In this case, this is the wages of low-skilled workers, i.e. 100. Thus, to describe the salary, we have three averages - fashion (100 units), median (200 units) and arithmetic mean (400 units).

For the distributions of income and wages observed in real life, the same pattern is valid: the fashion is less than the median, and the median is less than the arithmetic mean.

What are the averages used in economics? Usually, in order to replace a collection of numbers with a single number, in order to compare populations using averages. Let, for example, Y 1, Y 2, ..., Y n - a set of expert assessments "exposed" to one object of expertise (for example, one of the options for the strategic development of a firm), Z 1, Z 2, ..., Z n - the second (another version of this development). How can these populations be compared? Obviously, the easiest way is by average values.

How do you calculate the averages? Known different kinds mean values: arithmetic mean, median, mode, geometric mean, harmonic mean, mean square. Recall that general concept the average value was introduced by a French mathematician of the first half of the 19th century. academician O. Cauchy. It is as follows: the average value is any function Ф (X 1, X 2, ..., X n) such that for all possible values ​​of the arguments, the value of this function is not less than the minimum of the numbers X 1, X 2, ... , X n, and no more than the maximum of these numbers. All of the above types of averages are Cauchy averages.

With an admissible scale conversion, the mean value obviously changes. But the conclusions about which population the average is greater and for which it is less should not change (in accordance with the requirement of the invariance of conclusions, adopted as the main requirement in the TI). Let us formulate the corresponding mathematical problem of finding the type of mean values, the comparison result of which is stable with respect to admissible scale transformations.

Let Ф (X 1 X 2, ..., X n) be the Cauchy mean. Let the average for the first population be less than the average for the second population: then, according to the TI, for the stability of the result of comparing the means, it is necessary that for any admissible transformation g from the group of admissible transformations in the corresponding scale it is true that the average of the transformed values ​​from the first population is also less than the average of the transformed values for the second set. Moreover, the formulated condition must be true for any two sets Y 1, Y 2, ..., Y n and Z 1, Z 2, ..., Z n and, recall, any admissible transformation. The mean values ​​satisfying the formulated condition are called admissible (in the corresponding scale). According to the TI, only such averages can be used when analyzing expert opinions and other data measured in the scale under consideration.

By using mathematical theory, developed in the 1970s, manages to describe the form of admissible means in the main scales. It is clear that for the data measured in the name scale, only the mode is suitable as an average.

18. Average values ​​in ordinal scale

Consider the processing of expert opinions measured on an ordinal scale. The following statement is true.

Theorem1 ... Of all Cauchy means, only the terms variation series(ordinal statistics).

Theorem 1 is valid provided that the mean Φ (X 1 X 2, ..., X n) is continuous (in the set of variables) and symmetric function. The latter means that when the arguments are rearranged, the value of the function Ф (X 1 X 2, ..., X n) does not change. This condition is quite natural, because we find the average value for a set (set), and not for a sequence. The set does not change depending on the order in which we list its elements.

According to Theorem 1, as the mean for data measured on an ordinal scale, one can use, in particular, the median (for an odd sample size). With an even volume, one of the two central members of the variation series should be used - as they are sometimes called, the left median or the right median. Fashion can be used too - it is always a member of the variation series. But you can never calculate the arithmetic mean, geometric mean, etc.

The following theorem is true.

Theorem 2... Let Y 1, Y 2, ..., Y m be independent identically distributed random variables with the distribution function F (x), and Z 1, Z 2, ..., Z n be independent identically distributed random variables with the function distribution H (x), and the samples Y 1, Y 2, ..., Y m and Z 1, Z 2, ..., Z n are independent and MY X> MZ X. In order for the probability of an event to tend to 1 for min (m, n) for any strictly increasing continuous function g satisfying the condition | g i |> X, it is necessary and sufficient that for all х the inequality F (x)< Н(х), причем существовало число х 0 , для которого F(x 0)

Note. The condition with an upper limit is purely intra-mathematical in nature. In fact, the function g is an arbitrary admissible transformation on an ordinal scale.

According to Theorem 2, the arithmetic mean can also be used in the ordinal scale if samples from two distributions are compared that satisfy the inequality given in the theorem. Simply put, one of the distribution functions must always lie above the other. Distribution functions cannot intersect, they are only allowed to touch each other. This condition is fulfilled, for example, if the distribution functions differ only by the shift:

F (x) = H (x + ∆)

for some ∆.

The last condition is fulfilled if two values ​​of a certain quantity are measured using the same measuring instrument, in which the distribution of errors does not change when passing from measuring one value of the quantity in question to measuring another.

Average according to Kolmogorov

A generalization of several of the averages listed above is the Kolmogorov average. For numbers X 1, X 2, ..., X n, the Kolmogorov average is calculated by the formula

G ((F (X l) + F (X 2) + ... F (X n)) / n),

where F is a strictly monotone function (i.e., strictly increasing or strictly decreasing),

G is the inverse function of F.

Among the Kolmogorov averages, there are many well-known characters. So, if F (x) = x, then the Kolmogorov mean is the arithmetic mean, if F (x) = lnx, then the geometric mean, if F (x) = 1 / x, then the harmonic mean, if F (x) = x 2, then the root mean square, etc. The Kolmogorov average is a special case of the Cauchy average. On the other hand, popular averages such as median and fashion cannot be represented as Kolmogorov averages. The following statements are proved in the monograph.

Theorem3 ... If certain intra-mathematical conditions of regularity are valid, only the arithmetic mean is acceptable in the interval scale of all Kolmogorov averages. Thus, geometric mean or root mean square of temperatures (in Celsius) or distances are meaningless. The arithmetic mean should be used as the average. You can also use median or fashion.

Theorem 4... If certain intramathematical conditions of regularity are valid, out of all Kolmogorov averages, only power averages with F (x) = x c, and the geometric mean are admissible in the scale of relations.

Comment. The geometric mean is the limit of power-law means for c> 0.

Are there averages according to Kolmogorov that cannot be used in the scale of relationships? Of course have. For example F (x) = fx.

Similarly to the average values, other statistical characteristics can be studied - indicators of dispersion, connection, distance, etc. It is easy to show, for example, that the correlation coefficient does not change with any admissible transformation in the pial of intervals, as does the ratio of variances, the variance does not change on the scale of differences, the coefficient of variation — on the scale of ratios, etc.

The above results on average values ​​are widely used, not only in economics, management, the theory of expert judgments or sociology, but also in engineering, for example, to analyze methods for aggregating sensors in the process control system of blast furnaces. The applied value of TI is great in the problems of standardization and quality management, in particular in qualimetry, where interesting theoretical results have been obtained. So, for example, any change in the weighting factors of single indicators of product quality leads to a change in the ordering of products according to the weighted average indicator (this theorem was proved by Prof. V.V. Podinovsky). Consequently, the above brief information about TI and its methods unite in a sense, economics, sociology and engineering sciences and are an adequate apparatus for solving complex problems that did not lend themselves to effective analysis before. the way opens up to the construction of realistic models and the solution of the forecast problem.

22. Paired Linear Regression

Let us now turn to a more detailed study of the simplest case of paired linear regression. Linear regression is described by the simplest functional dependence in the form of a straight line equation and is characterized by a transparent interpretation of the model parameters (equation coefficients). The right side of the equation allows for the given values ​​of the regressor (explanatory variable) to obtain the theoretical (calculated) values ​​of the effective (explained) variable. These values ​​are sometimes also called predicted (all in the same sense), i.e. obtained by theoretical formulas. However, when putting forward a hypothesis about the nature of the dependence, the coefficients of the equation are still unknown. Generally speaking, obtaining approximate values ​​of these coefficients is possible by various methods.

But the most important and widespread of these is the method of least squares (OLS). It is based (as already explained) on the requirement to minimize the sum of squares of deviations of the actual values ​​of the effective indicator from the calculated (theoretical) ones. Instead of theoretical values ​​(to obtain them), the right-hand sides of the regression equation are substituted into the sum of the squares of the deviations, and then the partial derivatives of this function (the sum of the squares of the deviations of the actual values ​​of the effective indicator from the theoretical ones) are found. These partial derivatives are taken not with respect to the variables x and y, but with respect to the parameters a and b. The partial derivatives are equated to zero, and after simple but cumbersome transformations, a system of normal equations is obtained to determine the parameters. Coefficient of variable x, i.e. b is called the regression coefficient, it shows the average change in the result with a change in factor by one unit. The parameter a may have no economic interpretation, especially if the sign of this coefficient is negative.

Pairwise linear regression is used to study the consumption function. The regression coefficient in the consumption function is used to calculate the multiplier. Almost always, the regression equation is supplemented by an indicator of the tightness of the connection. For the simplest case of linear regression, this indicator of the tightness of the connection is linear coefficient correlation. But since the linear correlation coefficient characterizes the closeness of the relationship between features in a linear form, the proximity of the absolute value of the linear correlation coefficient to zero does not yet serve as an indicator of the absence of a relationship between features.

It is with a different choice of the model specification and, therefore, the type of dependence that the actual relationship may turn out to be quite close to unity. But the quality of the selection of a linear function is determined using the square of the linear correlation coefficient - the coefficient of determination. It characterizes the proportion of the variance of the effective trait y, explained by the regression in total variance effective trait. The value that complements the coefficient of determination to 1 characterizes the proportion of variance caused by the influence of other factors not taken into account in the model (residual variance).

Pairwise regression is represented by the equation of connection between two variables y and x of the following form:

where y is the dependent variable (effective indicator), and x is the independent variable (explanatory variable, or indicator-factor). There are linear regression and non-linear regression. Linear regression is described by an equation of the form:

y = a + bx + .

Nonlinear regression, in turn, may be nonlinear with respect to the explanatory variables included in the analysis, but linear in the estimated parameters. Or maybe the regression is nonlinear in the estimated parameters. As examples of regression that is nonlinear in the explanatory variables, but linear in the estimated parameters, one can specify polynomial dependences of various degrees (polynomials) and an equilateral hyperbola.

Nonlinear regression for the estimated parameters is a power-law relationship with respect to a parameter (the parameter is in the exponent), an exponential relationship, where the parameter is at the base of the power, and an exponential relationship, when the entire linear relationship is entirely in the exponent. Note that in all these three cases the random component (random remainder)  is included in right side equations in the form of a factor, and not in the form of a summand, i.e. multiplicatively! The average deviation of the calculated values ​​of the effective indicator from the actual ones is characterized by the average approximation error. It is expressed as a percentage and should not exceed 7-8%. This average error of approximation is simply expressed as a percentage of the average of the relative magnitudes of the differences between the actual and calculated values.

Of great importance is the average coefficient of elasticity, which serves as an important characteristic of many economic phenomena and processes. It is calculated as the product of the value of the derivative of a given functional dependence by the ratio of the average value x to the average value y. The coefficient of elasticity shows how many percent on average across the population the result y will change from its average value when the factor x changes by 1% from its (factor x) average value.

An analysis of variance problems are closely related to paired regression and multiple regression (when there are many factors) and residual variance. Analysis of variance examines the variance of a dependent variable. In this case, the total sum of the squares of the deviations is divided into two parts. The first term is the sum of squared deviations due to regression, or explained (factorial). The second term is the residual sum of squares of deviations, unexplained by factor regression.

The proportion of variance explained by regression in the total variance of the effective trait y is characterized by the coefficient (index) of determination, which is nothing more than the ratio of the sum of squares of deviations due to regression to the total sum of squares of deviations (the first term to the entire sum).

When the parameters of the model (coefficients at unknowns) are determined using the least squares method, then, in essence, some random variables are found (in the process of obtaining estimates). Especially important is the estimation of the regression coefficient, which is some special form of a random variable. The properties of this random variable depend on the properties of the remainder in the equation (in the model). Consider the explanatory variable x as a nonrandom exogenous variable for a paired linear regression model. This just means that the values ​​of the variable x in all observations can be considered predetermined and in no way connected with the investigated dependence. Thus, the actual value of the variable being explained consists of two components: a non-random component and a random component (residual term).

On the other hand, the regression coefficient determined by the method of least squares (OLS) is equal to the quotient of dividing the covariance of the variables x and y by the variance of the variable x. Therefore, it also contains a random component. After all, the covariance depends on the values ​​of the variable y, where the values ​​of the variable y depend on the values ​​of the random remainder . Further, it is easy to show that the covariance of the variables x and y is equal to the product of the estimated regression coefficient beta () and the variance of the variable x, added with the covariance of the variables x and . Thus, the estimate of the regression coefficient beta is equal to this unknown regression coefficient itself, added with the quotient of dividing the covariance of the variables x and  by the variance of the variable x. Those. the estimate of the regression coefficient b obtained for any sample is presented as a sum of two terms: a constant value equal to the true value of the coefficient  (beta), and from a random component that depends on the covariance of the variables x and .

23. Mathematical conditions of Gauss-Markov and their application.

For regression analysis based on OLS to give the best results, the random term must satisfy four Gauss-Markov conditions.

The mathematical expectation of the random member is zero, i.e. it is unbiased. If the regression equation includes a constant term, then it is natural to consider such a requirement fulfilled, since this is a constant term and should take into account any systematic trend in the values ​​of the variable y, which, on the contrary, should not be contained in the explanatory variables of the regression equation.

The variance of the random term is constant for all observations.

The covariance of the values ​​of random variables forming the sample must be equal to zero, i.e. there is no systematic relationship between the values ​​of the random term in any two specific observations. Random members must be independent of each other.

The distribution law of a random term should be independent of the explanatory variables.

Moreover, in many applications the explanatory variables are not stochastic; do not have a random component. The value of any independent variable in each observation should be considered exogenous, completely determined by external causes that are not taken into account in the regression equation.

Together with the indicated Gauss-Markov conditions, it is also assumed that the random term has a normal distribution. It is valid under very broad conditions and is based on the so-called central limit theorem (CLT). The essence of this theorem is that if a random variable is the general result of the interaction of a large number of other random variables, none of which has a predominant influence on the behavior of this general result, then such a resulting random variable will be described by an approximately normal distribution. This closeness to normal distribution makes it possible to use the normal distribution to obtain estimates and, in a certain sense, its generalization, the Student's distribution, which differs markedly from the normal mainly on the so-called "tails", i.e. at small values ​​of the sample size. It is also important that if the random term is distributed normally, then the regression coefficients will also be distributed according to the normal law.

The established regression curve (regression equation) allows you to solve the problem of the so-called point forecast. In such calculations, a certain value of x outside the studied observation interval is taken and substituted into the right side of the regression equation (extrapolation procedure). Because estimates for the regression coefficients are already known, then it is possible to calculate the value of the explained variable y corresponding to the taken value of x. Naturally, in accordance with the meaning of the prediction (forecast), calculations are carried out forward (into the area of ​​future values).

However, since the coefficients were determined with a certain error, the point of interest is not the point estimate (point forecast) for the effective indicator, but the knowledge of the range within which the values ​​of the effective indicator corresponding to the taken value of the factor x will lie with a certain probability.

For this, the value of the standard error (standard deviation) is calculated. It can be obtained in the spirit of what has just been said as follows. The expression of the intercept a from the estimates through the means is substituted into the linear regression equation. Then it turns out that the standard error depends on the error of the mean effective factor y and additively on the error of the regression coefficient b. Simply, the square of this standard error is equal to the sum of the squared error of the mean y and the product of the squared error of the regression coefficient by the squared deviation of the factor x and its mean. Further, the first term, according to the laws of statistics, is equal to the quotient of dividing the variance of the general population by the value (size) of the sample.

The sample variance is used as an estimate instead of the unknown variance. Accordingly, the error of the regression coefficient is defined as the quotient of dividing the sample variance by the variance of the factor x. You can get the value of the standard error (standard deviation) and for reasons that are more independent of the linear regression model. For this, the concept of average error and marginal error and the relationship between them are used.

But even after obtaining the standard error, the question remains about the boundaries in which the predicted value will lie. In other words, about the measurement error interval, in the natural assumption in many cases that the middle of this interval is given by the calculated (average) value of the effective factor y. Here the central limit theorem comes to the rescue, which just indicates with what probability the unknown quantity is within this confidence interval.

In essence, the standard error formula, regardless of how and in what form it is obtained, characterizes the error in the position of the regression line. The value of the standard error reaches a minimum when the value of the factor x coincides with the mean value of the factor.

24. Statistical testing of hypotheses and assessment of the significance of linear regression by Fisher's test.

After the linear regression equation has been found, the significance of both the equation as a whole and its individual parameters is assessed. The estimation of the significance of the regression equation as a whole can be performed using various criteria. The use of Fisher's F-criterion is quite widespread and effective. At the same time, the null hypothesis H about is put forward, that the regression coefficient is zero, i.e. b = 0, and therefore the factor x has no effect on the outcome of y. The direct calculation of the F-criterion is preceded by the analysis of variance. Central to it is the decomposition of the total sum of squares of the deviations of the variable y from the mean value y into two parts - "explained" and "unexplained":

The total sum of the squares of the deviations of the individual values ​​of the effective trait y from the mean value y is caused by the influence of many factors.

Let's conditionally divide the whole set of reasons into two groups: the studied factor x and other factors. If the factor does not affect the result, then the regression line on the graph is parallel to the OX axis and y = y. Then the entire variance of the effective trait is due to the influence of other factors and the total sum of the squares of the deviations will coincide with the residual. If other factors do not affect the result, then y is functionally related to x and the residual sum of squares is zero. In this case, the sum of squares of the deviations explained by the regression is the same as the total sum of squares. Since not all points of the correlation field lie on the regression line, their scatter always takes place as due to the influence of the factor x, i.e. regression y on x, and caused by other causes (unexplained variation). The suitability of the regression line for forecasting depends on how much of the total variation in the trait y is attributed to the explained variation.

Obviously, if the sum of squares of deviations due to the regression is greater than the residual sum of squares, then the regression equation is statistically significant and the factor x has a significant impact on the result. This is tantamount to the fact that the coefficient of determination will approach one. Any sum of squares of deviations is related to the number of degrees of freedom, i.e. the number of freedom of independent variation of the characteristic. The number of degrees of freedom is associated with the number of units of the population or with the number of constants determined from it. With regard to the problem under study, the number of degrees of freedom should show how many independent deviations out of n possible [(at 1 -y), (at 2 -y), ... (at n -y)] are required to form a given sum of squares. So, for the total sum of squares ∑ (y-y cf) 2 (n-1) independent deviations are required, since for a set of n units, after calculating the average level, only (n-1) number of deviations freely vary. When calculating the explained or factorial sum of squares ∑ (y-y cf) 2, the theoretical (calculated) values ​​of the effective indicator y *, found along the regression line, are used: y (x) = a + bx.

Let us now return to the decomposition of the total sum of the squares of the deviations of the effective factor from the mean of this value. This sum contains two parts already defined above: the sum of the squares of the deviations, explained by the regression, and another sum, which is called the residual sum of squares of the deviations. An analysis of variance is associated with such a decomposition, which directly answers the fundamental question: how to evaluate the significance of the regression equation as a whole and its individual parameters? It also largely determines the meaning of this question. To assess the significance of the regression equation as a whole, Fisher's test (F-test) is used. According to the approach proposed by Fisher, a null hypothesis is put forward: the regression coefficient is zero, i.e. value b = 0. This means that factor X does not affect the outcome of Y.

Recall that almost always the points obtained as a result of statistical research do not fall exactly on the regression line. They are scattered, being more or less far removed from the regression line. This dispersion is due to the influence of factors other than the explanatory factor X, which are not taken into account in the regression equation. When calculating the explained, or factorial sum of squared deviations, the theoretical values ​​of the effective indicator, found along the regression line, are used.

For a given set of values ​​of the variables Y and X, the calculated value of the average value Y in linear regression is a function of only one parameter - the regression coefficient. In accordance with this, the factorial sum of squares of deviations has the number of degrees of freedom equal to 1. And the number of degrees of freedom of the residual sum of squares of deviations in linear regression is equal to n-2.

Therefore, dividing each sum of the squares of the deviations in the original expansion by its number of degrees of freedom, we obtain the mean square of the deviations (variance per one degree of freedom). Further, dividing the factorial variance by one degree of freedom by the residual variance by one degree of freedom, we obtain a criterion for testing the null hypothesis, the so-called F-ratio, or the criterion of the same name. Namely, if the null hypothesis is valid, the factorial and residual variances turn out to be simply equal to each other.

To reject the null hypothesis, i.e. acceptance of the opposite hypothesis, which expresses the fact of the significance (presence) of the investigated dependence, and not just a random coincidence of factors imitating a dependence that does not actually exist; it is necessary to use tables of critical values ​​of the specified ratio. The tables are used to find out the critical (threshold) value of the Fisher criterion. It is also called theoretical. Then it is checked by comparing it with the corresponding empirical (actual) value of the criterion calculated from the observation data, whether the actual value of the ratio exceeds the critical value from the tables.

In more detail, this is done as follows. A given level of probability of the presence of a null hypothesis is selected and the tables are used to find the critical value of the F-criterion, at which there can still be a random discrepancy of the variances by 1 degree of freedom, i.e. the maximum such value. Then the calculated value of the F-ratio is recognized as reliable (i.e., expressing the difference between the actual and residual variances) if this ratio is greater than the tabular one. Then the null hypothesis is rejected (it is not true that there are no signs of a connection) and, on the contrary, we come to the conclusion that there is a connection and is significant (it is not random, significant).

If the value of the ratio turns out to be less than the tabular value, then the probability of the null hypothesis turns out to be higher than the given level (which was chosen initially) and the null hypothesis cannot be rejected without a noticeable danger of getting an incorrect conclusion about the presence of a connection. Accordingly, the regression equation is considered insignificant in this case.

The very value of the F-criterion is associated with the coefficient of determination. In addition to assessing the significance of the regression equation as a whole, the significance of individual parameters of the regression equation is also assessed. In this case, the standard error of the regression coefficient is determined using the empirical actual standard deviation and the empirical variance per degree of freedom. After that, the Student's distribution is used to check the significance of the regression coefficient for calculating its confidence intervals.

The estimation of the significance of the regression and correlation coefficients using the Student's t-test is carried out by comparing the values ​​of these quantities and the value of the standard error. The magnitude of the error of the linear regression parameters and the correlation coefficient is determined by the following formulas:

where S is the root-mean-square residual sample deviation,

r xy - correlation coefficient.

Accordingly, the value of the standard error predicted by the regression line is given by the formula:

The corresponding ratios of the values ​​of the regression and correlation coefficients to their standard error form the so-called t-statistics, and the comparison of the corresponding tabular (critical) value of it and its actual value allows you to accept or reject the null hypothesis. But further, to calculate the confidence interval, the marginal error for each indicator is found as the product of the tabular value of the statistic t by the average random error of the corresponding indicator. In fact, in a slightly different way, we actually wrote it down just above. Then the boundaries of the confidence intervals are obtained: the lower bound by subtracting the corresponding marginal error from the corresponding coefficients (actually means), and the upper bound by adding (adding).

In linear regression, ∑ (y x -y avg) 2 = b 2 ∑ (x-x avg) 2. It is easy to verify this by referring to the formula for the linear correlation coefficient: r 2 xy = b 2 * σ 2 x / σ 2 y

where σ 2 y is the total variance of the attribute y;

σ 2 x - variance of the attribute y due to the factor x. Accordingly, the sum of the squares of the deviations due to linear regression will be:

∑ (y x -y avg) 2 = b 2 ∑ (x-x avg) 2.

Since for a given volume of observations in x and y, the factorial sum of squares in linear regression depends on only one constant of the regression coefficient b, this sum of squares has one degree of freedom. Consider the meaningful side of the calculated value of the attribute y, i.e. y x. The value of y x is determined by the linear regression equation: y x ​​= a + bx.

Parameter a can be defined as a = y-bx. Substituting the expression for the parameter a into the linear model, we get: y x ​​= y-bx + bx cf = y-b (x-x cf).

For a given set of variables y and x, the calculated value of y x in linear regression is a function of only one parameter - the regression coefficient. Accordingly, the factorial sum of the squares of the deviations has the number of degrees of freedom equal to 1.

There is equality between the number of degrees of freedom of the total, factorial and residual sums of squares. The number of degrees of freedom of the residual sum of squares in linear regression is (n-2). The number of degrees of freedom for the total sum of squares is determined by the number of units, and since we use the average calculated from the sample data, we lose one degree of freedom, i.e. (n-1). So, we have two equalities: for the sums and for the number of degrees of freedom. And this, in turn, brings us back to comparable variances per degree of freedom, the ratio of which gives Fisher's criterion.

25. Assessment of the significance of individual parameters of the regression equation and coefficients according to the Student's criterion.

27. Linear and nonlinear regression and methods of their research.

Linear regression and the methods of its study and estimation would not be so important if, in addition to this very important, but still the simplest case, we did not obtain with their help a tool for analyzing more complex nonlinear dependencies. Nonlinear regressions can be divided into two substantially different classes. The first and simpler is the class of nonlinear dependencies, in which there is nonlinearity with respect to the explanatory variables, but which remain linear in terms of the parameters included in them and subject to assessment. This includes the polynomials various degrees and equilateral hyperbola.

Such nonlinear regression for the variables included in the explanation by simple transformation (replacement) of variables can be easily reduced to the usual linear regression for new variables. Therefore, the estimation of the parameters in this case is carried out simply by the least squares method, since the dependences are linear in the parameters. So, an important role in economics is played by nonlinear dependence described by an equilateral hyperbola:

Its parameters are well estimated by the least squares method, and this dependence itself characterizes the relationship unit costs raw materials, fuel, materials with the volume of products, the time of circulation of goods and all these factors with the value of turnover. For example, the Phillips curve characterizes the non-linear relationship between the rate of unemployment and the percentage of wage growth.

The situation is completely different with regression, nonlinear in the estimated parameters, for example, represented by a power function, in which the degree itself (its exponent) is a parameter, or depends on a parameter. It can also be an exponential function, where the base of the degree is a parameter and an exponential function, in which, again, the indicator contains a parameter or a combination of parameters. This class, in turn, is divided into two subclasses: one includes externally non-linear, but essentially internally linear. In this case, you can bring the model to a linear form using transformations. However, if the model is internally non-linear, then it cannot be reduced to a linear function.

Thus, only models are internally nonlinear in regression analysis are considered to be truly non-linear. All others, reduced to linear by means of transformations, are not considered as such, and it is they who are considered most often in econometric studies. At the same time, this does not mean that it is impossible to study essentially nonlinear dependencies in econometrics. If the model is internally nonlinear in parameters, then iterative procedures are used to estimate the parameters, the success of which depends on the form of the equation of the features of the applied iterative method.

Let's return to the linear dependencies. If they are nonlinear both in parameters and in variables, for example, of the form y = a multiplied by the power X, the exponent of which is the parameter -  (beta):

Obviously, such a ratio can be easily transformed into a linear equation by simple logarithm.

After the introduction of new variables denoting the logarithms, a linear equation is obtained. Then the regression estimation procedure consists in calculating new variables for each observation by taking the logarithms of the original values. Then the regression dependence of the new variables is estimated. To go to the original variables, one should take the antilogarithm, that is, in fact, return to the powers themselves instead of their exponents (after all, the logarithm is the exponent). The case of exponential or exponential functions can be considered similarly.

For substantially nonlinear regression, it is impossible to apply the usual regression estimation procedure, since the corresponding dependence cannot be transformed into a linear one. The general scheme of actions is as follows:

1. Some plausible initial parameter values ​​are accepted;

2. The predicted values ​​of Y are calculated from the actual values ​​of X using these parameter values;

3. Calculate the residuals for all observations in the sample and then the sum of the squares of the residuals;

4. Submitted minor changes one or more parameter estimates;

5. Calculate the new predicted Y values, residuals and the sum of squares of residuals;

6. If the sum of the squares of the residuals is less than before, then the new parameter estimates are better than the previous ones and should be used as a new starting point;

7. Steps 4, 5 and 6 are repeated again until it turns out to be impossible to make such changes in the parameter estimates that would lead to a change in the sum of the residuals of the squares;

8. It is concluded that the value of the sum of squares of the residuals is minimized and the final estimates of the parameters are estimates by the method of least squares.

Among the nonlinear functions that can be reduced to a linear form, a power function is widely used in econometrics. The parameter b in it has a clear interpretation, being the coefficient of elasticity. In models that are nonlinear in the parameters being estimated, but brought to a linear form, OLS is applied to the transformed equations. The practical application of the logarithm and, accordingly, the exponent is possible when the effective indicator does not have negative values. When studying the relationships among the functions using the logarithm of the effective indicator, power-law dependences prevail in econometrics (supply and demand curves, production functions, development curves to characterize the relationship between the labor intensity of production, the scale of production, the dependence of GNI on the level of employment, Engel curves).

28. Reverse model and its use

Sometimes the so-called inverse model is used, which is internally nonlinear, but in it, in contrast to the equilateral hyperbola, it is not the explanatory variable that is transformed, but the resultant feature Y. Therefore, the inverse model turns out to be internally nonlinear and the OLS requirement is not fulfilled for the actual values ​​of the resultant feature Y, and for their inverse values. The study of correlation for nonlinear regression deserves special attention. In the general case, a parabola of the second degree, as well as polynomials of a higher order, during linearization takes the form of a multiple regression equation. If the regression equation, which is nonlinear with respect to the variable being explained, during linearization takes the form of a linear pairwise regression equation, then a linear correlation coefficient can be used to assess the tightness of the relationship.

If the transformations of the regression equation into linear form are associated with the dependent variable (resultant feature), then the linear correlation coefficient for the transformed feature values ​​gives only an approximate estimate of the relationship and does not numerically coincide with the correlation index. It should be borne in mind that when calculating the correlation index, the sums of the squares of the deviations of the effective indicator Y are used, and not their logarithms. The assessment of the significance of the correlation index is carried out in the same way as the assessment of the reliability (significance) of the correlation coefficient. The correlation index itself, as well as the determination index, is used to test the significance in general of the nonlinear regression equation by Fisher's F-criterion.

Note that the ability to build nonlinear models, both by reducing them to a linear form, and by using nonlinear regression, on the one hand, increases the versatility of regression analysis. On the other hand, it significantly complicates the tasks of the researcher. If you limit yourself to paired regression analysis, then you can plot the observations Y and X as a scatter plot. Often several different nonlinear functions roughly correspond to observations if they lie on some curve. But in the case of multiple regression analysis, such a graph cannot be built.

When considering alternative models with the same definition of the dependent variable, the selection procedure is relatively straightforward. You can estimate the regression based on all the likely functions you can imagine and choose the function that best explains the change in the dependent variable. It is clear that when a linear function explains about 64% of the variance of y, and a hyperbolic one explains 99.9%, the latter should obviously be chosen. But when different models use different functional forms, the problem of choosing a model is significantly complicated.

29. Using the Box-Cox test.

More generally, when considering alternative models with the same dependent variable definition, the choice is straightforward. It is most reasonable to evaluate the regression on the basis of all probable functions, stopping at the function that most explains the change in the dependent variable. If the coefficient of determination measures in one case the proportion of variance explained by the regression, and in the other - the proportion of variance explained by the regression of the logarithm of this dependent variable, then the choice is made without difficulty. It is another matter when these values ​​for the two models are very close and the problem of choice becomes much more complicated.

Then the standard procedure in the form of the Box-Cox test should be applied. If you just need to compare the models using the effective factor and its logarithm in the form of a variant of the dependent variable, then the variant of the Zarembka test is used. It proposes a transformation of the observation scale Y, which provides the ability to directly compare the mean square error (RMSE) in linear and logarithmic models. The corresponding procedure includes the following steps:

    The geometric mean of the Y values ​​in the sample is calculated, which coincides with the exponent of the arithmetic mean of the logarithm of Y;

    Observations Y are recalculated in such a way that they are divided by the value obtained at the first step;

    Regression is estimated for the linear model using the recalculated Y values ​​instead of the original Y values ​​and for the logarithmic model using the log of the recalculated Y values. The RMS values ​​for the two regressions are now comparable and therefore the model with the lower sum of squares of deviations provides a better fit with the true relationship of the observed values;

    To check that one of the models does not provide a significantly better fit, one can use the product of half the number of observations by the logarithm of the ratio of the RMS values ​​in the recalculated regressions, followed by taking absolute value this value.

30. The concept of intercorrelation and multicollinearity of factors.

34. The basics of MNC and the validity of its application.

Let us now turn to the basics of OLS, the validity of its application (including multiple regression problems) and the most important properties of estimates obtained using OLS. Let's start with the fact that along with the analytical dependence on the right side regression equation the random term also plays an important role. This random component is an unobservable quantity. The statistical tests of regression parameters and correlation indicators themselves are based on unverifiable assumptions about the distribution of this random component of multiple regression. These assumptions are only preliminary. Only after constructing the regression equation, a check is made for the presence of the estimates of random residuals (empirical analogs of the random component) of the a priori assumed properties. In essence, when the parameters of the model are estimated, the differences between the theoretical and actual values ​​of the effective indicator are calculated in order to thus estimate the random component itself. It is important to keep in mind that this is just a selective implementation of the unknown remainder of the given equation.

The regression coefficients obtained from the system of normal equations are sample estimates of the strength of the relationship. It is clear that they are of practical importance only when they are unbiased. Recall that in this case the average of the residuals is zero, or, which is the same, the average of the estimate is equal to the most estimated parameter. Then the residues will not accumulate at a large number sample estimates, and the found regression parameter itself can be considered as the average of a large number of unbiased estimates.

In addition, the estimates should have the smallest variance, i.e. to be effective and then it becomes possible to move from practically unsuitable point estimates to interval estimation. Finally, confidence intervals are applicable with a high degree of efficiency when the probability of obtaining an estimate at a given distance from the true (unknown) parameter value is close to unity. Such estimates are called consistent and the consistency property is characterized by an increase in their accuracy with an increase in the sample size.

However, the condition of consistency is not automatically satisfied and essentially depends on the fulfillment of the following two important requirements. First, the residuals themselves should be stochastic with the most pronounced randomness, i.e. all explicitly functional dependencies should be included precisely in the analytical component of multiple regression, and besides, the residual values ​​should be distributed independently of each other for different samples (no autocorrelation of residuals). The second, no less important requirement is that the variance of each deviation (residual) is the same for all values ​​of the X variables (homoscedasticity). Those. homoscedasticity is expressed by the constancy of the variance for all observations:

On the contrary, heteroscedasticity is the violation of such constancy of variance for different observations. In this case, the a priori (before observations) probability of obtaining strongly deviated values ​​with different theoretical distributions of the random term for different observations in the sample will be relatively high.

Autocorrelation of residuals, or the presence of a correlation between the residuals of current and previous (subsequent) observations, is seen by the value of the usual linear correlation coefficient. If it differs significantly from zero, then the residuals are autocorrelated and, therefore, the probability density function (distribution of residuals) depends on the observation point and on the distribution of the residual values ​​at other observation points. It is convenient to determine the autocorrelation of the residuals from the available statistical information in the presence of the ordering of observations by factor X. The absence of autocorrelation of the residuals ensures the consistency and efficiency of the estimates of the regression coefficients.

35. Homoscedasticity and heteroscedasticity, autocorrelation of residuals, generalized least squares method (GLS).

The uniformity of the variances of the residuals for all values ​​of the X variables, or homoscedasticity, is also absolutely necessary for obtaining consistent estimates of the regression parameters using the least squares method. Failure to meet the homoscedasticity condition leads to the so-called heteroscedasticity. It can lead to bias in the estimates of the regression coefficients. Heteroscedasticity will mainly affect the decrease in the effectiveness of estimates of the regression coefficients. At the same time, it becomes especially difficult to use the formula for the standard error of the regression coefficient, the use of which assumes a single variance of the residuals for any values ​​of the factor. As for the unbiasedness of the estimates of the regression coefficients, it, first of all, depends on the independence of the residuals and the values ​​of the factors themselves.

A rather illustrative, though not strict and skillful way of checking homoscedasticity is a graphical study of the nature of the dependence of the residuals on the average calculated (theoretical) effective trait, or the corresponding correlation fields. Analytical methods of research and assessment of heteroscedasticity are more rigorous. If there is a significant presence of heteroscedasticity, it is advisable to use generalized OLS instead of OLS.

In addition to the requirements for multiple regression arising from the use of OLS, it is also necessary to comply with the conditions on the variables included in the model. These, first of all, include the requirements regarding the number of model factors for a given volume of observations (1 to 7). Otherwise, the regression parameters will be statistically insignificant. From the point of view of the effectiveness of the application of the corresponding numerical methods in the implementation of the LSM, it is necessary that the number of observations exceeds the number of estimated parameters (in the system of equations, the number of equations is more than the number of sought variables).

The most significant achievement of econometrics is the significant development of the methods for estimating unknown parameters and the improvement of the criteria for identifying the static significance of the effects under consideration. In this regard, the impossibility or inappropriateness of using traditional OLS due to heteroscedasticity manifested to one degree or another led to the development of a generalized OLS (OLS). In fact, this corrects the model, changes its specification, transforms the initial data to ensure the unbiasedness, efficiency and consistency of the estimates of the regression coefficients.

It is assumed that the mean of the residuals is zero, but their variance is no longer constant, but proportional to the values ​​of K i, where these values ​​are proportionality coefficients that are different for different meanings factor x. Thus, it is these coefficients (K i values) that characterize the variance inhomogeneity. Naturally, it is believed that the value of the variance itself, which is a common factor at these proportionality coefficients, is unknown.

The original model, after introducing these coefficients into the multiple regression equation, continues to remain heteroscedastic (more precisely, these are the residual values ​​of the model). Let these residuals (residuals) be not autocorrelated. We introduce new variables obtained by dividing the initial variables of the model, fixed as a result of the i-th observation, by the square root of the proportionality coefficients K i. Then we get a new equation in transformed variables, in which the remainders are already homoscedastic. The new variables themselves are weighted old (original) variables.

Therefore, the estimation of the parameters of the new equation obtained in this way with homoscedastic residuals will be reduced to a weighted least squares method (in fact, this is an OLS). When used instead of the regression variables themselves, their deviations from the mean expressions for the regression coefficients acquire a simple and standardized (uniform) form, slightly different for OLS and OLSs by the correction factor 1 / K in the numerator and denominator of the fraction that gives the regression coefficient.

It should be borne in mind that the parameters of the transformed (corrected) model significantly depend on the concept underlying the proportionality coefficients K i. It is often assumed that the residuals are simply proportional to the factor values. The model takes the simplest form in the case when the hypothesis is accepted that the errors are proportional to the values ​​of the last factor in order. Then OLS allows increasing the weight of observations with smaller values ​​of the transformed variables when determining the regression parameters in comparison with the work of the standard OLS with the original initial variables. But these new variables are already receiving a different economic content.

The hypothesis about the proportionality of the residuals to the size of the factor may well have a real justification. Let some insufficiently homogeneous set of data be processed, for example, including large and small enterprises at the same time. Then large volumetric values ​​of the factor can correspond to both a large variance of the effective indicator and a large variance of residual values. Further, the use of OLS and the corresponding transition to relative values ​​not only reduces the variation of the factor, but also decreases the variance of the error. Thus, the simplest case of taking into account and correcting heteroscedasticity in regression models is realized through the use of OLS.

The above approach to the implementation of OLS in the form of a weighted OLS is quite practical - it is simple to implement and has a transparent economic interpretation. Of course, this is not the most general approach, and in the context of mathematical statistics, which serves as the theoretical basis of econometrics, we are offered a much more rigorous method that implements OLS in its most general form. In it, you need to know the covariance matrix of the error vector (column of residuals). And this is in practical situations as a rule, it is unfair, and it is impossible to find this matrix as such. Therefore, generally speaking, it is necessary to somehow estimate the required matrix in order to use such an estimate in the corresponding formulas instead of the matrix itself. Thus, the described variant of implementation of the OLSM represents one of such estimates. It is sometimes called the accessible generalized OLS.

It should also be taken into account that the coefficient of determination cannot serve as a satisfactory measure of the quality of the fit when using OLS. Returning to the use of OLS, we also note that the method of using standard deviations (standard errors) in White's form (the so-called consistent standard errors in the presence of heteroscedasticity) has sufficient generality. This method is applicable provided that the covariance matrix of the error vector is diagonal. If there is autocorrelation of residuals (errors), when there are nonzero elements (coefficients) in the covariance matrix and outside the main diagonal, then a more general method of standard errors in the form of Neve-West should be used. At the same time, there is a significant restriction: nonzero elements, in addition to the main diagonal, are located only on adjacent diagonals, which are no more than a certain amount from the main diagonal.

It is clear from what has been said that it is necessary to be able to check the data for heteroscedasticity. The tests below serve this purpose. They test the main hypothesis about the equality of the variances of the residuals against the alternative hypothesis (about the inequality of these hypotheses). In addition, there are a priori structural limitations on the nature of heteroscedasticity. In the Goldfeld-Quandt test, as a rule, the assumption is made about the direct dependence of the variance of the error (residual) on the value of some independent variable. The scheme for applying this test is as follows. First, the data is ordered in descending order of the independent variable for which there is suspicion of heteroscedasticity. Then, in this ordered dataset, a few average cases are excluded, where the word “few” means about a quarter (25%) of the total of all cases. Next, two independent regressions are performed for the first of the remaining (after the elimination) mean observations and the last two of these remaining mean observations. After that, two corresponding residues are constructed. Finally, the Fischer F-statistic is compiled, and if the hypothesis under study is true, then F is indeed the Fisher distribution with the corresponding degrees of freedom. Then the large value of this statistic means that the hypothesis being tested must be rejected. Without the step of excluding observations, the power of this test decreases.

The Breusch-Pagan test is used when it is assumed a priori that the variances depend on some additional variables. First, a regular (standard) regression is performed and a vector of residuals is obtained. Then the variance is estimated. Next, a regression of the squared residual vector divided by the empirical variance (variance estimate) is carried out. For it (regressions) find the explained part of the variation. And for this explained part of the variation, divided in half, statistics are built. If the null hypothesis is true (the absence of heteroscedasticity is true), then this quantity has the distribution hee-square. If the test, on the contrary, revealed heteroscedasticity, then the original model is transformed by dividing the components of the residual vector by the corresponding components of the vector of the observed independent variables.

36. Method of standard deviations in the form of White.

The following conclusions can be drawn. The use of OLS in the presence of heteroscedasticity is reduced to minimizing the sum of weighted squares of deviations. The use of the available OLS is associated with the need for a large number of observations, exceeding the number of estimated parameters. The most favorable for the use of OLS is the case when the error (residuals) is proportional to one of the independent variables and the resulting estimates are consistent. If, nevertheless, in a model with heteroscedasticity it is necessary to use not the OLS, but the standard OLS, then to obtain consistent estimates, one can use the error estimates in the form of White or Neve-West.

When analyzing time series, it is often necessary to take into account the statistical dependence of observations at different points in time. In this case, the assumption of uncorrelated errors is not met. Consider simple model, in which the errors form a first-order autoregressive process. In this case, the errors satisfy a simple recurrence relation, on the right-hand side of which one of the terms is a sequence of independent normally distributed random variables with zero mean and constant variance. The second term is the product of the parameter (autoregressive coefficient) by the values ​​of the residuals at the previous moment in time. The very sequence of error values ​​(residuals) forms a stationary random process. A stationary random process is characterized by the constancy of its characteristics over time, in particular, the average and variance. In this case, the covariance matrix (its terms) of interest to us can be easily written out using the powers of the parameter.

Estimation of an autoregressive model for a known parameter is performed using OLS. In this case, it is enough to simply reduce the original model by a simple transformation to a model whose errors satisfy the conditions of the standard regression model. It is very rare, but still there is a situation in which the autoregressive parameter is known. Therefore, it is generally necessary to perform estimation with an unknown autoregressive parameter. There are three most commonly used procedures for this assessment. Cochrane-Orcutt method, Hildreth-Lu procedure and Darbin method.

In general, the following conclusions are true. Time series analysis requires correction of the usual OLS, since the errors in this case are usually correlated. Often these errors form a first-order stationary autoregressive process. OLS estimates for first-order autoregression are unbiased, consistent, but ineffective. With a known autoregression coefficient, OLS is reduced to simple transformations (corrections) of the original system and then to the use of standard OLS. If, which is more often the case, the autoregression coefficient is unknown, then there are several procedures of the available OLS, which consist in estimating the unknown parameter (coefficient), after which the same transformations are applied as in the previous case of the known parameter.

37. The concept of the Breusch-Pagan test, the Goldfeldt-Quandt test

Let us test the hypothesis H 0 about the equality of individual regression coefficients to zero (with the alternative H 1 is not equal) at the significance level b = 0.05.

If the main hypothesis turns out to be wrong, we accept the alternative. To test this hypothesis, the Student's t-test is used.

The value of the t-test found from the observational data (it is also called the observed or actual) is compared with the tabular (critical) value determined from the Student distribution tables (which are usually given at the end of textbooks and workshops on statistics or econometrics).

The tabular value is determined depending on the level of significance (b) and the number of degrees of freedom, which in the case of linear pairwise regression is equal to (n-2), n is the number of observations.

If the actual value of the t-criterion is greater than the tabular one (in absolute value), then the main hypothesis is rejected and it is considered that with probability (1-b) the parameter or statistical characteristic in the general population is significantly different from zero.

If the actual value of the t-criterion is less than the tabular one (in absolute value), then there is no reason to reject the main hypothesis, i.e. a parameter or statistical characteristic in the general population does not differ significantly from zero at a significance level of b.

t crit (n-m-1; b / 2) = (30; 0.025) = 2.042

Since 1.7< 2.042, то статистическая значимость коэффициента регрессии b не подтверждается (принимаем гипотезу о равенстве нулю этого коэффициента). Это означает, что в данном случае коэффициентом b можно пренебречь.

Since 0.56< 2.042, то статистическая значимость коэффициента регрессии a не подтверждается (принимаем гипотезу о равенстве нулю этого коэффициента). Это означает, что в данном случае коэффициентом a можно пренебречь.

Confidence interval for the coefficients of the regression equation.

Let us determine the confidence intervals of the regression coefficients, which with a reliability of 95% will be as follows:

  • (b - t crit S b; b + t crit S b)
  • (0.64 - 2.042 * 0.38; 0.64 + 2.042 * 0.38)
  • (-0.13;1.41)

Since the point 0 (zero) lies within the confidence interval, the interval estimate of the coefficient b is statistically insignificant.

  • (a - t crit S a; a + t crit S a)
  • (24.56 - 2.042 * 44.25; 24.56 + 2.042 * 44.25)
  • (-65.79;114.91)

With a probability of 95%, it can be argued that the value this parameter will lie in the found interval.

Since the point 0 (zero) lies within the confidence interval, the interval estimate of the coefficient a is statistically insignificant.

2) F-statistics. Fisher's criterion.

The coefficient of determination R 2 is used to test the significance of the linear regression equation as a whole.

Checking the significance of the regression model is carried out using the Fisher's F-test, the calculated value of which is found as the ratio of the variance of the initial series of observations of the studied indicator and the unbiased estimate of the variance of the residual sequence for this model.

If the calculated value with k 1 = (m) and k 2 = (n-m-1) degrees of freedom is greater than the tabular value for a given level of significance, then the model is considered significant.

where m is the number of factors in the model.

The statistical significance of paired linear regression is estimated using the following algorithm:

  • 1. A null hypothesis is put forward that the equation as a whole is statistically insignificant: H 0: R 2 = 0 at the level of significance b.
  • 2. Next, the actual value of the F-criterion is determined:

where m = 1 for paired regression.

3. The tabular value is determined from the Fisher distribution tables for a given level of significance, taking into account that the number of degrees of freedom for the total sum of squares (greater variance) is 1 and the number of degrees of freedom of the residual sum of squares (less variance) for linear regression is n-2 ...

F table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom and significance level b. Significance level b is the probability of rejecting a correct hypothesis, provided that it is correct. Usually b is taken equal to 0.05 or 0.01.

4. If the actual value of the F-criterion is less than the tabular one, then they say that there is no reason to reject the null hypothesis.

Otherwise, the null hypothesis is rejected and an alternative hypothesis about the statistical significance of the equation as a whole is accepted with probability (1-b).

Tabular value of the criterion with degrees of freedom k 1 = 1 and k 2 = 30, F tab = 4.17

Since the actual value of F< F табл, то коэффициент детерминации статистически не значим (Найденная оценка уравнения регрессии статистически не надежна).

The relationship between Fisher's F-test and Student's t-statistic is expressed by the equality:

Regression equation quality indicators.

Check for autocorrelation of residuals.

An important prerequisite for constructing a qualitative OLS regression model is the independence of the values random deviations from the values ​​of deviations in all other observations. This ensures that there is no correlation between any deviations and, in particular, between adjacent deviations.

Autocorrelation (sequential correlation) is defined as the correlation between observed indicators ordered in time (time series) or in space (cross series). Autocorrelation of residuals (variances) is common in regression analysis when using time series data and very rarely when using cross-sectional data.

In economic problems, positive autocorrelation is much more common than negative autocorrelation. In most cases, positive autocorrelation is caused by the directional constant influence of some factors that were not taken into account in the model.

Negative autocorrelation effectively means that a positive deviation is followed by a negative one, and vice versa. This situation may occur if the same relationship between the demand for soft drinks and income is considered according to seasonal data (winter-summer).

Among the main reasons for autocorrelation are the following:

  • 1. Specification errors. The failure to take into account any important explanatory variable in the model or the wrong choice of the form of dependence usually lead to systemic deviations of observation points from the regression line, which can lead to autocorrelation.
  • 2. Inertia. Many economic indicators (inflation, unemployment, GNP, etc.) have a certain cyclical nature associated with the waveform of business activity. Therefore, the change in indicators does not occur instantly, but has a certain inertia.
  • 3. Cobweb effect. In many industrial and other areas, economic performance responds to change economic conditions with a delay (time lag).
  • 4. Data smoothing. Often, data for a certain long time period is obtained by averaging data over its constituent intervals. This can lead to a certain smoothing of fluctuations that were present within the period under consideration, which in turn can cause autocorrelation.

The consequences of autocorrelation are similar to those of heteroscedasticity: conclusions from t- and F-statistics that determine the significance of the regression coefficient and the coefficient of determination may be incorrect.