Estimation of the parameters of the regression equation. Linear Regression and Correlation: Meaning and Parameter Estimation


Rice. 2.1. Regression Line Plot

The first expression allows for the given values ​​of the factor x calculate the theoretical values ​​of the effective indicator, substituting the actual values ​​of the factor into it x... On the graph, the theoretical values ​​lie on a straight line, which represent the regression line (Fig. 2.1).

The construction of linear regression is reduced to the estimation of its parameters a and b... The classical approach to estimating linear regression parameters is based on method least squares(OLS).

OLS allows one to obtain such estimates of parameters a and b, at which the sum of the squares of deviations of the actual values ​​from the theoretical is minimal:

To find the minimum, it is necessary to calculate the partial derivatives of the sum (4) for each of the parameters - a and b- and equate them to zero.

(5)

We transform, we get system of normal equations:

(6)

In this system n - sample size, amounts are easily calculated from the source data. We solve the system regarding a and b, we get:

(7)

. (8)

Expression (7) can be written in another form:

(9)

where is the covariance of features, the variance of the factor x.

Parameter b called the regression coefficient. Its value shows the average change in the result with a change in the factor by one unit. The possibility of a clear economic interpretation of the regression coefficient made linear equation pairwise regression is quite common in econometric research.

Formally a - meaning y at x = 0. If x does not and cannot have zero value, then such an interpretation of the free term a doesn't make sense. Parameter a may not have economic content. Attempts to interpret it economically can lead to absurdity, especially when a< 0. Интерпретировать можно лишь знак при параметре a. If a> 0, then the relative change in the result is slower than the change in the factor. Let's compare these relative changes:

< при > 0, > 0 <

Sometimes a linear pairwise regression equation is written for deviations from the means:

where , . In this case, the free term is zero, which is reflected in expression (10). This fact follows from geometric considerations: the same straight line (3) corresponds to the regression equation, but when evaluating the regression in deviations, the origin of coordinates moves to a point with coordinates. In this case, in expression (8), both sums will be equal to zero, which will entail the equality of the free term to zero.

Let us consider, as an example, for a group of enterprises producing one type of product, the regression dependence of costs on product output .

Table 2.1

Production output, thousand units () Production costs, million rubles ()
31,1
67,9

Continuation of table 2.1

141,6
104,7
178,4
104,7
141,6
Total: 22 770,0

The system of normal equations will have the form:

Solving it, we get a =-5,79, b = 36,84.

The regression equation is:

Substituting into the equation the values NS, we find the theoretical values y(last column of the table).

The magnitude a makes no economic sense. If the variables x and y expressed in terms of deviations from the mean levels, the regression line on the graph will pass through the origin. In this case, the estimate of the regression coefficient will not change:

, where , .

In linear regression, the linear correlation coefficient acts as an indicator of the tightness of the relationship r:

The quantity characterizes the proportion of the variance y caused by the influence of other factors not taken into account in the model.

2.3. OLS prerequisites (Gauss-Markov conditions)

Connection between y and x in paired regression it is not functional, but correlative. Therefore, the parameter estimates a and b are random variables, the properties of which essentially depend on the properties of the random component ε. To obtain the best results from OLS, the following prerequisites for random deviation (Gauss-Markov conditions) must be met:

1. Expected value the random deviation is zero for all observations: .

2. The variance of random deviations is constant: .

The feasibility of this prerequisite is called homoscedasticity - constancy of variance of deviations. The impossibility of this premise is called heteroscedasticity - variability of variance of deviations.

3. Random deviations ε i and ε j are independent of each other for:

Satisfaction of this condition is called lack of autocorrelation.

4. The random deviation should be independent of the explanatory variables. Usually this condition is fulfilled automatically if the explanatory variables in the given model are not random. In addition, the feasibility of this premise for econometric models is not so critical compared to the first three.

If the above prerequisites are met, then Gauss-Markov theorem: estimates (7) and (8) obtained by least squares have the smallest variance in the class of all linear unbiased estimates .

Thus, under the Gaussian conditions - Markov estimates (7) and (8) are not only unbiased estimates of the regression coefficients, but also the most effective, i.e., they have the least variance in comparison with any other estimates of these parameters that are linear with respect to the values y i.

It is precisely the understanding of the importance of the Gaussian conditions - Markov is distinguished by a competent researcher using regression analysis, from incompetent. If these conditions are not met, the researcher must be aware of this. If corrective action is possible, then the analyst should be able to take it. If the situation cannot be remedied, the investigator should be able to assess how severely this could affect the results.

2.4. Evaluation of the materiality of the parameters of the linear
regressions and correlations

After the linear regression equation (3) has been found, the significance of both the equation as a whole and its individual parameters is assessed.

An estimate of the significance of the regression equation as a whole is given using F-Fisher's criterion. At the same time, a null hypothesis is put forward that the regression coefficient is zero and, therefore, the factor NS does not affect the result y.

Variance analysis is performed before calculating the criterion. It can be shown that the total sum of squared deviations (RMS) y from the mean value is decomposed into two parts - explained and unexplained:


(Total RMS) =

Two extreme cases are possible here: when the total standard deviation is exactly equal to the residual one and when the total standard deviation is equal to the factor one.

In the first case, the factor NS has no effect on the result, all variance y due to the influence of other factors, the regression line is parallel to the axis Oh and .

In the second case, other factors do not affect the result, y associated with x functionally, and the residual standard deviation is zero.

But in practice, both terms are present on the right-hand side of (13). The suitability of the regression line for forecasting depends on how much of the total variation y falls on the explained variation. If the explained standard deviation is greater than the residual standard deviation, then the regression equation is statistically significant and the factor NS has a significant impact on the result y... This is tantamount to the fact that the coefficient of determination will approach one.

The number of degrees of freedom.(df-degrees of freedom) is the number of independently varied characteristic values.

For the general standard deviation, independent deviations are required, since which allows you to freely vary the values, and the last n-th deviation is determined from the total amount equal to zero. That's why .

Factorial standard deviation can be expressed as follows:

This standard deviation depends on only one parameter b, since the expression under the sum sign does not apply to the values ​​of the effective characteristic. Consequently, the factorial standard deviation has one degree of freedom, and

For the definition, we will use the analogy with balance equality (11). Just as in equality (11), we can write equality between the numbers of degrees of freedom:

Thus, we can write ... From this balance we determine that

Dividing each RMS by its number of degrees of freedom, we get mean square of deviations, or variance per degree of freedom:

. (15)

. (16)

. (17)

Comparing the factorial and residual variances per degree of freedom, we obtain F-criterion for testing the null hypothesis, which in this case is written as

If true, then the variances do not differ from each other. For it is necessary to refute the factorial variance to exceed the residual by several times.

English statistician Snedecor developed tables critical values F at different levels materiality by Snedecor and different numbers degrees of freedom. Table value F-criterion is the maximum value of the ratio of variances that can occur if they are randomly different for a given level of probability of having a null hypothesis.

When finding a table value F-criterion is set the level of significance (usually 0.05 or 0.01) and two degrees of freedom - the numerator (it is equal to one) and the denominator, equal to

Calculated value F recognized as reliable (different from one) if it is greater than the tabular, i.e. (α; 1;). In this case, it is rejected and a conclusion is made about the materiality of the excess D fact above D remain., i.e., on the materiality statistical link between y and x.

If , then the probability is higher than a given level (for example: 0.05), and this hypothesis cannot be rejected without serious risk of making an incorrect conclusion about the presence of a connection between y and x. The regression equation is considered statistically insignificant and does not deviate.

The magnitude F-criterion is associated with the coefficient of determination.

, (19)

In linear regression, the significance of not only the equation as a whole is usually assessed, but also its individual parameters.

The standard error of the regression coefficient is determined by the formula:

, (20)

Residual variance per degree of freedom (same as).

The value of the standard error together with t- Student's t distribution at degrees of freedom is used to test the significance of the regression coefficient and to calculate its confidence intervals.

The magnitude of the regression coefficient is compared with its standard error; the actual value is determined t- Student's test

which is then compared with table value at a certain level of significance α and the number of degrees of freedom. Here the null hypothesis is tested in the form also assuming the insignificance of the statistical relationship between y and NS, but only taking into account the value b, and not the ratio between factorial and residual variances in overall balance variance of the effective trait. But the general meaning of the hypotheses is the same: checking the presence of a statistical relationship between y and NS or lack thereof.

If (α;), then the hypothesis should be rejected, and the statistical relationship y with NS is considered to be installed. In case (α;) the null hypothesis cannot be rejected and the influence NS on y recognized as immaterial.

There is a connection between and F:

Hence it follows that

Confidence interval for b defined as

where is the calculated (estimated) by OLS value of the regression coefficient.

The standard error of the parameter is determined by the formula:

Materiality assessment procedure a does not differ from that for the parameter b... In this case, the actual value t-criterion is calculated by the formula:

Significance test procedure linear coefficient correlation differs from the procedures above. This is explained by r how random value distributed according to the normal law only for a large number observations and small values ​​| r|. In this case, the hypothesis that there is no correlation between y and NS checked based on statistics

, (26)

which, in fairness, is approximately distributed according to the Student's law with () degrees of freedom. If , then the hypothesis is rejected with a probability of error not exceeding α ... From (19) it can be seen that in paired linear regression. Also, therefore. Thus, testing hypotheses about the significance of the regression and correlation coefficients is tantamount to testing the hypothesis about the significance of a linear regression equation.

But for small samples and values r close to, it should be borne in mind that the distribution r as a random variable differs from the normal one, and the construction of confidence intervals for r cannot be done in a standard way. In this case, it is generally easy to arrive at the contradiction that confidence interval will contain values ​​greater than one.

To get around this difficulty, the so-called
z-Fisher transform:

, (27)

which gives ok distributed value z, the values ​​of which when changing r–1 to +1 ranges from –∞ to + ∞. The standard error of this value is:

. (28)

For the value z there are tables in which its values ​​are given for the corresponding values r.

For z a null hypothesis is put forward that there is no correlation. In this case, the statistic values

which is distributed according to the Student's law with () degrees of freedom, does not exceed the tabular at the appropriate level of significance.

For each value z the critical values ​​can be calculated r... Critical value tables r designed for significance levels of 0.05 and 0.01 and corresponding degrees of freedom. If the calculated value r exceeds the absolute value of the table, then this value r considered significant. Otherwise, the actual value is irrelevant.

2.5. Nonlinear Regression Models
and their linearization

So far, we have only considered linear regression model y from x(3). At the same time, many important links in the economy are non-linear... Examples of this kind of regression models are production functions (the relationship between the volume of goods produced and the main factors of production - labor, capital, etc.) and the demand function (the relationship between the demand for a certain type of goods or services, on the one hand, and income and prices for this and other goods - on the other).

When analyzing nonlinear regression dependences, the most important issue application of the classical least squares is the way to linearize them. In the case of linearization of a nonlinear relationship, we obtain a linear regression equation of the type (3), the parameters of which are estimated by the usual least squares method, after which the original nonlinear relationship can be written.

The polynomial model of arbitrary degree stands somewhat apart in this sense:

to which the usual OLS can be applied without any preliminary linearization.

Consider this procedure in relation to a parabola of the second degree:

. (31)

Such a dependence is advisable if, for a certain interval of values ​​of the factor, the increasing dependence changes to a decreasing one, or vice versa. In this case, it is possible to determine the value of the factor at which the maximum or minimum value of the effective attribute is achieved. If the original data does not detect a change in the direction of the link, the parameters of the parabola become difficult to interpret, and it is better to replace the form of the link with other nonlinear models.

The use of OLS for estimating the parameters of a second-degree parabola is reduced to differentiating the sum of squares of the regression residuals for each of the estimated parameters and equating the obtained expressions to zero. A system of normal equations is obtained, the number of which is equal to the number of estimated parameters, i.e., three:

(32)

This system can be solved in any way, in particular, by the method of determinants.

The extreme value of the function is observed when the value of the factor is equal to:

If, then there is a maximum, i.e., the dependence first grows and then falls. This kind of relationship is observed in labor economics when studying wages manual workers, when age plays the role of a factor. When the parabola has a minimum, which usually manifests itself in the unit costs of production, depending on the volume of products.

In nonlinear dependencies that are not classical polynomials, preliminary linearization is necessarily carried out, which consists in transforming either variables or model parameters, or in a combination of these transformations. Let's consider some classes of such dependencies.

The dependences of the hyperbolic type are as follows:

. (33)

An example of such a dependency is Phillips curve, stating the inverse dependence of the percentage of wage growth on the level of unemployment. In this case, the parameter value b will be greater than zero.

Another example of dependence (33) is Engel curves, formulating the following regularity: with an increase in income, the share of income spent on food decreases, and the share of income spent on non-food products will increase. In this case, the resultant indicator in (33) shows the share of expenditures on non-food products.

The linearization of equation (33) is reduced to replacing the factor, and the regression equation has the form (3), in which instead of the factor NS use factor z:

The semilogarithmic curve is reduced to the same linear equation:

, (35)

which can be used to describe Engel curves. Here ln ( x) is replaced by z and equation (34) is obtained.

A fairly wide class of economic indicators is characterized by an approximately constant rate of relative growth over time. This corresponds to dependences of the exponential (exponential) type, which are written in the form:

or in the form

. (37)

Such dependence is also possible:

. (38)

Regressions like (36) - (38) use the same linearization method - logarithm. Equation (36) is reduced to the form:

. (39)

Changing the variable reduces it to linear form:

, (40)

where . If E satisfies the Gauss-Markov conditions, the parameters of equation (36) are estimated by the least squares method from equation (40). Equation (37) is reduced to the form:

which differs from (39) only in the form of the free term, and the linear equation looks like this:

, (42)

where . Options A and b obtained by ordinary least squares, then the parameter a dependence (37) is obtained as the antilogarithm A... Taking the logarithm of (38), we obtain a linear dependence:

, (43)

where, and the rest of the designations are the same as above. It also applies OLS to the transformed data, and the parameter b for (38) is obtained as the antilogarithm of the coefficient V.

Power dependences are widespread in the practice of socio-economic research. They are used to build and analyze production functions. In functions like:

especially valuable is the fact that the parameter b equal to the coefficient of elasticity of the effective trait by factor NS... Transforming (44) by taking the logarithm, we obtain a linear regression:

, (45)

Another type of nonlinearity that can be reduced to linear form is inverse relationship:

. (46)

By replacing, we get.

For the territories of the region, data are given for the year 200X.

Region number Average per capita subsistence minimum per day of one able-bodied worker, rubles, x Average daily wages, rubles, y
1 78 133
2 82 148
3 87 134
4 79 154
5 89 162
6 106 195
7 67 139
8 88 158
9 73 152
10 87 162
11 76 159
12 115 173

Exercise:

1. Build a correlation field and formulate a hypothesis about the form of the relationship.

2. Calculate the parameters of the linear regression equation

4. Using the average (general) coefficient of elasticity, give a comparative assessment of the strength of the relationship between the factor and the result.

7. Calculate the predicted value of the result if the predicted value of the factor increases by 10% from its average level. Determine the predictive confidence interval for the significance level.

Solution:

Let's solve this problem using Excel.

1. Comparing the available data x and y, for example, ranking them in ascending order of the factor x, one can observe the presence of a direct relationship between the signs, when an increase in the average per capita subsistence minimum increases the average daily wage. Based on this, we can make the assumption that the connection between the features is direct and it can be described by the equation of a straight line. The same conclusion is confirmed on the basis of graphical analysis.

To build a correlation field, you can use the PPP Excel. Enter the initial data in sequence: first x, then y.

Select the area of ​​cells containing data.

Then choose: Insert / Scatter chart / Scatter with markers as shown in Figure 1.

Figure 1 Plotting the correlation field

Analysis of the correlation field shows the presence of a dependence close to a straight line, since the points are located practically in a straight line.

2. To calculate the parameters of the linear regression equation
let's use the built-in statistical function LINEST.

For this:

1) Open an existing file containing the analyzed data;
2) Select a 5 × 2 blank cell area (5 rows, 2 columns) to display the results regression statistics.
3) Activate Function wizard: in the main menu select Formulas / Insert Function.
4) In the window Category you are taking Statistical, in the window the function - LINEST... Click the button OK as shown in Figure 2;

Figure 2 Function Wizard Dialog Box

5) Fill in the function arguments:

Known values ​​for

Known values ​​of x

Constant- a boolean value that indicates the presence or absence of an intercept in the equation; if Constant = 1, then the free term is calculated in the usual way; if Constant = 0, then the free term is 0;

Statistics- a boolean value that indicates whether to display additional information on the regression analysis or not. If Statistics = 1, then additional information is displayed, if Statistics = 0, then only estimates of the equation parameters are displayed.

Click the button OK;

Figure 3 LINEST function arguments dialog box

6) The first element of the final table will appear in the upper left cell of the selected area. To expand the entire table, press the key and then the key combination ++ .

Additional regression statistics will be displayed in the order shown in the following diagram:

The value of the coefficient b The value of the coefficient a
Standard error b Standard error a
Standard error y
F-statistics
Regression sum of squares

Figure 4 The result of calculating the LINEST function

We got the regression equation:

We conclude: With an increase in the average per capita subsistence minimum by 1 ruble. the average daily wage increases on average by 0.92 rubles.

This means that 52% of the variation in wages (y) is explained by the variation of factor x - the average per capita subsistence minimum, and 48% - by the action of other factors not included in the model.

According to the calculated coefficient of determination, the correlation coefficient can be calculated: .

The connection is assessed as close.

4. Using the average (general) coefficient of elasticity, we determine the strength of the factor's influence on the result.

For the equation of a straight line, the average (general) coefficient of elasticity is determined by the formula:

Find the average values ​​by selecting the area of ​​cells with x values, and select Formulas / AutoSum / Average, and do the same with the values ​​of y.

Figure 5 Calculation of the mean values ​​of the function and the argument

Thus, if the average per capita subsistence minimum changes by 1% of its average value, the average daily wage will change on average by 0.51%.

Using a data analysis tool Regression you can get it:
- the results of regression statistics,
- the results of analysis of variance,
- the results of the confidence intervals,
- residuals and graphs for fitting the regression line,
- residuals and normal probability.

The procedure is as follows:

1) check access to Analysis package... In the main menu, select in sequence: File / Options / Add-ins.

2) In the dropdown Control select item Excel add-ins and press the button Go.

3) In the window Add-ons check the box Analysis package and then click OK.

If Analysis package is not in the field list Available add-ons, press the button Overview to search.

If a message appears stating that the analysis package is not installed on your computer, click Yes to install it.

4) In the main menu, sequentially select: Data / Data Analysis / Analysis Tools / Regression and then click OK.

5) Complete the data entry and output parameters dialog box:

Input Span Y- the range containing the data of the effective attribute;

Input interval X- a range containing the data of the factor attribute;

Tags- a checkbox that indicates whether the first row contains column names or not;

Constant - zero- a flag indicating the presence or absence of an intercept in the equation;

Output Interval- it is enough to indicate the upper left cell of the future range;

6) New worksheet - you can set an arbitrary name for the new sheet.

Then press the button OK.

Figure 6 Dialog box for entering parameters of the Regression tool

The results of the regression analysis for the task data are presented in Figure 7.

Figure 7 Result of applying the regression tool

5. Estimate with average error approximation of the quality of the equations. Let's use the results of the regression analysis presented in Figure 8.

Figure 8 Result of using the "Residual output" regression tool

Let's compose a new table as shown in Figure 9. In column C we calculate the relative error of approximation by the formula:

Figure 9 Calculation of the average approximation error

The average approximation error is calculated by the formula:

The quality of the constructed model is assessed as good, since it does not exceed 8 - 10%.

6. From the table with regression statistics (Figure 4) we write out the actual value of Fisher's F-test:

Insofar as at a 5% level of significance, it can be concluded that the regression equation is significant (the relationship is proven).

8. Evaluation statistical significance We will carry out the regression parameters using the Student's t-statistics and by calculating the confidence interval for each of the indicators.

We put forward the hypothesis H 0 about the statistically insignificant difference of indicators from zero:

.

for the number of degrees of freedom

Figure 7 shows the actual values ​​of the t-statistic:

The t-test for the correlation coefficient can be calculated in two ways:

Method I:

where - random error of the correlation coefficient.

We take the data for the calculation from the table in Figure 7.

Method II:

The actual t-statistic values ​​are superior to the tabular values:

Therefore, the hypothesis H 0 is rejected, that is, the regression parameters and the correlation coefficient are not randomly different from zero, but statistically significant.

The confidence interval for the parameter a is defined as

For parameter a, the 95% bounds as shown in Figure 7 were:

The confidence interval for the regression coefficient is defined as

For the regression coefficient b, the 95% bounds as shown in Figure 7 were:

Analysis of the upper and lower boundaries of the confidence intervals leads to the conclusion that with the probability parameters a and b, being within the indicated boundaries, do not take zero values, i.e. are not statistically significant and are materially different from zero.

7. The obtained estimates of the regression equation allow us to use it for forecasting. If the forecast value of the subsistence minimum is:

Then the predicted value of the subsistence minimum will be:

We calculate the forecast error using the formula:

where

We also calculate the variance using the PPP Excel. For this:

1) Activate Function wizard: in the main menu select Formulas / Insert Function.

3) Fill in the range containing the numerical data of the factor attribute. Click on OK.

Figure 10 Calculation of variance

Received the variance value

To calculate the residual variance per degree of freedom, we use the ANOVA results as shown in Figure 7.

Confidence intervals for predicting individual values ​​of y at with a probability of 0.95 are determined by the expression:

The interval is wide enough, primarily due to the small volume of observations. On the whole, the fulfilled forecast of the average monthly salary turned out to be reliable.

The condition of the problem is taken from: Workshop on econometrics: Textbook. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M .: Finance and statistics, 2003. - 192 p .: ill.

When evaluating the parameters of the regression equation, the method of least squares (OLS) is used. In this case, certain prerequisites are made regarding the random component e. In the model, the random component e is an unobservable quantity. After the estimation of the parameters of the model has been made, calculating the difference between the actual and theoretical values ​​of the effective indicator y, it is possible to determine the estimates of the random component. Since they are not real random residuals, they can be considered some selective implementation of an unknown residual for this equation, i.e. ei.

When changing the model specification, adding new observations to it, the sample estimates of the residuals ei may change. Therefore, the task of regression analysis includes not only the construction of the model itself, but also the study of random deviations ei, i.e., residual values.

When using Fisher's and Student's tests, assumptions are made about the behavior of the residuals ei - the residuals are independent random variables and their mean value is 0; they have the same (constant) variance and are normally distributed.

Statistical tests of the regression parameters and correlation indicators are based on unverifiable assumptions about the distribution of the random component ei. They are only preliminary. After constructing the regression equation, it is checked whether the estimates ei (random residuals) have the properties that were assumed. This is due to the fact that the estimates of the regression parameters must meet certain criteria. They must be unbiased, wealthy, and effective. These properties of the estimates obtained by the OLS are extremely important for practical use in the use of the results of regression and correlation.

Unbiasedness estimates means that the mathematical expectation of the residuals is zero. If the estimates have the property of unbiasedness, then they can be compared across different studies.

Grades are considered effective if they have the least variance. V practical research this means the possibility of transition from point estimation to interval estimation.

Consistency estimates are characterized by an increase in their accuracy with an increase in the sample size. Of great practical interest are those regression results for which the confidence interval of the expected value of the regression parameter bi has a limit of probability values ​​equal to one. In other words, the probability of obtaining an estimate at a given distance from the true value of the parameter is close to unity.

The specified evaluation criteria (unbiasedness, consistency and efficiency) must be taken into account when different ways evaluation. The least squares method constructs regression estimates based on minimizing the sum of the squares of the residuals. Therefore, it is very important to investigate the behavior of the regression residuals ei. The conditions necessary to obtain unbiased, consistent, and efficient estimates are the OLS prerequisites that are desirable for obtaining reliable regression results.

Investigations of the ei residuals involve testing for the following five OLS prerequisites:

1.the random nature of the residues;

2.zero average value residuals independent of xi;

3. homoscedasticity - the variance of each deviation ei is the same for all values ​​of x;

4. absence of autocorrelation of residuals - the values ​​of residuals ei are distributed independently of each other;

5. the residuals are subject to a normal distribution.

If the distribution of random residuals ei does not correspond to some of the assumptions of the OLS, then the model should be adjusted.

First of all, the random nature of the residuals ei is checked - the first premise of the OLS. For this purpose, a graph of the dependence of the residuals ei on the theoretical values ​​of the effective trait is plotted.


The first expression allows for the given values ​​of the factor x calculate the theoretical values ​​of the effective indicator, substituting the actual values ​​of the factor into it x... On the graph, the theoretical values ​​lie on a straight line that represents the regression line.

The construction of linear regression is reduced to the estimation of its parameters - a and b... The classical approach to estimating linear regression parameters is based on the method of least squares (OLS).

To find the minimum, it is necessary to calculate the partial derivatives of the sum (4) for each of the parameters - a and b- and equate them to zero.

(5)

We transform, we get system of normal equations:

(6)

In this system n- sample size, amounts are easily calculated from the source data. We solve the system regarding a and b, we get:

(7)

. (8)

Expression (7) can be written in another form:

(9)

where feature covariance, factor variance x.

Parameter b called the regression coefficient. Its value shows the average change in the result with a change in the factor by one unit. The possibility of a clear economic interpretation of the regression coefficient made the linear pairwise regression equation quite common in econometric studies.

Formally a - meaning y at x = 0. If x does not and cannot have zero value, then such an interpretation of the free term a doesn't make sense. Parameter a may not have economic content. Attempts to interpret it economically can lead to absurdity, especially when a< 0. Интерпретировать можно лишь знак при параметре a. If a> 0, then the relative change in the result is slower than the change in the factor. Let's compare these relative changes:

< при > 0, > 0 <

Sometimes a linear pairwise regression equation is written for deviations from the means:

where , . In this case, the free term is equal to zero, which is reflected in expression (10). This fact follows from geometric considerations: the same straight line (3) corresponds to the regression equation, but when evaluating the regression in deviations, the origin of coordinates moves to a point with coordinates. In this case, in expression (8), both sums will be equal to zero, which will entail the equality of the free term to zero.

Let us consider, as an example, for a group of enterprises producing one type of product, the regression dependence of costs on product output .

Table 1

Production output, thousand units () Production costs, million rubles ()
31,1
67,9
141,6
104,7
178,4
104,7
141,6
Total: 22 770,0

The system of normal equations will have the form:

Solving it, we get a = -5.79, b = 36.84.

The regression equation is:

Substituting into the equation the values NS, we find the theoretical values y(last column of the table).

The magnitude a makes no economic sense. If the variables x and y expressed in terms of deviations from the mean levels, the regression line on the graph will pass through the origin. In this case, the estimate of the regression coefficient will not change:

, where , .

As another example, consider a consumption function of the form:

,

where C is consumption, y-income, K, L- options. This linear regression equation is usually used in conjunction with balance sheet equality:

,

where I- the amount of investments, r - saving.

For simplicity, let's assume that income is spent on consumption and investment. Thus, the system of equations is considered:

The presence of balance equality imposes restrictions on the value of the regression coefficient, which cannot be more than one, i.e. ...

Suppose the consumption function was:

.

The regression coefficient characterizes the propensity to consume. It shows that out of every thousand rubles of income, an average of 650 rubles is spent on consumption, and 350 rubles. invested. If we calculate the regression of the size of investments from income, i.e. , then the regression equation will be ... This equation need not be defined, since it is derived from the consumption function. The regression coefficients of these two equations are related by equality:

If the regression coefficient turns out to be more than one, then not only income is spent on consumption, but also savings.

The regression coefficient in the consumption function is used to calculate the multiplier:

Here m≈2.86, therefore additional investments of 1 thousand rubles. for a long period, other things being equal, will lead to additional income of 2.86 thousand rubles.

In linear regression, the linear correlation coefficient acts as an indicator of the tightness of the relationship r:

(11)

Its values ​​are in the range:. If b> 0, then for b< 0 ... According to the example, which means a very close dependence of production costs on the value of the volume of products.

To assess the quality of the selection of a linear function, it is calculated coefficient of determination as the square of the linear correlation coefficient r 2... It characterizes the proportion of the variance of the effective trait y explained by regression in total variance effective attribute:

(12)

The quantity characterizes the proportion of the variance y caused by the influence of other factors not taken into account in the model.

In the example. The regression equation explains 98.2% of the variance, and other factors account for 1.8%, this is the residual variance.


1.3. OLS prerequisites (Gauss-Markov conditions)

As mentioned above, the relationship between y and x in paired regression it is not functional, but correlative. Therefore, the parameter estimates a and b are random variables, the properties of which essentially depend on the properties of the random component ε. To obtain the best results from OLS, the following prerequisites for random deviation (Gauss-Markov conditions) must be met:

ten . The mathematical expectation of a random deviation is zero for all observations: .

twenty . The variance of random deviations is constant:.

The feasibility of this prerequisite is called homoscedasticity(constant variance of deviations). The impossibility of this premise is called heteroscedasticity(variance of variance of deviations)

thirty . Random deviations ε i and ε j are independent of each other for:

Satisfaction of this condition is called lack of autocorrelation.

4 0. The random deviation should be independent of the explanatory variables.

Usually this condition is fulfilled automatically if the explanatory variables in the given model are not random. In addition, the feasibility of this premise for econometric models is not so critical compared to the first three.

If the above prerequisites are met, then Gauss's theorem-Markov: estimates (7) and (8) obtained by least squares have the smallest variance in the class of all linear unbiased estimates .

Thus, under the Gauss-Markov conditions, estimates (7) and (8) are not only unbiased estimates of the regression coefficients, but also the most effective, i.e. have the smallest variance compared to any other estimates of these parameters that are linear with respect to the values y i.

It is the understanding of the importance of the Gauss-Markov conditions that distinguishes a competent researcher using regression analysis from an incompetent one. If these conditions are not met, the researcher must be aware of this. If corrective action is possible, then the analyst should be able to take it. If the situation cannot be remedied, the investigator should be able to assess how severely this could affect the results.

Economic phenomena are usually defined a large number simultaneously and cumulatively acting factors. In this regard, the problem often arises of investigating the dependence of the variable at from several explanatory variables ( x 1, x 2,…, x k) which can be solved with multiple correlation regression analysis.

When investigating addiction by methods multiple regression the problem is formed in the same way as when using paired regression, i.e. it is required to determine the analytical expression of the form of the relationship between the effective attribute at and factorial signs x, x 2,..., x k, find a function, where k is the number of factor features

Multiple regression is widely used in solving problems of demand, stock returns, when studying the function of production costs, in macroeconomic calculations, and a number of other issues of econometrics. Currently, multiple regression is one of the most common methods in econometrics. The main goal of multiple regression is to build a model with a large number of factors, while determining the influence of each of them separately, as well as their cumulative effect on the modeled indicator.

Due to the peculiarities of the least squares method, in multiple regression, as in a pair, only linear equations and equations that can be reduced to a linear form by transforming variables are used. The most commonly used linear equation, which can be written as follows:

a 0, a 1,…, a k - model parameters (regression coefficients);

ε j is a random variable (residual value).

Regression coefficient a j shows how much the effective indicator will change on average y, if the variable NS j increase by a unit of measure at a fixed (constant) value of other factors included in the regression equation. Parameters for x are called coefficients of "pure" regression .

Example.

Suppose that the dependence of spending on food for a population of families is characterized by the following equation:

y- family expenses per month for food, thousand rubles;

x 1 - monthly income per family member, thousand rubles;

x 2 - family size, people.

The analysis of this equation allows us to draw conclusions - with an increase in income per family member by 1 thousand rubles. food costs will increase by an average of 350 rubles. with the same average family size. In other words, 35% of additional household expenses are spent on food. An increase in family size with the same income implies an additional increase in food costs by 730 rubles. The first parameter is not subject to economic interpretation.

The assessment of the reliability of each of the parameters of the model is carried out using the Student's t-test. For any of the model parameters a j, the t-criterion value is calculated by the formula , where


S ε - standard (root-mean-square) deviation of the regression equation)

is determined by the formula

The regression coefficient a j is considered sufficiently reliable if the calculated value t- criterion with ( n - k - 1) degrees of freedom exceeds the tabular, i.e. t calc> t a jn - k -1. If the reliability of the regression coefficient is not confirmed, then it follows; conclusion about insignificance in the model of factorial j attribute and the need to remove it from the model or replace it with another factor attribute.

Important role in assessing the influence of factors, the coefficients of the regression model play. However, using them directly, it is impossible to compare factor signs according to the degree of their influence on the dependent variable due to the difference in units of measurement and different degrees of variability. To eliminate such differences, apply partial coefficients of elasticity E j and beta coefficients β j.

Formula for calculating the coefficient of elasticity

where

a j - factor regression coefficient j,

Average value of the effective trait

Average value of the characteristic j

The coefficient of elasticity shows by what percentage the dependent variable changes at when the factor changes j by 1%.

The formula for determining the beta coefficient.

, where

S xj - standard deviation of the factor j;

S y - standard deviation of the factor y.

β - the coefficient shows how much of the value of the standard deviation S y the dependent variable will change at with a change in the corresponding independent variable NS j by the value of its standard deviation at a fixed value of the remaining independent variables.

The share of the influence of a certain factor in the total influence of all factors can be estimated by the value delta coefficients Δ j.

The indicated coefficients allow us to rank the factors according to the degree of influence of the factors on the dependent variable.

The formula for determining the delta coefficient.

r yj - coefficient of pair correlation between factor j and dependent variable;

R 2 - multiple coefficient of determination.

The multiple determination coefficient is used to quality assessments multiple regression models.

The formula for determining the coefficient of multiple determination.

The coefficient of determination shows the proportion of variation of the effective trait that is under the influence of factorial traits, i.e. determines how much of the variation of the trait at taken into account in the model and is due to the influence of the factors included in the model. The closer R 2 to one, the higher the quality of the model

When adding explanatory variables, the value R 2 increases, so the coefficient R 2 should be adjusted for the number of independent variables by the formula

For model validation Regression uses Fisher's F-test. It is determined by the formula

If the calculated value of the criterion with γ 1, = k and γ 2 = (n - k- 1) degrees of freedom more than tabular at a given level of significance, then the model is considered significant.

As a measure of the accuracy of the model, the standard error is used, which is the ratio of the sum of squared levels of the residual component to the value (n - k -1):

The classical approach to estimating the parameters of a linear model is based on least squares method (OLS)... The system of normal equations has the form:

The solution of the system can be carried out according to one of known methods: Gauss method, Cramer method, etc.

Example 15.

For four enterprises in the region (Table 41), the dependence of production output per employee is being studied. y(thousand rubles) from the commissioning of new fixed assets (% of the value of funds at the end of the year) and from specific gravity highly qualified workers in the total number of workers (%). You want to write a multiple regression equation.

Table 41 - Dependence of production output per employee