Evaluation of the significance of the parameters of linear regression and the entire equation as a whole. Checking the significance of the entire regression equation as a whole

After finding the equation linear regression, the significance of both the equation as a whole and its individual parameters is assessed.

Check the significance of the regression equation - means to establish whether the mathematical model expressing the relationship between the variables corresponds to the experimental data and whether the explanatory variables included in the equation (one or more) are sufficient to describe the dependent variable.

Significance testing is performed based on analysis of variance.

According to the idea of ​​analysis of variance, the total sum of squares of deviations (RMS) y from the mean is decomposed into two parts - explained and unexplained:

or, respectively:

Two extreme cases are possible here: when the total standard deviation is exactly equal to the residual one and when the total standard deviation is equal to the factor one.

In the first case, the factor x does not affect the result, the entire variance of y is due to the influence of other factors, the regression line is parallel to the Ox axis and the equation should have the form.

In the second case, other factors do not affect the result, y is functionally related to x, and the residual standard deviation is zero.

However, in practice, both terms are present on the right-hand side. The suitability of the regression line for forecasting depends on how much of the total variation in y is attributable to the explained variation. If the explained standard deviation is greater than the residual standard deviation, then the regression equation is statistically significant and the factor x has a significant effect on the result y. This is tantamount to the fact that the coefficient of determination will approach one.

The number of degrees of freedom (df-degrees of freedom) is the number of independently variable values ​​of a feature.

The total standard deviation requires (n-1) independent deviations,

Factorial standard deviation has one degree of freedom, and

Thus, we can write:

From this balance we determine that = n-2.

Dividing each standard deviation by its number of degrees of freedom, we obtain the mean square of deviations, or variance per one degree of freedom: - total variance, - factorial, - residual.

Analysis statistical significance linear regression coefficients

Although the theoretical values ​​of the coefficients of the linear dependence equation are assumed to be constant, the estimates a and b of these coefficients, obtained in the course of constructing the equation from the data of a random sample, are random values. If the regression errors have normal distribution, then the estimates of the coefficients are also normally distributed and can be characterized by their mean values ​​and variance. Therefore, the analysis of the coefficients begins with the calculation of these characteristics.

The variances of the coefficients are calculated by the formulas:

Variance of the regression coefficient:

where is the residual variance per degree of freedom.

Parameter variance:

Hence, the standard error of the regression coefficient is determined by the formula:

The standard error of the parameter is determined by the formula:

They serve to test null hypotheses that the true value of the regression coefficient b or the intercept a is zero:.

An alternative hypothesis is:.

t - statisticians have t - Student distribution with degrees of freedom. According to the tables of the Student's distribution at a certain level of significance b and degrees of freedom, the critical value is found.

If, then the null hypothesis should be rejected, the coefficients are considered statistically significant.

If, then the null hypothesis cannot be rejected. (If the coefficient b is statistically insignificant, the equation should have the form, which means that there is no connection between the features. If the coefficient a is statistically insignificant, it is recommended to evaluate the new equation in the form).

Interval coefficient estimates linear equation regression:

Confidence interval for a: .

Confidence interval for b:

This means that with a given reliability (where is the level of significance), the true values ​​of a, b are in the specified intervals.

The regression coefficient has a clear economic interpretation, so the confidence bounds of the interval should not contain conflicting results, for example, They should not include zero.

Analysis of the statistical significance of the equation as a whole.

Fisher distribution in regression analysis

The estimation of the significance of the regression equation as a whole is given using the F-test of Fisher. In this case, a null hypothesis is put forward that all the regression coefficients, with the exception of the free term a, are equal to zero and, therefore, the factor x does not affect the result y (or).

The value of the F - criterion is associated with the coefficient of determination. When multiple regression:

where m is the number of independent variables.

When paired regression the F-statistics formula takes the form:

When finding the tabular value of the F-criterion, a significance level (usually 0.05 or 0.01) and two degrees of freedom are set: - in the case of multiple regression, - for paired regression.

If, then it is rejected and it is concluded that the statistical relationship between y and x is significant.

If, then the probability of the regression equation is considered statistically insignificant, not rejected.

Comment. In paired linear regression. Also, therefore. Thus, testing hypotheses about the significance of the regression and correlation coefficients is tantamount to testing the hypothesis about the significance of a linear regression equation.

The Fisher distribution can be used not only to test the hypothesis that all linear regression coefficients are simultaneously zero, but also the hypothesis that some of these coefficients are zero. This is important in the development of a linear regression model, since it allows one to assess the validity of excluding individual variables or their groups from the number of explanatory variables, or, conversely, including them in this number.

Let, for example, at first, multiple linear regression was estimated for n observations with m explanatory variables, and the coefficient of determination is equal, then the last k variables are excluded from the number of explanatory variables, and the equation for which the coefficient of determination is equal to each additional variable explains part, however small, of the variation of the dependent variable).

In order to test the hypothesis about the simultaneous equality of all coefficients to zero with the excluded variables, the value is calculated

having a Fisher distribution with degrees of freedom.

According to the Fisher distribution tables, at a given level of significance, they find. And if, then the null hypothesis is rejected. In this case, it is incorrect to exclude all k variables from the equation.

Similar reasoning can be carried out about the validity of including one or more k new explanatory variables in the regression equation.

In this case, F is calculated - statistics

having a distribution. And if it exceeds the critical level, then the inclusion of new variables explains a significant part of the previously unexplained variance of the dependent variable (i.e., the inclusion of new explanatory variables is justified).

Remarks. 1. It is advisable to include new variables one at a time.

2. To calculate F - statistics when considering the inclusion of explanatory variables in the equation, it is desirable to consider the coefficient of determination corrected for the number of degrees of freedom.

F - Fisher's statistic is also used to test the hypothesis about the coincidence of regression equations for individual groups of observations.

Let there be 2 samples containing, respectively, observations. For each of these samples, a regression equation of the form was estimated. Let the standard deviation from the regression line (i.e.) be equal for them, respectively,.

The null hypothesis is tested: that all the corresponding coefficients of these equations are equal to each other, i.e. the regression equation for these samples is the same.

Let the regression equation of the same type be estimated at once for all observations, and the standard deviation.

Then F is calculated - statistics by the formula:

It has a Fisher distribution with degrees of freedom. F - statistics will be close to zero if the equation for both samples is the same, since in this case. Those. if, then the null hypothesis is accepted.

If, then, the null hypothesis is rejected, and a unified regression equation cannot be constructed.

Final tests in econometrics

1. The estimation of the significance of the parameters of the regression equation is carried out on the basis of:

A) t - Student's criterion;

b) F-Fisher - Snedecor test;

c) root mean square error;

d) average approximation error.

2. The regression coefficient in the equation characterizing the relationship between the volume of products sold (million rubles) and the profit of enterprises of the automotive industry for the year (million rubles) means that with an increase in the volume of products sold by 1 million rubles profit increases by:

d) 0.5 mln. rub.;

c) 500 thousand. rub.;

D) 1.5 million rubles.

3. The correlation ratio (correlation index) measures the degree of closeness of the relationship between X andY:

a) only with a nonlinear form of dependence;

B) for any form of addiction;

c) only with a linear relationship.

4. In the direction of communication there are:

a) moderate;

B) straight;

c) straightforward.

5. Based on 17 observations, a regression equation was constructed:
.
To check the significance of the equation, we calculatedobserved valuet- statistics: 3.9. Conclusion:

A) The equation is significant for a = 0,05;

b) The equation is insignificant at a = 0.01;

c) The equation is insignificant at a = 0.05.

6. What are the consequences of violating the OLS assumption "the mathematical expectation of the regression residuals is zero"?

A) Biased estimates of regression coefficients;

b) Effective but inconsistent estimates of regression coefficients;

c) Ineffective estimates of the regression coefficients;

d) Inconsistent estimates of the regression coefficients.

7. Which of the following statements is true in the case of heteroscedasticity of the residuals?

A) Conclusions on t and F-statistics are unreliable;

d) Estimates of the parameters of the regression equation are biased.

8. What the test is based on rank correlation Spearman?

A) On the use of t - statistics;

c) On use ;

9. What is White's test based on?

b) On the use of F - statistics;

C) On use ;

d) On the graphical analysis of the residues.

10. What method can be used to eliminate autocorrelation?

11. What is the violation of the assumption of the constant variance of the residuals called?

a) Multicollinearity;

b) Autocorrelation;

C) Heteroscedasticity;

d) Homoscedasticity.

12. Dummy variables are introduced into:

a) only in linear models;

b) only in multiple nonlinear regression;

c) only in nonlinear models;

D) both in linear and non-linear models, reduced to linear form.

13. If the matrix of paired correlation coefficients contains
, then this indicates:

A) On the presence of multicollinearity;

b) About the absence of multicollinearity;

c) On the presence of autocorrelation;

d) About the absence of heteroscedasticity.

14. With what measure is it impossible to get rid of multicollinearity?

a) Increasing the sample size;

D) Conversion of the random component.

15. If
and the rank of matrix A is less than (K-1) then the equation:

a) overidentified;

B) not identified;

c) accurately identified.

16. The regression equation is:

A)
;

b)
;

v)
.

17. What is the problem of identifying the model?

A) obtaining uniquely defined parameters of the model given by a system of simultaneous equations;

b) selection and implementation of methods for statistical estimation of unknown parameters of the model based on the initial statistical data;

c) checking the adequacy of the model.

18. What method is used to estimate the parameters of the overidentified equation?

C) DMNK, KMNK;

19. If the qualitative variable haskalternative values, then the modeling uses:

A) (k-1) dummy variable;

b) k dummy variables;

c) (k + 1) dummy variable.

20. The analysis of the tightness and direction of the relationship of the two signs is carried out on the basis of:

A) pair correlation coefficient;

b) the coefficient of determination;

c) multiple correlation coefficient.

21. In a linear equation x = a 0 + a 1 x regression coefficient shows:

a) the tightness of communication;

b) the proportion of the variance "Y", depending on the "X";

B) how much, on average, "Y" will change when "X" changes by one unit;

d) error of the correlation coefficient.

22. What indicator is used to determine the part of the variation due to a change in the value of the factor under study?

a) coefficient of variation;

b) correlation coefficient;

B) coefficient of determination;

d) coefficient of elasticity.

23. The coefficient of elasticity shows:

A) by how many% the value of y will change when x changes by 1%;

b) by how many units of its measurement the value of y will change when x changes by 1%;

c) by how many% the value of y will change when x changes by unit. your dimension.

24. What methods can be used to detect heteroscedasticity?

A) Goldfeld-Quandt test;

B) Spearman's rank correlation test;

c) Durbin-Watson test.

25. What is the Holfeld-Quandt test based on

a) On the use of t - statistics;

B) On the use of F - statistics;

c) On use ;

d) On the graphical analysis of the residues.

26. What methods cannot be used to eliminate the autocorrelation of residuals?

a) The generalized least squares method;

B) Weighted least squares method;

C) Maximum likelihood method;

D) Two-step least squares method.

27. What is the violation of the assumption of the independence of the remainders called?

a) Multicollinearity;

B) Autocorrelation;

c) Heteroscedasticity;

d) Homoscedasticity.

28. What method can be used to eliminate heteroscedasticity?

A) Generalized least squares method;

b) Weighted least squares method;

c) maximum likelihood method;

d) Two-step least squares method.

30. If byt-criterion, most of the regression coefficients are statistically significant, and the model as a whole isF- the criterion is insignificant, then this may indicate:

a) Multicollinearity;

B) About autocorrelation of residuals;

c) About the heteroscedasticity of the residues;

d) This option is not possible.

31. Is it possible to get rid of multicollinearity by transforming variables?

a) This measure is effective only when the sample size is increased;

32. What method can be used to find the parameter estimates of the linear regression equation:

A) the least square method;

b) correlation regression analysis;

c) analysis of variance.

33. A multiple linear regression equation with dummy variables is constructed. To check the significance of individual coefficients, use distribution:

a) Normal;

b) Student's t;

c) Pearson;

d) Fischer-Snedecor.

34. If
and the rank of matrix A is greater than (K-1) then the equation:

A) overidentified;

b) not identified;

c) accurately identified.

35. To estimate the parameters of an accurately identifiable system of equations, the following is applied:

a) DMNK, KMNK;

b) DMNC, MNC, KMNK;

36. Chow's criterion is based on the application of:

A) F - statistics;

b) t - statistics;

c) Durbin – Watson criteria.

37. Dummy variables can take on the following values:

d) any values.

39. For 20 observations, a regression equation was constructed:
.
To check the significance of the equation, the statistic is calculated:4.2. Conclusions:

a) The equation is significant at a = 0.05;

b) The equation is insignificant at a = 0.05;

c) The equation is insignificant at a = 0.01.

40. Which of the following statements is not true in the case of heteroscedasticity of the residues?

a) Conclusions on t and F statistics are unreliable;

b) Heteroscedasticity is manifested through the low value of the Durbin-Watson statistics;

c) In case of heteroscedasticity, the estimates remain effective;

d) Estimates are biased.

41. Chow's test is based on comparison:

A) dispersions;

b) the coefficients of determination;

c) mathematical expectations;

d) medium.

42. If in the Chow test
then it is considered:

A) that the division into subintervals is advisable from the point of view of improving the quality of the model;

b) the model is statistically insignificant;

c) the model is statistically significant;

d) that it makes no sense to split the sample into parts.

43. Dummy variables are variables:

a) high quality;

b) random;

B) quantitative;

d) logical.

44. Which of the following methods cannot be applied to detect autocorrelation?

a) The method of series;

b) the Darbin-Watson criterion;

c) Spearman's rank correlation test;

D) White's test.

45. The simplest structural form of the model is:

A)

b)

v)

G)
.

46. ​​What measures can be used to get rid of multicollinearity?

a) Increasing the sample size;

b) Exclusion of variables highly correlated with the rest;

c) Modification of the model specification;

d) Transformation of the random component.

47. If
and the rank of the matrix A is (K-1) then the equation:

a) overidentified;

b) not identified;

B) clearly identified;

48. A model is considered identified if:

a) among the equations of the model there is at least one normal;

B) each equation of the system is identifiable;

c) among the equations of the model there is at least one unidentified one;

d) among the equations of the model there is at least one overidentified one.

49. What method is used to estimate the parameters of an unidentified equation?

a) DMNK, KMNK;

b) DMSC, MNC;

C) the parameters of such an equation cannot be estimated.

50. At the junction of which areas of knowledge did econometrics emerge:

A) economic theory; economic and math statistics;

b) economic theory, mathematical statistics and probability theory;

c) economic and mathematical statistics, probability theory.

51. In the multiple linear regression equation, confidence intervals for the regression coefficients are constructed using the distribution:

a) Normal;

B) Student's t;

c) Pearson;

d) Fischer-Snedecor.

52. Based on 16 observations, a paired linear regression equation was constructed. Forcheck the significance of the regression coefficient calculatedt 6l =2.5.

a) The coefficient is insignificant at a = 0.05;

b) The coefficient is significant at a = 0.05;

c) The coefficient is significant at a = 0.01.

53. It is known that between the quantitiesXandYexistspositive connection. Within what limitsis the pair correlation coefficient?

a) from -1 to 0;

b) from 0 to 1;

B) from –1 to 1.

54. The multiple correlation coefficient is 0.9. What percentagethe variance of the effective trait is explained by the influence of allfactor signs?

55. Which of the following methods cannot be applied to detect heteroscedasticity?

A) Goldfeld-Quandt test;

b) Spearman's rank correlation test;

c) the method of rows.

56. The given form of the model is:

a) a system of nonlinear functions of exogenous variables from endogenous ones;

B) a system of linear functions of endogenous variables from exogenous ones;

c) a system of linear functions of exogenous variables from endogenous ones;

d) a system of normal equations.

57. In what limits does the partial correlation coefficient, calculated by recurrent formulas, change?

a) from - to + ;

b) from 0 to 1;

c) from 0 to + ;

D) from –1 to +1.

58. In what limits does the partial correlation coefficient, calculated through the coefficient of determination, change?

a) from - to + ;

B) from 0 to 1;

c) from 0 to + ;

d) from –1 to +1.

59. Exogenous variables:

a) dependent variables;

B) independent variables;

61. When adding another explanatory factor to the regression equation, the multiple correlation coefficient:

a) decrease;

b) will increase;

c) will retain its value.

62. A hyperbolic regression equation is constructed:Y= a+ b/ X... ForThe distribution is used to test the significance of the equation:

a) Normal;

B) Student's t;

c) Pearson;

d) Fischer-Snedecor.

63. For what types of systems, the parameters of individual econometric equations can be found using the traditional method of least squares?

a) a system of normal equations;

B) a system of independent equations;

C) a system of recursive equations;

D) a system of interdependent equations.

64. Endogenous variables:

A) dependent variables;

b) independent variables;

c) dated by previous points in time.

65. Within what limits does the coefficient of determination change?

a) from 0 to + ;

b) from - to + ;

B) from 0 to +1;

d) from -l to +1.

66. Constructed multiple linear regression equation. To check the significance of individual coefficients, use distribution:

a) Normal;

b) Student's t;

c) Pearson;

D) Fischer-Snedecor.

67. When adding one more explanatory factor to the regression equation, the coefficient of determination:

a) decrease;

B) will increase;

c) will retain its value;

d) will not decrease.

68. The essence of the least squares method is that:

A) the estimate is determined from the condition of minimizing the sum of the squares of the deviations of the sample data from the estimated estimate;

b) the estimate is determined from the condition of minimizing the sum of deviations of the sample data from the estimated estimate;

c) the estimate is determined from the condition of minimizing the sum of the squares of the deviations of the sample mean from the sample variance.

69. To what class of nonlinear regressions does the parabola belong:

73. To what class of nonlinear regressions does the exponential curve belong:

74. To what class of nonlinear regressions does the function of the form ŷ
:

A) regressions, nonlinear relative to the variables included in the analysis, but linear in the estimated parameters;

b) nonlinear regressions for the estimated parameters.

78. To what class of nonlinear regressions does the function of the form ŷ
:

a) regressions that are nonlinear with respect to the variables included in the analysis, but linear in the estimated parameters;

B) nonlinear regressions for the estimated parameters.

79. In the hyperbolic regression equation ŷ
if the value
b >0 , then:

A) with an increase in the factor attribute X the values ​​of the effective characteristic at decrease slowly, and at x → ∞ average value at will be equal a;

b) the value of the effective attribute at increases with slower growth with an increase in the factor attribute X, and at x → ∞

81. The coefficient of elasticity is determined by the formula

A) Linear function;

b) Parabolas;

c) Hyperboles;

d) exponential curve;

e) Power-law.

82. The coefficient of elasticity is determined by the formula
for a regression model in the form:

a) Linear function;

B) Parabolas;

c) Hyperboles;

d) exponential curve;

e) Power-law.

86. Equation
called:

A) linear trend;

b) parabolic trend;

c) hyperbolic trend;

d) an exponential trend.

89. Equation
called:

a) a linear trend;

b) parabolic trend;

c) hyperbolic trend;

D) an exponential trend.

90. System of types called:

A) a system of independent equations;

b) a system of recursive equations;

c) a system of interdependent (joint, simultaneous) equations.

93. Econometrics can be defined as:

A) it is an independent scientific discipline, combining a set of theoretical results, techniques, methods and models designed to give concrete quantitative expression to general (qualitative) laws due to economic theory on the basis of economic theory, economic statistics and mathematical and statistical tools;

B) the science of economic measurements;

C) statistical analysis of economic data.

94. The tasks of econometrics include:

A) forecast of economic and socio-economic indicators characterizing the state and development of the analyzed system;

B) imitation of possible scenarios for the socio-economic development of the system to identify how the planned changes of certain controllable parameters will affect the output characteristics;

c) testing hypotheses using statistical data.

95. By nature, there are connections:

A) functional and correlation;

b) functional, curvilinear and rectilinear;

c) correlation and inverse;

d) statistical and direct.

96. With a direct connection with an increase in the factor attribute:

a) the effective feature decreases;

b) the effective sign does not change;

C) the effective trait increases.

97. What methods are used to identify the presence, nature and direction of communication in statistics?

a) average values;

B) comparison of parallel rows;

C) the method of analytical grouping;

d) relative values;

E) graphical method.

98. What method is used to identify the form of influence of some factors on others?

a) correlation analysis;

B) regression analysis;

c) index analysis;

d) analysis of variance.

99. What method is used to quantify the strength of the impact of some factors on others:

A) correlation analysis;

b) regression analysis;

c) method of average values;

d) analysis of variance.

100. What indicators in terms of their size exist in the range from minus to plus units:

a) coefficient of determination;

b) correlation ratio;

B) linear correlation coefficient.

101. The regression coefficient for a one-factor model shows:

A) by how many units does the function change when the argument changes by one unit;

b) by what percentage the function changes by one unit of argument change.

102. The coefficient of elasticity shows:

a) by what percentage the function changes when the argument changes by one unit of its measurement;

B) by what percentage the function changes when the argument changes by 1%;

c) by how many units of its measurement the function changes with a change in the argument by 1%.

105. The value of the correlation index, equal to 0.087, indicates:

A) about their weak dependence;

b) about a strong relationship;

c) errors in calculations.

107. The value of the pair correlation coefficient, equal to 1.12, indicates:

a) about their weak dependence;

b) about a strong relationship;

C) about errors in calculations.

109. Which of the given numbers can be the values ​​of the pair correlation coefficient:

111. Which of the given numbers can be the values ​​of the multiple correlation coefficient:

115. Note the correct form for the linear regression equation:

a) ŷ
;

b) ŷ
;

c) ŷ
;

D) ŷ
.

After the regression equation has been constructed and its accuracy has been estimated using the coefficient of determination, the question remains how this accuracy was achieved and, accordingly, can this equation be trusted. The fact is that the regression equation was not built according to the general population, which is unknown, but on a sample from it. Points from the general population fall into the sample randomly, therefore, in accordance with the theory of probability, among other cases, it is possible that the sample from the “wide” general population turns out to be “narrow” (Fig. 15).

Rice. 15. Possible option for points to be included in the sample from the general population.

In this case:

a) the regression equation constructed from the sample may differ significantly from the regression equation for the general population, which will lead to forecast errors;

b) the coefficient of determination and other characteristics of accuracy will turn out to be unjustifiably high and will mislead about the predictive qualities of the equation.

In the limiting case, the option is not excluded when from the general population, which is a cloud with main axis parallel to the horizontal axis (there is no relationship between the variables) due to random selection, a sample will be obtained, the main axis of which will be inclined to the axis. Thus, attempts to predict the next values ​​of the general population based on sample data from it are fraught not only with errors in assessing the strength and direction of the relationship between the dependent and independent variables, but also with the danger of finding a relationship between variables where there is actually none.

In the absence of information about all points of the general population, the only way to reduce errors in the first case is to use the regression equation when estimating the coefficients, which ensures their unbiasedness and efficiency. And the likelihood of the second case can be significantly reduced due to the fact that one property of the general population is known a priori with two variables independent of each other - it is precisely this connection that is absent in it. This reduction is achieved by checking statistical significance the obtained regression equation.

One of the most commonly used verification options is as follows. For the obtained regression equation, it is determined
-statistics
- the characteristic of the accuracy of the regression equation, which is the ratio of that part of the variance of the dependent variable that is explained by the regression equation to the unexplained (residual) part of the variance. Equation to determine
-statistics in the case of multivariate regression is:

where:
- explained variance - the part of the variance of the dependent variable Y that is explained by the regression equation;

-residual variance- a part of the variance of the dependent variable Y, which is not explained by the regression equation, its presence is a consequence of the action of a random component;

- the number of points in the sample;

- the number of variables in the regression equation.

As can be seen from the above formula, variances are determined as the quotient of dividing the corresponding sum of squares by the number of degrees of freedom. Number of degrees of freedom this is the minimum required number of dependent variable values, which are sufficient to obtain the desired characteristic of the sample and which can be freely varied, taking into account the fact that for this sample all other quantities used to calculate the desired characteristic are known.

To obtain the residual variance, the coefficients of the regression equation are required. In the case of paired linear regression, there are two coefficients, therefore, in accordance with the formula (taking
) the number of degrees of freedom is
... It means that to determine the residual variance, it is enough to know the coefficients of the regression equation and only
values ​​of the dependent variable from the sample. The remaining two values ​​can be calculated based on these data, and therefore are not freely variable.

To calculate the explained variance of the dependent variable values, it is not required at all, since it can be calculated knowing the regression coefficients for the independent variables and the variance of the independent variable. In order to be convinced of this, it is enough to recall the expression given earlier
... Therefore, the number of degrees of freedom for the residual variance is equal to the number of independent variables in the regression equation (for paired linear regression
).

As a result
-criterion for the equation of paired linear regression is determined by the formula:

.

It is proved in probability theory that
- the criterion of the regression equation obtained for a sample from the general population in which there is no relationship between the dependent and independent variable has the Fisher distribution, which is well studied. Thanks to this, for any value
-criterion, you can calculate the probability of its occurrence and vice versa, determine the value
-criterion that he cannot exceed with a given probability.

To carry out a statistical test of the significance of the regression equation, it is formulated null hypothesis the absence of a connection between the variables (all coefficients for the variables are equal to zero) and the level of significance is selected .

Significance level Is the admissible probability of making error of the first kind- reject the correct null hypothesis as a result of testing. In this case, making a mistake of the first kind means recognizing, from the sample, the presence of a relationship between variables in the general population, when in fact it is not there.

Typically, the significance level is assumed to be 5% or 1%. The higher the level of significance (the less
), the higher reliability level test equal
, i.e. the greater the chance of avoiding recognition bias in the sample of the presence of a relationship in the general population of actually unrelated variables. But with an increase in the level of significance, the danger of committing errors of the second kind- reject the correct null hypothesis, i.e. not notice in the sample the actual relationship of variables in the general population. Therefore, depending on which error has large negative consequences, one or another level of significance is chosen.

For the selected significance level according to the Fisher distribution, the tabular value is determined
the probability of exceeding which in the sample power obtained from the general population without a relationship between the variables does not exceed the level of significance.
is compared with the actual value of the criterion for the regression equation .

If the condition is met
, then the erroneous detection of a connection with the value
-criterion equal or greater for a sample from the general population with unrelated variables will occur with a probability less than the level of significance. In accordance with the rule “there are no very rare events”, we come to the conclusion that the relationship between the variables established from the sample is also present in the general population from which it was obtained.

If it turns out
, then the regression equation is not statistically significant. In other words, there is a real likelihood that a relationship between the variables that does not exist in reality has been established from the sample. An equation that does not pass the test for statistical significance is treated in the same way as a drug with expired suitability - such drugs are not necessarily spoiled, but since there is no certainty about their quality, they prefer not to use them. This rule does not save you from all mistakes, but it allows you to avoid the most gross ones, which is also quite important.

The second option for checking, more convenient in the case of using spreadsheets, is to compare the probability of occurrence of the obtained value
-criterion with a level of significance. If this probability is below the significance level
, then the equation is statistically significant, otherwise it is not.

After checking the statistical significance of the regression equation in general, it is useful, especially for multivariate dependencies, to check for the statistical significance of the obtained regression coefficients. The ideology of testing is the same as when testing the equation as a whole, but as a criterion it is used -Student's test defined by the formulas:

and

where: , - values ​​of the Student's criterion for the coefficients and respectively;

- residual variance of the regression equation;

- the number of points in the sample;

- the number of variables in the sample, for paired linear regression
.

The obtained actual values ​​of the Student's criterion are compared with the tabular values
obtained from the Student distribution. If it turns out that
, then the corresponding coefficient is statistically significant, otherwise not. The second option for checking the statistical significance of the coefficients is to determine the probability of the appearance of the Student's test
and compare with the significance level
.

For variables whose coefficients turned out to be statistically insignificant, there is a high probability that their effect on the dependent variable in the general population is completely absent. Therefore, either it is necessary to increase the number of points in the sample, then perhaps the coefficient will become statistically significant and at the same time its value will be refined, or, as independent variables, find others that are more closely related to the dependent variable. In this case, the forecasting accuracy will increase in both cases.

As an express method for assessing the significance of the coefficients of the regression equation, the following rule can be applied - if the Student's criterion is greater than 3, then such a coefficient, as a rule, turns out to be statistically significant. In general, it is believed that in order to obtain statistically significant regression equations, it is necessary that the condition
.

The standard error of forecasting from the obtained regression equation of the unknown value
with a known
evaluated by the formula:

Thus, the forecast with a confidence level of 68% can be presented as:

If a different confidence level is required
, then for the significance level
it is necessary to find the Student's criterion
and confidence interval for forecasting with a level of reliability
will be equal
.

Predicting multivariate and nonlinear dependencies

If the predicted value depends on several independent variables, then in this case there is multivariate regression kind:

where:
- regression coefficients describing the influence of variables
by the predicted amount.

The methodology for determining regression coefficients is the same as paired linear regression, especially when using a spreadsheet, since it uses the same function for paired linear regression and multivariate linear regression. In this case, it is desirable that there are no relationships between the independent variables, i.e. changing one variable did not affect the values ​​of other variables. But this requirement is not mandatory, it is important that there are no functional linear dependencies between the variables. The procedures described above for checking the statistical significance of the obtained regression equation and its individual coefficients, the estimate of the forecasting accuracy remains the same as for the case of paired linear regression. At the same time, the use of multivariate regressions instead of pair regressions usually allows, with an appropriate choice of variables, to significantly increase the accuracy of describing the behavior of the dependent variable, and hence the accuracy of prediction.

In addition, multivariate linear regression equations allow describing the nonlinear dependence of the predicted value on independent variables. The procedure for reducing a nonlinear equation to a linear form is called linearization... In particular, if this dependence is described by a polynomial of degree different from 1, then, by replacing variables with degrees different from one to new variables in the first degree, we obtain the problem of multivariate linear regression instead of nonlinear. So, for example, if the influence of the independent variable is described by a parabola of the form

then replacement
allows you to transform a nonlinear problem to a multidimensional linear form

Nonlinear problems in which nonlinearity arises due to the fact that the predicted value depends on the product of independent variables can be just as easily transformed. To take into account this influence, it is necessary to introduce a new variable equal to this product.

In cases where nonlinearity is described by more complex dependencies, linearization is possible due to coordinate transformation. For this, the values ​​are calculated
and plotting the dependence of the initial points in various combinations of transformed variables. The combination of transformed coordinates or transformed and non-transformed coordinates in which the dependence is closest to a straight line suggests a change in variables that will transform the nonlinear dependence to a linear form. For example, a nonlinear dependence of the form

becomes linear

where:
,
and
.

The obtained regression coefficients for the transformed equation remain unbiased and efficient, but it is impossible to check the statistical significance of the equation and coefficients.

Checking the validity of using the least squares method

The use of the least squares method ensures the efficiency and unbiasedness of the estimates of the coefficients of the regression equation under the following conditions (conditions Gaus-Markov):

1.

2.

3.values do not depend on each other

4.values independent of independent variables

The easiest way to check whether these conditions are met is by plotting residual graphs
depending on the , then on the independent (independent) variables. If the points on these graphs are located in a corridor located symmetrically to the abscissa axis and there are no regularities in the location of the points, then the Gaus-Markov conditions are met and there is no opportunity to improve the accuracy of the regression equation. If this is not the case, then there is a possibility to significantly increase the accuracy of the equation and for this it is necessary to refer to the specialized literature.

To check the significance, the ratio of the regression coefficient and its standard deviation is analyzed. This ratio is the Student's distribution, that is, to determine the significance, we use the t-criterion:

- RMS residual dispersion;

- the sum of deviations from the mean

If t races. > t tab. , then the coefficient b i is significant.

The confidence interval is determined by the formula:

ORDER OF PERFORMANCE OF WORK

    Take the initial data according to the type of work (according to the student's number in the journal). A static control object with two inputs is specified. X 1 , X 2 and one way out Y... A passive experiment was carried out on the object and a sample of 30 points was obtained, containing the values X 1 , X 2 and Y for each experiment.

    Open a new file in Excel 2007. Enter the initial information into the columns of the original table - the values ​​of the input variables X 1 , X 2 and the output variable Y.

    Prepare two additional columns for entering calculated values Y and leftovers.

    Call the program "Regression": Data / Data Analysis / Regression.

Rice. 1. Dialog box "Data Analysis".

    Enter the addresses of the source data into the "Regression" dialog box:

    input bin Y, input bin X (2 columns),

    set the reliability level to 95%,

    in the "Output interval" option, specify the upper left cell of the place where the regression analysis data is output (the first cell on the 2-page of the worksheet),

    enable the options "Balances" and "Schedule of balances",

    press the OK button to start the regression analysis.

Rice. 2. Dialog box "Regression".

    Excel will display 4 tables and 2 graphs of the dependence of residuals on variables X1 and X2.

    Format the table "Output of totals" - expand the column with the names of the output data, make 3 significant digits after the decimal point in the second column.

    Format the ANOVA table - make it easy to read and understand the number of significant digits after commas, shorten the names of variables and adjust the width of the columns.

    Format the table of equation coefficients - shorten the names of variables and adjust the width of the columns if necessary, make the number of significant digits convenient for reading and understanding, delete the last 2 columns (values ​​and table markup).

    Transfer the data from the "Output remainder" table to the prepared columns of the source table, then delete the "Output remainder" table (option "paste special").

    Enter the obtained estimates of the coefficients into the original table.

    Pull the tables of results to the top of the page.

    Build below tables charts Yexp, Ysettlement and forecast errors (residual).

    Format residual charts. Using the obtained graphs, evaluate the correctness of the model by inputs X1, X2.

    Print the results of the regression analysis.

    Understand the results of the regression analysis.

    Prepare a report on the work.

EXAMPLE OF PERFORMANCE OF WORK

The technique of performing regression analysis in the EXCEL package is shown in Figures 3-5.

Rice. 3. An example of regression analysis in the EXCEL package.


Fig. 4. Residual plots of variables X1, X2

Rice. 5. Graphs Yexp,Ysettlement and forecast errors (residual).

According to the regression analysis, we can say:

1. The regression equation obtained using Excel is as follows:

    Determination coefficient:

The variation of the result by 46.5% is explained by the variation of factors.

    The general F test tests the hypothesis that the regression equation is statistically significant. The analysis is performed by comparing the actual and tabular values ​​of the Fisher's F-test.

Since the actual value exceeds the table
, then we conclude that the resulting regression equation is statistically significant.

    Multiple correlation coefficient:

    b 0 :

t tab. (29, 0.975) = 2.05

b 0 :

Confidence interval:

    Determine the confidence interval for the coefficient b 1 :

Checking the significance of the coefficient b 1 :

t races. > t tab. , coefficient b 1 is significant

Confidence interval:

    Determine the confidence interval for the coefficient b 2 :

Significance check for coefficient b 2 :

Determine the confidence interval:

JOB OPTIONS

Table 2. Options for tasks

Option No.

Effective feature Y i

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 2

Y 2

Y 2

Y 2

Y 2

Factor No. X i

Factor No. X i

Continuation of table 1

Option No.

Effective feature Y i

Y 2

Y 2

Y 2

Y 2

Y 2

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Factor No. X i

Factor No. X i

Table 3. Initial data

Y 1

Y 2

Y 3

X 1

X 2

X 3

X 4

X 5

QUESTIONS FOR SELF-CONTROL

    Regression analysis tasks.

    Prerequisites for regression analysis.

    Basic equation of analysis of variance.

    What does Fischer's F-ratio show?

    How is the tabular value of the Fisher criterion determined?

    What does the coefficient of determination show?

    How to determine the significance of the regression coefficients?

    How to determine the confidence interval of the regression coefficients?

    How to determine the calculated value of the t-test?

    How to determine the tabular value of the t-test?

    Formulate the main idea of ​​the analysis of variance, for which tasks is it most effective?

    What are the main theoretical premises of analysis of variance?

    Decompose the total sum of squared deviations into components in ANOVA.

    How to get variance estimates from the sum of squared deviations?

    How are the required degrees of freedom obtained?

    How is the standard error determined?

    Explain the scheme of two-way analysis of variance.

    How is cross-classification different from hierarchical classification?

    How is balanced data different?

The report is drawn up in a text editor Word on A4 paper GOST 6656-76 (210x297 mm) and contains:

    The name of the laboratory work.

    Objective.

  1. Calculation results.

TIME ALLOWED FOR EXECUTION

LABORATORY WORK

Preparation for work - 0.5 acad. hours.

Performance of work - 0.5 acad. hours.

Computer calculations - 0.5 acad. hours.

Registration of work - 0.5 acad. hours.

Literature

    Identification of control objects. / A. D. Semenov, D. V. Artamonov, A. V. Bryukhachev. Tutorial. - Penza: PSU, 2003 .-- 211 p.

    The basics statistical analysis... Workshop on statistical methods and operations research using the STATISTIC and EXCEL packages. / Vukolov E.A. Tutorial... - M .: FORUM, 2008 .-- 464 p.

    Foundations of the theory of identification of control objects. / A.A. Ignatiev, S.A. Ignatiev. Tutorial. - Saratov: SSTU, 2008 .-- 44 p.

    Probability theory and mathematical statistics in examples and problems using EXCEL. / G.V. Gorelova, I.A. Katsko. - Rostov n / a: Phoenix, 2006.- 475 p.

    Purpose of work 2

    Basic concepts 2

    Work order 6

    Example of performing work 9

    Self-check questions 13

    Time allotted for work 14

    After evaluating the parameters a and b, we obtained a regression equation by which we can estimate the values y by set values x... It is natural to believe that the calculated values ​​of the dependent variable will not coincide with the actual values, since the regression line describes the relationship only on average, in general. Individual meanings are scattered around her. Thus, the reliability of the calculated values ​​obtained by the regression equation is largely determined by the scatter of the observed values ​​around the regression line. In practice, as a rule, the variance of the errors is unknown and is estimated from observations simultaneously with the regression parameters a and b... It is logical to assume that the estimate is related to the sum of squares of the regression residuals. The quantity is a sample estimate of the variance of the disturbances contained in the theoretical model ... It can be shown that for the paired regression model

    where is the deviation of the actual value of the dependent variable from its calculated value.

    If , then for all observations the actual values ​​of the dependent variable coincide with the calculated (theoretical) values . Graphically, this means that the theoretical regression line (the line plotted according to the function) passes through all points of the correlation field, which is possible only when strictly functional communication... Therefore, the effective feature at completely due to the influence of the factor X.

    Usually, in practice, there is some scattering of the points of the correlation field relative to the theoretical regression line, i.e. deviations of empirical data from theoretical ones. This spread is due to both the influence of the factor X, i.e. regression y on X, (such variance is called explained, since it is explained by the regression equation), and the action of other reasons (unexplained variation, random). The magnitude of these deviations is the basis for calculating the quality indicators of the equation.

    According to the main position of the analysis of variance, the total sum of the squares of the deviations of the dependent variable y from the mean can be decomposed into two components: explained by the regression equation and unexplained:

    ,

    where are the values y calculated by the equation.

    Let's find the ratio of the sum of squares of deviations, explained by the regression equation, to the total sum of squares:

    , where

    . (7.6)

    The ratio of the part of the variance explained by the regression equation to total variance the effective attribute is called the coefficient of determination. The value cannot exceed one and this maximum value will only be achieved at, i.e. when each deviation is equal to zero and therefore all points of the scatter diagram lie exactly on a straight line.

    The coefficient of determination characterizes the share of the variance explained by the regression in the total value of the variance of the dependent variable . Accordingly, the value characterizes the proportion of variation (variance) y, unexplained by the regression equation, which means that it is caused by the influence of other factors unaccounted for in the model. The closer to one, the higher the quality of the model.



    With paired linear regression, the coefficient of determination is equal to the square of the paired linear coefficient correlations:.

    The root of this coefficient of determination is the multiple correlation coefficient (index), or theoretical correlation ratio.

    In order to find out whether the value of the coefficient of determination obtained when evaluating the regression reflects the true relationship between y and x check the significance of the constructed equation as a whole and individual parameters. Checking the significance of a regression equation lets you know if a regression equation is suitable for practical use, such as forecasting or not.

    At the same time, the main hypothesis is put forward about the insignificance of the equation as a whole, which is formally reduced to the hypothesis that the regression parameters are equal to zero, or, which is the same, that the determination coefficient is equal to zero:. An alternative hypothesis about the significance of the equation is the hypothesis about the inequality of the regression parameters to zero or the inequality of the coefficient of determination to zero:.

    To test the significance of the regression model, use F- Fisher's criterion, calculated as the ratio of the sum of squares (per one independent variable) to the residual sum of squares (per one degree of freedom):

    , (7.7)

    where k- the number of independent variables.

    After dividing the numerator and denominator of relation (7.7) by the total sum of the squares of the deviations of the dependent variable, F- the criterion can be equivalently expressed based on the coefficient:

    .

    If the null hypothesis is correct, then the explained by the regression equation and the unexplained (residual) variance do not differ from each other.

    Calculated value F- the criterion is compared with the critical value, which depends on the number of independent variables k, and on the number of degrees of freedom (n-k-1)... Tabular (critical) value F- criterion is the maximum value of the variance ratios that can occur if they are randomly diverged for a given level of probability of a null hypothesis. If the calculated value F- the criterion is larger than the tabular criterion for a given level of significance, then the null hypothesis of the absence of a connection is rejected and a conclusion is made about the significance of this connection, i.e. the model is considered significant.

    For paired regression model

    .

    In linear regression, the significance of not only the equation as a whole is usually assessed, but also its individual coefficients. For this, the standard error of each of the parameters is determined. The standard errors of the regression coefficients of the parameters are determined by the formulas:

    , (7.8)

    (7.9)

    Standard errors of regression coefficients or standard deviations calculated by formulas (7.8,7.9), as a rule, are given in the results of calculating the regression model in statistical packages.

    Based on the standard errors of the regression coefficients, the significance of these coefficients is checked using the usual statistical hypothesis testing scheme.

    As the main hypothesis, the hypothesis is put forward about the insignificant difference from zero of the "true" regression coefficient. In this case, an alternative hypothesis is the reverse hypothesis, that is, about the inequality of the “true” regression parameter to zero. This hypothesis is tested using t- statistics with t-Student distribution:

    Then the calculated values t- statistics are compared with critical values t- statistics determined by the Student distribution tables. Critical value determined depending on the level of significance α and the number of degrees of freedom, which is equal to (n-k-1), n ​​- number of observations, k- the number of independent variables. In the case of linear pairwise regression, the number of degrees of freedom is (P- 2). The critical value can also be calculated on a computer using the built-in function TDRONSTRATE in the Excel package.

    If the calculated value t- statistics are more critical, then the main hypothesis is rejected and it is believed that with a probability (1-α) The “true” regression coefficient differs significantly from zero, which is a statistical confirmation of the existence of a linear dependence of the corresponding variables.

    If the calculated value t- statistics is less than critical, then there is no reason to reject the main hypothesis, that is, the “true” regression coefficient differs insignificantly from zero at the level of significance α ... In this case, the factor corresponding to this coefficient should be excluded from the model.

    The significance of the regression coefficient can be established by constructing a confidence interval. Confidence interval for regression parameters a and b is defined as follows:

    ,

    ,

    where is determined by the Student's distribution table for the significance level α and the number of degrees of freedom (P- 2) for paired regression.

    Since the regression coefficients in econometric studies have a clear economic interpretation, the confidence intervals should not contain zero. The true value of the regression coefficient cannot simultaneously contain positive and negative values, including zero, otherwise we get contradictory results in the economic interpretation of the coefficients, which cannot be. Thus, the coefficient is significant if the obtained confidence interval does not cover zero.

    Example 7.4. According to example 7.1:

    a) Construct a paired linear regression model of the dependence of profit from sales on the selling price using software for data processing.

    b) Estimate the significance of the regression equation as a whole using F- Fisher's criterion for α = 0.05.

    c) Estimate the significance of the coefficients of the regression model using t-Student's test at α = 0.05 and α = 0.1.

    For the regression analysis, we use a standard office EXCEL program... We will build a regression model using the REGRESSION tool of the ANALYSIS PACKAGE setting (Fig. 7.5), which is launched as follows:

    ServiceData AnalysisREGRESSIONOK.

    Figure 7.5. Using the REGRESSION tool

    In the REGRESSION dialog box, in the Input Y range field, you must enter the address of the range of cells containing the dependent variable. In the Input interval X field, you must enter the addresses of one or more ranges containing the values ​​of independent variables. The Labels in the first row checkbox is set to active if the column headers are also selected. In fig. 7.6. shows the screen form of calculating the regression model using the REGRESSION tool.

    Rice. 7.6. Building a Pairwise Regression Model Using

    instrument REGRESSION

    As a result of the operation of the REGRESSION tool, the following regression analysis protocol is formed (Figure 7.7).

    Rice. 7.7. Regression Analysis Protocol

    The equation for the dependence of profit from sales on the selling price is as follows:

    We estimate the significance of the regression equation using F- Fisher's criterion. Meaning F- Fisher's criterion is taken from the table "Analysis of variance" of the EXCEL protocol (Fig. 7.7.). Calculated value F- criterion 53.372. Table value F- criterion at the level of significance α = 0.05 and the number of degrees of freedom is 4.964. Because , then the equation is considered significant.

    Calculated values t- Student's criterion for the coefficients of the regression equation are shown in the resultant table (Fig. 7.7). Table value t- Student's criterion at the level of significance α = 0.05 and 10 degrees of freedom is 2.228. For the regression coefficient a, hence the coefficient a does not matter. For the regression coefficient b, therefore, the coefficient b meaningful.