Assessment of the materiality of the parameters of linear regression and the entire equation as a whole. Check the significance of the entire regression equation in general

After the equation was found linear regressionThe validity of the importance of both the equations in general and its individual parameters is carried out.

Check the significance of the regression equation - it means to establish whether the mathematical model corresponds to the dependence between variables, experimental data and is sufficiently included in the equation of explanatory variables (one or several) to describe the dependent variable.

Verification of significance is made on the basis of dispersion analysis.

According to the idea of \u200b\u200bdispersion analysis, the total amount of the squares of deviations (SKO) Y from the average value is decomposed into two parts - explained and inexplicable:

or, accordingly:

Here are two extreme cases: when the total approach is exactly equal to the residual and when the common approach is equal to factor.

In the first case, the factor X does not affect the result, the entire dispersion of Y is due to the effect of other factors, the regression line is parallel to the axis OH and the equation should be viewed.

In the second case, other factors do not affect the result, y is associated with X functionally, and the residual approach is zero.

However, in practice, both allegations are present in the right part. The suitability of the regression line for the forecast depends on which part of the total variation of Y has to be explained by the variation. If the estimated approach will be greater than the residual approach, the regression equation is statistically significant and factor x has a significant impact on the result y. This is equivalent to the fact that the determination coefficient will approach one.

The number of degrees of freedom (DF-Degrees of Freedom) is the number of independently variable signs.

For general assessment, independent deviations are required (N-1)

Factor speed has one degree of freedom, and

Thus, we can write:

From this balance we determine that \u003d n-2.

Dividing each approach to its number of degrees of freedom, we obtain the average square of deviations, or a dispersion by one degree of freedom: - General dispersion, - factor, is residual.

Analysis statistical significance Linear regression coefficients

Although the theoretical values \u200b\u200bof the coefficients of the linear dependence equation are assumed to be constant values, estimates A and B of these coefficients obtained during the construction of the equation according to random sample data are random values. If regression errors have normal distributionThe ratings of coefficients are also distributed normally and can be characterized by their average values \u200b\u200band dispersion. Therefore, the analysis of the coefficients begins with the calculation of these characteristics.

Dispersions of coefficients are calculated by formulas:

Dispersion of the regression coefficient:

where is the residual dispersion of one degree of freedom.

Parameter dispersion:

Hence the standard error of the regression coefficient is determined by the formula:

The standard error of the parameter is determined by the formula:

They serve to check the zero hypotheses that the true value of the regression coefficient B or the free member A is zero :.

Alternative hypothesis has the form :.

t - Statistics have T - the distribution of Student with degrees of freedom. According to the distribution tables of Student at a certain level of significance b and degrees of freedom, they are critical.

If, the zero hypothesis must be rejected, the coefficients are considered statistically significant.

If, the zero hypothesis cannot be rejected. (In the event that the coefficient B is statistically insignificant, the equation must be viewed, and this means that there is no connection between the signs. If the coefficient A is statistically insignificant, it is recommended to evaluate the new equation in the form).

Interval ratings of coefficients linear equation Regression:

Trust interval for but: .

Trust interval for B:

This means that with a given reliability (where - the level of significance) true values \u200b\u200bA, B are at the specified intervals.

The regression coefficient has a clear economic interpretation, so the confidence limits of the interval should not contain contradictory results, for example, they should not include zero.

Analysis of the statistical significance of the equation as a whole.

Fisher distribution in regression analysis

The assessment of the significance of the regression equation as a whole is given using Fisher's F-Criteria. At the same time, the zero hypothesis is put forward that all regression coefficients, with the exception of the free member A, are zero and, therefore, the factor X does not affect the result y (or).

The value of F - the criterion is associated with the determination coefficient. When multiple regression:

where M is the number of independent variables.

When paired regressionformula F - Statistics takes the form:

When the table value of the F-criterion is set to the level of significance (usually 0.05 or 0.01) and two degrees of freedom: - in the case of multiple regression, for paired regression.

If, it deviates and is concluded about the materiality of the statistical connection between Y and X.

If, the probability of the regression equation is considered statistically insignificant, does not deviate.

Comment. In pair linear regression. In addition, therefore. Thus, checking the hypotheses on the importance of regression and correlation coefficients is equivalent to checking the hypothesis about the materiality of the linear regression equation.

Fisher's distribution can be used not only to test the hypothesis about simultaneous equality zero of all linear regression coefficients, but also hypothesis about equality zero parts of these coefficients. This is important when developing a linear regression model, as it makes it possible to estimate the validity of the exclusion of individual variables or their groups of explanatory variables, or, on the contrary, the inclusion of them in this number.

Suppose, for example, the multiple linear regression on the observations with explanatory variables was first estimated, and the determination coefficient is equal, then the last k variables are excluded from among those explanatory, and according to the same data the equation for which the determination coefficient is equal to (, because . Each additional variable explains the part, albeit a small, variations of the dependent variable).

In order to test the hypothesis about the simultaneous equality zero of all coefficients with excluded variables, the value is calculated

having a Fisher's distribution with degrees of freedom.

According to the tables of the distribution of Fisher, at a given level of significance, find. And if, the zero hypothesis is rejected. In this case, exclude all k variables from the equation incorrectly.

Similar reasoning can be carried out and on the reason for the inclusion in the regression equation of one or more k new explanatory variables.

In this case, F - statistics are calculated

having distribution. And if it exceeds the critical level, then the inclusion of new variables explains the essential part of the previously unexplained variable dispersion (i.e., the inclusion of new explanatory variables is justified).

Comments. 1. Include new variables appropriate one.

2. To calculate F - statistics, when considering the inclusion of explanatory variables to the equation, it is desirable to consider the determination coefficient adjusted to the number of degrees of freedom.

F - Fisher statistics are also used to test the hypothesis about the coincidence of the regression equations for individual observation groups.

Let there be 2 samples containing, respectively, observations. Each of these samples is estimated to the form regression equation. Let the regression line (i.e.) be equal to them, respectively.

Zero hypothesis is checked: the fact that all the corresponding coefficients of these equations are equal to each other, i.e. The regression equation for these samples is the same.

Suppose that the equation of regression of the same species is immediately for all observations, and approx.

Then F - statistics on the formula:

It has a Fisher's distribution with degrees of freedom. F - statistics will be close to zero if the equation for both samples is the same, because in this case. Those. If, the zero hypothesis is accepted.

If, the zero hypothesis is rejected, and the unified regression equation cannot be built.

Final tests on econometric

1. Assessment of the significance of the parameters of the regression equation is carried out on the basis of:

A) T - the criterion of Student;

b) F-criteria Fisher - Snedkore;

c) medium quadratic error;

d) an average error of approximation.

2. The regression coefficient in the equation characterizing the relationship between the volume of sales (million rubles) and the profit of the automotive industry enterprises for the year (million rubles) means that with an increase in the volume of products implemented 1 million rubles. Profit increases on:

d) 0.5 million. rub.;

c) 500 thousand. rub.;

D) 1.5 million rubles.

3. Correlation ratio (correlation index) measures the degree of tightness of communication between X andY.:

a) only with non-linear form of dependence;

B) under any form of dependence;

c) only with linear dependence.

4. In the direction of communication are:

a) moderate;

B) straight;

c) straight.

5. In 17 observations, the regression equation was constructed:
.
To verify the significance of the equation calculatedobserved valuet. - Statistics: 3.9. Output:

A) equation significantly when = 0,05;

b) the equation is insignificant at a \u003d 0.01;

c) Equation is insignificant at a \u003d 0.05.

6. What are the consequences of violation of the assumption of MNA "mathematical expectation of regression residues equals zero"?

A) offset estimates of regression coefficients;

b) effective, but insolvent estimates of regression coefficients;

c) ineffective assessments of regression coefficients;

d) insolvent assessment of regression coefficients.

7. Which of the following statements is true in the case of heterosdastics of residues?

A) the conclusions of T and F-Statistics are unreliable;

d) estimates of the parameters of the regression equation are displaced.

8. What is the founded test rank correlation Spearman?

A) on the use of T - statistics;

c) on use ;

9. What is the basis of the White test?

b) on the use of f-statistics;

C) on use ;

d) on graphic analysis of residues.

10. What method can be used to eliminate autocorrelation?

11. What is the name of the violation of the assumption of the residue dispersion?

a) multicollinearity;

b) autocorrelation;

C) heterosfeasting;

d) homocyadasticity.

12. Fictive variables are entered into:

a) only in linear models;

b) only in multiple nonlinear regression;

c) only in nonlinear models;

D) both in linear and nonlinear models, resulting in a linear view.

13. If in the matrix of paired correlation coefficients are found
, this indicates:

A) about the presence of multicollinearity;

b) on the absence of multicolinearity;

c) about the presence of autocorrelation;

d) about the absence of heterosdasticity.

14. With which measure it is impossible to get rid of multicollinearity?

a) an increase in sampling;

D) converting a random component.

15. If
and the rank of the matrix A less (K-1) then the equation:

a) rigorized;

B) unidentified;

c) accurately identified.

16. Regression escape has the form:

BUT)
;

b)
;

in)
.

17. What is the problem of model identification?

A) obtaining uniquely defined parameters of a model given by the system of simultaneous equations;

b) the choice and implementation of methods for statistical assessment of unknown parameters of the model on source statistical data;

c) check adequacy model.

18. What method is used to estimate the parameters of a maximized equation?

C) DHMNA, KMNA;

19. If a qualitative variable hask. Alternative values, then when modeling uses:

A) (K-1) fictitious variable;

b) reflexive variables;

c) (k + 1) fictitious variable.

20. Analysis of the grindiness and direction of bonds of two signs is carried out on the basis of:

A) pairwise correlation coefficient;

b) determination coefficient;

c) multiple correlation coefficient.

21. In the linear equation x. = but 0 + A. 1 x The regression coefficient shows:

a) tightness of communication;

b) the fraction of the dispersion "Y", depending on "x";

C) as an average will change "y" with a change in "x" per unit;

d) the error of the correlation coefficient.

22. What is the indicator used to determine the part of the variation due to a change in the value of the factor under study?

a) coefficient of variation;

b) correlation coefficient;

C) determination coefficient;

d) the coefficient of elasticity.

23. Elasticity coefficient shows:

A) how much% will change the value y when x changes by 1%;

b) how many units of their measurement will change the value of the YPR changing 1%;

c) how much% will change the value of YPRI changing the unit. of its measurement.

24. What methods can be applied to detect heterosage?

A) Test of Galfeld Kvandt;

B) the test correlation of the Spirmeal;

c) Test Darbin-Watson.

25. What is the basis of the test of Gulfeld -Kvandt

a) on the use of T- Statistics;

B) on the use of F - statistics;

c) on use ;

d) on graphic analysis of residues.

26. With which methods you can not eliminate the autocorrelation of residues?

a) a generalized method of least squares;

B) a suspended method of least squares;

C) the method of maximum likelihood;

D) a two-step method of smallest squares.

27. What is the name of the disabilities for the independence of the residues?

a) multicollinearity;

B) autocorrelation;

c) heterosfeasting;

d) homocyadasticity.

28. What method can be used to eliminate heterosdalism?

A) a generalized method of least squares;

b) a suspended method of least squares;

c) the method of maximum likelihood;

d) a two-step method of smallest squares.

30. If int.-criteria Most regression coefficients are statistically significant, and the model as a wholeF.- The criterion is insignificant, this may indicate:

a) multicollinarity;

B) on autocorrelation of residues;

c) about the heterosdastics of residues;

d) this option is impossible.

31. Is it possible to get rid of multicollinarity by converting variables?

a) this measure is effective only with an increase in the sample size;

32. With which method, you can find estimates of the parameter of the linear regression equation:

A) method of the smallest square;

b) correlation regression analysis;

c) dispersion analysis.

33. A multiple linear regression equation with fictitious variables was constructed. To verify the significance of individual coefficients used distribution:

a) normal;

b) Student;

c) Pearson;

d) Fishera-Sedekor.

34. If
and the rank of the matrix a more (K-1) then the equation:

A) rigorized;

b) unidentified;

c) accurately identified.

35. To estimate the parameters of the exactly identifiable system of the equations applies:

a) DHMNA, KMNA;

b) DHMNA, MNK, ICC;

36. Criterion Chow is based on application:

A) f - statistics;

b) t - statistics;

c) Darbina -UOTSON criteria.

37. Fictive variables can take values:

d) any values.

39. In 20 observations, the regression equation was constructed:
.
To verify the significance of the equation, the statistics value is calculated:4.2. Conclusions:

a) the equation is meaningful at a \u003d 0.05;

b) the equation is insignificant at a \u003d 0.05;

c) Equation is insignificant at a \u003d 0.01.

40. Which of the following statements is not true in the case of heterosdastics of residues?

a) conclusions on TIF statisticals are unreliable;

b) heterosfeasticity is manifested through the low value of Darbina-Watson statistics;

c) with the heterosdasticity of the assessment remain effective;

d) estimates are displaced.

41. Test Chow based on comparison:

A) dispersions;

b) determination coefficients;

c) mathematical expectations;

d) average.

42. If in Test Chow
it is considered:

A) that the division into the rates is advisable in terms of improving the quality of the model;

b) the model is statistically insignificant;

c) the model is statistically significant;

d) What makes no sense to break the sample on the part.

43. Fictive variables are variables:

a) qualitative;

b) random;

C) quantitative;

d) logical.

44. Which of the listed methods can not be applied to detect autocorrelation?

a) method of series;

b) Darbina Watson's criterion;

c) the test correlation of the Spirmeal;

D) Test White.

45. The simplest structural form of the model has the form:

BUT)

b)

in)

d)
.

46. \u200b\u200bWith how measures it is possible to get rid of multicollinearity?

a) an increase in sampling;

b) elimination of variables highly regulated with the rest;

c) change the specification of the model;

d) converting a random component.

47. If
and the rank of the matrix A is equal (K-1) that equation:

a) rigorized;

b) unidentified;

C) exactly identified;

48. The model is considered to be identified if:

a) among the equations of the model there are at least one normal;

B) each system equation is identifiable;

c) among the equations of the model there are at least one unidentified;

d) Among the equations of the model there are at least one-sacredmedized one.

49. What method is used to assess the parameters of the non-subsidized equation?

a) DHMNA, KMNA;

b) DHMNA, MNK;

C) the parameters of such an equation can not be estimated.

50. At the junction of what areas of knowledge appeared econometrics:

A) economic theory; Economic I. math statistics;

b) economic theory, mathematical statistics and theory of probability;

c) Economic and mathematical statistics, theory of probability.

51. In the multiple linear regression equation, confidence intervals are built for regression coefficients by distribution:

a) normal;

B) Student;

c) Pearson;

d) Fishera-Sedekor.

52. For 16 observations, a pair linear regression equation was constructed. Forchecks for the significance of the regression coefficient calculatedt. n6L =2.5.

a) the coefficient is insignificant at a \u003d 0.05;

b) the coefficient is significant at a \u003d 0.05;

c) the coefficient is significant at a \u003d 0.01.

53. It is known between valuesX. andY. existspositive connection. In what limitsis a pair correlation coefficient?

a) from -1 to 0;

b) from 0 to 1;

C) from -1 to 1.

54. Multiple correlation coefficient is 0.9. What percentagethe dispersion of the productive feature is explained by the influence of allfactor signs?

55. Which of the listed methods cannot be applied to detect heterosage?

A) Test of Galfeld Kvandt;

b) the test correlation of the Spirmeal;

c) Method of series.

56. The model shape is:

a) a system of nonlinear functions of exogenous variables from endogenous;

B) a system of linear functions of endogenous variables from exogenous;

c) a system of linear functions of exogenous variables from endogenous;

d) system of normal equations.

57. In what limits is the private correlation coefficient calculated by recreate formulas?

a) from - up to +. ;

b) from 0 to 1;

c) from 0 to + ;

D) from -1 to +1.

58. In what limits is the private correlation coefficient calculated through the determination coefficient?

a) from - up to +. ;

B) from 0 to 1;

c) from 0 to + ;

d) from -1 to +1.

59. Exogenous variables:

a) dependent variables;

B) independent variables;

61. When adding another explanatory factor to the regression equation, a multiple correlation coefficient:

a) will decrease;

b) will increase;

b) Save its meaning.

62. A hyperbolic regression equation is constructed:Y.= a.+ b./ X.. Forthe validation of the equation is used distribution:

a) normal;

B) Student;

c) Pearson;

d) Fishera-Sedekor.

63. For which types of systems, the parameters of individual econometric equations can be found with the help of the traditional method of the smallest squares?

a) system of normal equations;

B) system of independent equations;

C) system of recursive equations;

D) a system of interdependent equations.

64. Endogenous variables:

A) dependent variables;

b) independent variables;

c) dated previous time points.

65. What limits changes the determination coefficient?

a) from 0 to + ;

b) from - up to +. ;

C) from 0 to +1;

d) from -L to +1.

66. A multiple linear regression equation is constructed. To verify the significance of individual coefficients used distribution:

a) normal;

b) Student;

c) Pearson;

D) Fishera-Snedecora.

67. When adding another explanatory factor to the regression equation, the determination coefficient:

a) will decrease;

B) will increase;

c) will save its value;

d) will not decrease.

68. The essence of the method of smallest squares is that:

A) the evaluation is determined from the conditions for minimizing the sum of the squares of the deviations of the sample data from the defined estimate;

b) the assessment is determined from the conditions for minimizing the amount of deviations of the sample data from the estimated estimate;

c) The evaluation is determined from the condition of minimizing the sum of the squares of the deviations of the sample medium on the selective dispersion.

69. What class of nonlinear regressions is parabola:

73. Which class of nonlinear regressions includes an exponential curve:

74. What class of nonlinear regressions includes the function of the form ŷ
:

A) regression, nonlinear relative to the variables included in the analysis, but linear on the estimated parameters;

b) nonlinear regression on the estimated parameters.

78. What class of nonlinear regressions includes the function of the form ŷ
:

a) regression, nonlinear relative to the variables included in the analysis, but linear on the estimated parameters;

B) nonlinear regression on the estimated parameters.

79. In the regression equation in the form of hyperbolas ŷ
if the value
b. >0 , then:

A) with an increase in factor h. Values \u200b\u200bof the productive feature w.slowly decrease, and x → ∞. average value w. will be equal but;

b) the value of the productive sign w. increases with slow growth with increasing factor h., and when x → ∞.

81. The coefficient of elasticity is determined by the formula

A) linear function;

b) parabola;

c) hyperboles;

d) an indicative curve;

e) power.

82. The coefficient of elasticity is determined by the formula
for regression model in the form:

a) linear function;

B) parabola;

c) hyperboles;

d) an indicative curve;

e) power.

86. Equation
called:

A) linear trend;

b) parabolic trend;

c) hyperbolic trend;

d) exponential trend.

89. Equation
called:

a) linear trend;

b) parabolic trend;

c) hyperbolic trend;

D) exponential trend.

90. System species called:

A) system of independent equations;

b) system of recursive equations;

c) a system of interdependent (joint, simultaneous) equations.

93. Econometrics can be defined as:

A) this is an independent scientific discipline, which combines a set of theoretical results, techniques, methods and models intended in order to give a specific quantitative expression on the basis of economic theory, economic statistics and mathematical and statistical tools to give a specific quantitative expression, due to economic theory;

B) science of economic dimensions;

C) statistical analysis of economic data.

94. To the tasks of econometrics can be attributed:

A) the forecast of economic and socio-economic indicators characterizing the state and development of the analyzed system;

B) the imitation of possible scenarios of the socio-economic development of the system to identify how the planned changes in those or other parameter control will affect the output characteristics;

c) checking hypotheses on statistical data.

95. By nature there are connections:

A) functional and correlation;

b) functional, curvilinear and straightforward;

c) correlation and inverse;

d) statistical and straight.

96. With a direct connection with an increase in factor notification:

a) the result is reduced;

b) the resulting feature does not change;

C) The result is increasing.

97. What methods are used to identify the presence, nature and direction of communication in statistics?

a) average values;

B) comparison of parallel rows;

C) the method of analytical grouping;

d) relative values;

E) graphic method.

98. What method is used to identify the form of exposure to other factors to others?

a) correlation analysis;

B) regression analysis;

c) index analysis;

d) dispersion analysis.

99. What method is used to quantify the influence of one factors to others:

A) correlation analysis;

b) regression analysis;

c) average values;

d) dispersion analysis.

100. What no indicators exist within the minus to plus units:

a) the determination coefficient;

b) correlation relationship;

C) linear correlation coefficient.

101. The regression coefficient at a single-factor model shows:

A) how many units the function changes when the argument is changed per unit;

b) how much percent changes the function per unit of argument change.

102. Elasticity coefficient shows:

a) how much percent the function changes with a change in the argument per unit of its measurement;

B) how much percent changes the function with a change in the argument by 1%;

c) how many units of their measurement changes the function with a change in the argument by 1%.

105. The magnitude of the correlation index, equal to 0.087, testifies:

A) about the weak dependence;

b) about a strong relationship;

c) about errors in calculations.

107. The quantity of the pairwise correlation coefficient equal to 1.12 shows:

a) about the weak dependence;

b) about a strong relationship;

C) about errors in calculations.

109. Which of the above numbers may be the values \u200b\u200bof the pair ratio of the correlation:

111. Which of the above numbers may be the values \u200b\u200bof the multiple correlation coefficient:

115. Note the correct form of the linear regression equation:

a) ŷ
;

b) ŷ
;

c) ŷ
;

D) ŷ
.

After the regression equation is built and with the help of the determination coefficient, its accuracy is estimated, it remains an open question due to which this accuracy is achieved and, accordingly, it is possible to trust this equation. The fact is that the regression equation was not based on general aggregatewhich is unknown, and on the sample of it. Points from the general aggregate fall into the sample randomly, according to this, in accordance with the theory of probability, among other cases, an option is possible when the sample from the "wide" set of a set will be "narrow" (Fig. 15).

Fig. 15. Possible option to enter the sample from the general population.

In this case:

a) the regression equation, built on the sample, can significantly differ from the regression equation for the general population, which will lead to forecast errors;

b) The determination coefficient and other characteristics of the accuracy will be unnecessarily high and will mislead the predictive qualities of the equation.

In the ultimate case, the option is not excluded when from the general population of a cloud with the main axis A parallel horizontal axis (there is no connection between variables) due to random selection, the sample will be obtained, the main axis of which will be inclined to the axis. Thus, attempts to predict the next values \u200b\u200bof the general population based on these sample data from it are fraught with not only errors in assessing the strength and direction of communication between dependent and independent variables, but also the danger of finding the connection between variables where there is actually no.

In the absence of information on all points of the general population, the only way to reduce errors in the first case is to use when evaluating the coefficients of the regression equation of the method that ensures their inconsistency and efficiency. And the probability of the onset of the second case can be significantly reduced due to the fact that a priori knows one property of the general population with two independent variables from each other - it does not have this connection. This reduction is achieved by checking statistical significancethe resulting regression equation.

One of the most frequently used verification options is as follows. For the obtained regression equation is determined
-statistics
- Characteristics of the accuracy of the regression equation, which represents the ratio of the part of the dispersion of the dependent variable that is explained by the regression equation to the inexplicable (residual) part of the dispersion. Equation for determination
- The statistics in the case of multidimensional regression has the form:

where:
- explained dispersion - part of the dispersion of the dependent variable is explained by the regression equation;

-residual dispersion- part of the dispersion of the dependent variable is not explained by the regression equation, its presence is a consequence of the action of a random component;

- the number of points in the sample;

- The number of variables in the regression equation.

As can be seen from the above formula, the dispersion is defined as the private from the division of the corresponding sum of squares by the number of degrees of freedom. The number of degrees of freedomthis is the minimum required number of values \u200b\u200bof the dependent variable, which is sufficient to obtain the desired sample characteristics and which can freely vary with the fact that all other values \u200b\u200bused to calculate the desired characteristic are known for this sample.

To obtain a residual dispersion, the coefficients of the regression equation are necessary. In the case of paired linear regression of coefficients, two, in accordance with the formula (taking
) The number of freedom degrees is equal
. It is understood that to determine the residual dispersion it is enough to know the coefficients of the regression equation and only
values \u200b\u200bof the dependent variable from the sample. The remaining two values \u200b\u200bcan be calculated on the basis of these data, and therefore are not freely varying.

To calculate the explanated dispersion of the values \u200b\u200bof the dependent variable are not required at all, since it can be calculated, knowing the regression coefficients with independent variables and the dispersion of an independent variable. In order to make sure this is enough to remember the expression previously
. Therefore, the number of degrees of freedom for residual dispersion is equal to the number of independent variables in the regression equation (for paired linear regression
).

As a result
-criteria for the pair linear regression equation is determined by the formula:

.

In theory of probability, it is proved that
-criteria of the regression equation obtained for sample from the general population in which there is no connection between the dependent and independent variable has the distribution of Fisher, quite well studied. Due to this for any value
-criteria can calculate the likelihood of its appearance and vice versa to determine the value
-Criteria which he will not be able to exceed with a given probability.

To implement the statistical verification of the significance of the regression equation is formulated zero hypothesison the absence of communication between variables (all coefficients with variable are zero) and the level of significance is selected .

Significance level- This is the permissible chance of committing error of the first kind- reject as a result of checking the faithful zero hypothesis. In the case under consideration, make a mistake of the first kind means to recognize on the sample the presence of a connection between variables in the general population, when in fact it is not there.

Typically, the level of significance is taken equal to 5% or 1%. The higher the level of significance (the smaller
), the higher reliability leveldough, equal
. The greater the chance to avoid the error of recognition on the sample of communication in the general population in fact unrelated variables. But with increasing level of significance, the danger of committing second Rhoda errors- reject the correct zero hypothesis, i.e. Do not notice on the sample there is actually the connection of variables in the general population. According to this, depending on which error has large negative consequences, one or another level of significance is chosen.

For the selected level of significance on the distribution of Fisher, a table value is determined
the probability of exceeding which is in the sample of power obtained from the general population without communication between variables does not exceed the level of significance.
compared with the actual value of the criterion for the regression equation .

If condition is satisfied
, then erroneous communication detection with value
-criteria is equal or large by sample from the general population with unrelated, variables will occur with a probability of less than the level of significance. In accordance with the rule, "very rare events does not happen," we come to the conclusion that the connection established by the sample between the variables is also available in the general population from which it is obtained.

If it turns out
, The regression equation is statistically not significantly significant. In other words, there is a real chance that the sample is installed that does not exist in reality the relationship between variables. To the equation that could not withstand inspection on statistical significance, refer to the same way as to the medicine with expired Nowadays - such medicines are not necessarily corrupted, but once there is no confidence in their quality, they prefer not to use them. This rule does not save all errors, but avoids the richest, which is also quite important.

The second version of the verification is more convenient in the case of the use of electronic tables, this is a matching probability of the appearance of the obtained value.
-Criteria with significance level. If this probability is lower than the level of significance
It means that the equation is statistically significant, otherwise not.

After the inspection of the statistical significance of the regression equation is generally useful, especially for multidimensional dependencies to verify the statistical significance of the obtained regression coefficients. The ideology of checking is the same as when checking the equation as a whole, but as a criterion is used - Student - Cyrteteradefined by formulas:

and

where: , - Student criterion values \u200b\u200bfor coefficients and respectively;

- residual dispersion of the regression equation;

- the number of points in the sample;

- the number of variables in the sample, for pair linear regression
.

The resulting actual values \u200b\u200bof the Student criterion are compared with table values.
obtained from Student's distribution. If it turns out that
, the corresponding coefficient is statistically significant, otherwise not. The second version of the inspection of the statistical significance of the coefficients is to determine the probability of the appearance of the Student's criterion
and compare with the level of significance
.

For variables, whose coefficients turned out to be statistically not significant, the likelihood of the fact that their influence on the dependent variable in the general population is generally absent. Therefore, or it is necessary to increase the number of points in the sample, then the coefficient may become statistically significant and at the same time will specify its value, or as independent variables find others, more closely related to the dependent variable. The accuracy of prediction at the same time in both cases will increase.

As an express method for assessing the significance of the coefficients of the regression equation, the following rule can be applied - if the Student's criterion is greater than 3, then such a coefficient is usually statistically significant. But in general it is believed that to obtain statistically significant regression equations, it is necessary that the condition is carried out
.

Standard prediction error in the obtained equation of regression of an unknown value
with known
estimated by the formula:

Thus, the forecast with a trust probability of 68% can be represented as:

In the event that other confidence probability is required
, then for the level of significance
it is necessary to find the Student's criterion
and trust interval for prediction with reliability
will be equal
.

Prediction of multidimensional and nonlinear dependencies

In the event that the predicted value depends on several independent variables, then in this case there is multidimensional regressionviews:

where:
- regression coefficients describing the influence of variables
on the predicted value.

The method of determining the regression coefficients does not differ from the pair linear regression, especially when using the spreadsheet, since there is the same function and for the pair and for multidimensional linear regression. It is desirable that there are no interrelations between independent variables, i.e. The change in one variable did not affect the values \u200b\u200bof other variables. But this requirement is not mandatory, it is important that there are no functional linear dependencies between the variables. The procedures described above the verification of the statistical significance of the obtained regression equation and its individual coefficients, the estimate of the prediction accuracy remains the same as for the case of paired linear regression. At the same time, the use of multidimensional regression instead of a steam room usually allows us to significantly improve the accuracy of the description of the behavior of the dependent variable, which means the accuracy of forecasting.

In addition, the multidimensional linear regression equation allows to describe the nonlinear dependence of the projected value from independent variables. The procedure for bringing a nonlinear equation to a linear view is called linearization. In particular, if this dependence is described by a polynomial degree different from 1, then, having replaced the variables with degrees differ from the unit to new variables in the first degree, we obtain the problem of multidimensional linear regression instead of nonlinear. So, for example, if the effect of an independent variable is described by parabola

that replacement
allows you to convert a non-linear problem to a multidimensional linear view

Nonlinear tasks can also be converted easily due to the fact that the predicted value depends on the product of independent variables. To account for such an influence, you must enter a new variable equal to this product.

In cases where nonlinearity is described by more complex dependencies, linearization is possible due to the conversion of coordinates. Values \u200b\u200bare calculated for this
and graphs of the dependence of the source points are being built in various combinations of transformed variables. The combination of converted coordinates or transformed and not transformed coordinates in which the dependence is closest to the straight line suggests the replacement of variables that will result in the transformation of non-linear dependence to the linear view. For example, a nonlinear dependence of the species

turns into a linear view

where:
,
and
.

The obtained regression coefficients for the transformed equation remain unsecured and effective, but the verification of the statistical significance of the equation and coefficients is impossible

Verification of the validity of the application of the least squares method

The use of the least squares method ensures the effectiveness and failure to estimate the coefficients of the regression equation while complying with the following conditions (conditions Gausa-Markov):

1.

2.

3. Values do not depend on each other

4. Values independent variables depend

Most easily can be checked to comply with these conditions by building residual graphs.
depending on the then from independent (independent) variables. If the points on these graphs are located in the corridor located symmetrically the abscissa axis and in the location of the points are not viewed by regularities, then the conditions of Gaus-Markov have been fulfilled and the ability to improve the accuracy of the regression equation are also available. If this is not the case, then it is possible to significantly improve the accuracy of the equation and it is necessary to refer to the special literature.

To verify significance, the ratio of the regression coefficient and its standard deviation is analyzed. This ratio is the distribution of Student, that is, to determine significance, use T - criterion:

- Sk.from residual dispersion;

- Amount of deviations from average

If T races \u003e T tab. , The coefficient B I is significant.

The confidence interval is determined by the formula:

Procedure for performing work

    Take the source data according to the work option (by the student number in the journal). A static control object with two entrances is set X. 1 , X. 2 and one output Y.. The object has a passive experiment and a sample of 30 points is obtained containing values H. 1 , H. 2 and Y. For each experiment.

    Open a new file in Excel 2007. Enter the source information into the source table columns - the values \u200b\u200bof the input variables X. 1 , H. 2 and output variables Y..

    Prepare additionally two columns to enter settlement values Y. and residues.

    Call the program "Regression": data / data analysis / regression.

Fig. 1. Dialog Box "Data Analysis".

    Enter the source data address in the Regression dialog box:

    input interval Y, input interval X (2 columns),

    set the level of reliability of 95%,

    in the Options "Output interval, indicate the left upper cell of the output location of the regression analysis (the first cell on the 2-page of the working sheet),

    include the Options "Remains" and "Remain Schedule",

    press the OK button to start regression analysis.

Fig. 2. Regression dialog box.

    Excel will display 4 tables and 2 graphs of residual residues from variables X1 and X2.

    Format the table "output output" - to expand the column with the names of the output data, make the 3 significant digits after the comma in the second column.

    Format the dispersion analysis table is to make the number of meaningful digits after commas, to reduce the name of variables and configure the column width.

    Format the table of the coefficients of the equation - to reduce the name of variables and adjust the width of the columns, make the number of meaningful digits convenient for reading and understanding, delete the last 2 columns (values \u200b\u200band markup of the table).

    Data from the "Output of the Output" table to transfer to the prepared columns of the source table, then the table "Output of the residue" is deleted (the "Special Insert" option).

    Enter the resulting rates of coefficients to the source table.

    Tighten the results tables to the maximum up page.

    Build under chart tables Y.exp, Y.calculation and forecast errors (residue).

    Format balance diagrams. According to the obtained graphics, evaluate the correctness of the input model X1, x2..

    Print the results of regression analysis.

    Clear the results of regression analysis.

    Prepare a report on work.

Example of performance

Receiving regression analysis in Excel Package is presented in Figures 3-5.

Fig. 3. An example of regression analysis in the Excel package.


Fig.4. Charts of remnants of variables X1, x2.

Fig. 5. Graphics Y.exp,Y.calculationand forecast errors (residue).

According to regression analysis, you can say:

1. The regression equation obtained using Excel has the form:

    Determination coefficient:

The result variation by 46.5% is explained by the variation of factors.

    The overall F-criterion checks the hypothesis about the statistical significance of the regression equation. The analysis is performed when comparing the actual and table value of the Fisher's F-criterion.

As the actual value exceeds tabular
, we conclude that the regression equation obtained is statistically significant.

    Multiple correlation coefficient:

    b. 0 :

t Tab. (29, 0.975) \u003d 2.05

b. 0 :

Trust interval:

    Determine the confidence interval for the coefficient b. 1 :

Checking the significance of the coefficient b. 1 :

t races \u003e T tab. , B 1 coefficient is significant

Trust interval:

    Determine the confidence interval for the coefficient b. 2 :

Verification of significance to the coefficient b. 2 :

Determine the confidence interval:

Options for tasks

Table 2. Task options

Option number

Result sign Y. i.

Y. 1

Y. 1

Y. 1

Y. 1

Y. 1

Y. 1

Y. 1

Y. 1

Y. 1

Y. 1

Y. 2

Y. 2

Y. 2

Y. 2

Y. 2

No. Factor X. i.

No. Factor X. i.

Continued Table 1.

Option number

Result sign Y. i.

Y. 2

Y. 2

Y. 2

Y. 2

Y. 2

Y. 3

Y. 3

Y. 3

Y. 3

Y. 3

Y. 3

Y. 3

Y. 3

Y. 3

Y. 3

No. Factor X. i.

No. Factor X. i.

Table 3. Initial data

Y. 1

Y. 2

Y. 3

X. 1

X. 2

X. 3

X. 4

X. 5

Questions for self-control

    Problems of regression analysis.

    Regression analysis prerequisites.

    The main equation of dispersion analysis.

    What does Fisher's ratio show?

    How is the table value of the Fisher's criterion define?

    What shows the determination coefficient?

    How to determine the significance of regression coefficients?

    How to determine the confidence interval of regression coefficients?

    How to determine the calculated value of the T-criterion?

    How to determine the table value of the T-criterion?

    Word the basic idea of \u200b\u200bdispersion analysis, to solve what tasks is it most effective?

    What are the main theoretical prerequisites dispersion analysis?

    Determination of the total sum of the deviations in the components in the dispersion analysis.

    How to obtain estimates of dispersions from the sums of squares of deviations?

    How do the necessary numbers of degrees of freedom are obtained?

    How is the standard error?

    Explain the diagram of two-factor dispersion analysis.

    What is the difference between the cross classification from the hierarchical classification?

    What are the difference between balanced data?

The report is drawn up in the Word text editor on A4 GOST 6656-76 (210x297 mm) paper and contains:

    Name of laboratory work.

    Purpose of work.

  1. The results of the calculation.

Time allotted

Laboratory work

Preparation for work - 0.5 Acad. hour.

Performing work - 0.5 Acad. hour.

Calculations for computers - 0.5 Acad. hour.

Working - 0.5 Acad. hour.

Literature

    Identification of control objects. / A. D. Semenov, D. V. Artamonov, A. V. Blyukhev. Tutorial. - Penza: PSU, 2003. - 211 p.

    Basics statistical analysis. Workshop on statistical methods and research operations using Statistic and Excel packages. / Vukolov E.A. Tutorial. - M.: Forum, 2008. - 464 p.

    Basics of the theory of identification of control objects. / A.A. Ignatiev, S.A. Ignatiev. Tutorial. - Saratov: SSTU, 2008. - 44 s.

    Theory of probability and mathematical statistics in examples and tasks using Excel. / G.V. Gorleova, I.A. Katsko. - Rostov N / D: Phoenix, 2006.- 475 p.

    Goal of work 2.

    Basic concepts 2.

    The procedure for performing work 6

    Example of running 9

    Questions for self-control 13

    Time allotted for performance 14

    Evaluating the parameters A. and b., we obtained the regression equation by which you can estimate the values y.according to the specified values x.. It is natural to believe that the calculated values \u200b\u200bof the dependent variable will not coincide with the actual values, since the regression line describes the relationship only on average, in general. Separate values \u200b\u200bare scattered around it. Thus, the reliability of calculation values \u200b\u200bobtained by the regression equation is largely determined by the scattering of the observed values \u200b\u200baround the regression line. In practice, as a rule, the dispersion of errors is unknown and is assessed by observations simultaneously with regression parameters A. and b.. It is quite logical to assume that the assessment is associated with the sum of the squares of regression residues. The value is a selective estimate of the dispersion of perturbations contained in the theoretical model . It can be shown that for the pair regression model

    where is the deviation of the actual value of the dependent variable from its calculated value.

    If a , for all observations, the actual values \u200b\u200bof the dependent variable coincide with the calculated (theoretical) values . Graphically this means that the theoretical regression line (the function built by function) passes through all points of the correlation field, which is possible only with strictly functional communication. Consequently, the result w. fully due to the influence of the factor x.

    Usually in practice there is some dispersion of the correlation field points relative to theoretical line of regression, i.e., deviations of empirical data from theoretical. This scatter is due to the influence of the factor h.. Regression y.by H., (such a dispersion is called explained, as it is explained by the regression equation) and the action of other reasons (inexplicable variation, random). The magnitude of these deviations is based on the calculation of the quality indicators of the equation.

    According to the basic position of the dispersion analysis, the total amount of squares of deviations of the dependent variable y.from the average value can be decomposed into two components: the regression explained by the equation and unexplained:

    ,

    where - values y.calculated by equation.

    We will find the ratio of the sum of the squares of deviations explained by the regression equation to the total sum of the squares:

    From!

    . (7.6)

    The ratio of the dispersion part explained by the regression equation to general dispersion The resulting feature is called the determination coefficient. The value cannot exceed the unit and it maximum value It will only be achieved when, i.e. When each deviation is zero and therefore all points of scattering chart are exactly on the line.

    The determination coefficient characterizes the share of explained by the dispersion regression in the total variable of the dependent variable . Accordingly, the value characterizes the proportion of variations (dispersion) y, An unexplained regression equation, which means caused by the influence of other factors unaccounted in the model. The closer to one, the higher the quality of the model.



    With paired linear regression, the determination coefficient is equal to the pair linear coefficient Correlations :.

    The root of this determination coefficient has a multiple correlation coefficient (index), or a theoretical correlation relationship.

    In order to find out whether the value obtained in the assessment of the regression is the value of the determination coefficient reflects the true relationship between y.and X. Check the significance of the constructed equation as a whole and individual parameters. Checking the significance of the regression equation allows you to find out, the regression equation is suitable for practical use, for example, for the forecast or not.

    At the same time, they put forward the main hypothesis about the insignificance of the equation as a whole, which is formally reduced to the hypothesis of the equality zero of regression parameters, or, which is the same, about the equality zero determination coefficient :. An alternative hypothesis of the significance of the equation is a hypothesis of inequality to zero regression parameters or inequality zero determination coefficient :.

    To check the significance of the regression model use F-fisher's criterion, calculated as the ratio of the sum of the squares (based on one independent variable) to the residual sum of the squares (based on one degree of freedom):

    , (7.7)

    where k. - The number of independent variables.

    After dividing the numerator and denominator of relations (7.7) on the total amount of squares of deviations of the dependent variable, F-the criterion can be equivalent to the coefficient basis:

    .

    If the zero hypothesis is valid, the regression explained by the equation and the unexplained (residual) dispersion does not differ from each other.

    Calculation F-the criterion is compared with a critical value, which depends on the number of independent variables. K., and from the number of degrees of freedom (N-K-1). Table (critical) value F-the criterion is the maximum amount of dispersion ratio, which may occur during a random divergence of them for a given level of probability of zero hypothesis. If the calculated value F-the criterion is more tabular at a given level of significance, the zero hypothesis about the absence of communication is rejected and is concluded about the materiality of this connection, i.e. The model is considered significant.

    For the pair regression model

    .

    In linear regression, the significance of not only equations in general is usually estimated, but also of its individual coefficients. To do this, the standard error of each of the parameters is defined. Standard errors of parameter regression coefficients are determined by formulas:

    , (7.8)

    (7.9)

    Standard errors of regression coefficients or rms deviations calculated by formulas (7.8.7,7,7.9) are usually given in the results of the calculation of the regression model in statistical packages.

    Relying on the standard errors of regression coefficients, check the importance of these coefficients using the usual scheme for checking statistical hypotheses.

    As the main hypothesis, they put forward a hypothesis about an insignificant difference from zero "true" regression coefficient. An alternative hypothesis is the hypothesis inverse, that is, about the inequality zero of the "true" regression parameter. Checking this hypothesis is carried out using t-statistics having t.-distribution of Student:

    Then the calculated values t-statistics are compared with critical values t-statistics defined by Student distribution tables. Critical value Depending on the level of significance α and the number of degrees of freedom, which is equal (N - K-1), P - The number of observations, k. - The number of independent variables. In the case of a linear pair regression, the number of degrees of freedom is equal (P-2). Critical value can also be calculated on a computer using the built-in PCTUDSPobob batch feature.

    If the calculated value t-statistics are more critical, then the main hypothesis is rejecting and believed that with probability (1-α) "True" regression coefficient is significantly different from zero, which is a statistical confirmation of the existence of a linear dependence of the corresponding variables.

    If the calculated value t-statistics are less critical, then there is no reason to reject the basic hypothesis, that is, the "true" regression coefficient is slightly different from zero in the level of significance α . In this case, the factor corresponding to this coefficient should be excluded from the model.

    The significance of the regression coefficient can be established by the method of building a confidential interval. Trust interval for regression parameters A. and b. Determine as follows:

    ,

    ,

    where it is determined by the Student distribution table for significance α and the number of degrees of freedom (P-2) for paired regression.

    Since regression coefficients in econometric studies have clear economic interpretation, confidence intervals should not contain zero. The true value of the regression coefficient cannot simultaneously contain positive and negative values, including zero, otherwise we obtain contradictory results with the economic interpretation of the coefficients, which cannot be. Thus, the coefficient is significant if the resulting confidence interval does not cover zero.

    Example 7.4. According to Example 7.1:

    (a) Build a paired linear model of regression of the dependence of profits from sales of the vacation price using data processing software.

    b) assess the importance of the regression equation as a whole using F-fisher's criterion α \u003d 0.05.

    c) assess the significance of the regression model coefficients using t.-Criteries of Student α \u003d 0.05and α \u003d 0.1.

    For regression analysis, we use standard office excel program. Building a regression model We carry out the regression regression tool the analysis package (Fig. 7.5), the launch of which is as follows:

    ServicesNalizA data from the data.

    Fig.7.5. Using the regression tool

    In the regression dialog box in the input interval field, you must enter the range of the range of cells containing the dependent variable. In the input interval field, you need to enter the addresses of one or more ranges containing the values \u200b\u200bof the independent variables checkbox in the first line - is set to the active state if the column headers are highlighted. In fig. 7.6. The on-screen form of calculating the regression model using the regression tool is shown.

    Fig. 7.6. Building a paired regression model with

    regression instrument

    As a result of the operation of the regress tool, the following protocol of regression analysis is formed (Fig. 7.7).

    Fig. 7.7. Protocol regression analysis

    The equation of dependence of profit from selling from the selling price has the form:

    Assessment of the significance of the regression equation F-fisher's criterion. Value F-fisher's criterion Take from the Dispersion Analysis Table Excel protocol (Fig. 7.7.). Calculation F-criteria 53,372. Table value F-criterion at the level of significance α \u003d 0.05 and the number of degrees of freedom is 4,964. As , the equation is considered significant.

    Calculated values t.-Criterium Student for the coefficients of the regression equation is given in the productive table (Fig. 7.7). Table value t.-Criteria Student at the level of significance α \u003d 0.05and 10 degrees of freedom is 2.228. For regression coefficient A. , therefore, the coefficient a. Not significant. For regression coefficient B. , therefore, the coefficient b. meaning.