Assessment of the significance of the regression equation and its coefficients. Assessment of the statistical significance of the regression equation of its parameters

With the help of LSM, one can only obtain estimates of the parameters of the regression equation. To test whether the parameters are significant (i.e., whether they are significantly different from zero in the true regression equation) statistical methods of hypothesis testing are used. As the main hypothesis, a hypothesis is put forward about an insignificant difference from zero of the regression parameter or the correlation coefficient. An alternative hypothesis, in this case, is the reverse hypothesis, i.e. about the inequality of zero parameter or correlation coefficient. To test the hypothesis, we use t- Student's criterion.

Value found from observations t- criterion (it is also called observed or actual) is compared with the tabular (critical) value determined by Student's distribution tables (which are usually given at the end of textbooks and workshops on statistics or econometrics). The tabular value is determined depending on the level of significance and the number of degrees of freedom, which in the case of linear pair regression is equal to ,n-number of observations.

If the actual value t-criterion is greater than the tabular one (modulo), then it is considered that with the probability the regression parameter (correlation coefficient) is significantly different from zero.

If the actual value t-criterion is less than the tabular one (modulo), then there is no reason to reject the main hypothesis, i.e. the regression parameter (correlation coefficient) differs insignificantly from zero at the significance level .

Actual values t-criteria are determined by the formulas:

,

,

where .

To test the hypothesis of an insignificant difference from zero of the linear pair correlation coefficient, the following criterion is used:

where r - an estimate of the correlation coefficient obtained from the observed data.

Forecast of the expected value of the effective feature Y according to the linear paired regression equation.

Let it be required to evaluate the predictive value of the attribute-result for a given value of the attribute-factor . The predicted value of the sign-result with a confidence probability equal to belongs to the forecast interval:

,

where - point forecast;

t - confidence coefficient determined from Student's distribution tables depending on the level of significance α and number of degrees of freedom;

Average error forecast.

A point forecast is calculated using a linear regression equation as:

.

The average forecast error is determined by the formula:

.

Example 1

Based on the data given in the Annex and corresponding to option 100, it is required:



1. Build a linear pair regression equation of one feature from another. One of the signs corresponding to your option will play the role of factorial (X) , the other is productive . Establish cause-and-effect relationships between signs on the basis of economic analysis. Explain the meaning of the parameters of the equation.

3. Evaluate the statistical significance of the regression parameters and the correlation coefficient with a significance level of 0.05.

4. Predict the expected value of the characteristic-result Y with the predicted value of the characteristic-factor x, constituting 105% of the average level X . Assess the accuracy of the forecast by calculating the forecast error and its confidence interval with a probability of 0.95.

Decision:

In this case, we will choose the exchange price of shares as a sign-factor, since the amount of accrued dividends depends on the profitability of shares. Thus, the sign will be effective performance dividends.

To facilitate the calculations, we will construct a calculation table, which is filled in during the solution of the problem. (Table 1)

For clarity, the dependence of Y on X will be represented graphically. (Picture 2)

Table 1 - Calculation table


1. Let's build a regression equation of the form: .

To do this, it is necessary to determine the parameters of the equation and .

Let's define ,

where is the average of the values , squared;

Mean in a square.

Let's define the parameter a 0:

We get the regression equation of the following form:

The parameter shows how much the dividends accrued based on the results of operations would be in the absence of influence from the share price. Based on the parameter, we can conclude that when the stock price changes by 1 rub. there will be a change in dividends in the same direction by 0.01 million rubles.



2. Calculate the linear coefficient of pair correlation and the coefficient of determination.

The linear pair correlation coefficient is determined by the formula:

,

We define and :

The correlation coefficient, equal to 0.708, makes it possible to judge the close relationship between the effective and factor signs .

The coefficient of determination is equal to the square linear coefficient correlations:

The coefficient of determination shows that on the variation of accrued dividends it depends on the variation in the share price, and on - on other factors not taken into account in the model.

3. Let us estimate the significance of the parameters of the regression equation and the linear correlation coefficient according to t- Student's criterion. It is necessary to compare calculated values t- criteria for each parameter and compare it with the table.

To calculate actual values t-criteria define:

For coefficients regression equation verification of their level of significance is carried out according to t -Student's criterion and by the criterion F Fisher. Below we consider the assessment of the reliability of regression indicators only for linear equations(12.1) and (12.2).

Y=a 0+a 1 X(12.1)

X=b 0+b 1 Y(12.2)

For this type of equations, they are evaluated by t-Student's criterion only the values ​​of the coefficients a 1i b 1 using the value calculation tf according to the following formulas:

Where r yx correlation coefficient, and the value a 1 can be calculated using formulas 12.5 or 12.7.

Formula (12.27) is used to calculate the quantity tf, a 1regression equations Y on x.

the value b 1 can be calculated using formulas (12.6) or (12.8).

Formula (12.29) is used to calculate the quantity tf, which allows estimating the level of significance of the coefficient b 1regression equations X on Y

Example. Let us estimate the level of significance of the regression coefficients a 1i b 1 equations (12.17), and (12.18) obtained in solving problem 12.1. Let's use formulas (12.27), (12.28), (12.29) and (12.30) for this.

Recall the form of the obtained regression equations:

Y x = 3 + 0,06 X(12.17)

X y = 9+ 1 Y(12.19)

Value a 1 in equation (12.17) is equal to 0.06. Therefore, to calculate according to the formula (12.27), you need to calculate the value Sb y x. According to the condition of the problem, the quantity P= 8. The correlation coefficient has also been calculated by us using formula 12.9: rxy = √ 0,06 0,997 = 0,244 .

It remains to calculate the quantities Σ (at v- y) 2 and Σ (X ι -x) 2 , which we have not calculated. It is best to do these calculations in table 12.2:

Table 12.2

No. of examinees p / p x ι i x ι –x (x ι –x) 2 at v- y (at v- y) 2
-4,75 22,56 - 1,75 3,06
-4,75 22,56 -0,75 0,56
-2,75 7,56 0,25 0,06
-2,75 7,56 1,25 15,62
1,25 1,56 1,25 15,62
3,25 10,56 0,25 0,06
5,25 27,56 -0,75 0,56
5,25 27,56 0,25 0,06
Sums 127,48 35,6
Medium 12,75 3,75

We substitute the obtained values ​​into the formula (12.28), we get:

Now let's calculate the value tf according to the formula (12.27):

Value tf is checked for the level of significance according to Table 16 of Appendix 1 for t- Student's criterion. The number of degrees of freedom in this case will be equal to 8-2 = 6, so the critical values ​​are equal, respectively, for P ≤ 0,05 t cr= 2.45 and for Р≤ 0,01 t cr=3.71. In the accepted form, it looks like this:

We build the "axis of significance":

Received value tf But that the value of the regression coefficient of equation (12.17) is indistinguishable from zero. In other words, the resulting regression equation is inadequate to the original experimental data.



Let us now calculate the significance level of the coefficient b 1. For this, it is necessary to calculate the value Sbxy according to the formula (12.30), for which all the necessary quantities have already been calculated:

Now let's calculate the value tf according to the formula (12.27):

We can immediately build the "axis of significance", since all the preliminary operations have been done above:

Received value tf fell into the zone of insignificance, therefore we must accept the hypothesis H about the fact that the value of the regression coefficient of equation (12.19) is indistinguishable from zero. In other words, the resulting regression equation is inadequate to the original experimental data.

Nonlinear Regression

The result obtained in the previous section is somewhat discouraging: we have found that both regression equations (12.15) and (12.17) are inadequate to the experimental data. The latter happened because both of these equations characterize a linear relationship between features, and we showed in Section 11.9 that between variables X and Y there is a significant curvilinear dependence. In other words, between variables X and Y in this problem it is necessary to look not for linear, but for curvilinear connections. We will do this using the “Stage 6.0” package (developed by A.P. Kulaichev, registration number 1205).

Task 12.2. The psychologist wants to choose a regression model that is adequate to the experimental data obtained in problem 11.9.

Decision. This problem is solved by simple enumeration of models linear regression offered in the statistical package Stage. The package is organized in such a way that the spreadsheet that is the source for further work, experimental data are entered in the form of the first column for the variable X and second column for variable Y. Then, in the main menu, select the Statistics section, in which the subsection - regression analysis, in this subsection again subsection - curvilinear regression. The last menu gives formulas (models) various kinds curvilinear regression, according to which you can calculate the corresponding regression coefficients and immediately check them for significance. Below we consider only a few examples of working with ready-made models (formulas) of curvilinear regression.



1. First model - exhibitor . Its formula is:

When calculating using the stat package, we get a 0 = 1 and a 1 = 0,022.

The calculation of the significance level for a gave the value R= 0.535. It is obvious that the obtained value is insignificant. Therefore, this regression model is inadequate to the experimental data.

2. Second model - power . Its formula is:

When counting and o = - 5.29, a, = 7.02 and a 1 = 0,0987.

Significance level for a 1 - R= 7.02 and for a 2 - P = 0.991. Obviously, none of the coefficients is significant.

3. The third model - polynomial . Its formula is:

Y= a 0 + a 1 X + a 2 X 2+ a 3 X 3

When counting a 0= - 29,8, a 1 = 7,28, a 2 = - 0.488 and a 3 = 0.0103. Significance level for a, - P = 0.143, for a 2 - P = 0.2 and for a, - P= 0,272

Conclusion - this model is inadequate to the experimental data.

4. Fourth model - parabola .

Its formula is: Y \u003d a o + a l -X 1 + a 2 X 2

When counting a 0 \u003d - 9.88, a, \u003d 2.24 and a 1 = - 0.0839 Significance level for a 1 - P = 0.0186, for a 2 - P = 0.0201. Both regression coefficients turned out to be significant. Therefore, the problem is solved - we have revealed the form of a curvilinear relationship between the success of solving the third subtest of Veksler and the level of knowledge in algebra - this is a dependence of a parabolic type. This result confirms the conclusion obtained in solving problem 11.9 about the presence of a curvilinear relationship between the variables. We emphasize that it was with the help of curvilinear regression that the exact form of the relationship between the studied variables was obtained.


Chapter 13 FACTOR ANALYSIS

Basic concepts of factor analysis

Factor analysis is a statistical method that is used when processing large amounts of experimental data. The tasks of factor analysis are: reducing the number of variables (data reduction) and determining the structure of relationships between variables, i.e. classification of variables, so factor analysis is used as a data reduction method or as a structural classification method.

An important difference between factor analysis and all the methods described above is that it cannot be used to process primary, or, as they say, “raw” experimental data, i.e. obtained directly from the examination of the subjects. The material for factor analysis is correlations, or rather, Pearson's correlation coefficients, which are calculated between variables (i.e. psychological signs) included in the survey. In other words, correlation matrices, or, as they are otherwise called, intercorrelation matrices, are subjected to factor analysis. The names of the columns and rows in these matrices are the same, as they represent a list of variables included in the analysis. For this reason, intercorrelation matrices are always square, i.e. the number of rows in them is equal to the number of columns, and symmetrical, i.e. symmetrical places with respect to the main diagonal have the same correlation coefficients.

It must be emphasized that the original data table from which the correlation matrix is ​​obtained does not have to be square. For example, a psychologist measured three indicators of intelligence (verbal, non-verbal and general) and school grades in three academic subjects (literature, mathematics, physics) in 100 subjects - ninth grade students. The original data matrix will be 100 x 6 and the intercorrelation matrix will be 6 x 6 because it has only 6 variables. With so many variables, the intercorrelation matrix will include 15 coefficients and it will not be difficult to analyze it.

However, imagine what happens if the psychologist receives not 6, but 100 indicators from each subject. In this case, he will have to analyze 4950 correlation coefficients. The number of coefficients in the matrix is ​​calculated by the formula n (n + 1) / 2 and in our case is equal to (100 × 99) / 2 = 4950, respectively.

Obviously, to conduct a visual analysis of such a matrix is ​​a difficult task. Instead, a psychologist can perform a mathematical procedure of factor analysis of a 100 × 100 correlation matrix (100 subjects and 100 variables) and in this way get easier material for interpreting experimental results.

The main concept of factor analysis is factor. This is an artificial statistical indicator resulting from special transformations of the table of correlation coefficients between the studied psychological characteristics, or the matrix of intercorrelations. The procedure for extracting factors from an intercorrelation matrix is ​​called matrix factorization. As a result of factorization, the correlation matrix can be extracted different amount factors up to a number equal to the number of original variables. However, the factors identified as a result of factorization, as a rule, are unequal in their value.

The elements of the factor matrix are called or scales"; and they are the correlation coefficients of a given factor with all indicators used in the study. The factor matrix is ​​very important because it shows how the studied indicators are related to each selected factor. At the same time, the factor weight demonstrates the measure, or closeness, of this connection.

Since each column of the factor matrix (factor) is a kind of variable, the factors themselves can also correlate with each other. Two cases are possible here: the correlation between the factors is equal to zero, in which case the factors are independent (orthogonal). If the correlation between the factors is greater than zero, then in this case the factors are considered dependent (obvious). We emphasize that orthogonal factors, in contrast to oblique ones, give more simple options interactions within the factor matrix.

As an illustration of orthogonal factors, L. Thurstone's problem is often cited, who, taking a series of boxes different sizes and forms, measured in each of them more than 20 different indicators and calculated the correlations between them. Having factorized the obtained matrix of intercorrelations, he obtained three factors, the correlation between which was equal to zero. These factors were "length", "width" and "height".

In order to better grasp the essence of factor analysis, we will analyze the following example in more detail.

Suppose that a psychologist receives the following data from a random sample of students:

V 1- body weight (in kg);

V 2 - the number of attendance at lectures and seminars on the subject;

V 3- leg length (in cm);

V 4- the number of books read on the subject;

V 5- arm length (in cm);

V 6 - examination grade in the subject ( V- from English word variable - variable).

When analyzing these features, it is not unreasonable to assume that the variables V1, K 3 and V 5- will be interconnected, because the larger the person, the more he weighs and the longer his limbs. This means that there should be statistically significant correlation coefficients between these variables, since these three variables measure some fundamental property of the individuals in the sample, namely their size. Similarly, it is likely that when calculating correlations between V2, V4 and V 6 sufficiently high correlation coefficients will also be obtained, since attending lectures and self-study will contribute to obtaining higher marks in the subject being studied.

Thus, from the entire possible array of coefficients, which is obtained by enumeration of pairs of correlated features V 1 and V 2 , V t and V 3 etc., two blocks of statistically significant correlations will presumably stand out. The rest of the correlations are between the features included in different blocks, is unlikely to have statistically significant coefficients, since the relationships between such features as limb size and academic performance are most likely random. So, a meaningful analysis of our 6 variables shows that they, in fact, measure only two generalized characteristics, namely: body size and degree of preparedness in the subject.

To the resulting matrix of intercorrelations, i.e. pairwise calculated correlation coefficients between all six variables V 1 - V 6, it is permissible to apply factor analysis. It can also be carried out manually, using a calculator, but the procedure for such statistical processing is very laborious. For this reason, factor analysis is currently carried out on computers, usually using standard statistical packages. All modern statistical packages have programs for correlation and factor analysis. A factor analysis computer program essentially attempts to "explain" correlations between variables in terms of a small number of factors (two in our example).

Suppose that, using a computer program, we have obtained the matrix of intercorrelations of all six variables and subjected it to factor analysis. As a result of factor analysis, Table 13.1 was obtained, which is called the “factor matrix”, or “factorial structural matrix”.

Table 13.1

Variable Factor 1 Factor 2
V 1 0,91 0,01
V 2 0,20 0,96
V 3 0,94 -0,15
V 4 0,11 0,85
V 5 0,89 0,07
V 6 -0,13 0,93

Traditionally, factors are represented in the table as columns, and variables as rows. The headings of the columns of Table 13.1 correspond to the numbers of the selected factors, but it would be more accurate to call them “factor loadings”, or “weights”, for factor 1, the same for factor 2. As mentioned above, factor loadings, or weights, are correlations between the respective variable and the given factor. For example, the first number 0.91 in the first factor means that the correlation between the first factor and the variable V 1 equals 0.91. The higher the factor load in absolute value, the greater its relationship with the factor.

Table 13.1 shows that the variables V 1 V 3 and V 5 have large correlations with factor 1 (in fact, variable 3 has a correlation close to 1 with factor 1). At the same time, the variables V 2 ,V 3 and 5 have correlations close to 0 with factor 2. Similarly, factor 2 is highly correlated with variables V2, V4 and V 6 and does not actually correlate with the variables V 1,V 3 and V 5

AT this example, it is obvious that there are two structures of correlations, and, therefore, all the information in Table 13.1 is determined by two factors. Now the final stage of work begins - the interpretation of the data obtained. When analyzing the factor matrix, it is very important to take into account the signs of factor loadings in each factor. If loads with opposite signs occur in the same factor, this means that there is an inversely proportional relationship between variables with opposite signs.

Note that when interpreting the factor, for convenience, it is possible to reverse the signs of all loads for this factor.

The factor matrix also shows which variables make up each factor. This is primarily due to the level of significance of the factor weight. Traditionally, the minimum significance level of correlation coefficients in factor analysis is taken equal to 0.4 or even 0.3 (in absolute value), since there are no special tables by which one could determine the critical values ​​for the significance level in the factor matrix. Therefore, the easiest way to see which variables "belong" to a factor is to flag those that have loadings greater than 0.4 (or less than -0.4). We point out that in computer packages, sometimes the level of significance of the factor weight is determined by the program itself and is set to more high level, for example 0.7.

So, from table 13.1, it follows that factor 1 is a combination of variables V 1 K 3 and V 5(but not V1, K 4 and V 6 , since their factor loadings modulo is less than 0.4). Likewise, factor 2 is a combination of variables V2, V4 and V6.

The factor selected as a result of factorization is a set of those variables included in the analysis that have significant loads. It often happens, however, that a factor includes only one variable with a significant factor weight, while the rest have an insignificant factor load. In this case, the factor will be determined by the name of the only significant variable.

In essence, the factor can be considered as an artificial "unit" of grouping variables (features) based on the links between them. This unit is conditional, because by changing certain conditions of the factorization procedure for the intercorrelation matrix, you can get a different factor matrix (structure). In the new matrix, the distribution of variables by factors and their factor loadings may turn out to be different.

In this regard, in factor analysis there is the concept of "simple structure". Simple is the structure of a factor matrix, in which each variable has significant loads in only one of the factors, and the factors themselves are orthogonal, i.e. do not depend on each other. In our example, the two common factors are independent. A factor matrix with a simple structure allows you to interpret the result and give a name to each factor. In our case, the first factor is “body size”, the second factor is “level of fitness”.

The foregoing does not exhaust the meaningful possibilities of the factor matrix. Additional characteristics can be extracted from it, allowing a more detailed study of the relationships between variables and factors. These characteristics are called "commonness" and " eigenvalue" factor a.

However, before presenting their description, we point out one fundamentally important property of the correlation coefficient, due to which these characteristics are obtained. The correlation coefficient, squared (i.e., multiplied by itself), shows how much of the variance (variance) of a feature is common to two variables, or, more simply, how much these variables overlap. So, for example, two variables with a correlation of 0.9 overlap with a power of 0.9 x 0.9 = 0.81. This means that 81% of the variance of both variables are common, i.e. match. Recall that the factor loadings in the factor matrix are the correlation coefficients between factors and variables, therefore, the squared factor loading characterizes the degree of commonality (or overlap) of the variances of a given variable and a given factor.

If the obtained factors do not depend on each other (“orthogonal” solution), it is possible to determine from the weights of the factor matrix what part of the variance is common to the variable and the factor. To calculate how much of the variance of each variable coincides with the variance of the factors, you can simply sum the squares of the factor loadings over all factors. From table 13.1, for example, it follows that 0.91 × 0.91 + + 0.01 × 0.01 = 0.8282, i.e. about 82% of the variability of the first variable is "explained" by the first two factors. The resulting value is called commonality variable, in this case variable V 1

Variables can have different degrees of commonality with factors. A variable with more generality has a significant degree of overlap (a large proportion of the variance) with one or more factors. Low generality implies that all correlations between variables and factors are small. This means that none of the factors has an overlapping share of variance with this variable. Low generality may indicate that a variable measures something qualitatively different from the other variables included in the analysis. For example, one variable associated with the assessment of motivation among tasks that assess ability will have close to zero commonality with ability factors.

Low generality can also mean that a particular item is heavily influenced by measurement error or is extremely difficult for the subject. It is also possible, on the contrary, that the task is so simple that each subject gives the correct answer to it, or the task is so vague in content that the subject does not understand the essence of the question. Thus, low generality implies that this variable does not fit with the factors for one of the following reasons: either the variable measures a different concept, or the variable has a large measurement error, or there are differences between subjects in the response options for this item that distort the variance of the feature.

Finally, with the help of such a characteristic as the eigenvalue of a factor, one can determine the relative importance of each of the selected factors. To do this, you need to calculate how much of the variance (variance) each factor explains. The factor that explains 45% of the variance (overlap) between variables in the original correlation matrix is ​​obviously more significant than the one that explains only 25% of the variance. These arguments, however, are admissible if the factors are orthogonal, in other words, do not depend on each other.

In order to calculate the eigenvalue of the factor, you need to square the factor loadings and add them in a column. Using the data in Table 13.1, we can verify that the eigenvalue of factor 1 is (0.91 × 0.91 + 0.20 × 0.20 + 0.94 × 0.94 + 0.11 × 0.11 + 0.84 × 0.84 + (- 0.13) ×

× (-0.13)) = 2.4863. If the eigenvalue of the factor is divided by the number of variables (6 in our example), then the resulting number will show what proportion of the variance is explained by this factor. In our case, we get 2.4863∙100%/6 = 41.4%. In other words, factor 1 explains about 41% of the information (dispersion) in the original correlation matrix. A similar calculation for the second factor will give 41.5%. In total, this will be 82.9%.

Thus, two common factors, when combined, explain only 82.9% of the variance in the indicators of the original correlation matrix. What happened to the "remaining" 17.1%? The fact is that, considering the correlations between 6 variables, we noted that the correlations fall into two separate blocks, and therefore decided that it was logical to analyze the material in terms of two factors, and not 6, as well as the number of initial variables. In other words, the number of constructs needed to describe the data has decreased from 6 (number of variables) to 2 (number of common factors). As a result of factorization, part of the information in the original correlation matrix was sacrificed to the construction of a two-factor model. The only condition under which information is not lost would be to consider a six-factor model.

After assessing the individual statistical significance of each of the regression coefficients, the cumulative significance of the coefficients is usually analyzed, i.e. the entire equation as a whole. Such an analysis is carried out on the basis of testing the hypothesis about the overall significance of the hypothesis about the simultaneous equality to zero of all regression coefficients with explanatory variables:

H 0: b 1 = b 2 = ... = b m = 0.

If this hypothesis is not rejected, then it is concluded that the cumulative effect of all m explanatory variables X 1 , X 2 , ..., X m of the model on the dependent variable Y can be considered statistically insignificant, and the overall quality of the regression equation is low.

This hypothesis is tested on the basis of analysis of variance comparing the explained and residual variance.

H 0: (explained variance) = (residual variance),

H 1: (explained variance) > (residual variance).

The F-statistic is built:

where is the variance explained by the regression;

– residual dispersion (sum of squared deviations divided by the number of degrees of freedom n-m-1). When the LSM prerequisites are met, the constructed F-statistic has a Fisher distribution with the numbers of degrees of freedom n1 = m, n2 = n–m–1. Therefore, if at the required level of significance a F obs > F a ; m n - m -1 \u003d F a (where F a; m; n - m -1 is the critical point of the Fisher distribution), then H 0 deviates in favor of H 1. This means that the variance explained by the regression is significantly greater than the residual variance, and, consequently, the regression equation quite qualitatively reflects the dynamics of the change in the dependent variable Y. If F observable< F a ; m ; n - m -1 = F кр. , то нет основания для отклонения Н 0 . Значит, объясненная дисперсия соизмерима с дисперсией, вызванной случайными факторами. Это дает основание считать, что совокупное влияние объясняющих переменных модели несущественно, а следовательно, общее качество модели невысоко.

However, in practice, instead of this hypothesis, a closely related hypothesis about the statistical significance of the coefficient of determination R 2 is checked:



H 0: R 2 > 0.

To test this hypothesis, the following F-statistic is used:

. (8.20)

The value of F, provided that the LSM prerequisites are met and that H 0 is valid, has a Fisher distribution similar to the distribution of the F-statistics (8.19). Indeed, dividing the numerator and denominator of the fraction in (8.19) by the total sum of squared deviations and knowing that it decomposes into the sum of squared deviations, explained by the regression, and the residual sum of squared deviations (this is a consequence, as will be shown later, of the system of normal equations)

,

we get the formula (8.20):

From (8.20) it is obvious that the exponents F and R 2 are equal or not equal to zero at the same time. If F = 0, then R 2 = 0, and the regression line Y = is the best OLS, and, therefore, the value of Y does not linearly depend on X 1 , X 2 , ..., X m . To test the null hypothesis H 0: F = 0 at a given significance level a according to the tables of critical points of Fisher's distribution is the critical value of F kr = F a ; m n - m -1 . The null hypothesis is rejected if F > F cr. This is equivalent to the fact that R 2 > 0, i.e. R 2 is statistically significant.

Analysis of statistics F allows us to conclude that in order to accept the hypothesis of simultaneous equality to zero of all coefficients of linear regression, the coefficient of determination R 2 should not differ significantly from zero. Its critical value decreases with an increase in the number of observations and can become arbitrarily small.

Let, for example, when assessing a regression with two explanatory variables X 1 i , X 2 i for 30 observations R 2 = 0.65. Then

Fobs = =25.07.

According to the tables of critical points of the Fisher distribution, we find F 0.05; 2; 27 = 3.36; F 0.01; 2; 27 = 5.49. Since F obl = 25.07 > F cr both at 5% and at 1% significance level, the null hypothesis is rejected in both cases.

If in the same situation R 2 = 0.4, then

Fobs = = 9.

The assumption of the insignificance of the connection is rejected here as well.

Note that in the case of pairwise regression, testing the null hypothesis for the F-statistic is equivalent to testing the null hypothesis for the t-statistic

correlation coefficient. In this case, the F-statistic is equal to the square of the t-statistic. The coefficient R 2 acquires independent significance in the case of multiple linear regression.

8.6. Analysis of variance to decompose the total sum of squared deviations. Degrees of freedom for the corresponding sums of squared deviations

Let's apply the above theory for pairwise linear regression.

After the linear regression equation is found, the significance of both the equation as a whole and its individual parameters is assessed.

The assessment of the significance of the regression equation as a whole is given using the Fisher F-test. This puts forward the null hypothesis that the regression coefficient zero, i.e. b = 0, and hence the factor x has no effect on the result y.

The direct calculation of the F-criterion is preceded by an analysis of the variance. The central place in it is occupied by the decomposition of the total sum of squared deviations of the variable y from the mean value into two parts - “explained” and “unexplained”:

Equation (8.21) is a consequence of the system of normal equations derived in one of the previous topics.

Proof of expression (8.21).

It remains to prove that the last term is equal to zero.

If you add up all the equations from 1 to n

y i = a+b×x i + e i , (8.22)

then we get åy i = a×å1+b×åx i +åe i . Since åe i =0 and å1 =n, we get

Then .

If we subtract equation (8.23) from expression (8.22), then we get

As a result, we get

The last sums are equal to zero due to the system of two normal equations.

The total sum of the squared deviations of the individual values ​​of the effective attribute y from the average value is caused by the influence of many reasons. We conditionally divide the entire set of causes into two groups: the studied factor x and other factors. If the factor on has no effect on the result, then the regression line is parallel to the OX axis and . Then the entire dispersion of the resulting attribute is due to the influence of other factors and the total sum of squared deviations will coincide with the residual. If other factors do not affect the result, then y is functionally related to x and the residual sum of squares is zero. In this case, the sum of squared deviations explained by the regression is the same as the total sum of squares.

Since not all points of the correlation field lie on the regression line, their scatter always takes place as due to the influence of the factor x, i.e. regression of y on x, and caused by the action of other causes (unexplained variation). The suitability of the regression line for prediction depends on how much of the total variation of the trait y is accounted for by the explained variation. Obviously, if the sum of squared deviations due to regression is greater than the residual sum of squares, then the regression equation is statistically significant and the x factor has a significant impact on the y feature. This is equivalent to the fact that the coefficient of determination will approach unity.

Any sum of squares is associated with the number of degrees of freedom (df - degrees of freedom), with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of units of the population n and the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations out of n possible are required to form a given sum of squares. So, for the total sum of squares, (n-1) independent deviations are required, because in the aggregate of n units, after calculating the average, only (n-1) the number of deviations freely vary. For example, we have a series of y values: 1,2,3,4,5. The average of them is 3, and then n deviations from the average will be: -2, -1, 0, 1, 2. Since, then only four deviations freely vary, and the fifth deviation can be determined if the previous four are known.

When calculating the explained or factorial sum of squares theoretical (calculated) values ​​of the effective feature are used

Then the sum of squared deviations due to linear regression is equal to

Since, for a given amount of observations in x and y, the factorial sum of squares in linear regression depends only on the regression constant b, this sum of squares has only one degree of freedom.

There is an equality between the number of degrees of freedom of the total, factorial and residual sum of squared deviations. The number of degrees of freedom of the residual sum of squares in linear regression is n-2. The number of degrees of freedom of the total sum of squares is determined by the number of units of variable features, and since we use the average calculated from the sample data, we lose one degree of freedom, i.e. df total = n–1.

So we have two equalities:

Dividing each sum of squares by the number of degrees of freedom corresponding to it, we obtain the mean square of the deviations, or, equivalently, the variance per one degree of freedom D.

;

;

.

Determining the dispersion per one degree of freedom brings the dispersions to a comparable form. Comparing the factorial and residual variances per one degree of freedom, we obtain the value of Fisher's F-criterion

where F-criterion for testing the null hypothesis H 0: D fact = D rest.

If the null hypothesis is true, then the factorial and residual variances do not differ from each other. For H 0, a refutation is necessary so that the factor variance exceeds the residual by several times. English statistician Snedecor developed tables critical values F-relationships at different levels of significance of the null hypothesis and various numbers degrees of freedom. The tabular value of the F-criterion is the maximum value of the ratio of variances that can occur if they randomly diverge for a given level of probability of the presence of a null hypothesis. The calculated value of the F-ratio is recognized as reliable if it is greater than the tabular one. If F fact > F table, then the null hypothesis H 0: D fact = D rest about the absence of a relationship of features is rejected and a conclusion is made about the significance of this relationship.

If F is a fact< F табл, то вероятность нулевой гипотезы H 0: D факт = D ост выше заданного уровня (например, 0,05) и она не может быть отклонена без серьёзного риска сделать неправильный вывод о наличии связи. В этом случае уравнение регрессии считается статистически незначимым. Гипотеза H 0 не отклоняется.

In this example from Chapter 3:

\u003d 131200 -7 * 144002 \u003d 30400 - the total sum of the squares;

1057.878*(135.43-7*(3.92571) 2) = 28979.8 - factor sum of squares;

\u003d 30400-28979.8 \u003d 1420.197 - residual sum of squares;

D fact = 28979.8;

D rest \u003d 1420.197 / (n-2) \u003d 284.0394;

F fact \u003d 28979.8 / 284.0394 \u003d 102.0274;

Fa=0.05; 2; 5=6.61; Fa=0.01; 2; 5 = 16.26.

Since F fact > F table both at 1% and at 5% significance level, we can conclude that the regression equation is significant (the relationship is proven).

The value of the F-criterion is related to the coefficient of determination. The factor sum of squared deviations can be represented as

,

and the residual sum of squares as

.

Then the value of the F-criterion can be expressed as

.

An assessment of the significance of a regression is usually given in the form of an analysis of variance table

, its value is compared with table value at a certain significance level α and the number of degrees of freedom (n-2).
Sources of Variation Number of degrees of freedom Sum of squared deviations Dispersion per degree of freedom F-ratio
actual Tabular at a=0.05
General
Explained 28979,8 28979,8 102,0274 6,61
Residual 1420,197 284,0394

To assess the significance, significance of the correlation coefficient, Student's t-test is used.

The average error of the correlation coefficient is found by the formula:

H
and based on the error, the t-test is calculated:

The calculated value of the t-test is compared with the tabular value found in the Student's distribution table at a significance level of 0.05 or 0.01 and the number of degrees of freedom n-1. If the calculated value of the t-test is greater than the tabulated one, then the correlation coefficient is recognized as significant.

With a curvilinear relationship, the F-criterion is used to assess the significance of the correlation relationship and the regression equation. It is calculated by the formula:

or

where η is the correlation ratio; n is the number of observations; m is the number of parameters in the regression equation.

The calculated value of F is compared with the table value for the accepted level of significance α (0.05 or 0.01) and the number of degrees of freedom k 1 =m-1 and k 2 =n-m. If the calculated value of F exceeds the tabulated value, the relationship is recognized as significant.

The significance of the regression coefficient is established using Student's t-test, which is calculated by the formula:

where σ 2 and i is the variance of the regression coefficient.

It is calculated by the formula:

where k is the number of factor features in the regression equation.

The regression coefficient is recognized as significant if t a 1 ≥t cr. t cr is found in the table of critical points of Student's distribution at the accepted level of significance and the number of degrees of freedom k=n-1.

4.3 Correlation-regression analysis in Excel

Let's carry out a correlation-regression analysis of the relationship between yield and labor costs per 1 quintal of grain. To do this, open an Excel sheet, in cells A1: A30 enter the values ​​of the factor attribute productivity of grain crops, in cells B1: B30 the values ​​of the effective feature - labor costs per 1 quintal of grain. From the Tools menu, select the Data Analysis option. Left-clicking on this item will open the Regression tool. Click on the OK button, the Regression dialog box appears on the screen. In the Input interval Y field, enter the values ​​of the resulting attribute (highlighting cells B1:B30), in the Input interval X field, enter the values ​​of the factor attribute (highlighting cells A1:A30). We mark the probability level of 95%, select New worksheet. We click on the OK button. The table “RESULTS” appears on the worksheet, in which the results of calculating the parameters of the regression equation, the correlation coefficient and other indicators are given, allowing you to determine the significance of the correlation coefficient and the parameters of the regression equation.

RESULTS

Regression statistics

Multiple R

R-square

Normalized R-square

standard error

Observations

Analysis of variance

Significance F

Regression

Odds

standard error

t-statistic

P-Value

bottom 95%

Top 95%

Lower 95.0%

Top 95.0%

Y-intersection

Variable X 1

In this table, "Multiple R" is the correlation coefficient, "R-squared" is the coefficient of determination. "Coefficients: Y-intersection" - a free term of the regression equation 2.836242; "Variable X1" - regression coefficient -0.06654. There are also the values ​​of Fisher's F-test 74.9876, Student's t-test 14.18042, "Standard error 0.112121", which are necessary to assess the significance of the correlation coefficient, parameters of the regression equation and the entire equation.

Based on the data in the table, we construct a regression equation: y x ​​\u003d 2.836-0.067x. The regression coefficient a 1 = -0.067 means that with an increase in grain yield by 1 quintal/ha, labor costs per 1 quintal of grain decrease by 0.067 man-hours.

Correlation coefficient r=0.85>0.7, therefore, the relationship between the studied features in this population is close. The coefficient of determination r 2 =0.73 shows that 73% of the variation of the effective trait (labor costs per 1 centner of grain) is caused by the action of the factor trait (grain yield).

In the table of critical points of the Fisher - Snedecor distribution, we find the critical value of the F-criterion at a significance level of 0.05 and the number of degrees of freedom k 1 =m-1=2-1=1 and k 2 =n-m=30-2=28, it is equal to 4.21. Since the calculated value of the criterion is greater than the tabular value (F=74.9896>4.21), the regression equation is recognized as significant.

To assess the significance of the correlation coefficient, we calculate the Student's t-test:

AT
In the table of critical points of the Student's distribution, we find the critical value of the t-test at a significance level of 0.05 and the number of degrees of freedom n-1=30-1=29, it is equal to 2.0452. Since the calculated value is greater than the tabulated one, the correlation coefficient is significant.

Estimating the Significance of an Equation multiple regression

The construction of an empirical regression equation is the initial stage of econometric analysis. The first regression equation built on the basis of a sample is very rarely satisfactory in terms of one or another characteristic. Therefore, the next most important task of econometric analysis is to check the quality of the regression equation. In econometrics, a well-established scheme for such verification is adopted.

So, the verification of the statistical quality of the estimated regression equation is carried out in the following areas:

Checking the significance of the regression equation;

Checking the statistical significance of the coefficients of the regression equation;

Checking the properties of the data, the feasibility of which was assumed when evaluating the equation (checking the feasibility of the LSM prerequisites).

Checking the significance of the multiple regression equation, as well as paired regression, is carried out using the Fisher criterion. In this case (unlike pairwise regression), the null hypothesis is put forward H 0 that all regression coefficients are equal to zero ( b 1=0, b 2=0, … , b m=0). The Fisher criterion is determined by the following formula:

where D fact - factorial variance, explained by regression, per one degree of freedom; D os - residual dispersion per one degree of freedom; R2- coefficient of multiple determination; t X in the regression equation (in paired linear regression t= 1); P - number of observations.

The obtained value of the F-criterion is compared with the table value at a certain level of significance. If its actual value is greater than the table value, then the hypothesis But about the insignificance of the regression equation is rejected, and an alternative hypothesis about its statistical significance is accepted.

Using the Fisher criterion, one can evaluate the significance of not only the regression equation as a whole, but also the significance of the additional inclusion of each factor in the model. Such an assessment is necessary in order not to load the model with factors that do not significantly affect the result. In addition, since the model consists of several factors, they can be introduced into it in a different sequence, and since there is a correlation between the factors, the significance of including the same factor in the model may differ depending on the sequence of factors introduced into it.

To assess the significance of including an additional factor in the model, we calculate particular criterion Fisher Fxi. It is based on a comparison of the increase in factorial variance, due to the inclusion of an additional factor in the model, with the residual variance per one degree of freedom for the regression as a whole. Therefore, the calculation formula private F-criterion for the factor will look like this:

where R 2 yx 1 x 2… xi … xp - multi-determination coefficient for a model with a full set P factors ; R 2 yx 1 x 2… x i -1 x i +1… xp- coefficient of multiple determination for a model that does not include a factor x i;P- number of observations; t- number of parameters at factors x in the regression equation.

The actual value of Fisher's partial criterion is compared with the tabular one at a significance level of 0.05 or 0.1 and the corresponding numbers of degrees of freedom. If the actual value Fxi exceeds F table, then the additional inclusion of the factor x i into the model is statistically justified, and the "pure" regression coefficient b i with a factor x i statistically significant. If Fxi smaller F table, then the additional inclusion of the factor in the model does not significantly increase the proportion of the explained variation of the result y, and, therefore, its inclusion in the model does not make sense, the regression coefficient for this factor in this case is statistically insignificant.

Using Fisher's private test, one can test the significance of all regression coefficients under the assumption that each corresponding factor x i is entered last into the multiple regression equation, and all other factors have already been included in the model before.

Estimation of the significance of the "pure" regression coefficients b i on Student's criterion t can be carried out without calculating private F-criteria. In this case, as in paired regression, the formula is applied for each factor

t bi = b i / m bi ,

where b i- coefficient of "pure" regression with a factor x i ; m bi- standard error of the regression coefficient b i .