Assessment of the importance of the regression equation and its coefficients. Evaluation of the statistical significance of the equation of regression of its parameters

With the help of MNA, it is possible to obtain only estimates of the parameters of the regression equation. To check whether the parameters are meaningful (i.e., they significantly differ from zero in the true regression equation) use statistical methods of testing hypotheses. As the main hypothesis, they put forward a hypothesis about the insignificant difference from zero regression parameter or correlation coefficient. Alternative hypothesis, while the hypothesis is reverse, i.e. About the inequality zero parameter or correlation coefficient. To test the hypothesis used t-student's criterion.

Found according to the observation data t-the criterion (it is also called observable or actual) is compared with a table (critical) value defined by Student's distribution tables (which are usually given at the end of textbooks and workshops on statistics or econometrics). The table value is determined depending on the level of significance and the number of degrees of freedom, which in the case of linear pair of regression is equal , N.-In observation.

If the actual value t.-criteria is more tabular (modulo), it is considered that with the probability of the regression parameter (correlation coefficient) is significantly different from zero.

If the actual value t.-criteria is less tabular (module), then there is no reason to reject the basic hypothesis, i.e. The regression parameter (correlation coefficient) is slightly different from zero at the level of significance.

Actual meanings t.-criteria are determined by formulas:

,

,

where .

To test the hypothesis about the insignificant difference from zero of the coefficient of linear pairing correlation, the criterion is used:

where r. - evaluation of the correlation coefficient obtained according to the observed data.

The forecast of the expected value of the effective sign Y according to the linear pair of regression equation.

Let it be required to estimate the projected value of the character-result for a specified sign-factor value. The predicted value of the sign-result with a trust probability is equal to the interval of the forecast:

,

where - point forecast;

t. - the trust coefficient determined by Student's distribution tables depending on the level of significance α and the number of degrees of freedom;

Average error forecast.

The point forecast is calculated according to the linear equation of regression, as:

.

The average forecast error is determined by the formula:

.

Example 1.

Based on the data provided in the application and the corresponding option 100, it is required:



1. Construct the equation of linear steamy regression of one attribution from the other. One of the signs corresponding to your option will play the role of factor (x) , other - effective . Causal relationships between signs to establish themselves economic Analysis. Calculate the meaning of the parameters of the equation.

3. Assess the statistical significance of regression parameters and correlation coefficient with a level of importance 0.05.

4. Perform the forecast of the expected value of the sign-result ypri by the forecast sign of the signator X105% of the mean level x . Assess the accuracy of the forecast, having calculated the error of the forecast and its trust interval With a probability of 0.95.

Decision:

In this case, we choose the course price of stocks in this case, since the value of accrued dividends depends on the profitability of the shares. Thus, the result will be a sign dividends accrued by the results of activities.

To facilitate the calculations, we construct a calculation table that is filled in the process of solving the problem. (Table 1)

For clarity, the dependence of YOT X will represent graphically. (Figure 2)

Table 1 - Calculated Table


1. Construct the regression equation of the form :.

To do this, it is necessary to determine the parameters of the equation and.

Determine ,

where is the average of the values erected into a square;

Mean in a square.

Determine the parameter a 0.:

We obtain the equation of the regression of the following form:

The parameter shows how many dividends would be accrued by the results of the activity in the absence of influence from the course price of shares. Based on the parameter, we can conclude that when the course price is changed by 1 rub. Dividends will be changed to the same direction by 0.01 million rubles.



2. Calculate the linear coefficient of steam correlation and the determination coefficient.

The linear coefficient of pair correlation is determined by the formula:

,

We define I. :

Correlation coefficient equal to 0.708, allows to judge close links between productive and factor signs .

The determination coefficient is equal to the square linear coefficient Correlations:

The determination coefficient shows that the variations of accrued dividends depends on the variation of the exchange rate of the shares, and from the rest of the factors unaccounted in the model.

3. We estimate the importance of the parameters of the regression equation and the linear correlation coefficient t-student's criterion. It is necessary to compare the calculated values t-criteria for each parameter and compare it with tabular.

To calculate actual values t.-Criteria Determine:

For coefficients regression equation Checking their level of significance is carried out by t. -Criterier of Student and by criterion F. Fisher. Below we will consider an assessment of the reliability of regression indicators only for linear equations (12.1) and (12.2).

Y \u003d A.0 + A.1 X.(12.1)

X \u003d B.0 + B.1 Y.(12.2)

For this type of equations evaluate t.- Student screteter only the values \u200b\u200bof the coefficients but1st b.1C using calculating the value TFaccording to the following formulas:

Where r yx.correlation coefficient, and the amount but1 can be calculated by formulas 12.5 or 12.7.

Formula (12.27) is used to calculate the magnitude TF, but1 REGRESSIA REGRESSION Y. by X.

Magnitude b.1 can be calculated by formulas (12.6) or (12.8).

Formula 12.29 is used to calculate the magnitude TF,which allows you to estimate the level of significance of the coefficient b.1 REGRESSIA REGRESSION X.by Y.

Example.We estimate the level of significance of regression coefficients but1st b.1 Eurasions (12.17), and (12.18) obtained in solving problems 12.1. We use for examples (12.27), (12.28), (12.29) and (12.30).

Recall the form of the obtained regression equations:

Y H. = 3 + 0,06 X.(12.17)

X y \u003d 9+1 Y.(12.19)

Value but1B equation (12.17) is 0.06. Therefore, to calculate the formula (12.27) you need to calculate the value SB Y x.According to the problem of the task, the value p\u003d 8. The correlation coefficient was also calculated by us by formula 12.9: r xy \u003d √0,06 0,997 = 0,244 .

It remains to calculate the values Σ (y ι.- y.) 2 I. Σ (h. ι -X.) 2, which we have not calculated. It is best to do these calculations in Table 12.2:

Table 12.2.

Test No. P / P x ι. i. x ι -x. (x ι -x.) 2 y ι.- y. (y ι.- y.) 2
-4,75 22,56 - 1,75 3,06
-4,75 22,56 -0,75 0,56
-2,75 7,56 0,25 0,06
-2,75 7,56 1,25 15,62
1,25 1,56 1,25 15,62
3,25 10,56 0,25 0,06
5,25 27,56 -0,75 0,56
5,25 27,56 0,25 0,06
Amount 127,48 35,6
Middle 12,75 3,75

We substitute the obtained values \u200b\u200bin formula (12.28), we obtain:

Now we calculate the amount TFby Formula (12.27):

Value TFchecked on the level of significance in Table 16 of Annex 1 for t-criteria Student. The number of degrees of freedom in this case will be 8-2 \u003d 6, so the critical values \u200b\u200bare equal, respectively, for P ≤0,05 t kr\u003d 2.45 and for P≤.0,01 t kr\u003d 3.71. In the adopted recording form it looks like this:

We build a "axis of significance":

The resulting value TF N O.the fact that the magnitude of the regression coefficient of equation (12.17) is indistinguishable from zero. In other words, the obtained regression equation is inadequately initial experimental data.



Calculate now the level of significance of the coefficient b.1. To do this, it is necessary to calculate the magnitude SB XY.by formula (12.30), for which all the necessary quantities are already calculated:

Now we calculate the amount TFby Formula (12.27):

We can immediately build an "axis of significance," since all preliminary operations were prereparated above:

The resulting value TFhit the insignificance zone, therefore we must take a hypothesis H. O about the fact that the magnitude of the regression coefficient of equation (12.19) is indistinguishable from zero. In other words, the obtained regression equation is inadequately initial experimental data.

Nonlinear regression

The result obtained in the previous section discourages: we obtained that both regression equations (12.15) and (12.17) are inadequate to experimental data. The latter was because both of these equations characterize a linear connection between the signs, and we in section 11.9 showed that there is between variables X.and Y.there is a significant curvilinear dependence. In other words, between variables H.and Y.in this task, it is necessary to look for not linear, but curvilinear connections. We will do it using the Stage 6.0 package (Development of A.P. Kulaicheva, registration number 1205).

Task 12.2.. The psychologist wants to choose a regression model adequate to experimental data obtained in the task of 11.9.

Decision.This task is solved by simply busting models crookedly linear regression Stage proposed in the statistical package. The package is organized in such a way that in the spreadsheet, which is the source for further workExperimental data are entered in the form of the first column for the variable. X.and second column for variable Y.Then the statistics section are selected in the main menu, in it subsection - regression analysisIn this subsection again subsection - curvilinear regression. In the last menu are given formulas (models) different species The curvilinear regression according to which it is possible to calculate the corresponding regression coefficients and immediately check them on significance. We will consider only a few examples of working with ready-made models (formulas) of curvilinear regression.



1. The first model - exhibitor . Her formula is as follows:

When calculating with the help of static reception but 0 \u003d 1 and but 1 = 0,022.

Calculation of the level of significance for A, gave the magnitude R\u003d 0.535. Obviously, the value obtained is insignificant. Consequently, this regression model is inadequate to experimental data.

2. Second model - power . Her formula is as follows:

When counting and o \u003d -5.29, a, \u003d 7.02 and but 1 = 0,0987.

The level of significance for but 1 - R\u003d 7.02 and for but 2 - P \u003d.0.991. Obviously, none of the coefficients mean.

3. Third model - polynomial . Her formula is as follows:

Y.= but 0 + but 1 X + a 2 x 2+ but 3 X. 3

When counting a 0.= - 29,8, but 1 = 7,28, but 2 \u003d - 0,488 and but 3 \u003d 0.0103. The level of significance for A, - P \u003d.0.143, for A 2 - P \u003d. 0.2 and for A, - P \u003d.0,272

Conclusion - this model is inadequate than experimental data.

4. Fourth model - parabola .

Her formula is as follows: Y \u003d a o + a l -x 1 + a 2 x 2

When counting but 0 \u003d - 9.88, a, \u003d 2.24 and but 1 \u003d - 0.0839 The level of significance for but 1 - P \u003d.0,0186, for but 2 - P \u003d.0,0201. Both regression coefficient turned out to be meaningful. Consequently, the task was solved - we revealed the form of curvilinear relationship between the success of the decision of the third subtiece of the bill and the level of knowledge on algebra is the dependence of the parabolic species. This result confirms the output obtained by solving the problem 11.9 on the presence of curvilinear relationship between variables. We emphasize that it is with the help of curvilinear regression that a precise type of dependence was obtained between the variables studied.


Chapter 13 Factor Analysis

Basic concepts of factor analysis

Factor analysis is a statistical method that is used in the processing of large experimental data arrays. Tasks of factor analysis are: reducing the number of variables (data reduction) and determination of the structure of relationships between variables, i.e. Classification of variables, so factor analysis is used as a method of reduction of data or as a structural classification method.

An important difference in factor analysis from all methods described above is that it cannot be used for processing primary, or, as they say "raw", experimental data, i.e. Received directly when examining the subjects. The material for factor analysis is correlation bonds, and more precisely, the Pearson correlation coefficients that are calculated between variables (that is, psychological signs) included in the survey. In other words, the correlation matrices are subjected to factor analysis, or, as they are otherwise called, intercorrelation matrices. The names of columns and rows in these matrices are the same, as they are a list of variables included in the analysis. For this reason, the matrix of intercreation is always square, i.e. The number of rows in them is equal to the number of columns, and symmetrical, i.e. On symmetric places relative to the main diagonal, the same correlation coefficients are worth the same.

It is necessary to emphasize that the source table of the data from which the correlation matrix is \u200b\u200bobtained, it must not be square. For example, a psychologist measured three indicators of intelligence (verbal, non-verbal and general) and school marks in three educational subjects (literature, mathematics, physics) in 100 subjects - students of ninth grades. The initial data matrix will have a size of 100 × 6, and the intercreation matrix size is 6 × 6, since it only has 6 variables. With such a number of variables, the matrix of intercreation will include 15 coefficients and analyze it will not be difficult.

However, we will imagine what will happen if the psychologist receives not 6, and 100 indicators from each subject. In this case, it will have to analyze 4950 correlation coefficients. The number of coefficients in the matrix is \u200b\u200bcalculated by the formula N (n + 1) / 2 and in our case equals, respectively (100 × 99) / 2 \u003d 4950.

Obviously, to conduct a visual analysis of such a matrix - the task is increasingly important. Instead, a psychologist may perform a mathematical procedure for a factor analysis of a correlation matrix of 100 × 100 (100 subjects and 100 variables) and this way to obtain a simpler material for interpreting experimental results.

The main concept of factor analysis is factor.This is an artificial statistical indicator that occurs as a result of special transformations of the correlation coefficient table between the studied psychological signs, or intercreation matrices. The procedure for extracting factors from the matrix of intercreation is called the factorization of the matrix. As a result of factorization from the correlation matrix can be extracted miscellaneous number Factors up to a number equal to the number of source variables. However, the factors allocated as a result of factorization are usually unequal in their meaning.

Factor matrix elements are calledor scales "; and they are the correlation coefficients of this factor with all the indicators used in the study. The factor matrix is \u200b\u200bvery important because it shows how the studied indicators are associated with each dedicated factor. In this case, factories demonstrates measure, or closeness, of this connection.

Since each column of the factor matrix (factor) is a kind of variable value, the factors themselves can also correlate each other. Two cases are possible here: the correlation between factors is zero, in which case factors are independent (orthogonal). If the correlation between factors is greater than zero, then in this case the factors are considered dependent (legal). We emphasize that orthogonal factors, in contrast to the submitting, give more simple options interactions inside the factor matrix.

As an illustration of orthogonal factors, L. Tsrtone is often given, which, taking a number of boxes different sizes and forms, measured in each of them more than 20 different indicators and calculated the correlations between them. Proofactorizing the resulting intercreation matrix, he received three factors, the correlation between which was zero. These factors were "Length", "Width" and "Height".

In order to better catch the essence of factor analysis, we will describe in more detail the following example.

Suppose that the psychologist has a random sampling of students receive the following data:

V 1.- body weight (in kg);

V 2 -the number of visits to lectures and seminar classes on the subject;

V 3.- Leg length (in cm);

V 4.- the number of read books on the subject;

V 5.- length of hand (in cm);

V 6 -examination on the subject ( V.- OT english word. Variable - variable).

When analyzing these signs, it is not devoid of grounds for the assumption that variables V 1,To 3 I. V 5.- Will be interconnected, because, the more the person, the more he weighs and the longer his limb. This means that statistically significant correlation coefficients should be obtained between these variables, since these three variables measure some fundamental property of individuals in the sample, namely: their dimensions. It is also likely that when calculating correlations between V 2, V 4and V 6.there will also be quite high correlation coefficients, since visiting lectures and independent classes will contribute to obtaining higher estimates on the subject being studied.

Thus, from the entire possible array of coefficients, which is obtained by the search for steam correlated signs V 1.and V 2, V Tand V 3.etc., presumably two blocks of statistically significant correlations will be allocated. The rest of the correlations - between the signs included in different blocksIt is unlikely to have statistically significant factors, since relations between such signs, as the limb and performance of the subject, are most likely random. So, a meaningful analysis of 6 of our variables shows that they, in fact, measure only two generalized characteristics, namely: the size of the body and the degree of preparedness on the subject.

To the resulting matrix of intercreation, i.e. calculated in pairs of correlation coefficients between all six variables V 1 - V 6,permissible to apply factor analysis. It can be done manually using the calculator, however, the procedure for such statistical processing is very laborious. For this reason, at present, factor analysis is carried out on computers, as a rule, using standard statistical packets. All modern statistical packages have programs for correlation and factor analyzes. A computational program for factor analysis is essentially trying to "explain" the correlation between variables in terms of a small number of factors (in our example of two).

Suppose that, using a computer program, we received a matrix of intercreation of all six variables and subjected to its factor analysis. As a result of factor analysis, Table 13.1, which is called a "factor matrix", or "factor structural matrix".

Table 13.1.

Variable Factor 1. Factor 2.
V 1. 0,91 0,01
V 2. 0,20 0,96
V 3. 0,94 -0,15
V 4. 0,11 0,85
V 5. 0,89 0,07
V 6. -0,13 0,93

By tradition, factors are represented in the table in the form of columns, and variables in the form of strings. Table 13.1 column headers correspond to numbers of dedicated factors, but more accurately would be called "factor loads", or "weight", by factor 1, the same by factor 2. As mentioned above, factor loads, or weights, are correlations between the corresponding variable and this factor. For example, the first number 0.91 in the first factor means that the correlation between the first factor and variable V 1. equal to 0.91. The higher the factor load in absolute value, the greater its connection with the factor.

From Table 13.1 it can be seen that variables V 1 v 3and V 5.have big correlations with factor 1 (actually the variable 3 has a correlation close to 1 with factor 1). At the same time variables V. 2 , V. 3 I. In 5.have correlations close to 0 with factor 2. Like this factor 2 highly correlates with variables V 2, V 4 and V 6. and in fact does not correlate with variables V 1., V. 3 I. V 5.

IN this exampleIt is obvious that there are two correlation structures, and, therefore, all the information of Table 13.1 is determined by two factors. Now the final stage of work begins - the interpretation of the data obtained. Analyzing the factor matrix, it is very important to take into account the signs of factor loads in each factor. If in the same factor there are loads with opposite signs, this means that there is a proportional dependence between variables that have opposite signs.

Note that when interpreting the factor for convenience, you can change the signs of all loads on this factor to the opposite.

The factor matrix also shows which variables each factor form. This is due primarily to the level of factual weight. By tradition, the minimum level of significance of the correlation coefficients in factor analysis is taken equal to 0.4 or even 0.3 (by absolute value), since there are no special tables for which critical values \u200b\u200bcould be defined for the level of significance in a factor matrix. Consequently, the easiest way to see what variables "belong" to the factor is to note those of them that have loads higher than 0.4 (or less than - 0.4). We indicate that in computer packages, sometimes the level of significance of factorily is determined by the program itself and is installed on more high level, for example 0.7.

So, from Table 13.1, it follows that factor 1 is a combination of variables V 1.To 3 I. V 5.(but not V 1,K. 4 and V 6,since their factor loads are module less than 0.4). Like this factor 2 is a combination of variables V 2, V 4and V 6.

The factor dedicated as a result of factorization is a combination of those variables from the number included in the analysis that have significant loads. It often happens, however, that the factor includes only one variable with a significant factorily weight, and the rest have an insignificant factor load. In this case, the factor will be determined by the name of the only significant variable.

In essence, the factor can be viewed as an artificial "unit" of grouping variables (symptoms) on the basis of connections available between them. This unit is conditional, because by changing certain conditions for the factorization procedure of the intercreation matrix, one can obtain a different factor matrix (structure). The new matrix may turn out to be another distribution of variables by factors and their factor loads.

In this regard, in factor analysis, there is the concept of a "simple structure". The simple is called the structure of a factor matrix, in which each variable has significant loads only one of the factors, and the factors themselves are orthogonal, i.e. Do not depend on each other. In our example, two common factors are independent. A factor matrix with a simple structure allows you to interpret the result obtained and give the name to each factor. In our case, the factor is the first - "body sizes", the factor of the second - "level of preparedness".

The above does not exhaust the meaningful possibilities of the factor matrix. It is possible to extract additional characteristics that allow more detailed to investigate the relationships of variables and factors. These characteristics are called "community" and " eigenvalue" factor a.

However, before submitting their description, we indicate on one fundamentally important property of the correlation coefficient, thanks to which these characteristics receive. The correlation coefficient erected into a square (i.e. multiplied by himself) shows which part of the dispersion (variability) of the feature is common to two variables, or, simply, how much these variables are overlapped. For example, two variables with a correlation 0.9 overlap with a degree of 0.9 x 0.9 \u003d 0.81. This means that 81% of the dispersion of the same variable is common, i.e. match up. Recall that factor loads in a factor matrix are the correlation coefficients between factors and variables, therefore, the factor load elevated to the square characterizes the degree of community (or overlapping) of dispersions of this variable and a given factor.

If the factors obtained do not depend on each other ("orthogonal" solution), according to the weights of the factor matrix, it can be determined which part of the dispersion is common to variable and factor. Calculate which part of the variable variable coincides with the variability of factors, you can simply summarize the squares of factor loads for all factors. From Table 13.1, for example, it follows that 0.91 × 0.91 + + 0.01 × 0.01 \u003d 0.8282, i.e. About 82% of the variable variable variable "is explained" by two first factors. The resulting value is called general variable in this case variable V 1.

Variables can have a different degree of community with factors. A variable with greater generality has a significant degree of overlap (greater fraction of dispersion) with one or more factors. Low commonality implies that all correlations between variables and the factors are small. This means that none of the factors has a coincident line of variable variable. Low commonality may indicate that the variable measures something qualitatively different from other variables included in the analysis. For example, one variable associated with an assessment of motivation among assuming ability tasks will have a community with the factors of abilities close to zero.

Small community may also mean that a certain task is experiencing a strong influence of measurement errors or extremely difficult for the subject. It is possible, on the contrary, it is also that the task is so simple that each subject gives it the correct answer to it, or the task is so unclear in content that the subject does not understand the essence of the question. Thus, the low community implies that this variable is not combined with factors for one of the reasons: either the variable measures another concept, or the variable has a greater measurement error, or there are distorting dispersion of a sign of distinction between the test in the options for answering this task.

Finally, with the help of this characteristic, as the own value of the factor, you can determine the relative significance of each of the selected factors. To do this, it is necessary to calculate what part of the dispersion (variability) explains each factor. The factor that explains 45% of the dispersion (overlap) between the variables in the original correlation matrix is \u200b\u200bobviously more significant than another, which explains only 25% dispersion. These arguments, however, are permissible if the factors are orthogonal, in other words, do not depend on each other.

In order to calculate the eigenvalue of the factor, the factor loads need to be built into the square, and fold them in a column. Using Table 13.1 data You can verify that the eigenvalue of factor 1 is (0.91 × 0.91 + 0.20 × 0,20 + 0.94 × 0.94 + 0.11 × 0,11 + 0.84 × 0.84 + (- 0.13) ×

× (-0.13)) \u003d 2.4863. If the eigenvalue of the factor is divided into the number of variables (6 in our example), the obtained number will show which proportion of the dispersion is explained by this factor. In our case, it turns out 2,4863 ∙ 100% / 6 \u003d 41.4%. In other words, factor 1 explains about 41% of the information (dispersion) in the initial correlation matrix. Similar calculation for the second factor will give 41.5%. In sum, it will be 82.9%.

Thus, two common factor, being combined, explain only 82.9% dispersion of indicators of the initial correlation matrix. What happened to the "remaining" 17.1%? The fact is that, considering the correlation between 6 variables, we noted that the correlations are disintegrated into two separate blocks, and therefore decided that it is logical to analyze the material in the concepts of two factors, and not 6, as well as the number of source variables. In other words, the number of constructs needed to describe the data decreased from 6 (number of variables) to 2 (the number of common factors). As a result of factorization, some of the information in the initial correlation matrix was sacrificed to the construction of a two-factor model. The only condition in which the information is not lost, would consider the six-Factor model.

After the evaluation of individual statistical significance Each of the recession coefficients is usually analyzed by the aggregate significance of the coefficients, i.e. Total equations in general. Such an analysis is carried out on the basis of the test of the hypothesis on the overall significance of the hypothesis about the simultaneous equality zero of all the regression coefficients with explaining variables:

H 0: B 1 \u003d B 2 \u003d ... \u003d B m \u003d 0.

If this hypothesis is not deflected, then it is concluded that the cumulative influence of all M explanatory variables x 1, x 2, ..., x m model to the dependent variable y can be considered statistically insignificant, and the overall quality of the regression equation is low.

The verification of this hypothesis is carried out on the basis of a dispersion analysis of the comparison of explained and residual dispersion.

H 0: (explained dispersion) \u003d (residual dispersion),

H 1: (explained dispersion)\u003e (residual dispersion).

F-statistics are built:

where - explained by the dispersion regression;

- residual dispersion (the sum of the deviations, divided by the number of degrees of freedom N-M-1). When performing the MNA prerequisites, the constructed F-statistics has a Fisher's distribution with the numbers of freedom N1 \u003d M, N2 \u003d N-M-1. Therefore, if with the required level of significance A F NAME\u003e F A; m; N - M -1 \u003d F A (where f a; m; n - M -1 is a critical fischer distribution point), then H 0 deviates in favor of H 1. This means that the dispersion regression is significantly more residual dispersion, and therefore, the regression equation highly reflects the dynamics of changes in the dependent variable Y. If F< F a ; m ; n - m -1 = F кр. , то нет основания для отклонения Н 0 . Значит, объясненная дисперсия соизмерима с дисперсией, вызванной случайными факторами. Это дает основание считать, что совокупное влияние объясняющих переменных модели несущественно, а следовательно, общее качество модели невысоко.

However, in practice, more often instead of the specified hypothesis, the hypothesis is tested about the statistical significance of the determination coefficient R 2:



H 0: R 2\u003e 0.

To check this hypothesis, the following F-statistics are used:

. (8.20)

The value f in the implementation of the MNA prerequisites and at the fairness of H 0 has a Fisher distribution similar to the distribution of F-statistics (8.19). Indeed, dividing the numerator and denominator of the fraction in (8.19) on the total sum of the abnormal squares And knowing that it disintegrates the sum of the squares of deviations, explained by the regression, and the residual amount of the squares of deviations (this is a consequence, as will be shown later, the system of normal equations)

,

we will obtain formula (8.20):

From (8.20) it is obvious that the indicators f and R 2 are equal or not equal to zero simultaneously. If f \u003d 0, then R 2 \u003d 0, and the regression line Y \u003d is the best on MNA, and, therefore, the value of Y is linearly independent of x 1, x 2, ..., x m. To check the zero hypothesis H 0: F \u003d 0 at a given level of significance A on the tables of critical points of the Fisher distribution, there is a critical value of F CR \u003d F A; m; N - M -1. Zero hypothesis deviates if F\u003e F cr. It is equivalent to the fact that R 2\u003e 0, i.e. R 2 is statistically significant.

The analysis of statistics F allows us to conclude that for the adoption of a hypothesis about simultaneous equality zero of all coefficients of linear regression coefficient R 2 should not differ significantly from zero. Its critical value decreases with an increase in the number of observations and can be small as small.

Let, for example, when evaluating regression with two explanatory variables x 1 i, x 2 I, 30 observations R 2 \u003d 0.65. Then

F Navel \u003d \u003d 25.07.

According to the tables of critical points of the Fisher distribution, we find F 0.05; 2; 27 \u003d 3.36; F 0.01; 2; 27 \u003d 5.49. Since F Navel \u003d 25.07\u003e F, both at 5%, and at 1% of the level of significance, the zero hypothesis is deflected in both cases.

If in the same situation R 2 \u003d 0.4, then

F Navel \u003d 9.

The assumption of insignificability of communication is rejected here.

Note that in the case of a pair regression, the zero hypothesis check for F-Statistics is equivalent to checking zero hypothesis for T-statistics

correlation coefficient. In this case, the F-statistics is equal to the square of T-statistics. Independent significance, the coefficient R 2 acquires in the case of multiple linear regression.

8.6. Dispersion analysis for decomposition of the total quantities of deviations. Degrees of freedom for relevant sums of squares of deviations

Applicable the theory outlined above the pair linear regression.

After the linear regression equation was found, the significance of both the equations in general and its separate parameters is evaluated.

Assessment of the significance of the regression equation as a whole is given using Fisher's F-Criteria. At the same time, the zero hypothesis is put forward that the regression coefficient equal to zero.. B \u003d 0, and, therefore, factor x does not affect the result of the result.

The direct calculation of the F-criterion is preceded by dispersion analysis. The central place in it takes the decomposition of the total sum of the squares of the variable deviations from the average value into two parts - "explained" and "unexplained":

Equation (8.21) is a consequence of a system of normal equations derived in one previous topics.

Proof of expression (8.21).

It remains to prove that the last term is zero.

If you add from 1 to n all equations

y i \u003d a + b × x i + e i, (8.22)

i get åy i \u003d a × å1 + b × Åx i + ore i. Since Åe i \u003d 0 and å1 \u003d n, then we get

Then .

If you subtract from the expression (8.22) Equation (8.23), then we get

As a result, we get

The last amounts are zero by virtue of the system of two normal equations.

The total amount of the squares of deviations of individual values \u200b\u200bof the productive feature from the mean value is caused by the effect of a set of reasons. Conditionally divide the entire set of reasons for two groups: the factors studied and other factors. If the factor on has no effect on the result, the regression line is parallel to the OX axis and. Then the entire dispersion of the productive sign is due to the impact of other factors and the total amount of the squares of deviations will coincide with the residual. If other factors do not affect the result, then it is connected to x functionally and the residual sum of the squares is zero. In this case, the sum of the squares of deviations explained by the regression coincides with the total amount of squares.

Since not all points of the correlation field lie on the regression line, they always take place their variation as caused by the influence of the factor x, i.e. regression in the x and caused by the action of other reasons (inexplicable variation). The suitability of the regression line for the forecast depends on which part of the general variation of the feature y is on the variation explained. It is obvious that if the sum of the squares of deviations caused by regression will be greater than the residual sum of the squares, the regression equation is statistically significant and factor X has a significant impact on the sign of y. This is equivalent to the fact that the determination coefficient will approach one.

Any sum of squares is associated with the number of degrees of freedom (DF - Degrees of Freedom), with the number of freedom of independent variation of the feature. The number of freedom degrees is associated with the number of units of the combination of N and with the number of constants determined by it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from n possible is required to form a given sum of squares. Thus, for the total sum of the squares, the (N-1) of independent deviations is required, for for a totality of n units after calculating the average, only (N-1) is freely vary. For example, we have a number of values \u200b\u200bfrom: 1,2,3,4,5. The mean of them is 3, and then N deviations from the average will be: -2, -1, 0, 1, 2. Since, only four deviations vary freely, and the fifth deviation can be determined if the previous four are known.

When calculating the explanatory or factor sum of the squares Theoretical (calculated) values \u200b\u200bof the performance feature are used.

Then the sum of the squares of deviations caused by linear regression is equal to

Since, with a given volume of observations by X and Y, the factor sum of the squares with linear regression depends only on the regression constant B, then this amount of the squares has only one degree of freedom.

There is equality between the number of degrees of freedom of the overall, factor and residual sum of the squares of deviations. The number of degrees of freedom of the residual sum of squares with linear regression is N-2. The number of freedom of freedom of the total squares is determined by the number of units of variable signs, and since we use the average calculated according to the sample data, then we lose one degree of freedom, i.e. df common. \u003d N-1.

So, we have two equality:

By dividing each sum of the squares to the corresponding number of degrees of freedom, we obtain the average square of deviations, or that the same, dispersion by one degree of freedom D.

;

;

.

The definition of a dispersion by one degree of freedom leads dispersion to comparable. Comparing factor and residual dispersion in the calculation of one degree of freedom, we obtain the Fisher's F-Criter

where the F-criterion for checking the zero hypothesis H 0: D Fact \u003d D Ost.

If the zero hypothesis is valid, then the factor and residual dispersion do not differ from each other. For H 0, it is necessary to refute so that the factor dispersion exceeds the residual several times. English Statistics Snedacor Developed Tables critical values F-relations at various levels of materiality zero hypothesis and various number degrees of freedom. The table value of the F-criterion is the maximum amount of dispersions ratio, which may occur when they are randomly discrepancies for this level of probability of zero hypothesis. The calculated value of the F-ratio is recognized as reliable if it is more tabular. If F Fact\u003e F Table, then the zero hypothesis H 0: D Fact \u003d D AST on the lack of communication of the signs is rejected and concluded about the materiality of this connection.

If F fact< F табл, то вероятность нулевой гипотезы H 0: D факт = D ост выше заданного уровня (например, 0,05) и она не может быть отклонена без серьёзного риска сделать неправильный вывод о наличии связи. В этом случае уравнение регрессии считается статистически незначимым. Гипотеза H 0 не отклоняется.

In the example of the example of Chapter 3:

\u003d 131200 -7 * 144002 \u003d 30400 - the total amount of squares;

1057,878 * (135.43-7 * (3,92571) 2) \u003d 28979,8 - factor sum of squares;

\u003d 30400-28979.8 \u003d 1420,197 - residual sum of squares;

D Fact \u003d 28979.8;

D ost \u003d 1420,197 / (n-2) \u003d 284,0394;

F fact \u003d 28979.8 / 284.0394 \u003d 102.0274;

F a \u003d 0.05; 2; 5 \u003d 6.61; F a \u003d 0.01; 2; 5 \u003d 16.26.

Since F fact\u003e F Table both at 1%, and at a 5% level of significance, it can be concluded that the regression equation is important (the link is proved).

The value of the F-criterion is associated with the determination coefficient. The factor sum of the squares of deviations can be represented as

,

and the residual sum of the squares is like

.

Then the value of the F-criterion can be expressed as

.

Assessment of the importance of regression is usually given in the form of a dispersion analysis table.

, its value is compared with tabular value At a certain level of significance α and the number of degrees of freedom (N-2).
Sources of variation The number of degrees of freedom The sum of the squares of deviations Dispersion of one degree of freedom F-attitude
actual Table with a \u003d 0.05
General
Explained 28979,8 28979,8 102,0274 6,61
Residual 1420,197 284,0394

To assess materiality, the significance of the correlation coefficient is used by the Student T-criterion.

There is an average error of the correlation coefficient by the formula:

N.
the basis of the error is calculated by the criterion:

The calculated value of the T-criterion is compared with the table, found in the Student distribution table at the level of significance of 0.05 or 0.01 and the number of degrees of freedom N-1. If the calculated value of the T-criterion is greater than the table, then the correlation coefficient is recognized as significant.

In curvilinear communication to assess the significance of the correlation ratio and the regression equation, the F criterion applies. It is calculated by the formula:

or

where η is a correlation relationship; n - the number of observations; M is the number of parameters in the regression equation.

The calculated value F is compared with a table for the adopted level of significance α (0.05 or 0.01) and the numbers of the degrees of freedom to 1 \u003d M-1 and k 2 \u003d n-m. If the calculated value f exceeds the tabular, the relationship is recognized as significant.

The significance of the regression coefficient is established using the Student t-criterion, which is calculated by the formula:

where σ 2 A I is the dispersion of the regression coefficient.

It is calculated by the formula:

where k is the number of factor signs in the regression equation.

The regression coefficient is recognized as significant if T A 1 ≥T cr. T of the Kyrgyz Republic is deepening in the table of critical stitch points of Student when the level of significance and the number of degrees of freedom k \u003d n-1.

4.3.Correlation-regression analysis in Excel

Conduct a correlation and regression analysis of the relationship between the yield and labor costs per 1 C grain. To do this, open the Excel sheet, in cells A1: A30 we enter the values \u200b\u200bof the factor The yield of grain crops, in cells B1: B30 the values \u200b\u200bof the productive sign - the cost of 1 C of grain. In the Tools menu, select the data analysis option. By clicking on the left mouse button on this item, open the regression tool. Click on the OK button, the regression dialog box appears on the screen. In the input interval field, we enter the values \u200b\u200bof the performance (highlighting the cells B1: B30), in the field of the inlet interval x we \u200b\u200benter the value of the factor of the sign (highlighting cells A1: A30). We note the level of probability of 95%, choose a new working sheet. Click on the OK button. The work sheet appears the "output output" table, in which the results of calculating the parameters of the regression equation, correlation coefficient and other indicators to determine the significance of the correlation coefficient and the parameters of the regression equation are given.

Total outcome

Regression statistics

Multiple R.

R-square

Normal R-Square

Standard error

Observations

Dispersion analysis

Significance F.

Regression

Factors

Standard error

t-statistics

P-value

Lower 95%

Top 95%

Lower 95.0%

Top 95.0%

Y-crossing

Variable x 1.

This table "Multiple R" is the correlation coefficient, "R-square" - the coefficient of determination. "The coefficients: Y-intersection" is a free member of the regression equation 2.836242; "Variable x1" - regression coefficient -0.06654. Here there are also values \u200b\u200bof Fisher's Fischer 74,9876, T-criterion of Student 14,18042, "Standard error 0.112121", which are necessary to assess the significance of the correlation coefficient, parameters of the regression equation and the entire equation.

Based on the data of the table, we construct the regression equation: at x \u003d 2,836-0.067x. The regression coefficient A 1 \u003d -0.067 means that with an increase in grain yield on 1 c / ha, labor costs per 1 grain decrease by 0.067 people-

The correlation coefficient r \u003d 0.85\u003e 0.7, therefore, the relationship between the studied signs in this aggregate is close. The determination coefficient R 2 \u003d 0.73 shows that 73% of the variation of the productive feature (labor costs per 1 grade) is caused by the action of a factor (grain yield).

In the table of critical points of the distribution of Fisher - Snedel, we will find the critical value of the F-criterion at the level of significance of 0.05 and the number of freedom of freedom to 1 \u003d M-1 \u003d 2-1 \u003d 1 and k 2 \u003d nm \u003d 30-2 \u003d 28, it is equal 4.21. Since the calculated value of the criterion is greater than the table (F \u003d 74.9896\u003e 4.21), the regression equation is recognized as significant.

To assess the significance of the correlation coefficient, calculate the T-criterion of Student:

IN
table of critical stitch distribution points will find the critical value of the Critrium at the level of significance of 0.05 and the number of freedom of freedom N-1 \u003d 30-1 \u003d 29, it is 2.0452. Since the estimated value is more tabular, the correlation coefficient is significant.

Evaluation of the significance of the equation multiple regression

The construction of the empirical regression equation is the initial stage of econometric analysis. The first regression equation based on the sample is very rarely satisfactory for one or another characteristics. Therefore, the next most important task of econometric analysis is to verify the quality of the regression equation. The econometrics adopted a well-established scheme of such an inspection.

So, checking the statistical quality of the estimated regression equation is carried out in the following areas:

· Check the significance of the regression equation;

· Checking the statistical significance of the coefficients of the regression equation;

· Verification of data properties, the feasibility of which was assumed to evaluate the equation (checking the feasibility of MNA prerequisites).

Checking the significance of the multiple regression equation, as well as the paired regression, is carried out using the Fisher's criterion. In this case (in contrast to pair regression), zero hypothesis is put forward H 0that all regression coefficients are zero ( b 1.=0, b 2.=0, … , b M.\u003d 0). Fisher's criterion is determined by the following formula:

where D. Fact - factor dispersion explained by regression, one degree of freedom; D. OST - residual dispersion by one degree of freedom; R 2.- multiple determination coefficient; t. h. In the regression equation (in pair linear regression t.= 1); p -the number of observations.

The resulting value of the F-criterion is compared with table at a certain level of significance. If its actual value is more tabular, then the hypothesis Butthe incognizes of the regression equation is rejected, and an alternative hypothesis of its statistical significance is adopted.

With the help of the Fisher's criterion, it is possible to estimate the importance of not only the regression equation as a whole, but also the significance of additional inclusion in the model of each factor. Such an assessment is necessary in order not to load the model by factors that do not have a significant effect on the result. In addition, since the model consists of several factors, they can be introduced into it in different sequences, and since there is a correlation between factors, the significance of the inclusion in the model of the same factor may vary depending on the sequence of factors in it.

To assess the significance of the inclusion of an additional factor in the model calculated private criterion Fisher F Xi.It is based on a comparison of the growth of factor dispersion due to the inclusion in the model of an additional factor, with a residual dispersion by one degree of freedom for regression as a whole. Consequently, the calculation formula private F-Criteria For the factor will have the following form:

where R 2 YX 1 x 2 ... XI ... XP -multiple determination coefficient for a full set model p Factors ; R 2 YX 1 x 2 ... x i -1 x i + 1 ... xp- Multiple determination coefficient for a model that does not include factor x I.; P- the number of observations; t.- the number of parameters when factors x.in the regression equation.

The actual value of the Fisher's private criterion is compared with the table at the level of significance of 0.05 or 0.1 and the corresponding number of degrees of freedom. If the actual value F XI.exceed F Table , then the additional inclusion of factor x I. The model is statistically justified, and the coefficient of "clean" regression b I.for factor x I.statistically significant. If F XI.less F Table The additional inclusion in the factor model does not significantly increase the share of explained variations y,and, therefore, its inclusion in the model does not make sense, the regression coefficient for this factor in this case is statistically insignificant.

With the help of a private filter criterion, you can check the significance of all regression coefficients under the assumption that each corresponding factor x I.it is introduced into the multiple regression equation last, and all other factors were already included in the model earlier.

Assessment of the significance of the coefficients of "clean" regression b I. by criteria Student T.can be carried out without calculating private F.-criteries. In this case, as in the paired regression, the formula is applied to each factor

t Bi \u003d B I / M BI,

where b I. - The coefficient of "clean" regression when factor x I. ; m BI.- Standard regression coefficient error b I. .