Assessment of the statistical significance of the regression equation of its parameters. Checking the Significance of the Regression Equation

Estimation of the significance of the parameters of the regression equation

Estimating the Significance of Equation Parameters linear regression produced using Student's t-test:

if t calc. > t cr, then the main hypothesis is accepted ( Ho), indicating statistical significance regression parameters;

if t calc.< t cr, then the alternative hypothesis is accepted ( H1), indicating the statistical insignificance of the regression parameters.

where m a , m b are the standard errors of the parameters a and b:

(2.19)

(2.20)

The critical (tabular) value of the criterion is found using the statistical tables of the Student's distribution (Appendix B) or according to the tables excel(section of the function wizard "Statistical"):

t cr = STEUDRASP( α=1-P; k=n-2), (2.21)

where k=n-2 also represents the number of degrees of freedom .

The estimate of statistical significance can also be applied to the linear correlation coefficient

where m r is the standard error of determining the values ​​of the correlation coefficient r yx

(2.23)

Below are the options for tasks for practical and laboratory work on the subject of the second section.

Questions for self-examination in section 2

1. Specify the main components of the econometric model and their essence.

2. The main content of the stages of econometric research.

3. Essence of approaches to determine the parameters of linear regression.

4. Essence and features of the application of the method least squares when determining the parameters of the regression equation.

5. What indicators are used to assess the closeness of the relationship of the studied factors?

6. Essence linear coefficient correlations.

7. The essence of the coefficient of determination.

8. The essence and main features of the procedures for assessing the adequacy (statistical significance) of regression models.

9. Assessment of the adequacy of linear regression models by the coefficient of approximation.

10. The essence of the approach for assessing the adequacy of regression models by the Fisher criterion. Definition of empirical and critical values criteria.

11. The essence of the concept of "dispersion analysis" in relation to econometric studies.

12. The essence and main features of the procedure for assessing the significance of parameters linear equation regression.

13. Features of the application of the Student's distribution in assessing the significance of the parameters of the linear regression equation.

14. What is the task of forecasting single values ​​of the studied socio-economic phenomenon?

1. Build a correlation field and formulate an assumption about the form of the relationship equation of the studied factors;

2. Write down the basic equations of the least squares method, make the necessary transformations, compile a table for intermediate calculations and determine the parameters of the linear regression equation;

3. Verify the correctness of the calculations performed using standard procedures and functions of electronic Excel tables.

4. Analyze the results, formulate conclusions and recommendations.

1. Calculation of the value of the linear correlation coefficient;

2. Construction of a dispersion analysis table;

3. Assessment of the coefficient of determination;

4. Verify the correctness of the calculations performed using standard procedures and functions of Excel spreadsheets.

5. Analyze the results, formulate conclusions and recommendations.

4. Conduct a general assessment of the adequacy of the selected regression equation;

1. Assessment of the adequacy of the equation by the values ​​of the approximation coefficient;

2. Assessment of the adequacy of the equation by the values ​​of the coefficient of determination;

3. Assessment of the adequacy of the equation by the Fisher criterion;

4. Conduct a general assessment of the adequacy of the parameters of the regression equation;

5. Verify the correctness of the calculations performed using standard procedures and functions of Excel spreadsheets.

6. Analyze the results, formulate conclusions and recommendations.

1. Using the standard procedures of the Excel Spreadsheet Function Wizard (from the "Mathematical" and "Statistical" sections);

2. Data preparation and features of using the "LINEST" function;

3. Data preparation and features of using the "PREDICTION" function.

1. Using the standard procedures of the Excel spreadsheet data analysis package;

2. Preparation of data and features of the application of the "REGRESSION" procedure;

3. Interpretation and generalization of table data regression analysis;

4. Interpretation and generalization of the data of the dispersion analysis table;

5. Interpretation and generalization of the data of the table for assessing the significance of the parameters of the regression equation;

When performing laboratory work according to one of the options, it is necessary to perform the following particular tasks:

1. Make a choice of the form of the equation of the relationship of the studied factors;

2. Determine the parameters of the regression equation;

3. To assess the tightness of the relationship of the studied factors;

4. Assess the adequacy of the selected regression equation;

5. Evaluate the statistical significance of the parameters of the regression equation.

6. Verify the correctness of the calculations performed using standard procedures and functions of Excel spreadsheets.

7. Analyze the results, formulate conclusions and recommendations.

Tasks for practical and laboratory work on the topic "Paired linear regression and correlation in econometric studies."

Option 1 Option 2 Option 3 Option 4 Option 5
x y x y x y x y x y
Option 6 Option 7 Option 8 Option 9 Option 10
x y x y x y x y x y

After the regression equation is constructed and its accuracy is estimated using the determination coefficient, the question remains open due to what this accuracy was achieved and, accordingly, whether this equation can be trusted. The fact is that the regression equation was not built according to population, which is unknown, but from a sample of it. Points from the general population fall into the sample randomly, therefore, in accordance with the theory of probability, among other cases, it is possible that the sample from the “broad” general population turns out to be “narrow” (Fig. 15).

Rice. fifteen. Possible variant hit points in the sample from the general population.

In this case:

a) the regression equation built on the sample may differ significantly from the regression equation for the general population, which will lead to forecast errors;

b) the coefficient of determination and other accuracy characteristics will turn out to be unreasonably high and will mislead about the predictive qualities of the equation.

In the limiting case, the option is not ruled out when from the general population, which is a cloud with main axis parallel to the horizontal axis (there is no relationship between the variables), due to random selection, a sample will be obtained, the main axis of which will be inclined to the axis. Thus, attempts to predict the next values ​​of the general population based on sample data from it are fraught not only with errors in assessing the strength and direction of the relationship between the dependent and independent variables, but also with the danger of finding a relationship between variables where there is actually none.

In the absence of information about all points of the general population, the only way to reduce errors in the first case is to use a method in estimating the coefficients of the regression equation that ensures their unbiasedness and efficiency. And the probability of the occurrence of the second case can be significantly reduced due to the fact that one property of the general population with two variables independent of each other is known a priori - it is this connection that is absent in it. This reduction is achieved by checking the statistical significance of the resulting regression equation.

One of the most commonly used verification options is as follows. For the resulting regression equation, the -statistics - characteristic of the accuracy of the regression equation is determined, which is the ratio of that part of the variance of the dependent variable that is explained by the regression equation to the unexplained (residual) part of the variance. The equation for determining -statistics in the case of multivariate regression is:

where: - explained variance - part of the variance of the dependent variable Y, which is explained by the regression equation;

Residual variance - part of the variance of the dependent variable Y that is not explained by the regression equation, its presence is a consequence of the action of a random component;

Number of points in the sample;

The number of variables in the regression equation.

As can be seen from the above formula, the variances are defined as the quotient of dividing the corresponding sum of squares by the number of degrees of freedom. The number of degrees of freedom is the minimum required number of values ​​of the dependent variable, which are sufficient to obtain the desired sample characteristic and which can freely vary, given that all other quantities used to calculate the desired characteristic are known for this sample.

To obtain the residual variance, the coefficients of the regression equation are needed. In the case of pairwise linear regression, there are two coefficients, therefore, in accordance with the formula (assuming ), the number of degrees of freedom is . This means that to determine the residual variance, it is sufficient to know the coefficients of the regression equation and only the values ​​of the dependent variable from the sample. The remaining two values ​​can be calculated from these data and are therefore not freely variable.

To calculate the explained variance, the values ​​of the dependent variable are not required at all, since it can be calculated by knowing the regression coefficients for the independent variables and the variance of the independent variable. To see this, it suffices to recall the expression given earlier . Therefore, the number of degrees of freedom for the residual variance is equal to the number of independent variables in the regression equation (for paired linear regression).

As a result, the -criterion for the paired linear regression equation is determined by the formula:

.

In probability theory, it has been proven that the -criterion of the regression equation obtained for a sample from the general population in which there is no connection between the dependent and independent variable has a Fisher distribution, which is quite well studied. Due to this, for any value of the -criterion, it is possible to calculate the probability of its occurrence and vice versa, to determine the value of the -criterion that it cannot exceed with a given probability.

To carry out a statistical test of the significance of the regression equation, a null hypothesis is formulated about the absence of a relationship between the variables (all coefficients for the variables are equal to zero) and the significance level is selected.

The significance level is the acceptable probability of making a Type I error - rejecting the correct null hypothesis as a result of testing. In this case, to make a Type I error means to recognize from the sample the presence of a relationship between the variables in the general population, when in fact it is not there.

The significance level is usually taken to be 5% or 1%. The higher the significance level (the smaller ), the higher the test reliability level equal to , i.e. the greater the chance of avoiding the sampling error of the existence of a relationship in the population of variables that are actually unrelated. But with an increase in the level of significance, the risk of committing an error of the second kind increases - to reject the correct null hypothesis, i.e. not to notice in the sample the actual relationship of variables in the general population. Therefore, depending on which error has large negative consequences, one or another level of significance is chosen.

For the selected significance level according to the Fisher distribution, a tabular value is determined, the probability of exceeding which in the sample with power , obtained from the general population without a relationship between variables, does not exceed the significance level. compared with the actual value of the criterion for regression equation.

If the condition is met, then the erroneous detection of a relationship with the value of the -criterion equal to or greater in the sample from the general population with unrelated variables will occur with a probability less than the significance level. In accordance with the rule “very rare events do not happen”, we come to the conclusion that the relationship between the variables established by the sample is also present in the general population from which it was obtained.

If it turns out, then the regression equation is not statistically significant. In other words, there is a real probability that a relationship between variables that does not exist in reality has been established in the sample. An equation that fails a test for statistical significance is treated the same as a drug with expired fit-

Tee - such medicines are not necessarily spoiled, but since there is no confidence in their quality, they are preferred not to be used. This rule does not protect against all errors, but it allows you to avoid the most gross ones, which is also quite important.

The second verification option, more convenient in the case of using spreadsheets, is a comparison of the probability of occurrence of the obtained criterion value with the significance level. If this probability is below the significance level , then the equation is statistically significant, otherwise it is not.

After checking the statistical significance of the regression equation, it is generally useful, especially for multivariate dependencies, to check for the statistical significance of the obtained regression coefficients. The ideology of checking is the same as when checking the equation as a whole, but as a criterion, the Student's criterion is used, which is determined by the formulas:

and

where: , - Student's criterion values ​​for coefficients and respectively;

- residual variance of the regression equation;

Number of points in the sample;

The number of variables in the sample, for pairwise linear regression.

The obtained actual values ​​of Student's criterion are compared with table values obtained from Student's distribution. If it turns out that , then the corresponding coefficient is statistically significant, otherwise it is not. The second option for checking the statistical significance of the coefficients is to determine the probability of the occurrence of Student's t-test and compare with the significance level .

Variables whose coefficients are not statistically significant are likely to have no effect on the dependent variable in the population at all. Therefore, either it is necessary to increase the number of points in the sample, then it is possible that the coefficient will become statistically significant and at the same time its value will be refined, or, as independent variables, find others that are more closely related to the dependent variable. In this case, the forecasting accuracy will increase in both cases.

As an express method for assessing the significance of the coefficients of the regression equation, the following rule can be used - if the Student's criterion is greater than 3, then such a coefficient, as a rule, turns out to be statistically significant. In general, it is believed that in order to obtain statistically significant equations regression is necessary for the condition to be satisfied.

The standard error of forecasting according to the obtained regression equation of an unknown value with a known one is estimated by the formula:

Thus, a forecast with a confidence level of 68% can be represented as:

If a different confidence probability is required, then for the significance level it is necessary to find the Student's test and confidence interval for a forecast with a level of reliability will be equal to .

Prediction of multidimensional and non-linear dependencies

If the predicted value depends on several independent variables, then in this case there is a multivariate regression of the form:

where: - regression coefficients describing the influence of variables on the predicted value.

The methodology for determining regression coefficients is no different from pairwise linear regression, especially when using a spreadsheet, since the same function is used there for both pairwise and multivariate linear regression. In this case, it is desirable that there are no relationships between the independent variables, i.e. changing one variable did not affect the values ​​of other variables. But this requirement is not mandatory, it is important that there are no functional linear dependencies between the variables. The above procedures for checking the statistical significance of the obtained regression equation and its individual coefficients, the assessment of forecasting accuracy remains the same as for the case of paired linear regression. At the same time, the use of multivariate regressions instead of a pair regression usually allows, with an appropriate choice of variables, to significantly improve the accuracy of describing the behavior of the dependent variable, and hence the accuracy of forecasting.

In addition, the equations of multivariate linear regression make it possible to describe the nonlinear dependence of the predicted value on independent variables. The procedure for bringing a nonlinear equation to a linear form is called linearization. In particular, if this dependence is described by a polynomial of degree different from 1, then, by replacing variables with degrees different from unity by new variables in the first degree, we obtain a multivariate linear regression problem instead of a nonlinear one. So, for example, if the influence of the independent variable is described by a parabola of the form

then the replacement allows us to transform the nonlinear problem to a multidimensional linear problem of the form

Nonlinear problems can be converted just as easily, in which non-linearity arises due to the fact that the predicted value depends on the product of independent variables. To account for this effect, it is necessary to introduce a new variable equal to this product.

In cases where the nonlinearity is described by more complex dependencies, linearization is possible due to coordinate transformations. For this, the values ​​are calculated and graphs of the dependence of the initial points in various combinations of the transformed variables are built. That combination of transformed coordinates, or transformed and non-transformed coordinates, in which the dependence is closest to a straight line suggests a change of variables that will lead to the transformation of a non-linear dependence to a linear form. For example, a nonlinear dependence of the form

turns into a linear

The resulting regression coefficients for the transformed equation remain unbiased and effective, but the equation and coefficients cannot be tested for statistical significance

Checking the validity of the application of the least squares method

The use of the least squares method ensures the efficiency and unbiased estimates of the coefficients of the regression equation, subject to the following conditions (Gaus-Markov conditions):

3. values ​​do not depend on each other

4. values ​​do not depend on independent variables

The easiest way to check whether these conditions are met is to plot the residuals versus , then the independent variable(s). If the points on these graphs are located in a corridor located symmetrically to the x-axis and there are no regularities in the location of the points, then the Gaus-Markov conditions are met and there are no opportunities to improve the accuracy of the regression equation. If this is not the case, then it is possible to significantly improve the accuracy of the equation, and for this it is necessary to refer to the special literature.

To assess the significance, significance of the correlation coefficient, Student's t-test is used.

The average error of the correlation coefficient is found by the formula:

H
and based on the error, the t-test is calculated:

The calculated value of the t-test is compared with the tabular value found in the Student's distribution table at a significance level of 0.05 or 0.01 and the number of degrees of freedom n-1. If the calculated value of the t-test is greater than the tabulated one, then the correlation coefficient is recognized as significant.

With a curvilinear relationship, the F-criterion is used to assess the significance of the correlation relationship and the regression equation. It is calculated by the formula:

or

where η is the correlation ratio; n is the number of observations; m is the number of parameters in the regression equation.

The calculated value of F is compared with the table value for the accepted level of significance α (0.05 or 0.01) and the number of degrees of freedom k 1 =m-1 and k 2 =n-m. If the calculated value of F exceeds the tabulated value, the relationship is recognized as significant.

The significance of the regression coefficient is established using Student's t-test, which is calculated by the formula:

where σ 2 and i is the variance of the regression coefficient.

It is calculated by the formula:

where k is the number of factor features in the regression equation.

The regression coefficient is recognized as significant if t a 1 ≥t cr. t cr is found in the table of critical points of Student's distribution at the accepted level of significance and the number of degrees of freedom k=n-1.

4.3 Correlation-regression analysis in Excel

Let's carry out a correlation-regression analysis of the relationship between yield and labor costs per 1 quintal of grain. To do this, open an Excel sheet, in cells A1: A30 enter the values ​​of the factor attribute productivity of grain crops, in cells B1: B30 the values ​​of the effective feature - labor costs per 1 quintal of grain. From the Tools menu, select the Data Analysis option. Left-clicking on this item will open the Regression tool. Click on the OK button, the Regression dialog box appears on the screen. In the Input interval Y field, enter the values ​​of the resulting attribute (highlighting cells B1:B30), in the Input interval X field, enter the values ​​of the factor attribute (highlighting cells A1:A30). We mark the probability level of 95%, select New worksheet. We click on the OK button. The table "RESULTS" appears on the worksheet, in which the results of calculating the parameters of the regression equation, the correlation coefficient and other indicators are given, allowing you to determine the significance of the correlation coefficient and the parameters of the regression equation.

RESULTS

Regression statistics

Multiple R

R-square

Normalized R-square

standard error

Observations

Analysis of variance

Significance F

Regression

Odds

standard error

t-statistic

P-value

bottom 95%

Top 95%

Lower 95.0%

Top 95.0%

Y-intersection

Variable X 1

In this table, "Multiple R" is the correlation coefficient, "R-squared" is the coefficient of determination. "Coefficients: Y-intersection" - a free term of the regression equation 2.836242; "Variable X1" - regression coefficient -0.06654. There are also the values ​​of Fisher's F-test 74.9876, Student's t-test 14.18042, "Standard error 0.112121", which are necessary to assess the significance of the correlation coefficient, the parameters of the regression equation and the entire equation.

Based on the data in the table, we construct a regression equation: y x ​​\u003d 2.836-0.067x. The regression coefficient a 1 = -0.067 means that with an increase in grain yield by 1 quintal/ha, labor costs per 1 quintal of grain decrease by 0.067 man-hours.

Correlation coefficient r=0.85>0.7, therefore, the relationship between the studied features in this population is close. The coefficient of determination r 2 =0.73 shows that 73% of the variation of the effective trait (labor costs per 1 centner of grain) is caused by the action of the factor trait (grain yield).

In the table of critical points of the Fisher - Snedecor distribution, we find the critical value of the F-criterion at a significance level of 0.05 and the number of degrees of freedom k 1 =m-1=2-1=1 and k 2 =n-m=30-2=28, it is equal to 4.21. Since the calculated value of the criterion is greater than the tabular value (F=74.9896>4.21), the regression equation is recognized as significant.

To assess the significance of the correlation coefficient, we calculate the Student's t-test:

AT
In the table of critical points of the Student's distribution, we find the critical value of the t-test at a significance level of 0.05 and the number of degrees of freedom n-1=30-1=29, it is equal to 2.0452. Since the calculated value is greater than the tabulated one, the correlation coefficient is significant.

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are concrete examples from the field of economics.

Types of regression

The concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • power;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Consider the problem of determining the dependence of the number of retired team members on the average salary at 6 industrial enterprises.

A task. Six enterprises analyzed the average monthly wages and the number of employees who quit own will. In tabular form we have:

The number of people who left

Salary

30000 rubles

35000 rubles

40000 rubles

45000 rubles

50000 rubles

55000 rubles

60000 rubles

For the problem of determining the dependence of the number of retired workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +…+a k x k , where x i are the influencing variables, a i are the regression coefficients, a k is the number of factors.

For this task, Y is the indicator of employees who left, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the spreadsheet "Excel"

Regression analysis in Excel must be preceded by the application of built-in functions to the available tabular data. However, for these purposes, it is better to use the very useful add-in "Analysis Toolkit". To activate it you need:

  • from the "File" tab, go to the "Options" section;
  • in the window that opens, select the line "Add-ons";
  • click on the "Go" button located at the bottom, to the right of the "Management" line;
  • check the box next to the name "Analysis Package" and confirm your actions by clicking "OK".

If everything is done correctly, the desired button will appear on the right side of the Data tab, located above the Excel worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for performing econometric calculations, we can begin to solve our problem. For this:

  • click on the "Data Analysis" button;
  • in the window that opens, click on the "Regression" button;
  • in the tab that appears, enter the range of values ​​for Y (the number of employees who quit) and for X (their salaries);
  • We confirm our actions by pressing the "Ok" button.

As a result, the program will automatically populate a new sheet of the spreadsheet with regression analysis data. Note! Excel has the ability to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values ​​are, or even a new workbook specifically designed to store such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data of the considered example looks like this:

First of all, you should pay attention to the value of the R-square. It is the coefficient of determination. AT this example R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more applicable the chosen model for a particular task. It is believed that it correctly describes the real situation with an R-squared value above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Ratio Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are set to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a specific model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence at all small. The "-" sign indicates that the coefficient has a negative value. This is obvious, since everyone knows that the higher the salary at the enterprise, the less people express a desire to terminate the employment contract or quit.

Multiple Regression

This term refers to a connection equation with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is the effective feature (dependent variable), and x 1 , x 2 , ... x m are the factor factors (independent variables).

Parameter Estimation

For multiple regression (MR) it is carried out using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε, we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

LSM is applicable to the MP equation on a standardizable scale. In this case, we get the equation:

where t y , t x 1, … t xm are standardized variables for which the mean values ​​are 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are set as normalized and centralized, so their comparison with each other is considered correct and admissible. In addition, it is customary to filter out factors, discarding those with the smallest values ​​of βi.

Problem using linear regression equation

Suppose there is a table of the price dynamics of a particular product N during the last 8 months. It is necessary to make a decision on the advisability of purchasing its batch at a price of 1850 rubles/t.

month number

month name

price of item N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet, you need to use the Data Analysis tool already known from the above example. Next, select the "Regression" section and set the parameters. It must be remembered that in the "Input interval Y" field, a range of values ​​for the dependent variable (in this case, the price of a product in specific months of the year) must be entered, and in the "Input interval X" - for the independent variable (month number). Confirm the action by clicking "Ok". On a new sheet (if it was indicated so), we get data for regression.

Based on them, we build a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the row with the name of the month number and the coefficients and the “Y-intersection” row from the sheet with the results of the regression analysis. Thus, the linear regression equation (LE) for problem 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, multiple correlation coefficients (MCC) and determination coefficients are used, as well as Fisher's test and Student's test. In the Excel table with regression results, they appear under the names of multiple R, R-square, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the tightness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables "Number of the month" and "Price of goods N in rubles per 1 ton". However, the nature of this relationship remains unknown.

The square of the coefficient of determination R 2 (RI) is a numerical characteristic of the share of the total scatter and shows the scatter of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., the statistical data are described with a high degree of accuracy by the obtained SD.

F-statistics, also called Fisher's test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to evaluate the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-criterion > t cr, then the hypothesis of the insignificance of the free term of the linear equation is rejected.

In the problem under consideration for the free member, using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e. we have a zero probability that the correct hypothesis about the insignificance of the free member will be rejected. For the coefficient at unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Consider a specific applied problem.

The management of NNN must make a decision on the advisability of purchasing a 20% stake in MMM SA. The cost of the package (JV) is 70 million US dollars. NNN specialists collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • annual turnover (VO);
  • accounts receivable (VD);
  • cost of fixed assets (SOF).

In addition, the parameter payroll arrears of the enterprise (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet

First of all, you need to create a table of initial data. It looks like this:

  • call the "Data Analysis" window;
  • select the "Regression" section;
  • in the box "Input interval Y" enter the range of values ​​of dependent variables from column G;
  • click on the icon with a red arrow to the right of the "Input interval X" window and select the range of all values ​​​​from columns B, C, D, F on the sheet.

Select "New Worksheet" and click "Ok".

Get the regression analysis for the given problem.

Examination of the results and conclusions

“We collect” from the rounded data presented above on the Excel spreadsheet sheet, the regression equation:

SP \u003d 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, they get a figure of 64.72 million US dollars. This means that the shares of JSC MMM should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you solve practical problems from the field of econometrics.

In socio-economic research, one often has to work in conditions of a limited population, or with selective data. Therefore, after the mathematical parameters of the regression equation, it is necessary to evaluate them and the equation as a whole for statistical significance, i.e. it is necessary to make sure that the resulting equation and its parameters are formed under the influence of non-random factors.

First of all, the statistical significance of the equation as a whole is evaluated. The evaluation is usually carried out using Fisher's F-test. The calculation of the F-criterion is based on the rule of adding variances. Namely, the general variance sign-result = factor variance + residual variance.

actual price

Theoretical price
Having built the regression equation, it is possible to calculate the theoretical value of the sign-result, i.e. calculated by the regression equation taking into account its parameters.

These values ​​will characterize the sign-result formed under the influence of the factors included in the analysis.

There are always discrepancies (residuals) between the actual values ​​of the result attribute and those calculated on the basis of the regression equation, due to the influence of other factors not included in the analysis.

The difference between the theoretical and actual values ​​of the attribute-result is called residuals. General variation of the trait-result:

The variation in the trait-result, due to the variation in the traits of the factors included in the analysis, is estimated through a comparison of the theoretical values ​​of the result. feature and its mean values. Residual variation through a comparison of theoretical and actual values ​​of the resulting feature. Total variance, residual and actual have a different number of degrees of freedom.

general, P- number of units in the studied population

actual, P- number of factors included in the analysis

Residual

Fisher's F-test is calculated as a ratio to , and calculated for one degree of freedom.

The use of Fisher's F-test as an estimate of the statistical significance of a regression equation is very logical. is the result. trait, due to the factors included in the analysis, i.e. this is the proportion of the explained result. sign. - this is a (variation) of the sign of the result due to factors whose influence is not taken into account, i.e. not included in the analysis.

That. F-criterion is designed to evaluate significant excess over . If it is not significantly lower than , and even more so if it exceeds , therefore, the analysis does not include those factors that really affect the result attribute.

Fisher's F-test is tabulated, the actual value is compared with the table. If , then the regression equation is considered statistically significant. If, on the contrary, the equation is not statistically significant and cannot be used in practice, the significance of the equation as a whole indicates the statistical significance of the correlation indicators.

After evaluating the equation as a whole, it is necessary to evaluate the statistical significance of the parameters of the equation. This estimate is made using Student's t-statistics. The t-statistic is calculated as the ratio of the equation parameters (modulo) to their standard mean square error. If a one-factor model is evaluated, then 2 statistics are calculated.

In all computer programs, the calculation of the standard error and t-statistics for the parameters is carried out with the calculation of the parameters themselves. T-statistics are tabulated. If the value is , then the parameter is considered statistically significant, i.e. formed under the influence of non-random factors.

Calculating the t-statistic essentially means testing the null hypothesis that the parameter is insignificant, i.e. its equality to zero. With a one-factor model, 2 hypotheses are evaluated: and

The level of significance of accepting the null hypothesis depends on the level of the accepted confidence level. So if the researcher specifies a probability level of 95%, the acceptance significance level will be calculated, therefore, if the significance level ≥ 0.05, then it is accepted and the parameters are considered statistically insignificant. If , then the alternative is rejected and accepted: and .

The statistical application packages also provide a significance level for accepting null hypotheses. An assessment of the significance of the regression equation and its parameters can give the following results:

First, the equation as a whole is significant (according to the F-test) and all parameters of the equation are also statistically significant. This means that the resulting equation can be used both for making management decisions as well as for forecasting.

Secondly, according to the F-criterion, the equation is statistically significant, but at least one of the parameters of the equation is not significant. The equation can be used to make management decisions regarding the analyzed factors, but cannot be used for forecasting.

Thirdly, the equation is not statistically significant, or the equation is significant according to the F-criterion, but all parameters of the resulting equation are not significant. The equation cannot be used for any purpose.

In order for the regression equation to be recognized as a model of the relationship between the result sign and the factor signs, it is necessary that it includes all the most important factors that determine the result, so that the meaningful interpretation of the equation parameters corresponds to the theoretically substantiated relationships in the phenomenon under study. The coefficient of determination R 2 must be > 0.5.

When building multiple equation regression, it is advisable to carry out an assessment by the so-called adjusted coefficient of determination (R 2). The value of R 2 (as well as correlations) increases with an increase in the number of factors included in the analysis. The value of the coefficients is especially overestimated in conditions of small populations. In order to extinguish the negative influence of R 2 and correlations are corrected taking into account the number of degrees of freedom, i.e. the number of freely varying elements when certain factors are included.

Adjusted coefficient of determination

P–set size/number of observations

k– number of factors included in the analysis

n-1 is the number of degrees of freedom

(1-R2)- the value of the residual / unexplained variance of the resulting attribute

Always less R2. on the basis, it is possible to compare estimates of equations with a different number of analyzed factors.

34. Problems of studying time series.

Series of dynamics are called time series or time series. A dynamic series is a time-ordered sequence of indicators characterizing a particular phenomenon (the volume of GDP from 90 to 98 years). The purpose of studying the series of dynamics is to identify patterns in the development of the phenomenon under study (the main trend) and forecast on this basis. It follows from the definition of RD that any series consists of two elements: time t and the level of the series (those specific values ​​of the indicator on the basis of which the DR series is built). DR series can be 1) momentary - series, the indicators of which are fixed at a point in time, on a specific date, 2) interval - series, the indicators of which are obtained for a certain period of time (1. population of St. Petersburg, 2. GDP for the period). The division of the series into moment and interval ones is necessary, since this determines the specifics of the calculation of some indicators of the DR series. Level summation interval series gives a meaningfully interpreted result, which cannot be said about the summation of the levels of the moment series, since the latter contain repeated counting. The most important problem in the analysis of time series is the problem of comparability of the levels of the series. This concept is very versatile. The levels should be comparable in terms of calculation methods and in terms of territory and coverage of population units. If the DR series is built in terms of cost, then all levels should be presented or calculated in comparable prices. When constructing interval series, the levels should characterize the same periods of time. When constructing moment Series D, the levels must be fixed on the same date. The rows can be complete or incomplete. Incomplete series are used in official publications (1980,1985,1990,1995,1996,1997,1998,1999…). A comprehensive analysis of the RD includes the study of the following points:

1. calculation of indicators of changes in RD levels

2. calculation of average indicators of RD

3. identifying the main trend of the series, building trend models

4. Estimation of autocorrelation in RD, construction of autoregressive models

5. correlation of RD

6. RD forecasting.

35. Indicators of change in the levels of time series .

AT general view RowD can be represented:

y is the DR level, t is the moment or period of time to which the level (indicator) refers, n is the length of the DR Series (number of periods). when studying a series of dynamics, the following indicators are calculated: 1. absolute growth, 2. growth factor (growth rate), 3. acceleration, 4. growth factor (growth rate), 5. absolute value 1% increase. Calculated indicators can be: 1. chain - obtained by comparing each level of the series with the immediately preceding one, 2. basic - obtained by comparing with the level chosen as the comparison base (unless otherwise specified, the 1st level of the series is taken as the base). 1. Chain absolute gains:. Shows how much more or less. Chain absolute increments are called indicators of the rate of change in the levels of the dynamic series. Base absolute growth: . If the levels of the series are relative indicators, expressed in %, then the absolute increase is expressed in points of change. 2. growth factor (growth rate): It is calculated as the ratio of the levels of the series to the immediately preceding ones (chain growth factors), or to the level taken as the comparison base (basic growth factors): . Characterizes how many times each level of the series > or< предшествующего или базисного. На основе коэффициентов роста рассчитываются темпы роста. Это коэффициенты роста, выраженные в %ах: 3. on the basis of absolute growth, the indicator is calculated - acceleration of absolute growth: . Acceleration is the absolute growth of absolute growths. Evaluates how the increments themselves change, whether they are stable or accelerating (increasing). 4. growth rate is the ratio of growth to the base of comparison. Expressed in %: ; . The growth rate is the growth rate minus 100%. Shows how much % this row level is > or< предшествующего либо базисного. 5. абсолютное значение 1% прироста. Рассчитывается как отношение абсолютного прироста к темпу прироста, т.е.: - сотая доля предыдущего уровня. Все эти показатели рассчитываются для оценки степени изменения уровней ряда. Цепные коэффициенты и темпы роста называются показателями интенсивности изменения уровней ДРядов.

2. Calculation of average indicators of RD Calculate the average levels of the series, the average absolute gains, the average growth rate and the average growth rate. Average indicators are calculated in order to summarize information and to be able to compare the levels and indicators of their change in different series. 1. average row level a) for interval time series it is calculated by the simple arithmetic mean: , where n is the number of levels in the time series; b) for moment series, the average level is calculated according to a specific formula, which is called the chronological average: . 2. average absolute growth is calculated on the basis of chain absolute increments according to the arithmetic mean simple:

. 3. Average growth rate calculated on the basis of chain growth factors using the geometric mean formula: . When commenting on the average indicators of the DR Series, it is necessary to indicate 2 points: the period that characterizes the analyzed indicator and the time interval for which the DR Series is built. 4. Average growth rate: . 5. average growth rate: .