Matrix of paired correlation coefficients. Pair Correlation Coefficient in Excel

Initially into the model at include all main components (calculated values ​​are indicated in brackets t-criteria):

The quality of the model is characterized by: multiple coefficient of determination r= 0.517, average relative approximation error = 10.4%, residual variance s2= 1.79 and F obs = 121. In view of the fact that F obs > F cr = 2.85 at α = 0.05, v1 = 6, v2= 14, the regression equation is significant and at least one of the regression coefficients - β 1 , β 2 , β 3 , β 4 - is not equal to zero.

If the significance of the regression equation (hypothesis H 0:β 1 = β 2 = β 3 = β 4 = 0 was checked at α = 0.05, then the significance of the regression coefficients, i.e. hypotheses H0: β j = 0 (j= 1, 2, 3, 4), should be checked at a significance level greater than 0.05, for example, at α = 0.1. Then for α = 0.1, v= 14 value t kr = 1.76, and significant, as follows from equation (53.41), are the regression coefficients β 1 , β 2 , β 3 .

Considering that the main components are not correlated with each other, we can immediately exclude all insignificant coefficients from the equation, and the equation will take the form

(53.42)

Comparing equations (53.41) and (53.42), we see that the elimination of insignificant principal components f4 and f5, did not affect the values ​​of the coefficients of the equation b 0 = 9,52, b 1 = 0,93, b 2 = 0.66 and corresponding t j (j = 0, 1, 2, 3).

This is due to the uncorrelated principal components. Here, the parallel of the regression equations for the initial indicators (53.22), (53.23) and the main components (53.41), (53.42) is interesting.

Equation (53.42) is significant because F obs = 194 > F kr = 3.01 found at α = 0.05, v1 = 4, v2= 16. The coefficients of the equation are also significant, since t j > t kr . = 1.746 corresponding to α = 0.01, v= 16 for j= 0, 1, 2, 3. Coefficient of determination r= 0.486 indicates that 48.6% of the variation at due to the influence of the first three principal components.

Equation (53.42) is characterized by an average relative error of approximation = 9.99% and residual variance s2 = 1,91.

The regression equation on the principal components (53.42) has slightly better approximating properties compared to the regression model (53.23) in terms of the initial indicators: r= 0,486 > r= 0,469; = 9,99% < (X) = 10.5% and s 2 (f) = 1,91 < s 2 (x) = 1.97. In addition, in equation (53.42), the principal components are linear functions of all input indicators, while equation (53.23) includes only two variables ( x 1 and x 4). In some cases, one has to take into account that the model (53.42) is difficult to interpret, since it includes the third principal component f 3, which we have not interpreted and whose contribution to the total variance of the initial indicators ( x 1 , ..., x 5) is only 8.6%. However, the exception f 3 from equation (53.42) significantly worsens the approximating properties of the model: r= 0.349; = 12.4% and s2(f) = 2.41. Then it is advisable to choose equation (53.23) as a regression model of productivity.

cluster analysis

In statistical research, the grouping of primary data is the main decision classification tasks, and therefore the foundation of all further work with the collected information.

Traditionally, this problem is solved in the following way. From the set of features that describe the object, one is selected, the most informative, from the point of view of the researcher, and the data is grouped in accordance with the values ​​of this feature. If it is required to classify according to several features, ranked among themselves in order of importance, then first the classification is carried out according to the first feature, then each of the resulting classes is divided into subclasses according to the second feature, and so on. Most combinational statistical groupings are built in a similar way.

In cases where it is not possible to streamline the classification features, the simplest method of multidimensional grouping is used - the creation of an integral indicator (index) that is functionally dependent on the original features, followed by classification according to this indicator.

The development of this approach is a variant of classification according to several generalizing indicators (principal components) obtained using the methods of factor or component analysis.

If there are several features (initial or generalized), the classification problem can be solved by cluster analysis methods, which differ from other multivariate classification methods in the absence of training samples, i.e. a priori information about the distribution of the general population.

The differences between the schemes for solving the problem of classification are largely determined by what is meant by the concepts of "similarity" and "degree of similarity".

After the goal of the work has been formulated, it is natural to try to determine the quality criteria, the objective function, the values ​​of which will allow us to compare various schemes classification.

In economic studies, the objective function, as a rule, should minimize some parameter defined on a set of objects (for example, the purpose of classifying equipment may be a grouping that minimizes the total cost of time and money for repair work).

In cases where it is not possible to formalize the goal of the problem, the criterion for the quality of classification can be the possibility of a meaningful interpretation of the groups found.

Consider the following problem. Let the collection P objects, each of which is characterized k measured traits. It is required to break this collection into groups (classes) that are homogeneous in a certain sense. At the same time, there is practically no a priori information about the nature of the distribution k-dimensional vector X inside classes.

The groups obtained as a result of partitioning are usually called clusters* (taxons**, images), the methods for finding them are called cluster analysis (respectively, numerical taxonomy or pattern recognition with self-learning).

* Cluster(English) - a group of elements characterized by some common property.

**takhop(English) - a systematized group of any category.

It is necessary from the very beginning to clearly understand which of the two classification problems is to be solved. If the usual typing problem is solved, then the set of observations is divided into a relatively small number of grouping areas (for example, interval variation series in the case of one-dimensional observations) so that the elements of one such region are as close as possible to each other.

The solution of another problem is to determine the natural stratification of the results of observations into well-defined clusters lying at some distance from each other.

If the first typing problem always has a solution, then in the second case it may turn out that the set of observations does not show a natural stratification into clusters, i.e. forms one cluster.

Although many methods of cluster analysis are quite elementary, most of the work in which they have been proposed dates back to the last decade. This is explained by effective solution cluster search tasks that require execution a large number arithmetic and logical operations became possible only with the advent and development of computer technology.

The usual form of representation of initial data in problems of cluster analysis is the matrix

each line of which represents measurement results k considered features in one of the examined objects. AT specific situations both the grouping of objects and the grouping of features may be of interest. In cases where the difference between these two tasks is not significant, for example, when describing some algorithms, we will use only the term "object", including the term "feature" in this concept.

Matrix X is not the only way to represent data in cluster analysis problems. Sometimes the initial information is given in the form square matrix

element rij which determines the degree of closeness i-th object to j-mu.

Most of the cluster analysis algorithms completely proceed from the distance (or proximity) matrix or require the calculation of its individual elements, so if the data is presented in the form x, then the first step in solving the problem of finding clusters will be the choice of a method for calculating distances, or proximity, between objects or features.

The question of determining the proximity between features is somewhat easier to solve. As a rule, cluster analysis of features pursues the same goals as factor analysis: the selection of groups of interconnected features that reflect a certain aspect of the objects under study. In this case, various statistical coupling coefficients serve as a measure of closeness.


Similar information.


To determine the degree of dependence between several indicators, multiple correlation coefficients are used. They are then summarized in a separate table, which is called the correlation matrix. The names of the rows and columns of such a matrix are the names of the parameters whose dependence on each other is established. Corresponding correlation coefficients are located at the intersection of rows and columns. Let's find out how you can make a similar calculation using Excel tools.

It is customary to determine the level of relationship between various indicators as follows, depending on the correlation coefficient:

  • 0 - 0.3 - no connection;
  • 0.3 - 0.5 - weak connection;
  • 0.5 - 0.7 - average connection;
  • 0.7 - 0.9 - high;
  • 0.9 - 1 - very strong.

If correlation coefficient negative, it means that the relationship of the parameters is inverse.

In order to compile a correlation matrix in Excel, one tool is used, included in the package "Data analysis". That's what it's called - "Correlation". Let's see how it can be used to calculate multiple correlation scores.

Step 1: Activate Analysis Pack

It must be said right away that the default package "Data analysis" disabled. Therefore, before proceeding with the procedure for directly calculating the correlation coefficients, you need to activate it. Unfortunately, not every user knows how to do this. Therefore, we will focus on this issue.


After the specified action, the tool package "Data analysis" will be activated.

Stage 2: coefficient calculation

Now you can proceed directly to the calculation of the multiple correlation coefficient. Let's use the example of the table of indicators of labor productivity, capital-labor ratio and power-to-weight ratio at various enterprises to calculate the multiple correlation coefficient of these factors using the example below.


Stage 3: analysis of the result

Now let's figure out how to understand the result that we got in the process of data processing by the tool "Correlation" in the Excel program.

As we can see from the table, the correlation coefficient of capital-labor ratio (Column 2) and power-to-weight ratio ( Column 1) is 0.92, which corresponds to a very strong relationship. Between labor productivity ( Column 3) and power-to-weight ratio ( Column 1) this indicator is 0.72, which is a high degree of dependence. Correlation coefficient between labor productivity ( Column 3) and capital-labor ratio ( Column 2) is equal to 0.88, which also corresponds to high degree dependencies. Thus, we can say that the relationship between all the studied factors can be traced quite strong.

As you can see, the package "Data analysis" in Excel is a very convenient and fairly easy-to-use tool for determining the multiple correlation coefficient. It can also be used to calculate the usual correlation between two factors.

Multiple regression is not the result of a transformation of the equation:

-
;

-
.

Linearization implies a procedure...

- bringing the equation of multiple regression to the steam room;

+ no casts linear equation to a linear view;

- reduction of a linear equation to a non-linear form;

- reduction of a nonlinear equation with respect to parameters to an equation that is linear with respect to the result.

Remains do not change;

The number of observations decreases

In a standardized multiple regression equation, the variables are:

Initial variables;

Standardized parameters;

Mean values ​​of initial variables;

standardized variables.

One method for assigning numeric values ​​to dummy variables is. . .

+– ranking;

Alignment of numerical values ​​in ascending order;

Alignment of numerical values ​​in descending order;

Finding the mean.

The matrix of paired correlation coefficients displays the values ​​of paired coefficients linear correlation between. . . .

Variables;

parameters;

Parameters and variables;

Variable and random factors.

The method for estimating the parameters of models with heteroscedastic residuals is called the ____________ method least squares:

Ordinary;

Indirect;

generalized;

Minimum.

The regression equation is given. Define the model specification.

Polynomial Pair Regression Equation;

Linear simple regression equation;

Polynomial equation of multiple regression;

Linear multiple regression equation.

In a standardized equation, the free term is ….

Equals 1;

Equal to the coefficient of multiple determination;

Equal to the multiple correlation coefficient;

Absent.

Factors are included as dummy variables in the multiple regression model.

Having probabilistic values;

Having quantitative values;

Not having qualitative values;

Not having quantitative values.

The factors of the econometric model are collinear if the coefficient ...

Correlations between them modulo more than 0.7;

The determinations between them are greater than 0.7 in absolute value;

The determinations between them are less than 0.7 in absolute value;

The generalized least squares method differs from the usual least squares method in that, when using GLS ...

The original levels of the variables are converted;

Remains do not change;

The remainder is equal to zero;

The number of observations decreases.

The sample size is determined ...

The numerical value of the variables selected in the sample;

The volume of the general population;

The number of parameters for independent variables;

The number of result variables.

11. Multiple regression is not the result of a transformation of the equation:

+-
;

-
;

-
.

The initial values ​​of the dummy variables assume the values ​​...

quality;

Quantitatively measurable;

The same;

Values.

The generalized least squares method implies ...

Variable conversion;

Transition from multiple regression to pair regression;

Linearization of the regression equation;

Two-stage application of the least squares method.

The linear equation of multiple regression has the form . Determine which factor or :

+- , since 3.7>2.5;

They have the same effect;

- , since 2.5>-3.7;

According to this equation, it is impossible to answer the question posed, since the regression coefficients are incomparable among themselves.

The inclusion of a factor in the model is advisable if the regression coefficient for this factor is ...

Zero;

insignificant;

essential;

Insignificant.

What is transformed when applying the generalized least squares method?

Standardized regression coefficients;

Dispersion of the effective feature;

Initial levels of variables;

Dispersion of a factor sign.

A study is being made of the dependence of the production of an enterprise employee on a number of factors. An example of a dummy variable in this model would be ______ employee.

Age;

The level of education;

Wage.

The transition from point estimation to interval estimation is possible if the estimates are:

Effective and insolvent;

Inefficient and wealthy;

Efficient and unbiased;

Wealthy and displaced.

A matrix of pairwise correlation coefficients is built to identify collinear and multicollinear …

parameters;

Random factors;

significant factors;

results.

Based on the transformation of variables using the generalized least squares method, we obtain a new regression equation, which is:

Weighted regression in which variables are taken with weights
;

;

Nonlinear regression in which variables are taken with weights
;

Weighted regression in which variables are taken with weights .

If the calculated value of the Fisher criterion is less than table value, then the hypothesis of the statistical insignificance of the equation ...

Rejected;

insignificant;

accepted;

Not essential.

If the factors are included in the model as a product, then the model is called:

total;

derivative;

Additive;

Multiplicative.

The regression equation that relates the resulting feature to one of the factors with the value of other variables fixed at the average level is called:

Multiple;

essential;

Private;

Insignificant.

Regarding the number of factors included in the regression equation, there are ...

Linear and non-linear regression;

Direct and indirect regression;

Simple and multiple regression;

Multiple and multivariate regression.

The requirement for regression equations, the parameters of which can be found using the least squares method, is:

Equality to zero of the values ​​of the factor attribute4

Non-linearity of parameters;

Equality to zero of the average values ​​of the resulting variable;

Linearity of parameters.

The least squares method is not applicable for ...

Linear equations of pair regression;

Polynomial multiple regression equations;

Equations that are non-linear in terms of the estimated parameters;

Linear equations of multiple regression.

When dummy variables are included in the model, they are assigned ...

Null values;

Numeric labels;

Same values;

Quality labels.

If there is a non-linear relationship between economic indicators, then ...

It is not practical to use the specification of a non-linear regression equation;

It is advisable to use the specification of a non-linear regression equation;

It is advisable to use the specification of a linear paired regression equation;

It is necessary to include other factors in the model and use a linear multiple regression equation.

The result of the linearization of polynomial equations is ...

Nonlinear Pair Regression Equations;

Linear equations of pair regression;

Nonlinear multiple regression equations;

Linear equations of multiple regression.

In the standardized multiple regression equation
0,3;
-2.1. Determine which factor or has a stronger effect on :

+- , since 2.1>0.3;

According to this equation, it is impossible to answer the question posed, since the values ​​of the “pure” regression coefficients are unknown;

- , since 0.3>-2.1;

According to this equation, it is impossible to answer the question posed, since the standardized coefficients are not comparable with each other.

Factorial equation variables multiple regressions converted from qualitative to quantitative are called ...

anomalous;

Multiple;

Paired;

Fictitious.

Estimates of the parameters of the linear equation of multiple regression can be found using the method:

Medium squares;

The largest squares;

Normal squares;

Least squares.

The main requirement for the factors included in the multiple regression model is:

Lack of relationship between result and factor;

Lack of relationship between factors;

Lack of linear relationship between factors;

The presence of a close relationship between factors.

Dummy variables are included in the multiple regression equation to take into account the effect of features on the result ...

quality character;

quantitative nature;

of a non-essential nature;

Random character.

From a pair of collinear factors, the econometric model includes the factor

Which, with a fairly close connection with the result, has the greatest connection with other factors;

Which, in the absence of connection with the result, has the maximum connection with other factors;

Which, in the absence of a connection with the result, has the least connection with other factors;

Which, with a fairly close relationship with the result, has a smaller relationship with other factors.

Heteroskedasticity refers to...

The constancy of the variance of the residuals, regardless of the value of the factor;

Addiction mathematical expectation residuals from the value of the factor;

Dependence of the variance of residuals on the value of the factor;

Independence of the mathematical expectation of the residuals from the value of the factor.

The value of the residual variance when a significant factor is included in the model:

Will not change;

will increase;

will be zero;

Will decrease.

If the specification of the model displays a non-linear form of dependence between economic indicators, then the non-linear equation ...

regressions;

determinations;

Correlations;

Approximations.

The dependence is investigated, which is characterized by a linear multiple regression equation. For the equation, the value of the tightness of the relationship between the resulting variable and a set of factors is calculated. As this indicator, a multiple coefficient was used ...

Correlations;

elasticity;

regressions;

Determinations.

A model of dependence of demand on a number of factors is being built. The dummy variable in this multiple regression equation is not _________consumer.

Family status;

The level of education;

For an essential parameter, the calculated value of the Student's criterion ...

More than the table value of the criterion;

Equal to zero;

Not more than the tabular value of the Student's criterion;

Less than the table value of the criterion.

An LSM system built to estimate the parameters of a linear multiple regression equation can be solved...

Moving average method;

The method of determinants;

Method of first differences;

Simplex method.

An indicator characterizing how many sigmas the result will change on average when the corresponding factor changes by one sigma, with the level of other factors unchanged, is called ____________ regression coefficient

standardized;

Normalized;

Aligned;

Centered.

The multicollinearity of the factors of the econometric model implies…

The presence of a non-linear relationship between the two factors;

The presence of a linear relationship between more than two factors;

Lack of dependence between factors;

The presence of a linear relationship between the two factors.

Generalized least squares is not used for models with _______ residuals.

Autocorrelated and heteroscedastic;

homoscedastic;

heteroskedastic;

Autocorrelated.

The method for assigning numeric values ​​to dummy variables is not:

Ranging;

Assignment of digital labels;

Finding the average value;

Assignment of quantitative values.

Normally distributed residues;

Homoscedastic residues;

Autocorrelation residuals;

Autocorrelations of the resulting trait.

The selection of factors in a multiple regression model using the inclusion method is based on a comparison of values ​​...

total variance before and after including the factor in the model;

Residual variance before and after including random factors in the model;

Variances before and after inclusion of the result in the model;

Residual variance before and after inclusion of factor model.

The generalized least squares method is used to correct...

Parameters of the nonlinear regression equation;

The accuracy of determining the coefficient of multiple correlation;

Autocorrelations between independent variables;

Heteroskedasticity of residuals in the regression equation.

After applying the generalized least squares method, it is possible to avoid _________ residuals

heteroskedasticity;

Normal distribution;

Equal to zero sums;

Random character.

Dummy variables are included in the ____________regression equations

Random;

steam room;

Indirect;

Multiple.

The interaction of the factors of the econometric model means that…

The influence of factors on the resulting feature depends on the values ​​of another non-collinear factor;

The influence of factors on the resulting attribute increases, starting from a certain level of factor values;

Factors duplicate each other's influence on the result;

The influence of one of the factors on the resulting attribute does not depend on the values ​​of the other factor.

Topic Multiple Regression (Problems)

The regression equation, built on 15 observations, has the form:

Missing values ​​as well confidence interval for

with a probability of 0.99 are:

The regression equation, built on 20 observations, has the form:

with a probability of 0.9 are:

The regression equation, built on 16 observations, has the form:

Missing values ​​as well as confidence interval for with a probability of 0.99 are:

The regression equation in a standardized form is:

The partial elasticity coefficients are equal to:

The standardized regression equation is:

The partial elasticity coefficients are equal to:

The standardized regression equation is:

The partial elasticity coefficients are equal to:

The standardized regression equation is:

The partial elasticity coefficients are equal to:

The standardized regression equation is:

The partial elasticity coefficients are equal to:

Based on 18 observations, the following data were obtained:

;
;
;
;

are equal:

Based on 17 observations, the following data were obtained:

;
;
;
;

Values ​​of the adjusted coefficient of determination, partial coefficients of elasticity and parameter are equal:

Based on 22 observations, the following data were obtained:

;
;
;
;

Values ​​of the adjusted coefficient of determination, partial coefficients of elasticity and parameter are equal:

Based on 25 observations, the following data were obtained:

;
;
;
;

Values ​​of the adjusted coefficient of determination, partial coefficients of elasticity and parameter are equal:

Based on 24 observations, the following data were obtained:

;
;
;
;

Values ​​of the adjusted coefficient of determination, partial coefficients of elasticity and parameter are equal:

Based on 28 observations, the following data were obtained:

;
;
;
;

Values ​​of the adjusted coefficient of determination, partial coefficients of elasticity and parameter are equal:

Based on 26 observations, the following data were obtained:

;
;
;
;

Values ​​of the adjusted coefficient of determination, partial coefficients of elasticity and parameter are equal:

In the regression equation:

Restore missing characteristics; construct a confidence interval for with a probability of 0.95 if n=12

OPTION 5

The dependence of the average life expectancy on several factors is studied according to the data for 1995, presented in Table. five.

Table 5

Mozambique

……………………………………………………………………………………..

Switzerland

Designations adopted in the table:

· Y-- average life expectancy at birth, years;

· X 1 -- GDP in purchasing power parities;

· X 2 -- chain population growth rate, %;

· X 3 -- chain labor force growth rate, %;

· X 4 -- infant mortality rate, % .

Required:

1. Make a matrix of paired correlation coefficients between all the variables under study and identify collinear factors.

2. Construct a regression equation that does not contain collinear factors. Check the statistical significance of the equation and its coefficients.

3. Build a regression equation containing only statistically significant and informative factors. Check the statistical significance of the equation and its coefficients.

Items 4 - 6 refer to the regression equation built when performing item 3.

4. Assess the quality and accuracy of the regression equation.

5. Give an economic interpretation of the coefficients of the regression equation and a comparative assessment of the strength of the influence of factors on the resulting variable Y.

6. Calculate the predicted value of the resulting variable Y if the predicted values ​​of the factors amount to 75% of their maximum values. Plot the confidence interval of the prediction of the actual value Y with 80% reliability.

Decision. To solve the problem, an EXCEL spreadsheet is used.

1. Using the add-on "Data analysis ... Correlation" we build a matrix of paired correlation coefficients between all the variables under study (menu "Tools" "Data analysis..." "Correlation"). On fig. 1 shows the panel correlation analysis with filled fieldsTo copy a window snapshot to the clipboard WINDOWS data the key combination Alt + Print Screen is used (on some keyboards - Alt + PrtSc). The results of the correlation analysis are given in appendix. 2 and transferred to table. one.

rice. 1. Correlation analysis panel

Table 1

Matrix of pairwise correlation coefficients

Analysis interfactorial correlation coefficients shows that the value of 0.8 exceeds in absolute value correlation coefficient between a pair of factors X 2 -X 3 (highlighted in bold). Factors X 2 -X 3 are thus recognized as collinear.

2. As shown in paragraph 1, the X2-X3 factors are collinear, which means that they actually duplicate each other, and their simultaneous inclusion in the model will lead to an incorrect interpretation of the corresponding regression coefficients. It can be seen that the X2 factor has a higher absolute correlation coefficient with the result Y than the X3 factor: ry,x2=0.72516; ry,x3=0.53397; |ry,x2|>|ry,x3| (see Table 1). This indicates a stronger influence of the X2 factor on the change in Y. The X3 factor is thus excluded from consideration.

To construct the regression equation, the values ​​of the variables used ( Y,X 1 , X 2 , X 4) copy to a blank worksheet ( adj. 3). We build the regression equation using the add-on " Data Analysis… Regression» (menu « Service" « Data analysis…» « Regression"). Panel regression analysis with filled fields is shown on rice. 2.

The results of the regression analysis are given in adj. 4 and transferred to tab. 2. The regression equation has the form (see " Odds» in tab. 2):

y = 75.44 + 0.0447 ? x 1 - 0.0453 ? x2 - 0.24? x4

The regression equation is recognized as statistically significant, since the probability of its random formation in the form in which it was received is 1.04571 × 10 -45 (see Fig. "F Significance" in tab. 2), which is significantly lower than the accepted significance level =0.05.

Probability of random formation of coefficients at the factor X 1 below the accepted level of significance =0.05 (see “ P-Value" in tab. 2), which indicates statistical significance coefficients and the significant impact of these factors on the change in annual profit Y.

Probability of random formation of coefficients at factors X 2 and X 4 exceeds the accepted level of significance =0.05 (see “ P-Value" in tab. 2), and these coefficients are not considered statistically significant.

rice. 2. Model regression analysis panel Y(X 1 ,X 2 ,X 4 )

table 2

Y(X 1 , X 2 , X 4 )

Analysis of variance

Significance F

Regression

Regression Equation

Odds

standard error

t-statistic

P-Value

bottom 95%

Top 95%

Lower 95.0%

Top 95.0%

Y-intersection

3. Based on the results of checking the statistical significance of the coefficients of the regression equation, carried out in the previous paragraph, we build a new regression model containing only informative factors, which include:

factors, the coefficients for which are statistically significant;

factors whose coefficients t _statistics modulo exceeds one (in other words, the absolute value of the coefficient is greater than its standard error).

The first group includes the factor X 1 to second -- factor X 4 . Factor X 2 is excluded from consideration as uninformative, and the final regression model will contain the factors X 1 , X 4 .

To build a regression equation, copy the values ​​of the variables used to a blank worksheet ( adj. five) and perform a regression analysis ( rice. 3). Its results are given in adj. 6 and transferred to tab. 3. The regression equation looks like:

y = 75.38278 + 0.044918 ? x 1 - 0.24031 ? x4

(cm. " Odds» in table 3).

rice. 3. Panel regression analysis of the model Y(X 1 , X 4 )

Table 3

Model Regression Analysis Results Y(X 1 , X 4 )

Regression statistics

Multiple R

R-square

Normalized R-square

standard error

Observations

Analysis of variance

Significance F

Regression

Regression Equation

Odds

standard error

t-statistic

P-Value

Y-intersection

The regression equation is statistically significant: the probability of its random formation is below the acceptable significance level = 0.05 (see " Significance F" in table 3).

The coefficient at the factor is also recognized as statistically significant X 1 the probability of its random formation is below the acceptable significance level = 0.05 (see “ P-Value" in tab. 3). This indicates a significant impact of GDP in purchasing power parities X 1 per change in annual profit Y.

Coefficient at the factor X 4 (annual infant mortality rate) is not statistically significant. However, this factor can still be considered informative, since t _his odds statistic exceeds modulo unit, although further conclusions regarding the factor X 4 should be treated with some caution.

4. Let's evaluate the quality and accuracy of the last regression equation using some statistical characteristics obtained during the regression analysis (see " regression statistics» in the table. 3):

multiple coefficient of determination

R2 = _ i=1 ____________ =0.946576

R 2 = shows that the regression model explains 94.7% of the variation in average life expectancy at birth Y, and this variation is due to a change in the factors included in the regression model X 1 , X 4 ;

regression standard error

shows that the values ​​predicted by the regression equation for average life expectancy at birth Y differ from the actual values ​​by an average of 2.252208 years.

The average relative approximation error is determined by the approximate formula:

E rel? 0.8 ? -- ? 100%=0.8 ? 2.252208/66.9? 100%?2.7

where thousand rubles. -- life expectancy value (determined using the built-in function " AVERAGE»; adj. one).

E rel shows that the values ​​of annual profit predicted by the regression equation Y differ from the actual values ​​by an average of 2.7%. The model has high precision(at - the accuracy of the model is high, at - good, at - satisfactory, at - unsatisfactory).

5. For the economic interpretation of the coefficients of the regression equation, we tabulate the average values ​​and standard deviations variables in the initial data (Table 4). Average values ​​were determined using the built-in function "AVERAGE", standard deviations - using the built-in function "STDEV" (see Appendix 1).

y x (1) x (2) x (3) x (4) x (5)
y 1.00 0.43 0.37 0.40 0.58 0.33
x (1) 0.43 1.00 0.85 0.98 0.11 0.34
x (2) 0.37 0.85 1.00 0.88 0.03 0.46
x (3) 0.40 0.98 0.88 1.00 0.03 0.28
x (4) 0.58 0.11 0.03 0.03 1.00 0.57
x (5) 0.33 0.34 0.46 0.28 0.57 1.00

An analysis of the matrix of paired correlation coefficients shows that the performance indicator is most closely related to the indicator x(4) - the amount of fertilizers used per 1 ha ().

At the same time, the relationship between the features-arguments is quite close. So, there is practically functional connection between the number of wheeled tractors ( x(1)) and the number of surface tillage tools .

The presence of multicollinearity is also evidenced by the correlation coefficients and . Given the close relationship of indicators x (1) , x(2) and x(3) , only one of them can enter the yield regression model.

To demonstrate the negative impact of multicollinearity, consider a yield regression model including all inputs:

Fobs = 121.

In parentheses are the values ​​of the corrected estimates of the standard deviations of the estimates of the coefficients of the equation .

Under the regression equation, the following adequacy parameters are presented: multiple coefficient of determination ; corrected estimate of the residual variance , average relative approximation error and calculated value of the -criterion Fobs = 121.

The regression equation is significant because F obl = 121 > F kp = 2.85 found from the table F- distributions at a=0.05; n 1 =6 and n 2 =14.

It follows from this that Q¹0, i.e., and at least one of the coefficients of the equation q j (j= 0, 1, 2, ..., 5) is not equal to zero.

To test the hypothesis about the significance of individual regression coefficients H0: q j =0, where j=1,2,3,4,5, compare critical value t kp = 2.14, found from the table t-distributions at significance level a=2 Q=0.05 and the number of degrees of freedom n=14, with the calculated value . It follows from the equation that the regression coefficient is statistically significant only when x(4) since ½ t 4½=2.90 > t kp=2.14.



The negative signs of the regression coefficients at x(1) and x(five) . From the negative values ​​of the coefficients it follows that the increase in saturation Agriculture wheeled tractors ( x(1)) and plant health products ( x(5)) negatively affects the yield. Thus, the resulting regression equation is unacceptable.

To obtain a regression equation with significant coefficients, we use step by step algorithm regression analysis. Initially, we use a step-by-step algorithm with the elimination of variables.

Exclude a variable from the model x(1) , which corresponds to the minimum absolute value of ½ t 1½=0.01. For the remaining variables, we will again construct the regression equation:

The resulting equation is significant, because F obs = 155 > F kp = 2.90, found at a significance level a=0.05 and numbers of degrees of freedom n 1 =5 and n 2 =15 according to the table F-distributions, i.e. vector q¹0. However, only the regression coefficient is significant in the equation at x(4) . Calculated values ​​½ t j ½ for other coefficients less than t kr = 2.131 found in the table t-distributions for a=2 Q=0.05 and n=15.

Excluding a variable from the model x(3) , which corresponds to the minimum value t 3 =0.35 and get the regression equation:

(2.9)

In the resulting equation, it is not statistically significant and we cannot economically interpret the coefficient at x(five) . Excluding x(5) we get the regression equation:

(2.10)

We got meaningful equation regressions with significant and interpretable coefficients.

However, the resulting equation is not the only “good” or “best” yield model in our example.

Let us show that in the condition of multicollinearity, the step-by-step algorithm with the inclusion of variables is more efficient. The first step in the yield model y includes a variable x(4) , which has the highest correlation coefficient with y, explained by the variable - r(y,x(4))=0.58. In the second step, including the equation along with x(4) variables x(1) or x(3) , we will get models that, for economic reasons and statistical characteristics exceed (2.10):

(2.11)

(2.12)

The inclusion of any of the three remaining variables in the equation worsens its properties. See, for example, equation (2.9).

Thus, we have three “good” yield models, from which one must be chosen for economic and statistical reasons.

By statistical criteria model (2.11) is the most adequate. It corresponds to the minimum values ​​of the residual variance = 2.26 and the average relative approximation error and the largest values ​​and Fobs = 273.

Some worst performance model (2.12) has adequacy, and then model (2.10).

We will now choose the best of models (2.11) and (2.12). These models differ from each other in variables x(1) and x(3) . However, in yield models, the variable x(1) (number of wheeled tractors per 100 ha) is preferable to variable x(3) (number of surface tillage implements per 100 ha), which is somewhat secondary (or derived from x (1)).

In this connection, for economic reasons, preference should be given to model (2.12). Thus, after implementing the stepwise regression analysis algorithm with the inclusion of variables and taking into account the fact that only one of the three related variables should enter the equation ( x (1) , x(2) or x(3)) choose the final regression equation:

The equation is significant at a=0.05, because F obl = 266 > F kp = 3.20 found from the table F-distributions for a= Q=0.05; n 1 =3 and n 2 =17. All regression coefficients are also significant in equation ½ t j½> t kp (a=2 Q=0.05; n=17)=2.11. The regression coefficient q 1 should be recognized as significant (q 1 ¹0) for economic reasons, while t 1 =2.09 only slightly less t kp = 2.11.

It follows from the regression equation that an increase per unit in the number of tractors per 100 hectares of arable land (with a fixed value x(4)) leads to an increase in grain yields by an average of 0.345 c/ha.

An approximate calculation of the coefficients of elasticity e 1 "0.068 and e 2" 0.161 shows that with an increase in indicators x(1) and x(4) by 1%, the grain yield increases by an average of 0.068% and 0.161%, respectively.

The multiple coefficient of determination indicates that only 46.9% of the yield variation is explained by the indicators included in the model ( x(1) and x(4)), that is, the saturation of crop production with tractors and fertilizers. The rest of the variation is due to the action of unaccounted for factors ( x (2) , x (3) , x(5) , weather conditions, etc.). The average relative approximation error characterizes the adequacy of the model, as well as the value of the residual variance . When interpreting the regression equation, the values ​​of relative approximation errors are of interest . Recall that - the model value of the effective indicator characterizes the average value of productivity for the totality of the considered areas, provided that the values ​​of the explanatory variables x(1) and x(4) fixed at the same level, namely x (1) = x i(1) and x (4) = x i(4) . Then for the values ​​of d i yields can be compared. Areas that correspond to d values i>0, have an above-average yield, and d i<0 - ниже среднего.

In our example, crop production is most efficient in the area corresponding to d 7 \u003d 28%, where the yield is 28% higher than the average for the region, and the least efficient - in the area with d 20 =-27,3%.


Tasks and exercises

2.1. From the general population ( y, x (1) , ..., x(p)), where y has a normal distribution law with conditional mathematical expectation and variance s 2 , a random sample of volume n, let it go ( y i, x i (1) , ..., x i(p)) - result i th observation ( i=1, 2, ..., n). Determine: a) the mathematical expectation of the least squares estimate of the vector q; b) the covariance matrix of the least squares estimate of the vector q; c) the mathematical expectation of the estimate.

2.2. According to the condition of problem 2.1, find the mathematical expectation of the sum of squared deviations due to regression, i.e. EQ R, where

.

2.3. According to the condition of problem 2.1, determine the mathematical expectation of the sum of squared deviations due to the residual variation relative to the regression lines, i.e. EQ ost where

2.4. Prove that under the hypothesis Н 0: q=0 the statistics

has an F-distribution with degrees of freedom n 1 =p+1 and n 2 =n-p-1.

2.5. Prove that when the hypothesis H 0: q j =0 is fulfilled, the statistics has a t-distribution with the number of degrees of freedom n=n-p-1.

2.6. Based on the data (Table 2.3) on the dependence of fodder bread shrinkage ( y) on the duration of storage ( x) find a point estimate of the conditional mathematical expectation under the assumption that the general regression equation is linear.

Table 2.3.

It is required: a) to find estimates and residual variance s 2 under the assumption that the general regression equation has the form ; b) check for a=0.05 the significance of the regression equation, i.e. hypothesis H 0: q=0; c) with reliability g=0.9 determine the interval estimates of the parameters q 0 , q 1 ; d) with reliability g=0.95 determine the interval estimate of the conditional expectation for X 0=6; e) determine at g=0.95 the confidence interval of prediction at the point X=12.

2.7. Based on the data on the dynamics of the growth rate of the share price for 5 months, given in Table. 2.4.

Table 2.4.

months ( x)
y (%)

and the assumption that the general regression equation has the form , it is required: a) to determine the estimates and parameters of the regression equation and the residual variance s 2 ; b) check at a=0.01 the significance of the regression coefficient, i.e. hypotheses H 0: q 1 =0;

c) with reliability g=0.95 find interval estimates of the parameters q 0 and q 1 ; d) with reliability g = 0.9, establish an interval estimate of the conditional mathematical expectation for x 0=4; e) determine at g=0.9 the confidence interval of prediction at the point x=5.

2.8. The results of the study of the dynamics of weight gain in young animals are given in Table 2.5.

Table 2.5.

Assuming that the general regression equation is linear, it is required: a) to determine estimates and parameters of the regression equation and residual variance s 2 ; b) check for a=0.05 the significance of the regression equation, i.e. hypotheses H 0: q=0;

c) with reliability g=0.8 to find interval estimates of the parameters q 0 and q 1 ; d) with reliability g=0.98 determine and compare the interval estimates of the conditional mathematical expectation for x 0 =3 and x 1 =6;

e) determine at g=0.98 the confidence interval of prediction at the point x=8.

2.9. Cost price ( y) one copy of the book, depending on the circulation ( x) (thousand copies) is characterized by data collected by the publishing house (Table 2.6). Determine the least squares estimates and parameters of the hyperbolic regression equation , with the reliability g=0.9 build confidence intervals for the parameters q 0 and q 1 , as well as the conditional mathematical expectation at x=10.

Table 2.6.

Determine estimates and parameters of the regression equation of the type x=20.

2.11. In table. 2.8 reported growth rates (%) of the following macroeconomic indicators n\u003d 10 developed countries of the world for 1992: GNP - x(1) , industrial production - x(2) , price index - x (3) .

Table 2.8.

Countries x and parameters of the regression equation, estimation of the residual variance; b) check at a=0.05 the significance of the regression coefficient, i.e. H 0: q 1 =0; c) with reliability g=0.9 find interval estimates q 0 and q 1 ; d) find at g=0.95 the confidence interval for at the point X 0 =x i, where i=5; e) compare the statistical characteristics of the regression equations: 1, 2 and 3.

2.12. Solve problem 2.11, taking for the value to be explained ( at) index x(1) , and for the explanatory ( X) variable x (3) .

1. Ayvazyan S.A., Mkhitaryan V.S. Applied Statistics and Fundamentals of Econometrics: Textbook. M., UNITI, 1998 (2nd edition 2001);

2. Ayvazyan S.A., Mkhitaryan V.S. Applied Statistics in Problems and Exercises: Textbook. M. UNITY - DANA, 2001;

3. Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Applied statistics. Dependency research. M., Finance and statistics, 1985, 487p.;

4. Aivazyan S.A., Buchstaber V.M., Enyukov I.S., Meshalkin L.D. Applied statistics. Classification and dimensionality reduction. M., Finance and statistics, 1989, 607p.;

5. Johnston J. Econometric Methods, Moscow: Statistics, 1980, 446 pp.;

6. Dubrov A.V., Mkhitaryan V.S., Troshin L.I. Multivariate statistical methods. M., Finance and statistics, 2000;

7. Mkhitaryan V.S., Troshin L.I. Research of dependences by methods of correlation and regression. M., MESI, 1995, 120 pp.;

8. Mkhitaryan V.S., Dubrov A.M., Troshin L.I. Multidimensional statistical methods in economics. M., MESI, 1995, 149p.;

9. Dubrov A.M., Mkhitaryan V.S., Troshin L.I. Mathematical statistics for businessmen and managers. M., MESI, 2000, 140s.;

10. Lukashin Yu.I. Regression and adaptive forecasting methods: Textbook, M., MESI, 1997.

11. Lukashin Yu.I. Adaptive methods of short-term forecasting. - M., Statistics, 1979.


APPS


Appendix 1. Options for tasks for independent computer research.