Regression equation. Multiple Regression Equation

What is regression?

Consider two continuous variables x = (x 1, x 2, .., x n), y = (y 1, y 2, ..., y n).

Let's place the points on a 2D scatter plot and say we have linear relationship if the data is fitted with a straight line.

If we believe that y depends on x, and changes in y are caused precisely by changes in x, we can define the regression line (regression y on x), which best describes the straightforward relationship between the two variables.

The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that although tall fathers tend to have tall sons, the average height of sons is smaller than that of their tall fathers. The average height of sons "regressed" and "reversed" to the average height of all fathers in the population. Thus, on average, tall fathers have lower (but still tall) sons, and lower fathers have higher (but still rather short) sons.

Regression line

A mathematical equation that estimates a simple (paired) linear regression line:

x called the independent variable or predictor.

Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. this "predicted value y»

  • a- free member (intersection) of the line of assessment; this value Y, when x = 0(Fig. 1).
  • b - slope or the gradient of the evaluated line; it represents the amount by which Y increases on average if we increase x by one unit.
  • a and b are called the regression coefficients of the estimated line, although this term is often used only for b.

Paired linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig. 1. Linear regression line showing the intersection of a and slope b (the amount of increase in Y as x increases by one unit)

Least square method

We carry out regression analysis using a sample of observations, where a and b- sample estimates of the true (general) parameters, α and β, which determine the linear regression line in the population (general population).

Most simple method determination of coefficients a and b is an method least squares (OLS).

The fit is estimated by considering the residuals (the vertical distance of each point from the line, e.g. residual = observed y- predicted y, Rice. 2).

The best fit line is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dashed lines) for each point.

Linear Regression Assumptions

So, for each observed value, the residual is equal to the difference and the corresponding predicted value. Each residual can be positive or negative.

You can use residuals to test the following assumptions underlying linear regression:

  • The balances are normally distributed with a zero mean;

If the assumptions of linearity, normality and / or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (for example, use a log transformation, etc.).

Abnormal values ​​(outliers) and influence points

An "influential" observation, if omitted, changes one or more estimates of the model parameters (ie, slope or intercept).

An outlier (an observation that contradicts most of the values ​​in a dataset) can be an “influential” observation and can be well detected visually when viewed from a 2D scatter plot or a residual plot.

Both for outliers and for “influential” observations (points), models are used, both with and without them, and they pay attention to the change in the estimate (regression coefficients).

When performing analysis, do not automatically discard outliers or influence points, as simple ignoring can affect the results obtained. Always investigate and analyze the causes of these outliers.

Linear regression hypothesis

When constructing a linear regression, the null hypothesis is tested that the general slope of the regression line β is zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate a test statistic equal to the ratio that obeys a distribution with degrees of freedom, where the standard error of the coefficient is


,

- estimation of the variance of the residuals.

Usually, if the level of significance achieved is the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom that gives the probability of a two-sided test

This is the interval that contains the general slope with a 95% probability.

For large samples, let's say we can approximate with a value of 1.96 (that is, the criterion statistics will tend to normal distribution)

Evaluation of the quality of linear regression: coefficient of determination R 2

Because of the linear relationship, and we expect it to change as it changes , and we call this variation that is caused or explained by regression. The residual variation should be as small as possible.

If this is the case, then most of the variation will be due to regression, and the points will lie close to the regression line, i.e. the line matches the data well.

Share total variance, which is explained by the regression is called coefficient of determination, usually expressed in terms of percentage and denote R 2(in paired linear regression, this is the value r 2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference is the percentage of variance that cannot be explained by the regression.

There is no formal test to evaluate, we have to rely on subjective judgment to determine the quality of the regression line fit.

Applying a regression line to forecast

You can use a regression line to predict a value from a value within the observed range (never extrapolate outside these limits).

We predict the mean for observables that have a particular value by plugging that value into the regression line equation.

So, if we predict how We use this predicted value and its standard error to estimate confidence interval for true average size in the population.

Repeating this procedure for different values ​​allows you to build confidence limits for this line. This is the band or area that contains the true line, for example, with a 95% confidence level.

Simple regression designs

Simple regression designs contain one continuous predictor. If there are 3 cases with predictor values ​​P, for example, 7, 4, and 9, and the design includes a first-order effect P, then the design matrix X will have the form

a regression equation using P for X1 looks like

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P, such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation takes the form

Y = b0 + b1 P2

Sigma-restricted and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (since categorical predictors simply do not exist). Regardless of the coding method chosen, the values ​​of the continuous variables are increased to the appropriate degree and used as the values ​​for the X variables. In this case, no recoding is performed. In addition, when describing regression designs, you can omit consideration of the design matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data presented in the table:

Rice. 3. Table of initial data.

Data compiled from a comparison of the 1960 and 1970 census in a randomly selected 30 districts. District names are represented as observation names. Information regarding each variable is presented below:

Rice. 4. Table of variable specifications.

Research task

For this example, the correlation between the poverty rate and the degree will be analyzed, which predicts the percentage of families that are below the poverty line. Therefore, we will treat variable 3 (Pt_Poor) as a dependent variable.

It can be hypothesized that population change and the percentage of families below the poverty line are related. It seems reasonable to expect that poverty leads to population outflow, hence there will be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng) as a predictor variable.

Viewing Results

Regression coefficients

Rice. 5. Regression coefficients Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and the Param. the non-standardized coefficient for the Pt_Poor regression on Pop_Chng is -0.40374. This means that for every unit decrease in population, there is a 40374 increase in the poverty rate. The upper and lower (default) 95% confidence limits for this non-standardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Distribution of variables

Correlation coefficients can become significantly overestimated or underestimated if there are large outliers in the data. Let us examine the distribution of the dependent variable Pt_Poor by district. To do this, let's build a histogram of the Pt_Poor variable.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even the two counties (the two right-hand columns) have a higher percentage of households below the poverty line than expected from the normal distribution, they appear to be "within the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. As a rule of thumb, outliers should be accounted for if the observation (or observations) do not fall within the interval (mean ± 3 times the standard deviation). In this case, it is worth repeating the analysis with and without outliers to ensure that they do not have a significant effect on the correlation between members of the population.

Scatter plot

If one of the hypotheses is a priori about the relationship between the given variables, then it is useful to check it on the graph of the corresponding scatterplot.

Rice. 8. Scatter diagram.

The scatter plot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, that is, with a 95% probability the regression line falls between the two dashed curves.

Significance criteria

Rice. 9. Table containing criteria for significance.

The criterion for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor, p<.001 .

Outcome

This example showed how to analyze a simple regression design. An interpretation of non-standardized and standardized regression coefficients was also presented. The importance of studying the distribution of responses of the dependent variable is discussed, and a technique for determining the direction and strength of the relationship between the predictor and the dependent variable is demonstrated.

REGRESSION RATIO

- English coefficient, regression; German Regressionskoeffizient. One of the characteristics of the relationship between the dependent y and the independent variable x. K. p. shows by how many units the value taken by y increases if the variable x changes by one of its change. Geometrically K. p. is the slope of the straight line y.

Antinazi. Encyclopedia of Sociology, 2009

See what the "REGRESSION RATIO" is in other dictionaries:

    regression coefficient- - [L.G. Sumenko. The English Russian Dictionary of Information Technology. M .: GP TsNIIS, 2003.] Topics information technologies in general EN regression coefficient ... Technical translator's guide

    Regression coefficient- 35. Regression coefficient A parameter of the regression analysis model Source: GOST 24026 80: Research tests. Planning an experiment. Terms and Definitions …

    regression coefficient- Coefficient of the independent variable in the regression equation ... Dictionary of Sociological Statistics

    REGRESSION RATIO- English. coefficient, regression; German Regressionskoeffizient. One of the characteristics of the relationship between the dependent y and the independent variable x. K. p. shows by how many units the value accepted by y increases if the variable x changes to ... ... Explanatory Dictionary of Sociology

    sample regression coefficient- 2.44. sample regression coefficient Coefficient of a variable in the equation of a curve or regression surface Source: GOST R 50779.10 2000: Statistical methods. Probability and basic statistics. Terms and Definitions … Dictionary-reference book of terms of normative and technical documentation

    Partial regression coefficient is a statistical measure that denotes the degree of influence of the independent variable on the dependent variable in a situation where the mutual influence of all other variables in the model is under the control of the researcher ... Sociological Dictionary Socium

    REGRESSION, WEIGHT- A synonym for regression coefficient ... Explanatory Dictionary of Psychology

    INHERITANCE RATIO- Indicator of the relative share of genetic variability in the total phenotypic variation of a trait. The most common methods for assessing the heritability of economically useful traits: where h2 is the coefficient of heritability; r intraclass ... ... Terms and definitions used in breeding, genetics and reproduction of farm animals

    - (R squared) is the proportion of the variance of the dependent variable explained by the considered dependence model, that is, the explanatory variables. More precisely, it is a unit minus the proportion of unexplained variance (variance of the random error of the model, or conditional ... ... Wikipedia

    Coefficient of the independent variable in the regression equation. So, for example, in the linear regression equation connecting random variables Y and X, R. k. B0 and b1 are equal: where r is the correlation coefficient of X and Y,. Calculation of R. k. Estimates (selected ... ... Encyclopedia of mathematics

Books

  • Introduction to Econometrics (CDpc), Yanovsky Leonid Petrovich, Bukhovets Alexey Georgievich. The fundamentals of econometrics and statistical analysis of one-dimensional time series are given. Much attention is paid to the classical pair and multiple regression, classical and generalized methods ...
  • Speed ​​reading. Effective Trainer (CDpc),. The program is addressed to users who want to master the speed reading technique in the shortest possible time. The course is structured according to the "theory - practice" principle. Theoretical material and practical ...

The regression coefficient is an absolute value by which the value of one feature changes on average when another related feature changes by a specified unit of measurement. Definition of regression. The relationship between y and x determines the sign of the regression coefficient b (if> 0 - direct relationship, otherwise - reverse). The linear regression model is the most commonly used and most studied in econometrics.

1.4. Approximation error Let us estimate the quality of the regression equation using the absolute approximation error. The predicted values ​​of the factors are substituted into the model and point predictive estimates of the studied indicator are obtained. Thus, the regression coefficients characterize the degree of significance of individual factors for increasing the level of the effective indicator.

Regression coefficient

Consider now task 1 of the regression analysis tasks on p. 300-301. One of the mathematical results of linear regression theory says that the estimate N is an unbiased estimate with the minimum variance in the class of all linear unbiased estimates. For example, you can calculate the number of colds on average at certain values ​​of the average monthly air temperature in the autumn-winter period.

Regression line and regression equation

Regression sigma is used to construct a regression scale, which reflects the deviation of the values ​​of the effective trait from its mean value, plotted on the regression line. 1, x2, x3 and the corresponding mean values ​​y1, y2 y3, as well as the smallest (y - σy / x) and largest (y + σy / x) values ​​(y) construct a regression scale. Output. Thus, the scale of regression within the calculated values ​​of body weight allows you to determine it at any other value of height or to assess the individual development of the child.

In matrix form, the regression equation (RE) is written as: Y = BX + U (\ displaystyle Y = BX + U), where U (\ displaystyle U) is the error matrix. The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

Paired linear regression can be extended to include more than one independent variable; in this case, it is known as multiple regression. Both for outliers and for “influential” observations (points), models are used, both with and without them, paying attention to the change in the estimate (regression coefficients).

Because of the linear relationship, and we expect that it changes as it changes, and we call this variation, which is caused or explained by regression. If this is the case, then most of the variation will be due to regression, and the points will lie close to the regression line, i.e. the line matches the data well. The difference is the percentage of variance that cannot be explained by the regression.

This method is used to visualize the form of connection between the studied economic indicators. Based on the correlation field, it can be hypothesized (for the general population) that the relationship between all possible values ​​of X and Y is linear.

Reasons for the existence of a random error: 1. Not including significant explanatory variables in the regression model; 2. Aggregation of variables. System of normal equations. In our example, the connection is direct. To predict the dependent variable of the effective indicator, it is necessary to know the predicted values ​​of all factors included in the model.

Comparison of correlation and regression coefficients

With a probability of 95%, it can be guaranteed that the Y values ​​for an unlimited number of observations will not go beyond the found intervals. If the calculated value with lang = EN-US> n-m-1) degrees of freedom is greater than the tabular value for a given significance level, then the model is considered significant. This ensures that there is no correlation between any deviations and, in particular, between adjacent deviations.

Regression coefficients and their interpretation

In most cases, positive autocorrelation is caused by the directional constant influence of some factors that were not taken into account in the model. Negative autocorrelation effectively means that a positive deviation is followed by a negative one, and vice versa.

What is regression?

2. Inertia. Many economic indicators (inflation, unemployment, GNP, etc.) have a certain cyclical nature associated with the waveform of business activity. In many industrial and other areas, economic performance responds to change economic conditions with a delay (time lag).

If the preliminary standardization of factor indicators is carried out, then b0 is equal to the average value of the effective indicator in the aggregate. The specific values ​​of the regression coefficients are determined from empirical data according to the least squares method (as a result of solving systems of normal equations).

The linear regression equation has the form y = bx + a + ε Here ε is a random error (deviation, perturbation). Since the error is more than 15%, then given equation not desirable to use as a regression. Substituting the corresponding x values ​​into the regression equation, you can determine the aligned (predicted) values ​​of the effective y (x) indicator for each observation.

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its application was rather difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are the specific examples from the field of economics.

Regression types

This concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • power-law;
  • exponential;
  • hyperbolic;
  • indicative;
  • logarithmic.

Example 1

Let us consider the problem of determining the dependence of the number of employees who quit their jobs on the average salary at 6 industrial enterprises.

Task. Six enterprises analyzed the average monthly wages and the number of employees who quit on their own... In tabular form, we have:

Number of resigned

The salary

30,000 rubles

35,000 rubles

40,000 rubles

45,000 rubles

50,000 rubles

55,000 rubles

60,000 rubles

For the problem of determining the dependence of the number of quit employees on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 + ... + a k x k, where x i are the influencing variables, a i are the regression coefficients, and a k is the number of factors.

For this task, Y is the indicator of the employees who quit, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the Excel table processor

Regression analysis in Excel must be preceded by the application of built-in functions to the existing tabular data. However, for these purposes it is better to use the very useful "Analysis Package" add-in. To activate it you need:

  • from the "File" tab go to the "Parameters" section;
  • in the window that opens, select the line "Add-ins";
  • click on the "Go" button located below, to the right of the "Control" line;
  • put a tick next to the name "Analysis package" and confirm your actions by clicking "OK".

If everything is done correctly, the required button will appear on the right side of the "Data" tab, located above the "Excel" worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for carrying out econometric calculations, we can start solving our problem. For this:

  • click on the "Data Analysis" button;
  • in the window that opens, click on the "Regression" button;
  • in the tab that appears, enter the range of values ​​for Y (the number of employees who quit) and for X (their salaries);
  • we confirm our actions by pressing the "Ok" button.

As a result, the program will automatically fill in the new sheet of the spreadsheet processor with the regression analysis data. Note! Excel has the ability to independently define the location that you prefer for this purpose. For example, it could be the same sheet that contains the Y and X values, or even a new workbook specifically designed to store this kind of data.

Analyzing Regression Results for R-Square

In Excel, the data obtained in the course of processing the data of the example in question is as follows:

First of all, you should pay attention to the value of the R-square. It represents the coefficient of determination. V this example R-square = 0.755 (75.5%), i.e. the calculated parameters of the model explain the relationship between the parameters under consideration by 75.5%. The higher the value of the coefficient of determination, the more the chosen model is considered to be more applicable for a specific task. It is believed that it correctly describes the real situation when the R-squared value is above 0.8. If R-square<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Odds analysis

The number 64.1428 shows what the value of Y will be if all variables xi in the model we are considering are zero. In other words, it can be argued that the value of the analyzed parameter is influenced by other factors that are not described in a particular model.

The next coefficient -0.16285, located in cell B18, shows the significance of the influence of the variable X on Y. This means that the average monthly salary of employees within the considered model affects the number of people who quit with a weight of -0.16285, i.e., the degree of its influence at all small. A “-” sign indicates that the coefficient is negative. This is obvious, since everyone knows that the higher the salary at the enterprise, the fewer people express a desire to terminate the employment contract or leave.

Multiple regression

This term is understood as a constraint equation with several independent variables of the form:

y = f (x 1 + x 2 +… x m) + ε, where y is an effective indicator (dependent variable), and x 1, x 2,… x m are indicators-factors (independent variables).

Parameter estimation

For multiple regression (MR), it is performed using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +… + b m x m + ε we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

OLS is applied to the MR equation on a standardized scale. In this case, we get the equation:

where t y, t x 1,… t xm are standardized variables for which the mean values ​​are equal to 0; β i are the standardized regression coefficients and the standard deviation is 1.

Note that all β i in this case are specified as normalized and centralized, so their comparison with each other is considered correct and valid. In addition, it is customary to filter out factors, discarding those with the smallest values ​​of βi.

Problem Using a Linear Regression Equation

Suppose you have a table of price dynamics for a specific product N over the last 8 months. It is necessary to make a decision on the advisability of purchasing his batch at a price of 1850 rubles / t.

month number

name of the month

product price N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet processor, you need to use the Data Analysis tool already known from the example presented above. Next, select the "Regression" section and set the parameters. It should be remembered that in the "Input interval Y" field, a range of values ​​must be entered for the dependent variable (in this case, the price of the product in specific months of the year), and in the "Input interval X" - for the independent variable (number of the month). We confirm the actions by clicking "Ok". On a new sheet (if it was indicated so) we get the data for the regression.

We use them to construct a linear equation of the form y = ax + b, where the coefficients of the line with the name of the month number and the coefficients and lines "Y-intersection" from the sheet with the results of regression analysis act as parameters a and b. Thus, the linear regression equation (SD) for problem 3 is written as:

Product price N = 11.714 * month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of the results

To decide whether the obtained linear regression equation is adequate, multiple correlation and determination coefficients are used, as well as Fisher's test and Student's t test. In the Excel table with the regression results, they are called multiple R, R-square, F-statistics and t-statistics, respectively.

KMC R makes it possible to assess the closeness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables “Month number” and “Product price N in rubles per tonne”. However, the nature of this connection remains unknown.

The square of the coefficient of determination R2 (RI) is a numerical characteristic of the proportion of the total spread and shows the spread of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is 84.8%, i.e., statistical data are described with a high degree of accuracy by the obtained SD.

The F-statistic, also called the Fisher test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to assess the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-test> t cr, then the hypothesis of the insignificance of the free term linear equation rejected.

In the considered problem for a free term using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, that is, we have a zero probability that the correct hypothesis about the insignificance of the free term will be rejected. For the coefficient at unknown t = 5.79405, and p = 0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient with the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Let's consider a specific applied problem.

The management of the company "NNN" must decide on the advisability of purchasing a 20% stake in JSC "MMM". The cost of the package (JV) is US $ 70 million. NNN specialists have collected data on similar transactions. It was decided to evaluate the value of the block of shares by such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • the volume of the annual turnover (VO);
  • accounts receivable (VD);
  • the cost of fixed assets (SOF).

In addition, the parameter is the wage arrears of the enterprise (V3 P) in thousands of US dollars.

Excel spreadsheet solution

First of all, you need to create a table of initial data. It looks like this:

  • call the "Data Analysis" window;
  • select the section "Regression";
  • the range of values ​​of dependent variables from column G is entered into the "Input interval Y" box;
  • click on the icon with a red arrow to the right of the "Input interval X" window and select on the sheet the range of all values ​​from columns B, C, D, F.

Check the "New Worksheet" item and click "Ok".

Get a regression analysis for a given task.

Study of the results and conclusions

We "collect" the regression equation from the rounded data presented above on the Excel spreadsheet sheet:

SP = 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103 * x1 + 0.541 * x2 - 0.031 * x3 + 0.405 * x4 + 0.691 * x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, the figure is 64.72 million US dollars. This means that the shares of JSC "MMM" should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet processor and the regression equation made it possible to make an informed decision regarding the advisability of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you solve practical problems in the field of econometrics.