Regression equation. Multiple regression equation

What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a 2D scatter plot and say we have linear relationship if the data is approximated by a straight line.

If we assume that y depends on x, and the changes in y caused by changes in x, we can define a regression line (regression y on the x), which best describes the straight-line relationship between these two variables.

The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that while tall fathers tend to have tall sons, the average height of sons is smaller than that of their tall fathers. The average height of sons "regressed" and "moved back" to the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still tall) sons, and short fathers have taller (but still rather short) sons.

regression line

Mathematical equation that evaluates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y is the dependent or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the predicted value y»

  • a- free member (crossing) of the evaluation line; this value Y, when x=0(Fig.1).
  • b - slope or the gradient of the estimated line; it is the amount by which Y increases on average if we increase x for one unit.
  • a and b are called the regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intersection of a and the slope b (the amount of increase in Y when x increases by one unit)

Least square method

We fulfill regression analysis, using a sample of observations, where a and b- sample estimates of the true (general) parameters, α and β , which determine the line of linear regression in the population (general population).

Most simple method determining coefficients a and b is method least squares (MNK).

The fit is evaluated by considering the residuals (the vertical distance of each point from the line, e.g. residual = observable y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with depicted residuals (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the residual is equal to the difference and the corresponding predicted one. Each residual can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

  • The residuals are normally distributed with zero mean;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (eg, use a logarithmic transformation, etc.).

Abnormal values ​​(outliers) and points of influence

An "influential" observation, if omitted, changes one or more model parameter estimates (ie slope or intercept).

An outlier (an observation that conflicts with most of the values ​​in the data set) can be an "influential" observation and can be well detected visually when looking at a 2D scatterplot or a plot of residuals.

Both for outliers and for "influential" observations (points), models are used, both with their inclusion and without them, pay attention to the change in the estimate (regression coefficients).

When doing an analysis, do not automatically discard outliers or influence points, as simply ignoring them can affect the results. Always study the causes of these outliers and analyze them.

Linear regression hypothesis

When constructing a linear regression, the null hypothesis is checked that the general slope of the regression line β zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which obeys a distribution with degrees of freedom, where the standard error of the coefficient


,

- estimation of the variance of the residuals.

Usually, if the significance level reached is the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom which gives the probability of a two-tailed test

This is the interval that contains the general slope with a probability of 95%.

For large samples, let's say we can approximate by a value of 1.96 (that is, the test statistic will tend to normal distribution)

Evaluation of the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as changes , and we call this the variation that is due to or explained by the regression. The residual variation should be as small as possible.

If so, then most of the variation will be explained by the regression, and the points will lie close to the regression line, i.e. the line fits the data well.

share total variance, which is explained by regression is called determination coefficient, usually expressed as a percentage and denoted R2(in paired linear regression, this is the value r2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference is the percentage of variance that cannot be explained by regression.

With no formal test to evaluate, we are forced to rely on subjective judgment to determine the quality of the fit of the regression line.

Applying a Regression Line to a Forecast

You can use a regression line to predict a value from a value within the observed range (never extrapolate beyond these limits).

We predict the mean for observables that have a certain value by substituting that value into the regression line equation.

So, if we predict as We use this predicted value and its standard error to estimate confidence interval for true medium size in the population.

Repeating this procedure for different values ​​allows you to build confidence limits for this line. This is a band or area that contains a true line, for example, with a 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 cases with predictor values ​​P , such as 7, 4 and 9, and the design includes a first order effect P , then the design matrix X will be

a regression equation using P for X1 looks like

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P , such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-restricted and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the encoding method chosen, the values ​​of the continuous variables are incremented by the appropriate power and used as the values ​​for the X variables. In this case, no conversion is performed. In addition, when describing regression plans, you can omit consideration of the plan matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data provided in the table:

Rice. 3. Table of initial data.

The data is based on a comparison of the 1960 and 1970 censuses in 30 randomly selected counties. County names are represented as observation names. Information regarding each variable is presented below:

Rice. 4. Variable specification table.

Research objective

For this example, the correlation between the poverty rate and the power that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor ) as a dependent variable.

One can put forward a hypothesis: the change in the population and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to an outflow of population, hence there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng ) as a predictor variable.

View results

Regression coefficients

Rice. 5. Regression coefficients Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and Param. the non-standardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374 . This means that for every unit decrease in population, there is an increase in the poverty rate of .40374. The upper and lower (default) 95% confidence limits for this non-standardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Distribution of variables

Correlation coefficients can become significantly overestimated or underestimated if there are large outliers in the data. Let us examine the distribution of the dependent variable Pt_Poor by county. To do this, we will build a histogram of the Pt_Poor variable.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the right-hand two columns) have a higher percentage of families that are below the poverty line than expected in a normal distribution, they appear to be "inside the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be taken into account if an observation (or observations) does not fall within the interval (mean ± 3 times standard deviation). In this case, it is worth repeating the analysis with and without outliers to make sure that they do not have a serious effect on the correlation between members of the population.

Scatterplot

If one of the hypotheses is a priori about the relationship between the given variables, then it is useful to check it on the plot of the corresponding scatterplot.

Rice. 8. Scatterplot.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., with 95% probability the regression line passes between the two dashed curves.

Significance criteria

Rice. 9. Table containing the significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Outcome

This example showed how to analyze a simple regression plan. An interpretation of non-standardized and standardized regression coefficients was also presented. The importance of studying the response distribution of the dependent variable is discussed, and a technique for determining the direction and strength of the relationship between the predictor and the dependent variable is demonstrated.

REGRESSION COEFFICIENT

- English coefficient, regression; German Regressionskoeffizient. One of the characteristics of the relationship between dependent y and independent variable x. K. r. shows by how many units the value accepted by y increases if the variable x changes by one unit of its change. Geometrically, K. r. is the slope of the straight line y.

Antinazi. Encyclopedia of Sociology, 2009

See what "REGRESSION COEFFICIENT" is in other dictionaries:

    regression coefficient- - [L.G. Sumenko. English Russian Dictionary of Information Technologies. M .: GP TsNIIS, 2003.] Topics information technology in general EN regression coefficient ... Technical Translator's Handbook

    Regression coefficient- 35. Regression coefficient Parameter of the regression analysis model Source: GOST 24026 80: Research tests. Experiment planning. Terms and Definitions …

    regression coefficient- The coefficient of the independent variable in the regression equation ... Dictionary of Sociological Statistics

    REGRESSION COEFFICIENT- English. coefficient, regression; German Regressionskoeffizient. One of the characteristics of the relationship between dependent y and independent variable x. K. r. shows by how many units the value accepted by y increases if the variable x changes to ... ... Explanatory Dictionary of Sociology

    sample regression coefficient- 2.44. sample regression coefficient Coefficient of a variable in a regression curve or surface equation Source: GOST R 50779.10 2000: Statistical methods. Probability and bases of statistics. Terms and Definitions … Dictionary-reference book of terms of normative and technical documentation

    Partial regression coefficient- a statistical measure that indicates the degree of influence of the independent variable on the dependent in a situation where the mutual influence of all other variables in the model is under the control of the researcher ... Sociological Dictionary Socium

    REGRESSIONS, WEIGHT- A synonym for the concept of regression coefficient ... Explanatory Dictionary of Psychology

    HERITABILITY COEFFICIENT- An indicator of the relative share of genetic variability in the overall phenotypic variation of a trait. The most common methods for assessing the heritability of economically useful traits are: where h2 is the heritability coefficient; r intraclass… … Terms and definitions used in breeding, genetics and reproduction of farm animals

    - (R squared) is the proportion of the variance of the dependent variable that is explained by the dependence model in question, that is, the explanatory variables. More precisely, this is one minus the proportion of unexplained variance (the variance of the random error of the model, or conditional ... ... Wikipedia

    The coefficient of the independent variable in the regression equation. So, for example, in a linear regression equation linking random variables Y and X, R. k. b0 and b1 are equal: where r is the correlation coefficient of X and Y, . Calculation of estimates R. k. Mathematical Encyclopedia

Books

  • Introduction to econometrics (CDpc), Yanovsky Leonid Petrovich, Bukhovets Alexey Georgievich. The foundations of econometrics and statistical analysis of one-dimensional time series are given. Much attention is paid to classical pair and multiple regression, classical and generalized methods…
  • Speed ​​reading. Effective Simulator (CDpc) , . The program is addressed to users who wish to master the technique of speed reading in the shortest possible time. The course is built on the principle of "theory - practice". Theoretical material and practical ...

The regression coefficient is the absolute value by which the value of one attribute changes on average when another attribute associated with it changes by a specified unit of measurement. Definition of regression. The relationship between y and x determines the sign of the regression coefficient b (if > 0 - direct relationship, otherwise - inverse). The linear regression model is the most commonly used and most studied in econometrics.

1.4. Approximation error. Let us evaluate the quality of the regression equation using the absolute approximation error. The predictive values ​​of the factors are substituted into the model and point predictive estimates of the indicator under study are obtained. Thus, the regression coefficients characterize the degree of significance of individual factors for increasing the level of the effective indicator.

Regression coefficient

Consider now problem 1 of the regression analysis tasks given on p. 300-301. One of the mathematical results of the theory of linear regression says that the estimate N, is the unbiased estimate with the minimum variance in the class of all linear unbiased estimates. For example, you can calculate the number of colds on average at certain values ​​of the average monthly air temperature in the autumn-winter period.

Regression line and regression equation

The regression sigma is used in the construction of a regression scale, which reflects the deviation of the values ​​of the effective attribute from its average value plotted on the regression line. 1, x2, x3 and their corresponding average values ​​y1, y2 y3, as well as the smallest (y - σry/x) and largest (y + σry/x) values ​​(y) to build a regression scale. Conclusion. Thus, the regression scale within the calculated values ​​of body weight allows you to determine it for any other value of growth or to assess the individual development of the child.

In matrix form, the regression equation (ER) is written as: Y=BX+U(\displaystyle Y=BX+U), where U(\displaystyle U) is the error matrix. The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression. Both for outliers and for “influential” observations (points), models are used, both with and without them, and pay attention to the change in the estimate (regression coefficients).

Because of the linear relationship, and we expect to change as it changes, and we call this variation, which is due to or explained by regression. If so, then most of the variation will be explained by the regression, and the points will lie close to the regression line, i.e. the line fits the data well. The difference is the percentage of variance that cannot be explained by regression.

This method is used to visualize the form of communication between the studied economic indicators. Based on the correlation field, one can hypothesize (for the general population) that the relationship between all possible values ​​of X and Y is linear.

Reasons for the existence of a random error: 1. Non-inclusion of significant explanatory variables in the regression model; 2. Aggregation of variables. System of normal equations. In our example, the connection is direct. To predict the dependent variable of the resultant attribute, it is necessary to know the predictive values ​​of all factors included in the model.

Comparison of correlation and regression coefficients

With a probability of 95%, it can be guaranteed that the value of Y with an unlimited number of observations will not go beyond the limits of the found intervals. If the calculated value with lang=EN-US>n-m-1) degrees of freedom is greater than the tabulated value at a given significance level, then the model is considered significant. This ensures that there is no correlation between any deviations and, in particular, between adjacent deviations.

Regression coefficients and their interpretation

In most cases, positive autocorrelation is caused by a directional constant influence of some factors not taken into account in the model. Negative autocorrelation actually means that a positive deviation is followed by a negative one and vice versa.

What is regression?

2. Inertia. Many economic indicators (inflation, unemployment, GNP, etc.) have a certain cyclicality associated with the undulation of business activity. In many industrial and other areas, economic indicators respond to changes economic conditions with a delay (time lag).

If a preliminary standardization of factor indicators has been carried out, then b0 is equal to the average value of the effective indicator in the aggregate. Specific values ​​of the regression coefficients are determined from empirical data according to the least squares method (as a result of solving systems of normal equations).

The linear regression equation has the form y = bx + a + ε Here ε is a random error (deviation, perturbation). Since the error is greater than 15%, then given equation not desirable to use as a regression. By substituting the appropriate values ​​of x into the regression equation, it is possible to determine the aligned (predicted) values ​​of the effective indicator y(x) for each observation.

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are concrete examples from the field of economics.

Types of regression

The concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • power;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Consider the problem of determining the dependence of the number of retired team members on the average salary at 6 industrial enterprises.

A task. Six enterprises analyzed the average monthly wages and the number of employees who quit own will. In tabular form we have:

The number of people who left

Salary

30000 rubles

35000 rubles

40000 rubles

45000 rubles

50000 rubles

55000 rubles

60000 rubles

For the problem of determining the dependence of the number of retired workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +…+a k x k , where x i are the influencing variables, a i are the regression coefficients, a k is the number of factors.

For this task, Y is the indicator of employees who left, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the spreadsheet "Excel"

Regression analysis in Excel must be preceded by the application of built-in functions to the available tabular data. However, for these purposes, it is better to use the very useful add-in "Analysis Toolkit". To activate it you need:

  • from the "File" tab, go to the "Options" section;
  • in the window that opens, select the line "Add-ons";
  • click on the "Go" button located at the bottom, to the right of the "Management" line;
  • check the box next to the name "Analysis Package" and confirm your actions by clicking "OK".

If everything is done correctly, the desired button will appear on the right side of the Data tab, located above the Excel worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for performing econometric calculations, we can begin to solve our problem. For this:

  • click on the "Data Analysis" button;
  • in the window that opens, click on the "Regression" button;
  • in the tab that appears, enter the range of values ​​for Y (the number of employees who quit) and for X (their salaries);
  • We confirm our actions by pressing the "Ok" button.

As a result, the program will automatically populate a new sheet of the spreadsheet with regression analysis data. Note! Excel has the ability to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values ​​are, or even a new workbook specifically designed to store such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data of the considered example looks like this:

First of all, you should pay attention to the value of the R-square. It is the coefficient of determination. AT this example R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more applicable the chosen model for a particular task. It is believed that it correctly describes the real situation with an R-squared value above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Ratio Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are set to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a particular model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence at all small. The "-" sign indicates that the coefficient has a negative value. This is obvious, since everyone knows that the higher the salary at the enterprise, the less people express a desire to terminate the employment contract or quit.

Multiple Regression

This term refers to a connection equation with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is the effective feature (dependent variable), and x 1 , x 2 , ... x m are the factor factors (independent variables).

Parameter Estimation

For multiple regression (MR) it is carried out using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε, we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

LSM is applicable to the MP equation on a standardizable scale. In this case, we get the equation:

where t y , t x 1, … t xm are standardized variables for which the mean values ​​are 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are set as normalized and centralized, so their comparison with each other is considered correct and admissible. In addition, it is customary to filter out factors, discarding those with the smallest values ​​of βi.

Problem using linear regression equation

Suppose there is a table of the price dynamics of a particular product N during the last 8 months. It is necessary to make a decision on the advisability of purchasing its batch at a price of 1850 rubles/t.

month number

month name

price of item N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet, you need to use the Data Analysis tool already known from the above example. Next, select the "Regression" section and set the parameters. It must be remembered that in the "Input interval Y" field, a range of values ​​for the dependent variable (in this case, the price of a product in specific months of the year) must be entered, and in the "Input interval X" - for the independent variable (month number). Confirm the action by clicking "Ok". On a new sheet (if it was indicated so), we get data for regression.

Based on them, we build a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the row with the name of the month number and the coefficients and the “Y-intersection” row from the sheet with the results of the regression analysis. Thus, the linear regression equation (LE) for problem 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, multiple correlation coefficients (MCC) and determination coefficients are used, as well as Fisher's test and Student's test. In the Excel table with regression results, they appear under the names of multiple R, R-square, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the tightness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables "Number of the month" and "Price of goods N in rubles per 1 ton". However, the nature of this relationship remains unknown.

The square of the coefficient of determination R 2 (RI) is a numerical characteristic of the share of the total scatter and shows the scatter of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., the statistical data are described with a high degree of accuracy by the obtained SD.

F-statistics, also called Fisher's test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to evaluate the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-criterion > t cr, then the hypothesis of the insignificance of the free term linear equation rejected.

In the problem under consideration for the free member, using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e., we have a zero probability that the correct hypothesis about the insignificance of the free member will be rejected. For the coefficient at unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Consider a specific applied problem.

The management of NNN must make a decision on the advisability of purchasing a 20% stake in MMM SA. The cost of the package (JV) is 70 million US dollars. NNN specialists collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • annual turnover (VO);
  • accounts receivable (VD);
  • cost of fixed assets (SOF).

In addition, the parameter payroll arrears of the enterprise (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet

First of all, you need to create a table of initial data. It looks like this:

  • call the "Data Analysis" window;
  • select the "Regression" section;
  • in the box "Input interval Y" enter the range of values ​​of dependent variables from column G;
  • click on the icon with a red arrow to the right of the "Input interval X" box and select on the sheet a range of all values ​​from columns B,C, D, F.

Select "New Worksheet" and click "Ok".

Get the regression analysis for the given problem.

Examination of the results and conclusions

“We collect” from the rounded data presented above on the Excel spreadsheet sheet, the regression equation:

SP \u003d 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, they get a figure of 64.72 million US dollars. This means that the shares of JSC MMM should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you solve practical problems from the field of econometrics.