Data analysis data regression. Regression analysis in excel

Statistical processing of data can also be carried out using the add-in ANALYSIS PACKAGE(fig. 62).

From the proposed items, he chooses the item " REGRESSION"And click on it with the left mouse button. Then click OK.

The window shown in Fig. 63.

Analysis tool " REGRESSION»Is used to fit a graph for a set of observations using the method least squares... Regression is used to analyze the effect on an individual dependent variable of the values ​​of one or more explanatory variables. For example, several factors affect the athletic performance of an athlete, including age, height and weight. You can calculate the impact of each of these three factors on the performance of an athlete, and then use that data to predict the performance of another athlete.

The Regression tool uses the function LINEST.

REGRESSION Dialog Box

Labels Select the check box if the first row or first column of the input range contains headers. Clear this check box if there are no titles. In this case, appropriate headers for the output table data will be generated automatically.

Confidence level Select the check box to include an additional level in the output totals table. In the appropriate field, enter the level of reliability to apply, in addition to the default 95% level.

Constant - zero Select the checkbox to make the regression line go through the origin.

Output Spacing Enter a reference to the top-left cell of the output range. Allocate at least seven columns for the output totals table, which will include: ANOVA results, coefficients, standard error of Y calculation, standard deviations, number of observations, standard errors for coefficients.

New Worksheet Select this switch to open a new worksheet in the workbook and insert the analysis results starting in cell A1. If necessary, enter a name for the new sheet in the field opposite the corresponding switch position.

New workbook Click the switch to this position to create a new workbook in which the results will be added to a new sheet.

Residuals Select the checkbox to include residuals in the output table.

Standardized residuals Select the checkbox to include standardized residuals in the output table.

Plot Residuals Select the checkbox to plot the residuals for each independent variable.

Fitting plot Select the checkbox to plot the plot of predicted values ​​versus observed values.

Normal Probability Plot Check the box to plot the normal probability graph.

Function LINEST

To carry out calculations, select the cell in which we want to display the average value with the cursor and press the = key on the keyboard. Next, in the Name field, indicate the desired function, for example AVERAGE(fig. 22).

Function LINEST calculates statistics for a series using the least squares method to compute a straight line that the best way approximates the available data and then returns an array that describes the resulting line. You can also combine the function LINEST with other functions to compute other kinds of models that are linear in unknown parameters (whose unknown parameters are linear), including polynomial, logarithmic, exponential, and power series... Since an array of values ​​is returned, the function must be specified as an array formula.

The equation for a straight line is as follows:

y = m 1 x 1 + m 2 x 2 +… + b (in the case of several ranges of x values),

where the dependent value y is a function of the independent x value, the m values ​​are the coefficients corresponding to each independent x variable, and b is a constant. Note that y, x and m can be vectors. Function LINEST returns an array (mn; mn-1;…; m 1; b). LINEST may also return additional regression statistics.

LINEST(known_y's; known_x's; const; statistics)

Known_y's is the set of y-values ​​that are already known for the relationship y = mx + b.

If known_y's has one column, then each column in known_x's is interpreted as a separate variable.

If known_y's has a single row, then each row in known_x's is interpreted as a separate variable.

Known_x's are an optional set of x values ​​that are already known for y = mx + b.

Known_x's can contain one or more sets of variables. If only one variable is used, then known_y's and known_x's can be of any shape, as long as they have the same dimension. If more than one variable is used, known_y's must be a vector (that is, one row high or one column wide).

If array_ known_x's is omitted, then this array (1; 2; 3; ...) is assumed to be the same size as array_ known_y's.

Const is a Boolean value that indicates whether the constant b is required to be 0.

If const is TRUE or omitted, the constant b is evaluated in the usual way.

If the argument "const" is FALSE, then the value of b is set equal to 0 and the values ​​of m are selected in such a way that the relation y = mx is satisfied.

Statistics is a Boolean value that indicates whether you want to return additional statistics for the regression.

If statistics is TRUE, LINEST returns additional regression statistics. The returned array will look like this: (mn; mn-1; ...; m1; b: sen; sen-1; ...; se1; seb: r2; sey: F; df: ssreg; ssresid).

If statistics is FALSE or omitted, LINEST returns only the coefficients m and the constant b.

Additional regression statistics. (Table 17)

The magnitude Description
se1, se2, ..., sen Standard error values ​​for the coefficients m1, m2, ..., mn.
seb Standard error value for constant b (seb = # N / A if const is FALSE).
r2 Coefficient of determinism. The actual y-values ​​are compared with the values ​​obtained from the equation of the straight line; based on the results of the comparison, the coefficient of determinism is calculated, normalized from 0 to 1. If it is equal to 1, then there is a complete correlation with the model, that is, there is no difference between the actual and estimated values ​​of y. Otherwise, if the coefficient of determinism is 0, it makes no sense to use a regression equation to predict y values. For more information on how r2 is calculated, see the "Remarks" at the end of this section.
sey Standard error for the estimate of y.
F F-statistic or F-observed value. The F statistic is used to determine if the observed relationship between the dependent and independent variables is random.
df Degrees of freedom. Degrees of freedom are useful for finding F-critical values in the statistics table. To determine the reliability level of the model, compare the values ​​in the table with the F-statistic returned by LINEST. For more information on calculating df, see the Remarks at the end of this section. Example 4 below shows the use of the F and df values.
ssreg Regression sum of squares.
ssresid Residual sum of squares. For more information on calculating the ssreg and ssresid values, see the Remarks at the end of this section.

The figure below shows the order in which the additional regression statistics are returned (Figure 64).

Notes:

Any straight line can be described by its slope and the intersection with the y-axis:

Slope (m): to determine the slope of a straight line, usually denoted by m, you need to take two points of the straight line (x 1, y 1) and (x 2, y 2); the slope will be (y 2 -y 1) / (x 2 -x 1).

Y-intersection (b): The y-intersection of a line, usually denoted by b, is the y-value of the point at which the line intersects the y-axis.

The straight line equation has the form y = mx + b. If you know the values ​​of m and b, you can calculate any point on the line by substituting the y or x values ​​in the equation. You can also use the TREND function.

If there is only one independent variable x, you can get the slope and y-intercept directly using the following formulas:

Slope: INDEX (LINEST (known_y's; known_x's); 1)

Y-intersection: INDEX (LINEST (known_y's; known_x's); 2)

The accuracy of the LINEST line approximation depends on the degree of scatter in the data. The closer the data is to a straight line, the more accurate the LINEST model is. LINEST uses the least squares method to determine the best fit to the data. When there is only one independent variable x, m and b are calculated using the following formulas:

where x and y are sample means, for example x = AVERAGE (known_x's) and y = AVERAGE (known_y's).

The LINEST and LOGEST fit functions can calculate the straight or exponential curve that best describes the data. However, they do not answer the question of which of the two results is more suitable for solving the task at hand. You can also calculate TREND (known_y's; known_x's) for a straight line, or GROWTH (known_y's; known_x's) for an exponential curve. These functions, if you do not specify new_x_values, return an array of computed y-values ​​for the actual x-values ​​along a straight line or curve. The calculated values ​​can then be compared with the actual values. You can also build charts for visual comparison.

By performing regression analysis, Microsoft Excel calculates, for each point, the square of the difference between the predicted y-value and the actual y-value. The sum of these squared differences is called the residual sum of squares (ssresid). Microsoft Excel then calculates the total sum of squares (sstotal). If const = TRUE or omitted, the total sum of squares equals the sum of the squares of the difference between the actual y values ​​and the mean y values. When const = FALSE, the total sum of squares will be equal to the sum of the squares of the real values ​​of y (without subtracting the mean value of y from the quotient value of y). The regression sum of squares can then be calculated as follows: ssreg = sstotal - ssresid. The smaller the residual sum of squares, the more value coefficient of determinism r2, which shows how well the equation obtained using regression analysis explains the relationship between the variables. The r2 coefficient is ssreg / sstotal.

In some cases, one or more X columns (let the Y and X values ​​be in the columns) have no additional predicative value in the other X columns. In other words, deleting one or more X columns may result in Y values ​​calculated with the same precision. In this case, redundant X columns will be excluded from the regression model. This phenomenon is called "collinearity" because redundant X columns can be represented as the sum of multiple non-redundant columns. LINEST checks for collinearity and removes any redundant X columns from the regression model if it finds them. The deleted X columns can be identified in the LINEST output by a factor of 0 and a se value of 0. Removing one or more columns as redundant changes the df value because it depends on the number of X columns actually used for predictive purposes. For more information on calculating df, see Example 4 below. When df changes due to the removal of redundant columns, sey and F also change. Collinearity is often discouraged. However, it should be used if some of the X columns contain 0 or 1 as an indicator indicating whether the subject of the experiment is in a separate group. If const = TRUE or omitted, LINEST inserts an additional X column to simulate the intersection point. If there is a column with values ​​of 1 for males and 0 for females, and there is also a column with values ​​of 1 for females and 0 for males, then the last column is removed because its values ​​can be obtained from the column with the “indicator of male gender”.

Calculation of df for cases when columns of X are not removed from the model due to collinearity is as follows: if there are k columns of known_x's and the value of const = TRUE or not specified, then df = n - k - 1. If const = FALSE, then df = n - k. In both cases, removing the X columns due to collinearity increases the df value by 1.

Formulas that return arrays must be entered as array formulas.

When you enter an array of constants for, for example, known_x's, use a semicolon to separate the values ​​on the same line and a colon to separate the lines. The separator characters vary depending on the options set in the Language and Standards window on the control panel.

It should be noted that the y-values ​​predicted by the regression equation may not be correct if they are outside the range of y-values ​​that were used to define the equation.

The main algorithm used in the function LINEST, differs from the main algorithm of the functions INCLINE and SECTION... Differences between algorithms can lead to different results for undefined and collinear data. For example, if the data points of known_y's are 0 and the data points of known_x's are 1, then:

Function LINEST returns a value equal to 0. Function algorithm LINEST used to return suitable values for collinear data, in which case at least one answer can be found.

The SLOPE and INTERCEPT functions return the # DIV / 0! Error. The SLOPE and INTERCEPT function algorithm is used to search for only one answer, and in this case there may be several.

In addition to calculating statistics for other types of regression, LINEST can be used to calculate ranges for other types of regression by entering the functions of x and y as series of x and y for LINEST. For example, the following formula:

LINEST (y-values, x-values ​​^ COLUMN ($ A: $ C))

works by having one column of Y values ​​and one column of X values ​​to compute an approximation to a cube (3rd degree polynomial) of the following form:

y = m 1 x + m 2 x 2 + m 3 x 3 + b

The formula can be changed to calculate other types of regression, but in some cases, adjustments to the output values ​​and other statistics are required.

28 Oct

Good afternoon, dear blog readers! Today we are going to talk about nonlinear regressions. Solution linear regressions can be viewed on the LINK.

This method used mainly in economic modeling and forecasting. Its purpose is to observe and identify the relationship between the two indicators.

The main types of nonlinear regressions are:

  • polynomial (quadratic, cubic);
  • hyperbolic;
  • power-law;
  • indicative;
  • logarithmic.

Various combinations can also be used. For example, for time series analysts in banking, insurance, demographic studies, the Gompzer curve is used, which is a kind of logarithmic regression.

In forecasting using nonlinear regressions, the main thing is to find out the correlation coefficient, which will show us whether there is a close relationship between two parameters or not. As a rule, if the correlation coefficient is close to 1, then there is a connection, and the forecast will be quite accurate. Another important element nonlinear regressions is the mean relative error ( A ) if it is in the interval<8…10%, значит модель достаточно точна.

On this, perhaps, we will finish the theoretical block and move on to practical calculations.

We have a table of car sales for an interval of 15 years (we denote it as X), the number of measurement steps will be the argument n, there is also revenue for these periods (we denote it as Y), we need to predict what the revenue will be in the future. Let's build the following table:

For research, we need to solve the equation (dependence of Y on X): y = ax 2 + bx + c + e. This is a pairwise quadratic regression. Let's apply in this case the method of least squares, to find out the unknown arguments - a, b, c. It will lead to a system of algebraic equations of the form:

To solve this system, we will use, for example, Cramer's method. We see that the sums included in the system are coefficients with unknowns. To calculate them, add several columns (D, E, F, G, H) to the table and sign them according to the meaning of the calculations - in column D we will square x, in E in a cube, in F in the 4th power, in G we multiply the exponents x and y, in H, square x and multiply with y.

You will get a table of the form filled with the necessary for solving the equation.

Let's form a matrix A system consisting of coefficients with unknowns in the left-hand sides of the equations. Place it in cell A22 and call it “ A =". We follow the system of equations that we have chosen to solve the regression.

That is, in cell B21 we must place the sum of the column where the X exponent was raised to the fourth power - F17. Let's just refer to the cell - "= F17". Next, we need the sum of the column where X was raised to a cube - E17, then we go strictly according to the system. Thus, we will need to fill in the entire matrix.

In accordance with Cramer's algorithm, we type a matrix A1, similar to A, in which, instead of the elements of the first column, the elements of the right-hand sides of the equations of the system should be placed. That is, the sum of column X squared times Y, the sum of column XY, and the sum of column Y.

We also need two more matrices - let's call them A2 and A3 in which the second and third columns will consist of the coefficients of the right-hand sides of the equations. The picture will be like this.

Following the chosen algorithm, we will need to calculate the values ​​of the determinants (determinants, D) of the resulting matrices. Let's use the formula MOPRED. Place the results in cells J21: K24.

The calculation of the coefficients of the equation according to Kramer will be carried out in the cells opposite the corresponding determinants according to the formula: a(in cell M22) - "= K22 / K21"; b(in cell M23) - "= K23 / K21"; with(in cell M24) - "= K24 / K21".

We get our required pairwise quadratic regression equation:

y = -0.074x 2 + 2.151x + 6.523

Let us estimate the closeness of the linear connection by the correlation index.

To calculate, add an additional column J to the table (let's call it y *). The calculation will be as follows (according to the regression equation we obtained) - "= $ M $ 22 * ​​B2 * B2 + $ M $ 23 * B2 + $ M $ 24". Place it in cell J2. It remains to stretch down the autocomplete marker to cell J16.

To calculate the sums (Y-Y averaged) 2, add columns K and L to the table with the corresponding formulas. The average for the Y column is calculated using the AVERAGE function.

In cell K25, place the formula for calculating the correlation index - "= ROOT (1- (K17 / L17))".

We see that the value of 0.959 is very close to 1, which means there is a close non-linear relationship between sales and years.

It remains to evaluate the quality of fitting the obtained quadratic regression equation (determination index). It is calculated using the formula of the square of the correlation index. That is, the formula in cell K26 will be very simple - "= K25 * K25".

The 0.920 factor is close to 1, which indicates a high quality fit.

The last step is to calculate the relative error. Add a column and add the formula there: “= ABS ((C2-J2) / C2), ABS - module, absolute value. Drag the marker down and in cell M18 display the average value (AVERAGE), assign the cells to a percentage format. The obtained result - 7.79% is within the acceptable error values<8…10%. Значит вычисления достаточно точны.

If the need arises, we can build a graph based on the obtained values.

An example file is attached - LINK!

Categories:/ / from 28.10.2017

CORRELATION-REGRESSION ANALYSIS INMS EXCEL

1. Create a source data file in MS Excel (for example, table 2)

2. Construction of the correlation field

To construct the correlation field in the command line, select the menu Insert / Chart... In the dialog box that appears, select the type of chart: Point; view: Scatter chart that allows you to compare pairs of values ​​(Fig. 22).

Figure 22 - Choosing a chart type


Figure 23– Window view when selecting a range and rows
Figure 25 - Window view, step 4

2. In the context menu, select the command Add a trend line.

3. In the dialog box that appears, select the type of graph (in our example, linear) and the parameters of the equation, as shown in Figure 26.


Click OK. The result is shown in Figure 27.

Figure 27 - Correlation field of the dependence of labor productivity on capital-labor ratio

Similarly, we construct the correlation field of the dependence of labor productivity on the equipment replacement ratio. (Figure 28).


Figure 28 - Correlation field of labor productivity dependence

from the equipment replacement ratio

3. Construction of the correlation matrix.

To build a correlation matrix in the menu Service choose Data analysis.

Using a data analysis tool Regression, in addition to the results of regression statistics, analysis of variance and confidence intervals, it is possible to obtain residuals and graphs for fitting the regression line, residuals and normal probability. To do this, you need to check the access to the analysis package. From the main menu, select Service / Add-ons... Check the box Analysis package(Figure 29)


Figure 30 - Dialog box Data analysis

After clicking OK, in the dialog box that appears, specify the input interval (in our example, A2: D26), grouping (in our case, by columns) and output parameters, as shown in Figure 31.


Figure 31 - Dialog box Correlation

The calculation result is presented in Table 4.

Table 4 - Correlation matrix

Column 1

Column 2

Column 3

Column 1

Column 2

Column 3

SINGLE-FACTOR REGRESSION ANALYSIS

WITH THE APPLICATION OF THE REGRESSION INSTRUMENT

To conduct a regression analysis of the dependence of labor productivity on capital-labor ratio in the menu Service choose Data analysis and indicate the analysis tool Regression(Figure 32).


Figure 33 - Dialog box Regression

Shows the effect of some values ​​(independent, independent) on the dependent variable. For example, how the number of economically active population depends on the number of enterprises, the size of wages and other parameters. Or: how do foreign investments, energy prices, etc., affect the level of GDP.

The result of the analysis allows you to prioritize. And based on the main factors, predict, plan the development of priority areas, make management decisions.

Regression happens:

Linear (y = a + bx);

Parabolic (y = a + bx + cx 2);

Exponential (y = a * exp (bx));

Power (y = a * x ^ b);

Hyperbolic (y = b / x + a);

Logarithmic (y = b * 1n (x) + a);

Exponential (y = a * b ^ x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take a linear regression type.

Task. At 6 enterprises, the average monthly salary and the number of employees who quit were analyzed. It is necessary to determine the dependence of the number of employees who quit on the average salary.

The linear regression model is as follows:

Y = a 0 + a 1 x 1 + ... + a k x k.

Where a - regression coefficients, x - influencing variables, k - number of factors.

In our example, Y is the indicator of employees who quit. The influencing factor is wages (x).

Excel has built-in functions that you can use to calculate the parameters of a linear regression model. But the Analysis Package add-in will do it faster.

We activate a powerful analytical tool:

1. Press the "Office" button and go to the "Excel Options" tab. "Add-ons".

2. At the bottom, under the drop-down list, in the "Control" field there will be an inscription "Excel add-ins" (if it is not there, click on the checkbox on the right and select). And the button "Go". We press.

3. A list of available add-ins opens. Select "Analysis Package" and click OK.

Upon activation, the add-in will be available on the "Data" tab.

Now let's go directly to the regression analysis.

1. Open the "Data Analysis" tool menu. We select "Regression".



2. A menu will open for selecting input values ​​and output parameters (where to display the result). In the fields for the initial data, we indicate the range of the described parameter (Y) and the factor influencing it (X). The rest can be left blank.

3. After clicking OK, the program will display the calculations on a new sheet (you can select the interval to display on the current sheet or assign output to a new book).

First of all, pay attention to the R-square and the coefficients.

R-square is the coefficient of determination. In our example - 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model is. Good - above 0.8. Bad - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".

The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.

The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of people leaving with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer quitters. Which is fair.

It is known for being useful in various fields of activity, including such discipline as econometrics, where this software utility is used in its work. Basically, all the actions of practical and laboratory exercises are performed in Excel, which greatly facilitates the work, giving detailed explanations of certain actions. Thus, one of the analysis tools "Regression" is used to select a graph for a set of observations using the least squares method. Let's consider what this tool of the program is and what is its benefit to users. Below is also a short but easy-to-understand instruction for building a regression model.

Main tasks and types of regression

Regression is a relationship between given variables, due to which it is possible to determine the prediction of the future behavior of these variables. Variables are various periodic phenomena, including human behavior. This Excel analysis is used to analyze the impact on a particular dependent variable of the values ​​of one or more variables. For example, sales in a store are influenced by several factors, including assortment, prices, and store location. Thanks to regression in Excel, you can determine the degree of influence of each of these factors based on the results of existing sales, and then apply the obtained data to forecast sales for another month or for another store located nearby.

Regression is usually presented as a simple equation that reveals the relationship and strength of the relationship between two groups of variables, where one group is dependent or endogenous, and the other is independent or exogenous. In the presence of a group of interrelated indicators, the dependent variable Y is determined based on the logic of reasoning, and the rest act as independent X-variables.

The main tasks of building a regression model are as follows:

  1. Selection of significant independent variables (X1, X2,…, Xk).
  2. Selecting the type of function.
  3. Construction of estimates for the coefficients.
  4. Construction of confidence intervals and regression functions.
  5. Checking the significance of the calculated estimates and the constructed regression equation.

There are several types of regression analysis:

  • paired (1 dependent and 1 independent variable);
  • multiple (several independent variables).

There are two types of regression equations:

  1. Linear, illustrating a strict linear relationship between variables.
  2. Nonlinear - Equations that can include powers, fractions, and trigonometric functions.

Model building instruction

To complete the given construction in Excel, you must follow the instructions:


For further calculation, use the "Linear ()" function, specifying Y-Values, X-Values, Const and statistics. Then define the set of points on the regression line using the "Trend" function - Y-Values, X-Values, New Values, Const. Using the specified parameters, calculate the unknown value of the coefficients based on the specified conditions of the problem.