Correlation function in excel microsoft. An example of calculating correlation, constructing linear regression and testing the hypothesis of the dependence of two SVs using our service

Correlation analysis is a popular statistical research method that is used to identify the degree of dependence of one indicator on another. Microsoft Excel has special tool, designed to perform this type of analysis. Let's find out how to use this feature.

The essence of correlation analysis

Purpose correlation analysis comes down to identifying the existence of a relationship between various factors. That is, it is determined whether a decrease or increase in one indicator affects the change in another.

If the dependence is established, then the correlation coefficient is determined. Unlike regression analysis, this is the only indicator that calculates this method statistical research. The correlation coefficient ranges from +1 to -1. If there is a positive correlation, an increase in one indicator contributes to an increase in the second. With a negative correlation, an increase in one indicator entails a decrease in another. The larger the module of the correlation coefficient, the more noticeable a change in one indicator is reflected in the change in the second. When the coefficient is 0, there is no dependence between them completely.

Calculation of the correlation coefficient

Now let's try to calculate the correlation coefficient on specific example. We have a table in which advertising costs and sales volumes are shown monthly in separate columns. We have to find out the degree to which the number of sales depends on the amount Money, which was spent on advertising.

Method 1: Defining Correlation Using the Function Wizard

One way in which correlation analysis can be performed is by using the CORREL function. The function itself has general form CORREL(array1, array2).

  1. Select the cell in which the calculation result should be displayed. Click on the “Insert Function” button, which is located to the left of the formula bar.
  2. In the list presented in the Function Wizard window, look for and select the CORREL function. Click on the “OK” button.
  3. The function arguments window opens. In the “Array1” field, enter the coordinates of the range of cells of one of the values, the dependence of which should be determined. In our case, these will be the values ​​in the “Sales value” column. In order to enter the address of the array into the field, simply select all the cells with data in the above column.

    In the “Array2” field you need to enter the coordinates of the second column. For us this is advertising costs. In exactly the same way as in the previous case, we enter the data in the field.

    Click on the “OK” button.

As you can see, the correlation coefficient in the form of a number appears in the cell we previously selected. In this case, it is equal to 0.97, which is a very high sign of the dependence of one value on another.

Method 2: Calculate correlation using analysis package

Alternatively, correlation can be calculated using one of the tools provided in the analysis package. But first we need to activate this tool.

  1. Go to the “File” tab.
  2. In the window that opens, move to the “Settings” section.
  3. Next, go to the “Add-ons” item.
  4. At the bottom of the next window, in the “Management” section, move the switch to the “Excel Add-ins” position if it is in a different position. Click on the “OK” button.
  5. In the add-ons window, check the box next to the “Analysis package” item. Click on the “OK” button.
  6. After this, the analysis package is activated. Go to the “Data” tab. As you can see, it appears here on the tape new block tools – “Analysis”. Click on the “Data Analysis” button, which is located in it.
  7. A list opens with various options data analysis. Select the “Correlation” item. Click on the “OK” button.
  8. A window with correlation analysis parameters opens. Unlike the previous method, in the “Input interval” field we enter the interval not of each column separately, but of all columns that participate in the analysis. In our case, this is data in the “Advertising costs” and “Sales value” columns.

    We leave the “Grouping” parameter unchanged – “By columns”, since our data groups are divided into two columns. If they were broken down line by line, then the switch would have to be moved to the “By line” position.

    In the default output parameters, the “New worksheet” item is set, that is, the data will be output on another sheet. You can change the location by moving the switch. This can be the current sheet (then you will have to specify the coordinates of the information output cells) or a new workbook (file).

    When all the settings are set, click on the “OK” button.

Since the output location for the analysis results was left as default, we move to a new sheet. As you can see, the correlation coefficient is indicated here. Naturally, it is the same as when using the first method - 0.97. This is because both options perform the same calculations, they just can be done in different ways.

As you can see, the Excel application offers two methods of correlation analysis at once. The result of the calculations, if you do everything correctly, will be completely identical. But, each user can choose a more convenient calculation option for him.

We are glad that we were able to help you solve the problem.

Ask your question in the comments, describing the essence of the problem in detail. Our specialists will try to answer as quickly as possible.

Did this article help you?

Regression and correlation analysis are statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below on specific practical examples Let's look at these two very popular analyzes among economists. We will also give an example of obtaining results when combining them.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, the size wages and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to highlight priorities. And based on the main factors, predict, plan the development of priority areas, and make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take the linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of quitting employees were analyzed. It is necessary to determine the dependence of the number of quitting employees on the average salary.

Model linear regression has the following form:

Y = a0 + a1x1 +…+akhk.

Where a are regression coefficients, x are influencing variables, k is the number of factors.

In our example, Y is the indicator of quitting employees. The influencing factor is wages (x).

Excel has built-in functions that can help you calculate the parameters of a linear regression model. But the “Analysis Package” add-on will do this faster.

We activate a powerful analytical tool:

  1. Click the "Office" button and go to the "Excel Options" tab. "Add-ons".
  2. At the bottom, under the drop-down list, in the “Manage” field there will be an inscription “Excel Add-ins” (if it is not there, click on the checkbox on the right and select). And the “Go” button. Click.
  3. A list of available add-ons opens. Select “Analysis package” and click OK.

Once activated, the add-on will be available in the Data tab.

Now let's do the regression analysis itself.

  1. Open the “Data Analysis” tool menu. Select "Regression".
  2. A menu will open for selecting input values ​​and output options (where to display the result). In the fields for the initial data, we indicate the range of the described parameter (Y) and the factor influencing it (X). The rest need not be filled out.
  3. After clicking OK, the program will display the calculations on a new sheet (you can select an interval to display on the current sheet or assign output to a new workbook).

First of all, we pay attention to R-squared and coefficients.

R-squared is the coefficient of determination. In our example – 0.755, or 75.5%. This means that the calculated parameters of the model explain 75.5% of the relationship between the studied parameters. The higher the coefficient of determination, the better the model. Good - above 0.8. Bad – less than 0.5 (such an analysis can hardly be considered reasonable). In our example – “not bad”.

The coefficient 64.1428 shows what Y will be if all variables in the model under consideration are equal to 0. That is, the value of the analyzed parameter is also influenced by other factors not described in the model.

The coefficient -0.16285 shows the weight of variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer people quit. Which is fair.

Correlation Analysis in Excel

Correlation analysis helps determine whether there is a relationship between indicators in one or two samples. For example, between the operating time of a machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a connection, then does an increase in one parameter lead to an increase (positive correlation) or a decrease (negative) of the other. Correlation analysis helps the analyst determine whether the value of one indicator can be used to predict the possible value of another.

The correlation coefficient is denoted by r. Varies from +1 to -1. Classification of correlations for different areas will be different. When the coefficient is 0, there is no linear relationship between samples.

Let's look at how to find the correlation coefficient using Excel.

To find paired coefficients, the CORREL function is used.

Objective: Determine whether there is a relationship between operating time lathe and the cost of its maintenance.

Place the cursor in any cell and press the fx button.

  1. In the “Statistical” category, select the CORREL function.
  2. Argument “Array 1” - the first range of values ​​– machine operating time: A2:A14.
  3. Argument “Array 2” - second range of values ​​– repair cost: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use “Data Analysis” (the “Analysis Package” add-on). You need to select correlation from the list and designate the array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this:

Correlation and regression analysis

In practice, these two techniques are often used together.

  1. We build a correlation field: “Insert” - “Diagram” - “Scatter diagram” (allows you to compare pairs). The range of values ​​is all numeric data in the table.
  2. Left-click on any point on the diagram. Then right. In the menu that opens, select “Add trend line.”
  3. Assign parameters for the line. Type – “Linear”. At the bottom – “Show equation on diagram.”
  4. Click “Close”.

Now the regression analysis data has become visible.

1.Open Excel

2.Create data columns. In our example, we will consider the relationship, or correlation, between aggression and self-doubt in first-graders. 30 children participated in the experiment, the data is presented in the Excel table:

1 column - subject number

Column 2 - aggressiveness in points

Column 3 - self-doubt in points

3.Then you need to select an empty cell next to the table and click on the icon f(x) in the Excel panel

4.The function menu will open, you must select among the categories Statistical, and then among the list of functions alphabetically find CORREL and click OK

5.Then a menu of function arguments will open, which will allow you to select the data columns we need. To select the first column Aggressiveness you need to click on the blue button next to the line Array1

6.Select data for Array1 from the column Aggressiveness and click on the blue button in the dialog box

7. Then, similarly to Array 1, click on the blue button next to the line Array2

8.Select data for Array2- column Diffidence and press the blue button again, then OK

9. Here, the r-Pearson correlation coefficient has been calculated and written in the selected cell. In our case, it is positive and approximately equal. This speaks about moderate positive connections between aggressiveness and self-doubt in first-graders

Thus, statistical inference experiment will be: r = 0.225, a moderate positive relationship between the variables was revealed aggressiveness And diffidence.

Some studies require the p-level of significance of the correlation coefficient to be specified, however Excel program, unlike SPSS, does not provide such an option. It’s okay, there are tables of critical correlation values ​​(A.D. Nasledov).

You can also build a regression line in Excel and attach it to the research results.

Have you already encountered the need to calculate the degree of connection between two statistical quantities and determine the formula by which they correlate? Normal person may ask why this might be necessary at all. Oddly enough, this is actually necessary. Knowing reliable correlations can help you make crazy money if you're, say, a stock trader. The problem is that for some reason no one reveals these correlations (surprising, isn’t it?).

Let's count them ourselves! For example, I decided to try to calculate the correlation of the ruble to the dollar through the euro. Let's look at how this is done in detail.

This article is designed for advanced levels of Microsoft Excel proficiency. If you don't have time to read the entire article, you can download the file and figure it out yourself.

If you often find yourself needing to do something like this I highly recommend that you consider purchasing the book. Statistical calculations in Excel.

What is important to know about correlations

To calculate a reliable correlation, you need to have a reliable sample; the larger it is, the more reliable the result will be. For purposes this example I took a daily sample of exchange rates for 10 years. The data is freely available, I took it from the site http://oanda.com.

What did I actually do

(1) Once I had the raw data, I started by checking the degree of correlation between the two data sets. To do this, I used the CORREL function - there is a little information about it. It returns the degree of correlation between two data ranges. The result, frankly speaking, was not particularly impressive (only about 70%). Generally speaking, the degree of correlation between two quantities is usually considered to be the square of this quantity, that is, the correlation turned out to be reliable by approximately 49%. This is very little!

(2) This seemed very strange to me. What errors could have crept into my calculations? So I decided to make a graph and see what could happen. The graph was specially broken down by year so that you could visually see where the correlation breaks down. The schedule turned out like this

(3) It is obvious from the graph that in the range of about 35 rubles per euro the correlation begins to break into two parts. Because of this, it turned out to be unreliable. It was necessary to determine why this was happening.

(4) The color shows that these data refer to 2007, 2008, 2009. Certainly! Periods of economic peaks and recessions are usually statistically unreliable, which is what happened in this case. Therefore, I tried to exclude these periods from the data (and to check, I checked the degree of correlation of the data in this period). The degree of correlation of these data alone is 0.01%, that is, it is completely absent. But without them, the data correlates by approximately 81%. This is already a fairly reliable correlation. Here is a graph with the function.

Next steps

Theoretically, the correlation function can be refined by converting it from linear to exponential or logarithmic. In this case, the statistical reliability of the correlation increases by approximately one percent, but the complexity of applying the formula increases incredibly. Therefore, I ask myself the question: is this really necessary? It's up to you to decide - for each specific case.

For the territories of the region, data for 200X is provided.

Region number Average per capita living wage per day of one able-bodied person, rub., x Average daily wage, rub., y
1 78 133
2 82 148
3 87 134
4 79 154
5 89 162
6 106 195
7 67 139
8 88 158
9 73 152
10 87 162
11 76 159
12 115 173

Exercise:

1. Construct a correlation field and formulate a hypothesis about the form of the connection.

2. Calculate the parameters of the linear regression equation

4. Using the average (general) elasticity coefficient, give a comparative assessment of the strength of the relationship between the factor and the result.

7. Calculate the predicted value of the result if the predicted value of the factor increases by 10% from its average level. Determine the forecast confidence interval for the significance level.

Solution:

Let's solve this problem using Excel.

1. By comparing the available data x and y, for example, ranking them in increasing order of factor x, one can observe the presence of a direct relationship between the characteristics, when an increase in the average per capita subsistence level increases the average daily wage. Based on this, we can make the assumption that the relationship between the characteristics is direct and can be described by a straight line equation. The same conclusion is confirmed based on graphical analysis.

To build a correlation field, you can use Excel PPP. Enter the initial data in sequence: first x, then y.

Select the area of ​​cells that contains data.

Then choose: Insert / Scatter Plot / Scatter with Markers as shown in Figure 1.

Figure 1 Construction of the correlation field

Analysis of the correlation field shows the presence of a close to rectilinear dependence, since the points are located almost in a straight line.

2. To calculate the parameters of the linear regression equation
Let's use the built-in statistical function LINEST.

For this:

1) Open an existing file containing the analyzed data;
2) Select a 5x2 area of ​​empty cells (5 rows, 2 columns) to display the results of regression statistics.
3) Activate Function Wizard: in the main menu select Formulas / Insert Function.
4) In the window Category you are taking Statistical, in the function window - LINEST. Click the button OK as shown in Figure 2;

Figure 2 Function Wizard Dialog Box

5) Fill in the function arguments:

Known values ​​for

Known values ​​of x

Constant- a logical value that indicates the presence or absence of a free term in the equation; if Constant = 1, then the free term is calculated in the usual way, if Constant = 0, then the free term is 0;

Statistics- a logical value that indicates whether to display additional information on regression analysis or not. If Statistics = 1, then additional information is displayed, if Statistics = 0, then only estimates of the equation parameters are displayed.

Click the button OK;

Figure 3 LINEST Function Arguments Dialog Box

6) The first element of the final table will appear in the upper left cell of the selected area. To open the entire table, press the key , and then to the key combination ++ .

Additional regression statistics will be output in the order shown in the following diagram:

Coefficient value b Coefficient a value
Standard error b Standard error a
Standard error y
F-statistic
Regression sum of squares

Figure 4 Result of calculating the LINEST function

We got the regression level:

We conclude: With an increase in the average per capita subsistence level by 1 rub. the average daily wage increases by an average of 0.92 rubles.

This means that 52% of the variation in wages (y) is explained by the variation of factor x - the average per capita living wage, and 48% - by the action of other factors not included in the model.

Using the calculated coefficient of determination, the correlation coefficient can be calculated: .

The connection is assessed as close.

4. Using the average (general) elasticity coefficient, we determine the strength of the factor’s influence on the result.

For a straight line equation, we determine the average (total) elasticity coefficient using the formula:

We will find the average values ​​by selecting the area of ​​cells with x values ​​and selecting Formulas / AutoSum / Average, and we will do the same with the values ​​of y.

Figure 5 Calculation of average function values ​​and argument

Thus, if the average per capita cost of living changes by 1% from its average value, the average daily wage will change by an average of 0.51%.

Using a data analysis tool Regression available:
- results of regression statistics,
- results of analysis of variance,
- results confidence intervals,
- residuals and regression line fitting graphs,
- residuals and normal probability.

The procedure is as follows:

1) check access to Analysis package. In the main menu, select: File/Options/Add-ons.

2) In the dropdown list Control select item Excel add-ins and press the button Go.

3) In the window Add-ons check the box Analysis package and then click the button OK.

If Analysis package not in the field list Available add-ons, press the button Review to perform a search.

If you receive a message indicating that the analysis package is not installed on your computer, click Yes to install it.

4) In the main menu, select: Data / Data Analysis / Analysis Tools / Regression and then click the button OK.

5) Fill out the data input and output parameters dialog box:

Input interval Y- range containing data of the resultant attribute;

Input interval X- range containing data of the factor characteristic;

Tags- a flag that indicates whether the first line contains column names or not;

Constant - zero- a flag indicating the presence or absence of a free term in the equation;

Output interval- it is enough to indicate the upper left cell of the future range;

6) New worksheet - you can specify an arbitrary name for the new sheet.

Then click the button OK.

Figure 6 Dialog box for entering parameters for the Regression tool

The results of the regression analysis for the problem data are presented in Figure 7.

Figure 7 Result of using the regression tool

5. Let's evaluate using average error approximation quality of equations. Let's use the results of the regression analysis presented in Figure 8.

Figure 8 Result of using the regression tool “Withdrawal of remainder”

Let's create a new table as shown in Figure 9. In column C, we calculate the relative approximation error using the formula:

Figure 9 Calculation of average approximation error

The average approximation error is calculated using the formula:

The quality of the constructed model is assessed as good, since it does not exceed 8 - 10%.

6. From table c regression statistics(Figure 4) we write down the actual value of Fisher’s F-test:

Because the at a 5% significance level, then we can conclude that the regression equation is significant (the relationship has been proven).

8. Evaluation statistical significance We will carry out regression parameters using Student’s t-statistics and by calculating the confidence interval of each indicator.

We put forward the hypothesis H 0 about a statistically insignificant difference between the indicators and zero:

.

for the number of degrees of freedom

Figure 7 has the actual t-statistic values:

The t-test for the correlation coefficient can be calculated in two ways:

Method I:

Where - random error of the correlation coefficient.

We will take the data for calculation from the table in Figure 7.

Method II:

The actual t-statistic values ​​exceed the table values:

Therefore, the hypothesis H 0 is rejected, that is, the regression parameters and the correlation coefficient do not differ from zero by chance, but are statistically significant.

The confidence interval for parameter a is defined as

For parameter a, the 95% limits as shown in Figure 7 were:

The confidence interval for the regression coefficient is defined as

For the regression coefficient b, the 95% limits as shown in Figure 7 were:

Analysis of the upper and lower limits of confidence intervals leads to the conclusion that with probability parameters a and b, being within the specified limits, do not take zero values, i.e. are not statistically insignificant and significantly different from zero.

7. The obtained estimates of the regression equation allow it to be used for forecasting. If the predicted cost of living is:

Then the predicted value of the cost of living will be:

We calculate the forecast error using the formula:

Where

We will also calculate the variance using Excel PPP. For this:

1) Activate Function Wizard: in the main menu select Formulas / Insert Function.

3) Fill in the range containing the numerical data of the factor characteristic. Click OK.

Figure 10 Calculation of variance

We got the variance value

To calculate the residual variance per degree of freedom, we will use the results of analysis of variance as shown in Figure 7.

Confidence intervals for predicting individual values ​​of y with a probability of 0.95 are determined by the expression:

The interval is quite wide, primarily due to the small volume of observations. In general, the forecast for the average monthly salary turned out to be reliable.

The condition of the problem is taken from: Workshop on econometrics: Proc. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M.: Finance and Statistics, 2003. - 192 p.: ill.

Let's calculate the correlation coefficient and covariance for different types relationships of random variables.

Correlation coefficient(correlation criterion Pearson, English Pearson Product Moment correlation coefficient) determines the degree linear relationships between random variables.

As follows from the definition, to calculate correlation coefficient it is required to know the distribution of random variables X and Y. If the distributions are unknown, then to estimate correlation coefficient used sample correlation coefficientr ( it is also designated as Rxy or r xy) :

where S x – standard deviation samples random variable x, calculated by the formula:

As can be seen from the calculation formula correlations, the denominator (the product of the standard deviations) simply normalizes the numerator such that correlation turns out to be a dimensionless number from -1 to 1. Correlation And covariance provide the same information (if known standard deviations ), But correlation more convenient to use, because it is a dimensionless quantity.

Calculate correlation coefficient And sample covariance in MS EXCEL is not difficult, since there are special functions CORREL() and KOVAR() for this purpose. It is much more difficult to figure out how to interpret the obtained values; most of the article is devoted to this.

Theoretical retreat

Let us remind you that correlation connection is called a statistical relationship consisting in the fact that different meanings one variable corresponds to different average values ​​are different (with a change in the value of X average value Y changes in a regular way). It is assumed that both variables X and Y are random values ​​and have a certain random scatter relative to them average value.

Note. If only one variable, for example, Y, has a random nature, and the values ​​of the other are deterministic (set by the researcher), then we can only talk about regression.

Thus, for example, when studying the dependence of the average annual temperature, one cannot talk about correlations temperature and year of observation and, accordingly, apply indicators correlations with their corresponding interpretation.

Correlation between variables can arise in several ways:

  1. The presence of a causal relationship between variables. For example, the amount of investment in Scientific research(variable X) and the number of patents received (Y). The first variable appears as independent variable (factor), second - dependent variable (outcome). It must be remembered that the dependence of quantities determines the presence of a correlation between them, but not vice versa.
  2. The presence of conjugation (common cause). For example, as the organization grows, the wage fund (payroll) and the cost of renting premises increase. Obviously, it is wrong to assume that the rental of premises depends on the payroll. Both of these variables depend linearly on the number of personnel in many cases.
  3. Mutual influence of variables (when one changes, the second variable changes, and vice versa). With this approach, two formulations of the problem are allowed; Any variable can act both as an independent variable and as a dependent variable.

Thus, correlation indicator shows how strong linear relationship between two factors (if there is one), and regression allows you to predict one factor based on the other.

Correlation, like any other statistical indicator, when correct use may be useful, but it also has limitations in use. If it shows a clearly defined linear relationship or a complete lack of relationship, then correlation will reflect this wonderfully. But, if the data shows a non-linear relationship (for example, quadratic), the presence of separate groups of values ​​or outliers, then the calculated value correlation coefficient may be misleading (see example file).

Correlation close to 1 or -1 (i.e. close in absolute value to 1) shows a strong linear relationship between the variables, a value close to 0 shows no relationship. Positive correlation means that with an increase in one indicator, the other on average increases, and with a negative indicator, it decreases.

To calculate the correlation coefficient, it is required that the compared variables satisfy the following conditions:

  • the number of variables must be equal to two;
  • variables must be quantitative (eg frequency, weight, price). The calculated average of these variables has a clear meaning: average price or average patient weight. Unlike quantitative variables, qualitative (nominal) variables take values ​​only from a finite set of categories (for example, gender or blood type). These values ​​are conventionally associated with numerical values ​​(for example, female gender is 1, and male gender is 2). It is clear that in this case the calculation average value, which is required to find correlations, is incorrect, and therefore the calculation itself is incorrect correlations;
  • variables must be random variables and have .

Two-dimensional data can have different structures. Some of them require certain approaches to work with:

  • For data with non-linear relationship correlation must be used with caution. For some problems, it may be useful to transform one or both variables to produce a linear relationship (this requires making an assumption about the type of nonlinear relationship to suggest desired type transformations).
  • By using scatter plots Some data may exhibit unequal variation (scatter). The problem with uneven variation is that locations with high variation not only provide the least accurate information, but also have the greatest impact when calculating statistics. This problem is also often solved by transforming the data, such as using logarithms.
  • Some data can be observed to be divided into groups (clustering), which may indicate the need to divide the population into parts.
  • An outlier (a sharply deviating value) can distort the calculated value of the correlation coefficient. An outlier may be due to chance, an error in data collection, or may actually reflect some feature of the relationship. Since the outlier deviates greatly from the average value, it makes a large contribution to the calculation of the indicator. Statistical indicators are often calculated with and without taking into account outliers.

Using MS EXCEL to calculate correlation

Let's take 2 variables as an example X And Y and correspondingly, sample consisting of several pairs of values ​​(X i; Y i). For clarity, let's build .

Note: For more information about constructing diagrams, see the article. In the example file for building scatter plots used because here we have deviated from the requirement that the variable X be random (this simplifies the generation various types relationships: building trends and a given spread). For real data, you must use a Scatter chart (see below).

Calculations correlations we will conduct for various cases relationships between variables: linear, quadratic and at lack of communication.

Note: In the example file, you can set the parameters of the linear trend (slope, Y-intercept) and the degree of scatter relative to this trend line. You can also adjust the quadratic parameters.

In the example file for building scatter plots if there is no dependence of variables, a scatter diagram is used. In this case, the points on the diagram are arranged in the form of a cloud.

Note: Please note that by changing the scale of the diagram along the vertical or horizontal axis, the cloud of points can be given the appearance of a vertical or horizontal line. It is clear that the variables will remain independent.

As mentioned above, to calculate correlation coefficient in MS EXCEL there is a CORREL() function. You can also use the similar function PEARSON(), which returns the same result.

To make sure that the calculations correlations are produced by the CORREL() function using the above formulas; the example file shows the calculation correlations using more detailed formulas:

=COVARIANCE.G(B28:B88;D28:D88)/STDEV.G(B28:B88)/STDEV.G(D28:D88)

=COVARIANCE.B(B28:B88;D28:D88)/STDEV.B(B28:B88)/STDEV.B(D28:D88)

Note: Square correlation coefficient r is equal to coefficient of determination R2, which is calculated when constructing a regression line using the QPIRSON() function. The value of R2 can also be output to scatter diagram, having built linear trend using standard MS EXCEL functionality (select the chart, select the tab Layout, then in the group Analysis click the button Trend line and select Linear approximation). For more information on constructing a trend line, see, for example, .

Using MS EXCEL to Calculate Covariance

Covariance is close in meaning to (also a measure of dispersion) with the difference that it is defined for 2 variables, and dispersion- for one. Therefore, cov(x;x)=VAR(x).

To calculate covariance in MS EXCEL (starting from version 2010), the functions COVARIATION.Г() and COVARIATION.В() are used. In the first case, the formula for calculating is similar to the above (end .G stands for Population ), in the second, instead of the multiplier 1/n, 1/(n-1) is used, i.e. ending .IN stands for Sample.

Note: The COVAR() function, which is present in MS EXCEL in earlier versions, is similar to the COVARIATION.G() function.

Note: The CORREL() and COVAR() functions are presented in the English version as CORREL and COVAR. The functions COVARIANCE.G() and COVARIANCE.B() as COVARIANCE.P and COVARIANCE.S.

Additional formulas for calculation covariances:

=SUMPRODUCT(B28:B88-AVERAGE(B28:B88);(D28:D88-AVERAGE(D28:D88)))/COUNT(D28:D88)

=SUMPRODUCT(B28:B88-AVERAGE(B28:B88),(D28:D88))/COUNT(D28:D88)

=SUMPRODUCT(B28:B88;D28:D88)/COUNT(D28:D88)-AVERAGE(B28:B88)*AVERAGE(D28:D88)

These formulas use the property covariances:

If the variables x And y independent, then their covariance is 0. If the variables are not independent, then the variance of their sum is equal to:

VAR(x+y)= VAR(x)+ VAR(y)+2COV(x;y)

A dispersion their difference is equal

VAR(x-y)= VAR(x)+ VAR(y)-2COV(x;y)

Estimation of statistical significance of the correlation coefficient

In order to test the hypothesis, we must know the distribution of the random variable, i.e. correlation coefficient r. Usually, the hypothesis is tested not for r, but for the random variable t r:

which has n-2 degrees of freedom.

If the calculated value of the random variable |t r | more than critical value t α,n-2 (α-specified), then the null hypothesis is rejected (the relationship between the values ​​is statistically significant).

Analysis package add-in

B to calculate covariance and correlation there are instruments of the same name analysis.

After calling the tool, a dialog box appears containing the following fields:

  • Input interval: you need to enter a link to a range with source data for 2 variables
  • Grouping: As a rule, the source data is entered in 2 columns
  • Labels in the first line: if the checkbox is checked, then Input interval must contain column headers. It is recommended to check the box so that the result of the Add-in contains informative columns
  • Output interval: the range of cells where the calculation results will be placed. It is enough to indicate the upper left cell of this range.

The add-in returns the calculated correlation and covariance values ​​(for covariance, the variances of both random variables are also calculated).

Where x·y, x, y are the average values ​​of the samples; σ(x), σ(y) - standard deviations.
In addition, the linear pair correlation coefficient can be determined through the regression coefficient b: , where σ(x)=S(x), σ(y)=S(y) - standard deviations, b - coefficient before x in the regression equation y= a+bx .

Other formula options:
or

K xy - correlation moment (covariance coefficient)

The linear correlation coefficient takes values ​​from –1 to +1 (see Chaddock scale). For example, when analyzing the closeness of the linear correlation between two variables, the pair coefficient was obtained linear correlation, equal to –1. This means that there is an exact inverse linear relationship between the variables.

Geometric meaning of the correlation coefficient: r xy shows how different the slope of two regression lines: y(x) and x(y) is, and how much the results of minimizing deviations in x and y differ. The greater the angle between the lines, the greater r xy.
The sign of the correlation coefficient coincides with the sign of the regression coefficient and determines the slope of the regression line, i.e. general direction of dependence (increasing or decreasing). The absolute value of the correlation coefficient is determined by the degree of proximity of the points to the regression line.

Properties of the correlation coefficient

  1. |r xy | ≤ 1;
  2. if X and Y are independent, then r xy =0, the converse is not always true;
  3. if |r xy |=1, then Y=aX+b, |r xy (X,aX+b)|=1, where a and b are constants, a ≠ 0;
  4. |r xy (X,Y)|=|r xy (a 1 X+b 1, a 2 X+b 2)|, where a 1, a 2, b 1, b 2 are constants.

Instructions. Specify the amount of input data. The resulting solution is saved in a Word file (see Example of finding a regression equation). A solution template is also automatically created in Excel. .

Number of lines (source data)
The final values ​​of the quantities are given (∑x, ∑x 2, ∑xy, ∑y, ∑y 2)