Correlation function in excel microsoft. An example of calculating the correlation, constructing linear regression and testing the hypothesis of the dependence of two SVs by our service

Correlation analysis is a popular statistical research method that is used to determine the degree of dependence of one indicator on another. Microsoft Excel has special tool designed to perform this type of analysis. Let's find out how to use this feature.

The essence of correlation analysis

The purpose correlation analysis comes down to identifying the presence of a relationship between various factors... That is, it is determined whether a decrease or increase in one indicator affects the change in another.

If the relationship is established, then the correlation coefficient is determined. Unlike regression analysis, this is the only indicator that calculates this method statistical research. The correlation coefficient ranges from +1 to -1. If there is a positive correlation, an increase in one indicator contributes to an increase in the second. With a negative correlation, an increase in one indicator entails a decrease in the other. The greater the modulus of the correlation coefficient, the more noticeable the change in one indicator is reflected in the change in the second. When the coefficient is equal to 0, the relationship between them is completely absent.

Calculation of the correlation coefficient

Now let's try to calculate the correlation coefficient by specific example... We have a table in which the cost of advertising and the amount of sales are listed in separate columns on a monthly basis. We have to find out the degree of dependence of the number of sales on the amount Money that was spent on advertising.

Method 1: Determine Correlation Through the Function Wizard

One of the ways in which you can perform correlation analysis is by using the CORREL function. The function itself has general form CORREL (array1, array2).

  1. Select the cell in which the calculation result should be displayed. Click on the "Insert Function" button, which is located to the left of the formula bar.
  2. In the list that is presented in the Function Wizard window, look for and select the CORREL function. Click on the "OK" button.
  3. The function arguments window opens. In the "Array1" field, enter the coordinates of the range of cells of one of the values, the dependence of which should be determined. In our case, these will be the values ​​in the "Amount of Sales" column. In order to enter the address of the array into the field, simply select all the cells with the data in the above column.

    In the field "Array2" you need to enter the coordinates of the second column. For us, these are advertising costs. In the same way as in the previous case, we enter the data in the field.

    Click on the "OK" button.

As you can see, the correlation coefficient appears in the form of a number in the cell we have previously selected. In this case, it is equal to 0.97, which is a very high indicator of the dependence of one quantity on another.

Method 2: calculating the correlation using the analysis package

In addition, the correlation can be calculated using one of the tools provided in the analysis package. But first we need to activate this tool.

  1. Go to the "File" tab.
  2. In the window that opens, move to the "Parameters" section.
  3. Next, go to the "Add-ons" item.
  4. At the bottom of the next window in the "Control" section, move the switch to the position "Excel Add-ins" if it is in a different position. Click on the "OK" button.
  5. In the add-ons window, check the box next to the "Analysis package" item. Click on the "OK" button.
  6. The analysis package is then activated. Go to the "Data" tab. As you can see, here on the tape appears new block instruments - "Analysis". Click on the button "Data Analysis", which is located in it.
  7. A list opens with different options data analysis. We select the item "Correlation". Click on the "OK" button.
  8. A window with the correlation analysis parameters opens. Unlike the previous method, in the "Input interval" field, we enter the interval not of each column separately, but of all columns that are involved in the analysis. In our case, this is the data in the columns "Advertising costs" and "Amount of sales".

    Leave the "Grouping" parameter unchanged - "By Columns", since our data groups are divided into two columns. If they were broken line by line, then it would be necessary to rearrange the switch to the "By lines" position.

    In the output parameters, the "New worksheet" item is set by default, that is, the data will be displayed on another sheet. You can change the location by flipping the switch. This can be the current sheet (then you will need to specify the coordinates of the cells for outputting information) or a new workbook (file).

    When all the settings are set, click on the "OK" button.

Since the place for displaying the analysis results was left by default, we move to a new sheet. As you can see, the correlation coefficient is indicated here. Naturally, it is the same as when using the first method - 0.97. This is because both options perform the same calculations, they can simply be done in different ways.

As you can see, the Excel application offers two ways of correlation analysis at once. The result of the calculations, if you do everything right, will be completely identical. But, each user can choose a more convenient option for calculating.

We are glad that we were able to help you solve the problem.

Ask your question in the comments, detailing the essence of the problem. Our experts will try to answer as quickly as possible.

Did this article help you?

Regression and correlation analysis - statistical research methods. These are the most common ways to show how a parameter depends on one or more independent variables.

Below on specific practical examples Consider these two analyzes that are very popular among economists. And also we will give an example of obtaining the results when they are combined.

Regression analysis in Excel

Shows the effect of some values ​​(independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, the size wages and other parameters. Or: how do foreign investments, energy prices, etc., affect the level of GDP.

The result of the analysis allows you to prioritize. And based on the main factors, predict, plan the development of priority areas, make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx2);
  • exponential (y = a * exp (bx));
  • power (y = a * x ^ b);
  • hyperbolic (y = b / x + a);
  • logarithmic (y = b * 1n (x) + a);
  • exponential (y = a * b ^ x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take a linear regression type.

Task. At 6 enterprises, the average monthly salary and the number of employees who quit were analyzed. It is necessary to determine the dependence of the number of employees who quit on the average salary.

Model linear regression looks like this:

Y = a0 + a1x1 + ... + akhk.

Where a - regression coefficients, x - influencing variables, k - number of factors.

In our example, Y is the indicator of employees who quit. The influencing factor is wages (x).

Excel has built-in functions that you can use to calculate the parameters of a linear regression model. But the Analysis Package add-in will do it faster.

We activate a powerful analytical tool:

  1. Press the "Office" button and go to the "Excel Options" tab. "Add-ons".
  2. Below, under the drop-down list, in the "Control" field there will be an inscription "Excel add-ins" (if it is not there, click on the checkbox on the right and select). And the button "Go". We press.
  3. A list of available add-ins opens. Select "Analysis Package" and click OK.

Upon activation, the add-in will be available on the "Data" tab.

Now let's go directly to the regression analysis.

  1. Open the Data Analysis tool menu. We select "Regression".
  2. A menu will open for selecting input values ​​and output parameters (where to display the result). In the fields for the initial data, we indicate the range of the described parameter (Y) and the factor influencing it (X). The rest can be left blank.
  3. After clicking OK, the program will display the calculations on a new sheet (you can select the interval to display on the current sheet or assign output to a new book).

First of all, pay attention to the R-square and the coefficients.

R-square is the coefficient of determination. In our example - 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model is. Good - above 0.8. Bad - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".

The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.

The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of people leaving with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer quitters. Which is fair.

Correlation analysis in Excel

Correlation analysis helps to establish whether there is a relationship between indicators in one or two samples. For example, between the operating time of the machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a relationship, then whether an increase in one parameter leads to an increase (positive correlation) or a decrease (negative) in the other. Correlation analysis helps the analyst determine whether the value of one indicator can predict the possible value of another.

The correlation coefficient is denoted by r. Varies from +1 to -1. Classification of correlations for different spheres will be different. When the coefficient is 0, there is no linear relationship between the samples.

Let's look at how to use Excel tools to find the correlation coefficient.

To find paired coefficients, the CORREL function is used.

Objective: Determine if there is a relationship between working hours lathe and the cost of its maintenance.

We put the cursor in any cell and press the fx button.

  1. In the "Statistical" category, select the CORREL function.
  2. Array 1 argument - the first range of values ​​- machine operation time: A2: A14.
  3. Array 2 argument - second range of values ​​- cost of repair: B2: B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use Data Analysis (Analysis Package add-in). In the list, you need to select a correlation and designate an array. Everything.

The obtained coefficients will be displayed in the correlation matrix. Like this:

Correlation-regression analysis

In practice, these two techniques are often used together.

  1. We build a correlation field: "Insert" - "Chart" - "Scatter chart" (allows you to compare pairs). The range of values ​​is all numeric data in the table.
  2. Left-click on any point on the diagram. Then right. In the menu that opens, select "Add a trend line".
  3. Assigning parameters to the line. Type - "Linear". Bottom - "Show Equation in Diagram".
  4. We press "Close".

Now the regression data is also visible.

1.Open Excel program

2. Create columns with data. In our example, we will consider the relationship, or correlation, between aggressiveness and self-doubt in first-graders. The experiment involved 30 children, the data are presented in the Excel table:

1 column - No. of the subject

Column 2 - aggressiveness in points

3 column - self-doubt in points

3.Then you need to select an empty cell next to the table and click on the icon f (x) in the Excel panel

4. A menu of functions will open, among the categories you need to select Statistical, and then among the list of functions alphabetically find CORREL and click OK

5. Then the menu of the function arguments will open, which will allow us to select the columns with the data we need. To select the first column Aggressiveness you need to click on the blue button at the line Array1

6.Select data for Array1 from the column Aggressiveness and click on the blue button in the dialog box

7. Then, similarly to Array 1, click on the blue button next to the line Array2

8.Select data for Array2- column Self-doubt and press the blue button again, then OK

9. Here, the coefficient of correlation r-Pearson is calculated and written in the selected cell. In our case, it is positive and approximately equal. This speaks of moderate positive links between aggressiveness and self-doubt in first graders

In this way, statistical inference experiment will be: r = 0.225, revealed a moderate positive relationship between the variables aggressiveness and self-doubt.

In some studies, it is required to indicate the p-level of significance of the correlation coefficient, however Excel program, unlike SPSS, does not provide this capability. It's okay, there are tables of critical correlation values ​​(A.D. Nasledov).

You can also build a regression line in Excel and attach it to the research results.

Have you already encountered the need to calculate the degree of relationship between two statistical quantities and determine the formula by which they correlate? Normal person may ask why it might be necessary at all. Oddly enough, this is really necessary. Knowing reliable correlations can help you make big money if you are, say, a stock trader. The problem is that for some reason no one discloses these correlations (surprising, isn't it?).

Let's count them ourselves! For example, I decided to try to calculate the correlation between the ruble and the dollar through the euro. Let's take a look at how this is done in detail.

This article is designed for the advanced level of Microsoft Excel. If you don't have time to read the entire article, you can download the file and deal with it yourself.

If you are often faced with the need to do something like this, I highly recommend considering buying a book Statistical calculations in Excel.

What is important to know about correlations

To calculate a reliable correlation, you must have a reliable sample, the larger it is, the more reliable the result will be. For purposes this example I took a daily sample of currency rates for 10 years. The data is freely available, I took it from the site http://oanda.com.

What i actually did

(1) When I had the original data, I started by checking the correlation between the two datasets. To do this, I used the CORREL function - there is little information about it. It returns the degree of correlation between two ranges of data. The result, frankly speaking, was not particularly impressive (only about 70%). In general, the degree of the ratio of the two quantities is considered to be the square of this quantity, that is, the correlation turned out to be reliable by about 49%. This is very small!

(2) It seemed very strange to me. What mistakes could have crept into my calculations? So I decided to plot a graph and see what might happen. The graph was simply simple with a breakdown by year, so that you could visually see where the correlation breaks. The graph turned out like this

(3) It is obvious from the graph that in the range of about 35 rubles per euro, the correlation begins to break into two parts. Because of this, it turned out to be unreliable. It was necessary to determine in connection with what this is happening.

(4) The color shows that these data refer to 2007, 2008, 2009. Certainly! The periods of economic peaks and downturns are usually statistically unreliable, which is what happened in this case. Therefore, I tried to exclude these periods from the data (well, for verification, I checked the degree of data correlation in this period). The degree of correlation between these data alone is 0.01%, that is, it is absent in principle. But without them, the data correlates by about 81%. This is already a fairly reliable correlation. Here is a graph with a function.

Further steps

In theory, the correlation function can be refined by translating it from linear to exponential or logarithmic. At the same time, the statistical reliability of the correlation increases by approximately one percent, but the complexity of the application of the formula increases incredibly. Therefore, for myself, I pose the question: is it really necessary? It's up to you - for each specific case.

For the territories of the region, data are given for the year 200X.

Region number Average per capita subsistence minimum per day of one able-bodied worker, rubles, x Average daily wages, rubles, y
1 78 133
2 82 148
3 87 134
4 79 154
5 89 162
6 106 195
7 67 139
8 88 158
9 73 152
10 87 162
11 76 159
12 115 173

Exercise:

1. Build a correlation field and formulate a hypothesis about the form of the relationship.

2. Calculate the parameters of the linear regression equation

4. Using the average (general) coefficient of elasticity, give a comparative assessment of the strength of the relationship between the factor and the result.

7. Calculate the predicted value of the result if the predicted value of the factor increases by 10% from its average level. Determine the predictive confidence interval for the significance level.

Solution:

Let's solve this problem using Excel.

1. Comparing the available data x and y, for example, ranking them in ascending order of the factor x, one can observe the presence of a direct relationship between the signs, when an increase in the average per capita subsistence minimum increases the average daily wage. Based on this, we can make the assumption that the connection between the features is direct and it can be described by the equation of a straight line. The same conclusion is confirmed on the basis of graphical analysis.

To build a correlation field, you can use the PPP Excel. Enter the initial data in sequence: first x, then y.

Select the area of ​​cells containing data.

Then choose: Insert / Scatter chart / Scatter with markers as shown in Figure 1.

Figure 1 Plotting the correlation field

Analysis of the correlation field shows the presence of a dependence close to a straight line, since the points are located practically in a straight line.

2. To calculate the parameters of the linear regression equation
let's use the built-in statistical function LINEST.

For this:

1) Open an existing file containing the analyzed data;
2) Select a 5 × 2 blank cell area (5 rows, 2 columns) to display the results of the regression statistics.
3) Activate Function wizard: in the main menu select Formulas / Insert Function.
4) In the window Category you are taking Statistical, in the window the function - LINEST... Click the button OK as shown in Figure 2;

Figure 2 Function Wizard Dialog Box

5) Fill in the function arguments:

Known values ​​for

Known values ​​of x

Constant- a boolean value that indicates the presence or absence of an intercept in the equation; if Constant = 1, then the free term is calculated in the usual way; if Constant = 0, then the free term is 0;

Statistics- a boolean value that indicates whether to display additional information on the regression analysis or not. If Statistics = 1, then additional information is displayed, if Statistics = 0, then only estimates of the equation parameters are displayed.

Click the button OK;

Figure 3 LINEST function arguments dialog box

6) The first element of the final table will appear in the upper left cell of the selected area. To expand the entire table, press the key and then the key combination ++ .

Additional regression statistics will be displayed in the order shown in the following diagram:

The value of the coefficient b The value of the coefficient a
Standard error b Standard error a
Standard error y
F-statistics
Regression sum of squares

Figure 4 The result of calculating the LINEST function

We got the regression equation:

We conclude: With an increase in the average per capita subsistence minimum by 1 ruble. the average daily wage increases on average by 0.92 rubles.

It means that 52% of the variation in wages (y) is explained by the variation of factor x - the average per capita subsistence minimum, and 48% - by the action of other factors not included in the model.

According to the calculated coefficient of determination, the correlation coefficient can be calculated: .

The connection is assessed as close.

4. Using the average (general) coefficient of elasticity, we determine the strength of the factor's influence on the result.

For the equation of a straight line, the average (general) coefficient of elasticity is determined by the formula:

Find the average values ​​by selecting the area of ​​cells with x values, and select Formulas / AutoSum / Average, and do the same with the values ​​of y.

Figure 5 Calculation of the mean values ​​of the function and the argument

Thus, if the average per capita subsistence minimum changes by 1% of its average value, the average daily wage will change on average by 0.51%.

Using a data analysis tool Regression you can get it:
- the results of regression statistics,
- the results of analysis of variance,
- results confidence intervals,
- residuals and graphs for fitting the regression line,
- residuals and normal probability.

The procedure is as follows:

1) check access to Analysis package... In the main menu, select in sequence: File / Options / Add-ins.

2) In the dropdown Control select item Excel add-ins and press the button Go.

3) In the window Add-ons check the box Analysis package and then click OK.

If Analysis package is not in the field list Available add-ons, press the button Overview to search.

If a message appears stating that the analysis package is not installed on your computer, click Yes to install it.

4) In the main menu, sequentially select: Data / Data Analysis / Analysis Tools / Regression and then click OK.

5) Complete the data entry and output parameters dialog box:

Input Span Y- the range containing the data of the effective attribute;

Input interval X- a range containing the data of the factor attribute;

Tags- a checkbox that indicates whether the first row contains column names or not;

Constant - zero- a flag indicating the presence or absence of an intercept in the equation;

Output Interval- it is enough to indicate the upper left cell of the future range;

6) New worksheet - you can set an arbitrary name for the new sheet.

Then press the button OK.

Figure 6 Dialog box for entering parameters of the Regression tool

The results of the regression analysis for the task data are presented in Figure 7.

Figure 7 Result of applying the regression tool

5. Estimate with average error approximation of the quality of the equations. Let's use the results of the regression analysis presented in Figure 8.

Figure 8 Result of using the "Residual output" regression tool

Let's compose a new table as shown in Figure 9. In column C we calculate the relative error of approximation by the formula:

Figure 9 Calculation of the average approximation error

The average approximation error is calculated by the formula:

The quality of the constructed model is assessed as good, since it does not exceed 8 - 10%.

6. From table with regression statistics(Figure 4) we will write out the actual value of Fisher's F-criterion:

Insofar as at a 5% level of significance, it can be concluded that the regression equation is significant (the relationship is proven).

8. Evaluation statistical significance We will carry out the regression parameters using the Student's t-statistics and by calculating the confidence interval for each of the indicators.

We put forward the hypothesis H 0 about the statistically insignificant difference of indicators from zero:

.

for the number of degrees of freedom

Figure 7 shows the actual values ​​of the t-statistic:

The t-test for the correlation coefficient can be calculated in two ways:

Method I:

where - random error of the correlation coefficient.

We take the data for the calculation from the table in Figure 7.

Method II:

The actual t-statistic values ​​are superior to the tabular values:

Therefore, the hypothesis H 0 is rejected, that is, the regression parameters and the correlation coefficient are not randomly different from zero, but statistically significant.

The confidence interval for the parameter a is defined as

For parameter a, the 95% bounds as shown in Figure 7 were:

The confidence interval for the regression coefficient is defined as

For the regression coefficient b, the 95% bounds as shown in Figure 7 were:

Analysis of the upper and lower boundaries of the confidence intervals leads to the conclusion that with the probability parameters a and b, being within the indicated boundaries, do not take zero values, i.e. are not statistically insignificant and are materially different from zero.

7. The obtained estimates of the regression equation allow us to use it for forecasting. If the forecast value of the subsistence minimum is:

Then the predicted value of the subsistence minimum will be:

We calculate the forecast error using the formula:

where

We also calculate the variance using the PPP Excel. For this:

1) Activate Function wizard: in the main menu select Formulas / Insert Function.

3) Fill in the range containing the numerical data of the factor attribute. Click on OK.

Figure 10 Calculation of variance

Received the variance value

To calculate the residual variance per degree of freedom, we use the ANOVA results as shown in Figure 7.

Confidence intervals for predicting individual values ​​of y at with a probability of 0.95 are determined by the expression:

The interval is wide enough, primarily due to the small volume of observations. On the whole, the fulfilled forecast of the average monthly salary turned out to be reliable.

The condition of the problem is taken from: Workshop on econometrics: Textbook. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M .: Finance and statistics, 2003. - 192 p .: ill.

We calculate the correlation coefficient and covariance for different types interconnections of random variables.

Correlation coefficient(correlation criterion Pearson, eng. Pearson Product Moment correlation coefficient) determines the degree linear the relationship between random variables.

As follows from the definition, to calculate correlation coefficient it is required to know the distribution of random variables X and Y. If the distributions are unknown, then to estimate correlation coefficient used by sample correlation coefficientr ( it is also denoted as R xy or r xy) :

where S x - standard deviation sampling random variable x, calculated by the formula:

As you can see from the formula for calculating correlations, the denominator (product of standard deviations) simply normalizes the numerator so that correlation turns out to be a dimensionless number from -1 to 1. Correlation and covariance provide the same information (if known standard deviations ), but correlation more convenient to use, because it is dimensionless.

Calculate correlation coefficient and sample covariance in MS EXCEL it is not difficult, since there are special functions CORREL () and KOVAR () for this. It is much more difficult to figure out how to interpret the obtained values; most of the article is devoted to this.

Theoretical digression

Recall that correlation called the statistical relationship, which consists in the fact that different meanings one variable corresponds to different average the value of another (with a change in the value of X mean Y changes naturally). It is assumed that both the variables X and Y are random values ​​and have a certain random spread relative to their mean value.

Note... If only one variable has a random nature, for example, Y, and the values ​​of the other are deterministic (set by the researcher), then we can only talk about regression.

Thus, for example, when studying the dependence of the average annual temperature, one cannot speak of correlations temperature and year of observation and, accordingly, apply indicators correlations with their corresponding interpretation.

Correlation link between variables can occur in several ways:

  1. The presence of a causal relationship between variables. For example, the amount of investment in Scientific research(variable X) and the number of patents obtained (Y). The first variable acts as independent variable (factor), the second is dependent variable (result)... It must be remembered that the dependence of the quantities determines the presence of a correlation between them, but not vice versa.
  2. Contingency (common cause). For example, with the growth of the organization, the wage fund (payroll) and the cost of renting premises grow. Obviously, it is wrong to assume that the lease of premises depends on the payroll. Both of these variables are in many cases linearly dependent on the number of staff.
  3. Interaction of variables (when one changes, the second variable changes, and vice versa). With this approach, two statements of the problem are admissible; any variable can act as an independent variable and as a dependent.

In this way, correlation index shows how strong linear relationship between two factors (if any), and regression predicts one factor based on the other.

Correlation, like any other statistical indicator, at correct application may be useful, but it also has limitations on its use. If it shows a clearly pronounced linear relationship or a complete lack of relationship, then correlation will reflect this wonderfully. But, if the data shows a non-linear relationship (for example, quadratic), the presence of separate groups of values ​​or outliers, then the calculated value correlation coefficient can be confusing (see example file).

Correlation close to 1 or -1 (i.e. close in modulus to 1) shows a strong linear relationship of variables, a value close to 0 indicates no relationship. Positive correlation means that with an increase in one indicator, the other increases on average, and with a negative one, it decreases.

To calculate the correlation coefficient, it is required that the compared variables satisfy the following conditions:

  • the number of variables must be equal to two;
  • variables must be quantitative (eg frequency, weight, price). The calculated mean of these variables has a clear meaning: the average price or the average weight of the patient. Unlike quantitative ones, qualitative (nominal) variables take values ​​only from a finite set of categories (for example, gender or blood type). These values ​​are conditionally associated with numerical values ​​(for example, female sex - 1, and male - 2). It is clear that in this case the calculation mean value which is required to find correlations, is incorrect, which means that the calculation of the correlations;
  • variables must be random variables and have .

Two-dimensional data can have different structures. Certain approaches are required to work with some of them:

  • For nonlinear data correlation must be used with care. For some tasks, it can be useful to transform one or both variables so as to obtain a linear relationship (this requires making an assumption about the type of nonlinear relationship in order to propose required type transformations).
  • Via scatterplots some data exhibit unequal variation (scatter). The problem with unequal variation is that locations with high variation not only provide the least accurate information, but also have the greatest impact when calculating statistics. This problem is also often solved by transforming the data, for example, using the logarithm.
  • For some data, one can observe division into groups (clustering), which may indicate the need to divide the population into parts.
  • An outlier (outlier value) can distort the calculated value of the correlation coefficient. An outlier may be the cause of randomness, errors in data collection, or it may actually reflect some peculiarity in the relationship. Since the outlier deviates greatly from the average value, it makes a large contribution to the calculation of the indicator. Often, statistical indicators are calculated with and without regard to emissions.

Using MS EXCEL to calculate correlation

Take 2 variables as an example X and Y and correspondingly, sampling consisting of several pairs of values ​​(X i; Y i). For clarity, let's build.

Note: For more information on plotting diagrams, see the article. In the example file to build scatterplots used because here we have departed from the requirement of the randomness of the variable X (this simplifies the generation different types relationships: building trends and a given spread). In the case of real data, you must use a Scatter chart (see below).

Calculations correlations let's hold for different cases relationships between variables: linear, quadratic and at lack of communication.

Note: In the example file, you can set the parameters of the linear trend (slope, intersection with the Y-axis) and the degree of scatter relative to this trend line. You can also adjust the parameters of the quadratic dependence.

In the example file to build scatterplots in the absence of dependence of variables, a Scatter diagram is used. In this case, the points on the diagram are arranged in the form of a cloud.

Note: Note that by zooming the chart along the vertical or horizontal axis, the point cloud can be made to appear as a vertical or horizontal line. It is clear that in this case the variables will remain independent.

As mentioned above, to calculate correlation coefficient in MS EXCEL there is a CORREL () function. Alternatively, you can use a similar function, PEARSON (), which returns the same result.

To make sure that the calculations correlations are produced by the CORREL () function according to the above formulas, the example file contains the calculation correlations using more detailed formulas:

=COVARIATION.Y (B28: B88; D28: D88) / STDEV.H (B28: B88) / STDEV.H (D28: D88)

=COVARIATION.B (B28: B88; D28: D88) / STDEV.B (B28: B88) /STDEV.B (D28: D88)

Note: Square correlation coefficient r equals coefficient of determination R2, which is calculated when constructing the regression line using the KVPIRSON () function. The R2 value can also be displayed on scatter plot by building linear trend using the standard MS EXCEL functionality (select the diagram, select the tab Layout then in the group Analysis press the button Trend line and select Linear approximation). For more information on plotting a trend line, see, for example, in.

Using MS EXCEL to calculate covariance

Covariance is close in meaning to c (it is also a measure of scatter) with the difference that it is defined for 2 variables, and dispersion- for one. Therefore, cov (x; x) = VAR (x).

To calculate the covariance in MS EXCEL (starting from the 2010 version), the COVARIATION.R () and COVARIATION.V () functions are used. In the first case, the formula for calculating is similar to the above (end .G denotes General population ), in the second - instead of the factor 1 / n, 1 / (n-1) is used, i.e. ending .V denotes Sample.

Note: The COVAR () function, which is present in MS EXCEL of earlier versions, is similar to the COVARIATION.G () function.

Note: The CORREL () and COVAR () functions in the English version are represented as CORREL and COVAR. The COVARIANCE.G () and COVARIANCE.B () functions are like COVARIANCE.P and COVARIANCE.S.

Additional formulas for calculation covariance:

=SUMPRODUCT (B28: B88-AVERAGE (B28: B88); (D28: D88-AVERAGE (D28: D88))) / COUNT (D28: D88)

=SUMPRODUCT (B28: B88-AVERAGE (B28: B88); (D28: D88)) / COUNT (D28: D88)

=SUMPRODUCT (B28: B88; D28: D88) / COUNT (D28: D88) -AVEL (B28: B88) * AVERAGE (D28: D88)

These formulas use the property covariance:

If the variables x and y independent, then their covariance is 0. If the variables are not independent, then the variance of their sum is:

VAR (x + y) = VAR (x) + VAR (y) + 2COV (x; y)

A dispersion their difference is

VAR (x-y) = VAR (x) + VAR (y) -2COV (x; y)

Assessment of the statistical significance of the correlation coefficient

In order to test a hypothesis, we need to know the distribution of the random variable, i.e. correlation coefficient r. Usually, the hypothesis is tested not for r, but for a random variable t r:

which has n-2 degrees of freedom.

If the calculated value of the random variable | t r | more than critical value t α, n-2 (α-given), then the null hypothesis is rejected (the relationship between the values ​​is statistically significant).

Analysis Package add-in

B to calculate covariance and correlation there are instruments of the same name analysis.

After calling the tool, a dialog box appears that contains the following fields:

  • Input interval: you need to enter a link to the range with the initial data for 2 variables
  • Grouping: as a rule, the original data is entered in 2 columns
  • Labels in the first line: if checked, then Input interval must contain column headings. It is recommended to check the box so that the result of the Add-in operation contains informative columns
  • Output Interval: the range of cells where the calculation results will be placed. It is enough to indicate the top-left cell of this range.

The add-in returns the calculated correlation and covariance values ​​(variances of both random variables are also calculated for the covariance).

Where x y, x, y are the average values ​​of the samples; σ (x), σ (y) - standard deviations.
In addition, the linear pairwise correlation coefficient can be determined through the regression coefficient b:, where σ (x) = S (x), σ (y) = S (y) are the standard deviations, b is the coefficient in front of x in the regression equation y = a + bx.

Other formulas:
or

K xy - correlation moment (covariance coefficient)

The linear correlation coefficient takes values ​​from –1 to +1 (see Chaddock scale). For example, when analyzing the tightness of a linear correlation between two variables, the pair coefficient was obtained linear correlation equal to –1. This means that there is an exact inverse linear relationship between the variables.

The geometric meaning of the correlation coefficient: r xy shows how much the slope of the two regression lines, y (x) and x (y), differs, how much the results of minimizing deviations in x and y differ. The larger the angle between the lines, the larger r xy.
The sign of the correlation coefficient coincides with the sign of the regression coefficient and determines the slope of the regression line, i.e. general direction of dependence (increase or decrease). The absolute value of the correlation coefficient is determined by the degree of proximity of points to the regression line.

Correlation Coefficient Properties

  1. | r xy | ≤ 1;
  2. if X and Y are independent, then r xy = 0, the converse is not always true;
  3. if | r xy | = 1, then Y = aX + b, | r xy (X, aX + b) | = 1, where a and b are constants, and ≠ 0;
  4. | r xy (X, Y) | = | r xy (a 1 X + b 1, a 2 X + b 2) |, where a 1, a 2, b 1, b 2 are constants.

Instruction. Indicate the amount of source data. The resulting solution is saved in a Word file (see Example of finding the regression equation). A solution template in Excel is also automatically generated. ...

Number of lines (initial data)
The total values ​​of the quantities are given (∑x, ∑x 2, ∑xy, ∑y, ∑y 2)