Excel correlation coefficient error. Correlation-regression analysis in Excel: execution instruction

Notice! The solution to your specific problem will look similar this example, including all tables and explanatory texts presented below, but taking into account your original data ...

A task:
There is a related sample of 26 pairs of values ​​(x k, y k):

k 1 2 3 4 5 6 7 8 9 10
x k 25.20000 26.40000 26.00000 25.80000 24.90000 25.70000 25.70000 25.70000 26.10000 25.80000
y k 30.80000 29.40000 30.20000 30.50000 31.40000 30.30000 30.40000 30.50000 29.90000 30.40000

k 11 12 13 14 15 16 17 18 19 20
x k 25.90000 26.20000 25.60000 25.40000 26.60000 26.20000 26.00000 22.10000 25.90000 25.80000
y k 30.30000 30.50000 30.60000 31.00000 29.60000 30.40000 30.70000 31.60000 30.50000 30.60000

k 21 22 23 24 25 26
x k 25.90000 26.30000 26.10000 26.00000 26.40000 25.80000
y k 30.70000 30.10000 30.60000 30.50000 30.70000 30.80000

It is required to calculate / build:
- correlation coefficient;
- to test the hypothesis of the dependence of the random variables X and Y, at the significance level α = 0.05;
- coefficients of the linear regression equation;
- scatter plot (correlation field) and regression line plot;

SOLUTION:

1. Calculate the correlation coefficient.

The correlation coefficient is an indicator of the mutual probabilistic influence of two random variables. Correlation coefficient R can take values ​​from -1 before +1 ... If the absolute value is closer to 1 , then this is evidence of a strong relationship between the quantities, and if closer to 0 - then, this indicates a weak connection or its absence. If the absolute value R is equal to one, then we can talk about functional communication between quantities, that is, one quantity can be expressed through another through a mathematical function.


You can calculate the correlation coefficient using the following formulas:
n
Σ
k = 1
(x k -M x) 2, σ y 2 =
M x =
1
n
n
Σ
k = 1
x k, M y =

or by the formula

R x, y =
M xy - M x M y
S x S y
(1.4), where:
M x =
1
n
n
Σ
k = 1
x k, M y =
1
n
n
Σ
k = 1
y k, M xy =
1
n
n
Σ
k = 1
x k y k (1.5)
S x 2 =
1
n
n
Σ
k = 1
x k 2 - M x 2, S y 2 =
1
n
n
Σ
k = 1
y k 2 - M y 2 (1.6)

In practice, to calculate the correlation coefficient, the formula (1.4) is often used because it requires less computation. However, if the covariance was previously calculated cov (X, Y), then it is more advantageous to use formula (1.1), since in addition to the actual covariance value, you can also use the results of intermediate calculations.

1.1 Let's calculate the correlation coefficient by the formula (1.4), for this we calculate the values ​​x k 2, y k 2 and x k y k and enter them in table 1.

Table 1


k
x k y k x k 2 y k 2 x ky k
1 2 3 4 5 6
1 25.2 30.8 635.04000 948.64000 776.16000
2 26.4 29.4 696.96000 864.36000 776.16000
3 26.0 30.2 676.00000 912.04000 785.20000
4 25.8 30.5 665.64000 930.25000 786.90000
5 24.9 31.4 620.01000 985.96000 781.86000
6 25.7 30.3 660.49000 918.09000 778.71000
7 25.7 30.4 660.49000 924.16000 781.28000
8 25.7 30.5 660.49000 930.25000 783.85000
9 26.1 29.9 681.21000 894.01000 780.39000
10 25.8 30.4 665.64000 924.16000 784.32000
11 25.9 30.3 670.81000 918.09000 784.77000
12 26.2 30.5 686.44000 930.25000 799.10000
13 25.6 30.6 655.36000 936.36000 783.36000
14 25.4 31 645.16000 961.00000 787.40000
15 26.6 29.6 707.56000 876.16000 787.36000
16 26.2 30.4 686.44000 924.16000 796.48000
17 26 30.7 676.00000 942.49000 798.20000
18 22.1 31.6 488.41000 998.56000 698.36000
19 25.9 30.5 670.81000 930.25000 789.95000
20 25.8 30.6 665.64000 936.36000 789.48000
21 25.9 30.7 670.81000 942.49000 795.13000
22 26.3 30.1 691.69000 906.01000 791.63000
23 26.1 30.6 681.21000 936.36000 798.66000
24 26 30.5 676.00000 930.25000 793.00000
25 26.4 30.7 696.96000 942.49000 810.48000
26 25.8 30.8 665.64000 948.64000 794.64000


1.2. We calculate M x by the formula (1.5).

1.2.1. x k

x 1 + x 2 + ... + x 26 = 25.20000 + 26.40000 + ... + 25.80000 = 669.500000

1.2.2.

669.50000 / 26 = 25.75000

M x = 25.750000

1.3. In a similar way, we calculate M y.

1.3.1. Add all the elements in sequence y k

y 1 + y 2 +… + y 26 = 30.80000 + 29.40000 + ... + 30.80000 = 793.000000

1.3.2. Divide the resulting sum by the number of sample elements

793.00000 / 26 = 30.50000

M y = 30.500000

1.4. Calculate M xy.

1.4.1. Add up all the elements of the 6th column of Table 1 in succession

776.16000 + 776.16000 + ... + 794.64000 = 20412.830000

1.4.2. Divide the resulting sum by the number of elements

20412.83000 / 26 = 785.10885

M xy = 785.108846

1.5. We calculate the value of S x 2 by the formula (1.6.).

1.5.1. Add up all the elements of the 4th column of Table 1 in succession

635.04000 + 696.96000 + ... + 665.64000 = 17256.910000

1.5.2. Divide the resulting sum by the number of elements

17256.91000 / 26 = 663.72731

1.5.3. Subtract the square of M x from the last number to get the value for S x 2

S x 2 = 663.72731 - 25.75000 2 = 663.72731 - 663.06250 = 0.66481

1.6. We calculate the value of S y 2 by the formula (1.6.).

1.6.1. Add up all the elements of the 5th column of Table 1 sequentially

948.64000 + 864.36000 + ... + 948.64000 = 24191.840000

1.6.2. Divide the resulting sum by the number of elements

24191.84000 / 26 = 930.45538

1.6.3. Subtract the square of M y from the last number to obtain the value for S y 2

S y 2 = 930.45538 - 30.50000 2 = 930.45538 - 930.25000 = 0.20538

1.7. Let us calculate the product of the quantities S x 2 and S y 2.

S x 2 S y 2 = 0.66481 0.20538 = 0.136541

1.8. Let's extract the last number Square root, we get the value S x S y.

S x S y = 0.36951

1.9. Let's calculate the value of the correlation coefficient by the formula (1.4.).

R = (785.10885 - 25.75000 30.50000) / 0.36951 = (785.10885 - 785.37500) / 0.36951 = -0.72028

ANSWER: R x, y = -0.720279

2. Check the significance of the correlation coefficient (check the hypothesis of dependence).

Since the estimate of the correlation coefficient is calculated on a finite sample, and therefore may deviate from its general value, it is necessary to check the significance of the correlation coefficient. The check is performed using the t-criterion:

t =
R x, y
n - 2
1 - R 2 x, y
(2.1)

Random value t follows the Student's t-distribution and according to the t-distribution table it is necessary to find the critical value of the criterion (t cr.α) at ​​a given significance level α. If the modulus t calculated by formula (2.1) turns out to be less than t cr.α, then there is no dependence between the random variables X and Y. Otherwise, the experimental data do not contradict the hypothesis of the dependence of random variables.


2.1. We calculate the value of the t-criterion by the formula (2.1) we get:
t =
-0.72028
26 - 2
1 - (-0.72028) 2
= -5.08680

2.2. Let us determine from the t-distribution table the critical value of the parameter t cr. Α

The desired value of t cr. Α is located at the intersection of the line corresponding to the number of degrees of freedom and the column corresponding to the given significance level α.
In our case, the number of degrees of freedom is n - 2 = 26 - 2 = 24 and α = 0.05 , which corresponds to the critical value of the criterion t cr. α = 2.064 (see table 2)

table 2 t-distribution

Number of degrees of freedom
(n - 2)
α = 0.1 α = 0.05 α = 0.02 α = 0.01 α = 0.002 α = 0.001
1 6.314 12.706 31.821 63.657 318.31 636.62
2 2.920 4.303 6.965 9.925 22.327 31.598
3 2.353 3.182 4.541 5.841 10.214 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.895 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.767
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
40 1.684 2.021 2.423 2.704 3.307 3.551
60 1.671 2.000 2.390 2.660 3.232 3.460
120 1.658 1.980 2.358 2.617 3.160 3.373
1.645 1.960 2.326 2.576 3.090 3.291


2.2. Let's compare the absolute value of the t-criterion and t cr. Α

Absolute value t-criterion is not less than critical t = 5.08680, t cr. α = 2.064, therefore experimental data, with a probability of 0.95(1 - α), do not contradict the hypothesis on the dependence of random variables X and Y.

3. Calculate the coefficients of the linear regression equation.

The linear regression equation is an equation of a straight line that approximates (approximately describes) the relationship between random variables X and Y. If we assume that X is free and Y is dependent on X, then the regression equation will be written as follows


Y = a + b X (3.1), where:

b =R x, y
σ y
σ x
= R x, y
S y
S x
(3.2),
a = M y - b M x (3.3)

The coefficient calculated by the formula (3.2) b called the linear regression coefficient. In some sources a are called constant coefficient regression and b respectively variables.

The prediction errors Y for a given X value are calculated by the formulas:

The quantity σ y / x (formula 3.4) is also called residual standard deviation, it characterizes the departure of the value Y from the regression line described by equation (3.1) at a fixed (given) value of X.

.
S y 2 / S x 2 = 0.20538 / 0.66481 = 0.30894. Let's extract the square root of the last number - we get:
S y / S x = 0.55582

3.3 Calculate the coefficient b by formula (3.2)

b = -0.72028 0.55582 = -0.40035

3.4 Calculate the coefficient a by formula (3.3)

a = 30.50000 - (-0.40035 25.75000) = 40.80894

3.5 Estimate the errors of the regression equation.

3.5.1 We extract from S y 2 the square root we get:

= 0.31437
3.5.4 Let's calculate the relative error by the formula (3.5)

δ y / x = (0.31437 / 30.50000) 100% = 1.03073%

4. Build a scatterplot (correlation field) and a regression line plot.

A scatter plot is graphic image the corresponding pairs (x k, y k) in the form of points of the plane, in rectangular coordinates with the axes X and Y. The correlation field is one of the graphical representations of the associated (paired) sample. The regression line is plotted in the same coordinate system. Scales and starting points on the axes should be carefully chosen to make the chart as clear as possible.

4.1. Find the minimum and maximum sample element X is the 18th and 15th elements, respectively, x min = 22.10000 and x max = 26.60000.

4.2. We find the minimum and maximum element of the sample Y this is the 2nd and 18th elements, respectively, y min = 29.40000 and y max = 31.60000.

4.3. On the abscissa axis, select the starting point slightly to the left of the point x 18 = 22.10000, and such a scale that the point x 15 = 26.60000 would fit on the axis and the rest of the points were clearly distinguished.

4.4. On the ordinate axis, select the starting point slightly to the left of the point y 2 = 29.40000, and such a scale that the point y 18 = 31.60000 would fit on the axis and the rest of the points were clearly distinguished.

4.5. On the abscissa we place the x k values, and on the ordinate the y k values.

4.6. Put the points (x 1, y 1), (x 2, y 2),…, (x 26, y 26) on the coordinate plane. We get the scatter diagram (correlation field), shown in the figure below.

4.7. Let's draw a regression line.

To do this, we find two different points with coordinates (x r1, y r1) and (x r2, y r2) satisfying equation (3.6), draw them on the coordinate plane and draw a straight line through them. Take the value x min = 22.10000 as the abscissa of the first point. Substitute the value x min into equation (3.6), we get the ordinate of the first point. Thus, we have a point with coordinates (22.10000, 31.96127). In a similar way, we get the coordinates of the second point, putting the value x max = 26.60000 as the abscissa. The second point will be: (26.60000, 30.15970).

The regression line is shown in the figure below in red.

Note that the regression line always passes through the point of the mean of the X and Y values, i.e. with coordinates (M x, M y).

To determine the degree of dependence between several indicators, multiple correlation coefficients are used. They are then summarized in a separate table, which is called the correlation matrix. The names of the rows and columns of such a matrix are the names of the parameters, the dependence of which on each other is established. Corresponding correlation coefficients are located at the intersection of rows and columns. Let's find out how you can do a similar calculation using Excel tools.

It is accepted as follows to determine the level of relationship between various indicators, depending on the correlation coefficient:

  • 0 - 0.3 - no connection;
  • 0.3 - 0.5 - weak connection;
  • 0.5 - 0.7 - medium bond;
  • 0.7 - 0.9 - high;
  • 0.9 - 1 - very strong.

If the correlation coefficient is negative, it means that the relationship between the parameters is inverse.

In order to compose the correlation matrix in Excel, one tool is used, included in the package "Data analysis"... It's called that - "Correlation"... Let's find out how you can use it to calculate multiple correlation scores.

Stage 1: activation of the analysis package

I must say right away that by default the package "Data analysis" disabled. Therefore, before proceeding with the procedure for directly calculating the correlation coefficients, you need to activate it. Unfortunately, not every user knows how to do this. Therefore, we will focus on this issue.


After the specified action, the package of tools "Data analysis" will be activated.

Stage 2: calculating the coefficient

Now you can go directly to the calculation of the multiple correlation coefficient. Let us, using the example of the table of labor productivity indicators, capital-labor ratio and power-to-labor ratio at various enterprises, presented below, calculate the multiple correlation coefficient of these factors.


Stage 3: analysis of the obtained result

Now let's figure out how to understand the result that we got in the process of processing the data with the tool. "Correlation" in Excel.

As you can see from the table, the capital-labor ratio correlation coefficient (Column 2) and power-to-weight ratio ( Column 1) is 0.92, which corresponds to a very strong relationship. Between labor productivity ( Column 3) and power-to-weight ratio ( Column 1) this indicator is 0.72, which is a high degree of dependence. Correlation coefficient between labor productivity ( Column 3) and capital-labor ratio ( Column 2) is equal to 0.88, which also corresponds to high degree dependencies. Thus, we can say that the relationship between all the studied factors is quite strong.

As you can see, the package "Data analysis" Excel is a very convenient and fairly easy-to-use tool for determining multiple correlation coefficients. It can also be used to calculate the usual correlation between the two factors.

LABORATORY WORK

CORRELATION ANALYSIS INEXCEL

1.1 Correlation analysis in MS Excel

Correlation analysis consists in determining the degree of relationship between two random variables X and Y. The correlation coefficient is used as a measure of such a relationship. The correlation coefficient is estimated from a sample of the volume of n connected pairs of observations (x i, y i) from the joint general population of X and Y. linear correlation coefficient(Pearson's coefficient), assuming that the samples X and Y are normally distributed.

The correlation coefficient ranges from -1 (strict inverse linear relationship) to 1 (strict direct proportional relationship). At a value of 0, there is no linear relationship between the two samples.

General classification of correlations (according to Ivanter E.V., Korosov A.V., 1992):

There are several types of correlation coefficients, depending on the variables X and Y, which can be measured on different scales. It is this fact that determines the choice of the corresponding correlation coefficient (see Table 13):

In MS Excel, a special function is used to calculate pairwise linear correlation coefficients CORREL (array1, array2),

test subjects

where array1 is a reference to the range of cells of the first sample (X);

Example 1: 10 schoolchildren were given tests for visual-figurative and verbal thinking. The average time for solving test tasks was measured in seconds. The researcher is interested in the question: is there a relationship between the time for solving these problems? Variable X - denotes the average time for solving visual-figurative, and variable Y - the average time for solving verbal tasks of tests.

R solution: To identify the degree of interconnection, first of all, it is necessary to enter data into the MS Excel table (see table, Fig. 1). Then the value of the correlation coefficient is calculated. To do this, place the cursor in cell C1. On the toolbar, click the Insert Function (fx) button.

In the Function Wizard dialog box that appears, select a category Statistical and function CORREL and then click OK. Use the mouse pointer to enter the data range of the X selection in the Array1 field (A1: A10). In the Array2 field, enter the data range of the sample Y (B1: B10). Click OK. The value of the correlation coefficient will appear in cell C1 - 0.54119. Next, you need to look at the absolute number of the correlation coefficient and determine the type of relationship (close, weak, medium, etc.)

Rice. 1. Results of calculating the correlation coefficient

Thus, the connection between the time of solving the visual-figurative and verbal tasks of the test has not been proven.

Exercise 1. There are data for 20 agricultural holdings. To find correlation coefficient between the values ​​of the yield of grain crops and the quality of the land and assess its significance. The data are shown in the table.

Table 2. Dependence of the yield of grain crops on the quality of the land

Farm number

Land quality, score

Productivity, c / ha


Task 2. Determine whether there is a connection between the operating time of a sports fitness machine (thousand hours) and the cost of repairing it (thousand rubles):

Simulator operation time (thousand hours)

Repair cost (thousand rubles)

1.2 Multiple correlation in MS Excel

At a large number observations, when the correlation coefficients must be sequentially calculated for several samples, for convenience, the obtained coefficients are summarized in tables called correlation matrices.

Correlation matrix is a square table in which the correlation coefficient between the corresponding parameters is found at the intersection of the corresponding rows and columns.

In MS Excel, to calculate correlation matrices, the procedure is used Correlation from the package Data analysis. The procedure allows you to obtain a correlation matrix containing the correlation coefficients between various parameters.

To implement the procedure, you must:

1.execute the command Service - Analysis data;

2.in the list that appears Analysis tools select line Correlation and press the button OK;

3.in the dialog box that appears, specify Input interval, that is, enter a reference to the cells containing the analyzed data. The input range must contain at least two columns.

4.in section Grouping set the switch in accordance with the entered data (by columns or by rows);

5.specify day off interval, that is, enter a reference to the cell, starting from which the analysis results will be shown. The size of the output range will be determined automatically, and a message will be displayed on the screen if the output range may overlap with the original data. Press the button OK.

A correlation matrix will be displayed in the output range, in which the correlation coefficient between the corresponding parameters is found at the intersection of each row and column. Output range cells that have matching row and column coordinates contain the value 1, since each column in the input range is fully correlated with itself

Example 2. Monthly observations of weather conditions and attendance at museums and parks are available (see Table 3). It is necessary to determine whether there is a relationship between the state of the weather and the attendance of museums and parks.

Table 3. Observation results

Number of clear days

The number of museum visitors

Number of park visitors

Solution... To perform correlation analysis, enter the initial data into the A1: G3 range (Fig. 2). Then on the menu Service select item Analysis data and then specify the line Correlation... In the dialog box that appears, specify Input interval(A2: C7). Specify that the data is viewed in columns. Specify the output range (E1) and click the button OK.

In fig. 33 shows that the correlation between weather conditions and museum attendance is -0.92, and between weather conditions and park attendance is 0.97, and between park and museum attendance is 0.92.

Thus, the analysis revealed dependencies: a strong inverse linear relationship between museum attendance and the number of sunny days and an almost linear (very strong direct) relationship between park attendance and weather conditions. There is a strong inverse relationship between museum and park attendance.

Rice. 2. The results of calculating the correlation matrix from example 2

Assignment 3... 10 managers were assessed using the method of expert assessments of the psychological characteristics of the leader's personality. 15 experts evaluated each psychological characteristic on a five-point scale (see Table 4). The psychologist is interested in the question, in what relationship are these characteristics of the leader among themselves.

Table 4. Research results

Subjects p / p

tact

exactingness

criticality

Correlation coefficient (or linear coefficient correlation) is denoted as "r" (in rare cases as "ρ") and characterizes linear correlation(that is, a relationship that is defined by some meaning and direction) of two or more variables. The value of the coefficient lies between -1 and +1, that is, the correlation can be both positive and negative. If the correlation coefficient is -1, there is a perfect negative correlation; if the correlation coefficient is +1, there is a perfect positive correlation. Otherwise, there is a positive correlation between the two variables, a negative correlation, or no correlation. The correlation coefficient can be calculated manually using free online calculators or with a good graphing calculator.

Steps

Calculating the correlation coefficient manually

    Collect data. Before you start calculating the correlation coefficient, study these pairs of numbers. Better to write them down in a table that can be arranged vertically or horizontally. Label each row or column with "x" and "y".

    • For example, given four pairs of values ​​(numbers) of the variables "x" and "y". You can create the following table:
      • x || y
      • 1 || 1
      • 2 || 3
      • 4 || 5
      • 5 || 7
  1. Calculate the arithmetic mean "x". To do this, add up all the x values, and then divide the result by the number of values.

    • In our example, there are four values ​​for the variable "x". To calculate the arithmetic mean "x", add these values, and then divide the sum by 4. The calculations are written as follows:
    • μ x = (1 + 2 + 4 + 5) / 4 (\ displaystyle \ mu _ (x) = (1 + 2 + 4 + 5) / 4)
    • μ x = 12/4 (\ displaystyle \ mu _ (x) = 12/4)
    • μ x = 3 (\ displaystyle \ mu _ (x) = 3)
  2. Find the arithmetic mean "y". To do this, follow the same steps, that is, add up all the y values, and then divide the sum by the number of values.

    • In our example, four values ​​of the variable "y" are given. Add these values, and then divide the sum by 4. The calculations will be written as follows:
    • μ y = (1 + 3 + 5 + 7) / 4 (\ displaystyle \ mu _ (y) = (1 + 3 + 5 + 7) / 4)
    • μ y = 16/4 (\ displaystyle \ mu _ (y) = 16/4)
    • μ y = 4 (\ displaystyle \ mu _ (y) = 4)
  3. Calculate the standard deviation "x". After calculating the average values ​​of "x" and "y", find standard deviations these variables. The standard deviation is calculated using the following formula:

    • σ x = 1 n - 1 Σ (x - μ x) 2 (\ displaystyle \ sigma _ (x) = (\ sqrt ((\ frac (1) (n-1)) \ Sigma (x- \ mu _ ( x)) ^ (2))))
    • σ x = 1 4 - 1 ∗ ((1 - 3) 2 + (2 - 3) 2 + (4 - 3) 2 + (5 - 3) 2) (\ displaystyle \ sigma _ (x) = (\ sqrt ((\ frac (1) (4-1)) * ((1-3) ^ (2) + (2-3) ^ (2) + (4-3) ^ (2) + (5-3) ^ (2)))))
    • σ x = 1 3 ∗ (4 + 1 + 1 + 4) (\ displaystyle \ sigma _ (x) = (\ sqrt ((\ frac (1) (3)) * (4 + 1 + 1 + 4)) ))
    • σ x = 1 3 ∗ (10) (\ displaystyle \ sigma _ (x) = (\ sqrt ((\ frac (1) (3)) * (10))))
    • σ x = 10 3 (\ displaystyle \ sigma _ (x) = (\ sqrt (\ frac (10) (3))))
    • σ x = 1.83 (\ displaystyle \ sigma _ (x) = 1.83)
  4. Calculate the standard deviation "y". Follow the steps outlined in the previous step. Use the same formula, but plug in the y values.

    • In our example, the calculations will be written like this:
    • σ y = 1 4 - 1 ∗ ((1 - 4) 2 + (3 - 4) 2 + (5 - 4) 2 + (7 - 4) 2) (\ displaystyle \ sigma _ (y) = (\ sqrt ((\ frac (1) (4-1)) * ((1-4) ^ (2) + (3-4) ^ (2) + (5-4) ^ (2) + (7-4) ^ (2)))))
    • σ y = 1 3 ∗ (9 + 1 + 1 + 9) (\ displaystyle \ sigma _ (y) = (\ sqrt ((\ frac (1) (3)) * (9 + 1 + 1 + 9)) ))
    • σ y = 1 3 ∗ (20) (\ displaystyle \ sigma _ (y) = (\ sqrt ((\ frac (1) (3)) * (20))))
    • σ y = 20 3 (\ displaystyle \ sigma _ (y) = (\ sqrt (\ frac (20) (3))))
    • σ y = 2.58 (\ displaystyle \ sigma _ (y) = 2.58)
  5. Write down the basic formula for calculating the correlation coefficient. This formula includes the means, standard deviations, and the number (n) of pairs of numbers for both variables. The correlation coefficient is denoted as "r" (in rare cases as "ρ"). This article uses a formula to calculate the Pearson correlation coefficient.

    • Here and in other sources, quantities can be denoted in different ways. For example, some formulas contain “ρ” and “σ”, while others contain “r” and “s”. Some textbooks give different formulas, but they are mathematical counterparts to the above formula.
  6. You have calculated the means and standard deviations of both variables, so you can use the formula to calculate the correlation coefficient. Recall that "n" is the number of pairs of values ​​for both variables. Other values ​​have been calculated earlier.

    • In our example, the calculations will be written like this:
    • ρ = (1 n - 1) Σ (x - μ x σ x) ∗ (y - μ y σ y) (\ displaystyle \ rho = \ left ((\ frac (1) (n-1)) \ right) \ Sigma \ left ((\ frac (x- \ mu _ (x)) (\ sigma _ (x))) \ right) * \ left ((\ frac (y- \ mu _ (y)) (\ sigma _ (y))) \ right))
    • ρ = (1 3) ∗ (\ displaystyle \ rho = \ left ((\ frac (1) (3)) \ right) *)[ (1 - 3 1.83) ∗ (1 - 4 2. 58) + (2 - 3 1.83) ∗ (3 - 4 2. 58) (\ displaystyle \ left ((\ frac (1-3) ( 1.83)) \ right) * \ left ((\ frac (1-4) (2.58)) \ right) + \ left ((\ frac (2-3) (1.83)) \ right) * \ left ((\ frac (3-4) (2.58)) \ right))
      + (4 - 3 1.83) ∗ (5 - 4 2. 58) + (5 - 3 1.83) ∗ (7 - 4 2. 58) (\ displaystyle + \ left ((\ frac (4-3 ) (1.83)) \ right) * \ left ((\ frac (5-4) (2.58)) \ right) + \ left ((\ frac (5-3) (1.83)) \ right) * \ left ((\ frac (7-4) (2.58)) \ right))]
    • ρ = (1 3) ∗ (6 + 1 + 1 + 6 4.721) (\ displaystyle \ rho = \ left ((\ frac (1) (3)) \ right) * \ left ((\ frac (6 + 1 + 1 + 6) (4.721)) \ right))
    • ρ = (1 3) ∗ 2.965 (\ displaystyle \ rho = \ left ((\ frac (1) (3)) \ right) * 2.965)
    • ρ = (2.965 3) (\ displaystyle \ rho = \ left ((\ frac (2.965) (3)) \ right))
    • ρ = 0.988 (\ displaystyle \ rho = 0.988)
  7. Analyze the result. In our example, the correlation coefficient is 0.988. This value in some way characterizes a given set of pairs of numbers. Pay attention to the sign and magnitude of the value.

    • Since the value of the correlation coefficient is positive, there is a positive correlation between the variables "x" and "y". That is, as the value of "x" increases, the value of "y" also increases.
    • Since the value of the correlation coefficient is very close to +1, the values ​​of the variables "x" and "y" are highly correlated. If you put points on the coordinate plane, they will be located close to some straight line.

    Using Online Calculators to Calculate the Correlation Coefficient

    1. Find a calculator on the Internet to calculate the correlation coefficient. This coefficient is often calculated in statistics. If there are many pairs of numbers, it is almost impossible to calculate the correlation coefficient manually. Therefore, there are online calculators to calculate the correlation coefficient. In a search engine, enter "correlation coefficient calculator" (without the quotes).

    2. Enter data. Check the instructions on the website to enter the correct data (pairs of numbers). It is imperative to enter the appropriate pairs of numbers; otherwise, you will get the wrong result. Please be aware that different websites have different data entry formats.

      • For example, at http://ncalculators.com/statistics/correlation-coefficient-calculator.htm, the values ​​of the variables x and y are entered in two horizontal lines. The values ​​are separated by commas. That is, in our example, the values ​​"x" are entered like this: 1,2,4,5, and the values ​​"y" like this: 1,3,5,7.
      • On another site, http://www.alcula.com/calculators/statistics/correlation-coefficient/, data is entered vertically; in this case, do not confuse the corresponding pairs of numbers.
    3. Calculate the correlation coefficient. After entering the data, simply click on the "Calculate", "Calculate" or similar button to get the result.

      Using a graphing calculator

      1. Enter data. Take a graphing calculator, go into statistical calculation mode and select the "Edit" command.

        • Different calculators require different keys to be pressed. This article discusses the Texas Instruments TI-86 calculator.
        • To enter the statistical calculation mode, press - Stat (above the "+" key). Then press F2 - Edit.
      2. Delete the previous saved data. Most calculators keep the statistics you enter until you erase them. To avoid confusing old data with new ones, first delete any stored information.

        • Use the arrow keys to move the cursor and highlight the 'xStat' heading. Then press Clear and Enter to clear all values ​​entered in the xStat column.
        • Use the arrow keys to highlight the 'yStat' heading. Then press Clear and Enter to clear all values ​​entered in the yStat column.
      3. Enter the initial data. Use the arrow keys to move the cursor to the first cell under the heading "xStat". Enter the first value and press Enter. At the bottom of the screen, “xStat (1) = __” will be displayed, with the value entered instead of a space. After you press Enter, the entered value will appear in the table, and the cursor will move to the next line; this will display "xStat (2) = __" at the bottom of the screen.

        • Enter all the values ​​for the variable "x".
        • After entering all the values ​​for x, use the arrow keys to navigate to the yStat column and enter the values ​​for y.
        • After entering all pairs of numbers, press Exit to clear the screen and exit the statistical calculation mode.
      4. Calculate the correlation coefficient. It characterizes how close the data is to a certain straight line. The graphing calculator can quickly determine the suitable straight line and calculate the correlation coefficient.

        • Click Stat - Calc. On the TI-86, press - -.
        • Select the Linear Regression ( Linear regression). On the TI-86, press which is labeled "LinR". The screen will display the line "LinR _" with a blinking cursor.
        • Now enter the names of two variables: xStat and yStat.
          • On TI-86, open the list of names; to do this, press - -.
          • The available variables are displayed on the bottom line of the screen. Select (you probably need to press F1 or F2 to do this), enter a comma, and then select.
          • Press Enter to process the entered data.
      5. Analyze your results. By pressing Enter, the screen will display the following information:

        • y = a + b x (\ displaystyle y = a + bx): this is the function that describes the line. Please note that the function is not written in standard form(y = kx + b).
        • a = (\ displaystyle a =)... This is the y-coordinate of the intersection of the straight line with the y-axis.
        • b = (\ displaystyle b =)... This is slope straight.
        • corr = (\ displaystyle (\ text (corr)) =)... This is the correlation coefficient.
        • n = (\ displaystyle n =)... This is the number of pairs of numbers that were used in the calculations.

IN scientific research often there is a need to find a relationship between the effective and factor variables (the yield of a crop and the amount of precipitation, the height and weight of a person in homogeneous groups by sex and age, heart rate and body temperature, etc.).

The second are signs that contribute to the change of those associated with them (the first).

Correlation Analysis

There are many.Based on the above, we can say that correlation analysis is a method used to test the hypothesis about statistical significance two or more variables, if the researcher can measure them, but not change them.

There are other definitions of the concept in question. Correlation analysis is a processing technique that examines the correlation coefficients between variables. This compares the correlation coefficients between one pair or a plurality of pairs of features to establish statistical relationships between them. Correlation analysis is a method for studying the statistical dependence between random variables with the optional presence of a strict functional nature, in which the dynamics of one random variable leads to dynamics mathematical expectation another.

Understanding Correlation Falsity

When conducting a correlation analysis, it is necessary to take into account that it can be carried out in relation to any set of features, which are often absurd in relation to each other. Sometimes they have no causal connection with each other.

In this case, one speaks of a false correlation.

Correlation analysis tasks

Based on the above definitions, the following tasks of the described method can be formulated: to obtain information about one of the sought variables using the other; determine the closeness of the relationship between the studied variables.

Correlation analysis involves the determination of the relationship between the studied characteristics, in connection with which the tasks of correlation analysis can be supplemented with the following:

  • identification of the factors that have the greatest impact on the resultant sign;
  • identification of previously unexplored reasons for connections;
  • building a correlation model with its parametric analysis;
  • study of the significance of communication parameters and their interval estimation.

Relationship between correlation analysis and regression analysis

The method of correlation analysis is often not limited to finding the tightness of the relationship between the studied values. Sometimes it is supplemented by drawing up regression equations, which are obtained using the analysis of the same name, and which represent a description of the correlation dependence between the resulting and factorial (factor) attribute (attributes). This method, together with the analysis under consideration, constitutes the method

Method conditions

Effective factors depend on one to several factors. The method of correlation analysis can be used if there is a large number of observations on the value of the effective and factor indicators (factors), while the factors under study should be quantitative and reflected in specific sources. The first can be determined by the normal law - in this case, the Pearson correlation coefficients are the result of the correlation analysis, or, if the signs do not obey this law, the coefficient is used rank correlation Spearman.

Selection rules for correlation analysis factors

When applying this method it is necessary to determine the factors that influence the performance indicators. They are selected taking into account the fact that there must be causal relationships between the indicators. In the case of creating a multivariate correlation model, those of them are selected that have a significant impact on the resulting indicator, while it is preferable not to include interdependent factors with a pair correlation coefficient of more than 0.85 in the correlation model, as well as those in which the relationship with the effective parameter is non-rectilinear. or functional in nature.

Displaying results

The results of the correlation analysis can be presented in text and graphical forms. In the first case, they are presented as a correlation coefficient, in the second - in the form of a scatter diagram.

In the absence of correlation between the parameters, the points on the diagram are located chaotically, the average degree of connection is characterized by more orderliness and is characterized by a more or less uniform distance of the marked marks from the median. A strong bond tends to a straight line and for r = 1 the dot plot is a straight line. Inverse correlation differs in the direction of the graph from the upper left to the lower right, the straight line - from the lower left to the upper right corner.

3D representation of a scatter plot

In addition to the traditional 2D scatter plot, a 3D graphical representation of the correlation analysis is now used.

A scatterplot matrix is ​​also used, which displays all paired plots in one figure in a matrix format. For n variables, the matrix contains n rows and n columns. The diagram located at the intersection of the i-th row and j-th column is a graph of the variables Xi versus Xj. Thus, each row and column is one dimension, a single cell displays a scatterplot of two dimensions.

Assessment of the tightness of communication

The tightness of the correlation is determined by the correlation coefficient (r): strong - r = ± 0.7 to ± 1, medium - r = ± 0.3 to ± 0.699, weak - r = 0 to ± 0.299. This classification is not strict. The figure shows a slightly different scheme.

An example of the application of the method of correlation analysis

An interesting study has been undertaken in the UK. It is devoted to the relationship between smoking and lung cancer, and was carried out by means of correlation analysis. This observation is presented below.

Initial data for correlation analysis

Professional group

mortality

Farmers, foresters and fishermen

Miners and quarry workers

Producers of gas, coke and chemicals

Glass and ceramics manufacturers

Workers in furnaces, forges, foundries and rolling mills

Electrical and electronic workers

Engineering and related professions

Woodworking production

Tanners

Textile workers

Workwear manufacturers

Workers in the food, beverage and tobacco industries

Manufacturers of paper and printing

Manufacturers of other products

Builders

Painters and decorators

Stationary engine drivers, crane drivers, etc.

Workers not included elsewhere

Transport and communication workers

Warehouse workers, storekeepers, packers and filling machine workers

Clerical workers

Sellers

Sports and Recreation Service Workers

Administrators and managers

Professionals, technicians and artists

Let's start the correlation analysis. It is better to start the solution for clarity with a graphical method, for which we will build a scatter diagram (scatter).

It demonstrates a direct connection. However, it is difficult to draw an unambiguous conclusion based only on the graphical method. Therefore, we will continue to carry out the correlation analysis. An example of calculating the correlation coefficient is presented below.

Using software (for the example of MS Excel, it will be described below), we determine the correlation coefficient, which is 0.716, which means a strong relationship between the parameters under study. Let us determine the statistical reliability of the obtained value according to the corresponding table, for which we need to subtract 2 from 25 pairs of values, as a result of which we get 23 and from this row in the table we find r critical for p = 0.01 (since this is medical data, a more strict dependence, in other cases p = 0.05 is sufficient), which is 0.51 for this correlation analysis. The example has shown that the calculated r is greater than the critical r, the value of the correlation coefficient is considered statistically significant.

Using software for correlation analysis

The described type of statistical data processing can be carried out using software, in particular MS Excel. Correlation involves calculating the following parameters using functions:

1. The correlation coefficient is determined using the CORREL function (array1; array2). Array1,2 - cell of the range of values ​​of the resultant and factorial variables.

Linear Correlation Coefficient is also called Pearson Correlation Coefficient, therefore, starting with Excel 2007, you can use the function with the same arrays.

Graphical display of correlation analysis in Excel is performed using the "Charts" panel with the "Scatter Chart" selection.

After specifying the initial data, we get a graph.

2. Assessment of the significance of the pair correlation coefficient using the Student's t-test. The calculated value of the t-test is compared with the tabular (critical) value this indicator from corresponding table values ​​of the considered parameter, taking into account the given level of significance and the number of degrees of freedom. This estimate is performed using the function TDRESTR (probability; degrees_freedom).

3. Matrix of pair correlation coefficients. The analysis is carried out using the Data Analysis tool, in which Correlation is selected. The statistical assessment of the pair correlation coefficients is carried out by comparing its absolute value with the tabular (critical) value. If the calculated pair correlation coefficient is exceeded over that critical one, we can say, taking into account a given degree of probability, that the null hypothesis about the significance of the linear relationship is not rejected.

Finally

The use of the method of correlation analysis in scientific research makes it possible to determine the relationship between various factors and performance indicators. It should be borne in mind that a high correlation coefficient can also be obtained from an absurd pair or set of data, and therefore given view analysis must be carried out on a sufficiently large data set.

After obtaining the calculated value of r, it is desirable to compare it with r critical to confirm the statistical reliability of a certain value. Correlation analysis can be carried out manually using formulas, or using software tools, in particular MS Excel. Here, you can build a scatter (scatter) diagram in order to visualize the relationship between the studied factors of the correlation analysis and the effective indicator.