Error Correlation Factor Excel. Correlation and regression analysis in Excel: execution instruction

Note! The solution to your specific task will look likewise this example, including all tables and explanatory texts presented below, but taking into account your source data ...

A task:
There is a knitted sample of 26 pairs of values \u200b\u200b(x k, y k):

k. 1 2 3 4 5 6 7 8 9 10
x K. 25.20000 26.40000 26.00000 25.80000 24.90000 25.70000 25.70000 25.70000 26.10000 25.80000
y K. 30.80000 29.40000 30.20000 30.50000 31.40000 30.30000 30.40000 30.50000 29.90000 30.40000

k. 11 12 13 14 15 16 17 18 19 20
x K. 25.90000 26.20000 25.60000 25.40000 26.60000 26.20000 26.00000 22.10000 25.90000 25.80000
y K. 30.30000 30.50000 30.60000 31.00000 29.60000 30.40000 30.70000 31.60000 30.50000 30.60000

k. 21 22 23 24 25 26
x K. 25.90000 26.30000 26.10000 26.00000 26.40000 25.80000
y K. 30.70000 30.10000 30.60000 30.50000 30.70000 30.80000

It is required to calculate / build:
- correlation coefficient;
- check the hypothesis of the dependence of random variables x and y, at the level of significance α \u003d 0.05;
- coefficients of the linear regression equation;
- scattering diagram (correlation field) and regression line schedule;

DECISION:

1. Calculate the correlation coefficient.

The correlation coefficient is an indicator of the mutual probability influence of two random variables. Correlation coefficient R. can make values \u200b\u200bfrom -1 before +1 . If absolute value is closer to 1 , then this is evidence of a strong connection between values, and if closer to 0 - That, it speaks of a weak connection or its absence. If absolute value R. equal to one, then we can talk about functional communication Between the values, that is, one value can be expressed through another by means of a mathematical function.


Calculate the correlation coefficient in the following formulas:
N.
Σ
k \u003d 1.
(x k -m x) 2, Σ y 2. =
M X. =
1
N.
N.
Σ
k \u003d 1.
x k M Y. =

or by formula

R x, y =
M xy - m x m y
S x S y
(1.4), where:
M X. =
1
N.
N.
Σ
k \u003d 1.
x k M Y. =
1
N.
N.
Σ
k \u003d 1.
y k M XY. =
1
N.
N.
Σ
k \u003d 1.
x K y k (1.5)
S X 2. =
1
N.
N.
Σ
k \u003d 1.
x k 2 - m x 2, S y 2. =
1
N.
N.
Σ
k \u003d 1.
y k 2 - m y 2 (1.6)

In practice, formula (1.4) is often used to calculate the correlation coefficient. It requires less computing. However, if the covariance was previously calculated cOV (X, Y), it is more profitable to use formula (1.1), because In addition to the actual covariance, you can use the results of intermediate calculations.

1.1 Calculate the correlation coefficient by formula (1.4)To do this, calculate the values \u200b\u200bof x k 2, y k 2 and x k y k and bring them to Table 1.

Table 1


k.
x K. y K. x K. 2 y K. 2 x K.y K.
1 2 3 4 5 6
1 25.2 30.8 635.04000 948.64000 776.16000
2 26.4 29.4 696.96000 864.36000 776.16000
3 26.0 30.2 676.00000 912.04000 785.20000
4 25.8 30.5 665.64000 930.25000 786.90000
5 24.9 31.4 620.01000 985.96000 781.86000
6 25.7 30.3 660.49000 918.09000 778.71000
7 25.7 30.4 660.49000 924.16000 781.28000
8 25.7 30.5 660.49000 930.25000 783.85000
9 26.1 29.9 681.21000 894.01000 780.39000
10 25.8 30.4 665.64000 924.16000 784.32000
11 25.9 30.3 670.81000 918.09000 784.77000
12 26.2 30.5 686.44000 930.25000 799.10000
13 25.6 30.6 655.36000 936.36000 783.36000
14 25.4 31 645.16000 961.00000 787.40000
15 26.6 29.6 707.56000 876.16000 787.36000
16 26.2 30.4 686.44000 924.16000 796.48000
17 26 30.7 676.00000 942.49000 798.20000
18 22.1 31.6 488.41000 998.56000 698.36000
19 25.9 30.5 670.81000 930.25000 789.95000
20 25.8 30.6 665.64000 936.36000 789.48000
21 25.9 30.7 670.81000 942.49000 795.13000
22 26.3 30.1 691.69000 906.01000 791.63000
23 26.1 30.6 681.21000 936.36000 798.66000
24 26 30.5 676.00000 930.25000 793.00000
25 26.4 30.7 696.96000 942.49000 810.48000
26 25.8 30.8 665.64000 948.64000 794.64000


1.2. Calculate M x according to formula (1.5).

1.2.1. x K.

x 1 + x 2 + ... + x 26 \u003d 25.20000 + 26.40000 + ... + 25.80000 \u003d 669.500000

1.2.2.

669.50000 / 26 = 25.75000

M x \u003d 25.750000

1.3. Similarly, calculate M y.

1.3.1. Mix consistently all the elements y K.

y 1 + y 2 + ... + y 26 \u003d 30.80000 + 29.40000 + ... + 30.80000 \u003d 793.000000

1.3.2. We divide the amount received by the number of sampling elements

793.00000 / 26 = 30.50000

M y \u003d 30.500000

1.4. Similarly calculate M xy.

1.4.1. Mix consistently all elements of the 6th column of table 1

776.16000 + 776.16000 + ... + 794.64000 = 20412.830000

1.4.2. We split the amount received by the number of items

20412.83000 / 26 = 785.10885

M xy \u003d 785.108846

1.5. Calculate the value of S x 2 by formula (1.6.).

1.5.1. Moving sequentially all elements of the 4th column of table 1

635.04000 + 696.96000 + ... + 665.64000 = 17256.910000

1.5.2. We split the amount received by the number of items

17256.91000 / 26 = 663.72731

1.5.3. Subscribe from the last number the square of the value M x is obtained for S X 2

S X 2. = 663.72731 - 25.75000 2 = 663.72731 - 663.06250 = 0.66481

1.6. Calculate the value of S y 2 by formula (1.6.).

1.6.1. Mix the sequentially all elements of the 5th column of the table 1

948.64000 + 864.36000 + ... + 948.64000 = 24191.840000

1.6.2. We split the amount received by the number of items

24191.84000 / 26 = 930.45538

1.6.3. Submount from the last number the square of the M y value will be obtained for S y 2

S y 2. = 930.45538 - 30.50000 2 = 930.45538 - 930.25000 = 0.20538

1.7. Calculate the product of the values \u200b\u200bof S x 2 and S y 2.

S x 2 S y 2 \u003d 0.66481 0.20538 \u003d 0.136541

1.8. Removing and the last number square root, we get the value of S x S y.

S x S y \u003d 0.36951

1.9. Calculate the value of the correlation coefficient by formula (1.4.).

R \u003d (785.10885 - 25.75000 30.50000) / 0.36951 \u003d (785.10885 - 785.37500) / 0.36951 \u003d -0.72028

Answer: R x, y \u003d -0.720279

2. Check the significance of the correlation coefficient (we check the dependence hypothesis).

Since the evaluation of the correlation coefficient is calculated on the final sample, and therefore may deviate from its general value, it is necessary to test the significance of the correlation coefficient. Check is performed using T-criteria:

t \u003d.
R x, y
N - 2.
1 - R 2 x, y
(2.1)

Random value t. It is followed by T distribution of Student and on Table T distribution it is necessary to find the critical value of the criterion (T k.α) at \u200b\u200ba given level of significance α. If the module calculated by formula (2.1) t is less than T kr.α, then the dependencies between the random values \u200b\u200bx and y are not. Otherwise, experimental data do not contradict the hypothesis of the dependence of random variables.


2.1. Calculate the value of T-criteria by formula (2.1) we obtain:
t \u003d.
-0.72028
26 - 2
1 - (-0.72028) 2
= -5.08680

2.2. We define on the T distribution table, the critical value of the parameter T kr.α

The desired T of T kr.α is located at the intersection of the string corresponding to the number of degrees of freedom and the column of the corresponding level of importance α.
In our case, the number of degrees of freedom is N - 2 \u003d 26 - 2 \u003d 24 and α \u003d 0.05 what corresponds to the critical value of the criterion T kr.α \u003d 2.064 (see Table 2)

table 2 t distribution

The number of degrees of freedom
(n - 2)
α \u003d 0.1. α \u003d 0.05 α \u003d 0.02 α \u003d 0.01 α \u003d 0.002. α \u003d 0.001
1 6.314 12.706 31.821 63.657 318.31 636.62
2 2.920 4.303 6.965 9.925 22.327 31.598
3 2.353 3.182 4.541 5.841 10.214 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.895 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.767
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
40 1.684 2.021 2.423 2.704 3.307 3.551
60 1.671 2.000 2.390 2.660 3.232 3.460
120 1.658 1.980 2.358 2.617 3.160 3.373
1.645 1.960 2.326 2.576 3.090 3.291


2.2. Compare the absolute value of T-criteria and T k.α

Absolute value T -Criteria is not less than critical T \u003d 5.08680, T kr.α \u003d 2.064, therefore experimental data, with a probability of 0.95 (1 - α), do not contradict hypothesis On the dependence of random variables X and Y.

3. Calculate the coefficients of the linear regression equation.

The linear regression equation is the equation of a straight, approximating (approximately describing) dependence between random values \u200b\u200bX and Y. If we assume that the value X is free, and Y dependent on x, then the regression equation is recorded as follows


Y \u003d a + b x (3.1), where:

b \u003d.R x, y
Σ y.
Σ X.
= R x, y
S y.
S X.
(3.2),
a \u003d m y - b m x (3.3)

Calculated by formula (3.2) coefficient b. Called the linear regression coefficient. In some sources a. Call constant coefficient Regression I. b. Accordingly, variables.

The errors of the prediction y at a given value of X are calculated by formulas:

The value of σ y / x (formula 3.4) is also called residual average quadratic deviationIt characterizes the care of y from the regression line described by equation (3.1), with a fixed (specified) value of X.

.
S y 2 / s x 2 \u003d 0.20538 / 0.66481 \u003d 0.30894. Removing from the last number square root - we get:
S y / s x \u003d 0.55582

3.3 Calculate the coefficient B By formula (3.2)

b. = -0.72028 0.55582 = -0.40035

3.4 Calculate the coefficient a By formula (3.3)

a. = 30.50000 - (-0.40035 25.75000) = 40.80894

3.5 Establish the error of the regression equation.

3.5.1 Removing from S y 2 square root we get:

= 0.31437
3.5.4 We calculate the relative error in the formula (3.5)

Δ y / x \u003d (0.31437 / 30.50000) 100% \u003d 1.03073%

4. Build the scattering diagram (correlation field) and regression line graph.

Scattering diagram is graphic image Related pairs (x k, y k) in the form of plane points, in rectangular coordinates with X and Y axes. The correlation field is one of the graphical representations of the associated (pair) sample. In the same coordinate system, the regression line schedule is also built. You should carefully choose the scale and starting points on the axes so that the diagram is as clear as possible.

4.1. We find the minimum and maximum sampling element X is the 18th and 15th elements, respectively, x min \u003d 22.10000 and x max \u003d 26.60000.

4.2. We find the minimum and maximum sampling element y is the 2nd and 18th elements, respectively, y min \u003d 29.40000 and y max \u003d 31.60000.

4.3. On the abscissa axis, select the starting point slightly leeter Point x 18 \u003d 22.1000, and such a scale so that the axis is placed point x 15 \u003d 26.60000 and the remaining points clearly distinguished.

4.4. On the axis of the ordinates, we select the starting point slightly left the point y 2 \u003d 29.40000, and such a scale so that the point y 18 \u003d 31.60000 can be placed on the axis and the other points distinguished.

4.5. On the abscissa axis, we place the values \u200b\u200bof X k, and the values \u200b\u200bof the y k on the axis are the ordinate.

4.6. Apply (x 1, y 1), (x 2, y 2), ..., (x 26, y 26) on the coordinate plane. We obtain the scattering diagram (correlation field) shown in the figure below.

4.7. Feature regression line.

To do this, we find two different points with coordinates (X R1, Y R1) and (x R2, Y R2) satisfying equation (3.6), we will apply them to the coordinate plane and spend directly through them. As an abscissa of the first point, take the value x min \u003d 22.10,000. We substitute the value of x min to equation (3.6), we get the order of the first point. Thus, we have a point with coordinates (22.1000, 31.96127). Similarly, we obtain the coordinates of the second point, putting the x max \u003d 26.60000 as an abscissa. The second point will be: (26.60000, 30.15970).

Regression line is shown in the figure below in red

Note that the regression line always passes through the average values \u200b\u200bof the values \u200b\u200bof X and Y, i.e. Coordinates (M x, M y).

To determine the degree of dependence between several indicators, multiple correlation coefficients are applied. They are then reduced into a separate table, which has the name of the correlation matrix. The names of the rows and columns of such a matrix are the names of the parameters, whose dependence of each other is installed. The corresponding correlation coefficients are located at the intersection of rows and columns. Let's find out how to make a similar calculation using Excel tools.

It is customary to determine the level of relationship between different indicators as follows, depending on the correlation coefficient:

  • 0 - 0.3 - communication is absent;
  • 0.3 - 0.5 - the connection is weak;
  • 0.5 - 0.7 - Middle Communication;
  • 0.7 - 0.9 - high;
  • 0.9 - 1 - very strong.

If the correlation coefficient is negative, this means that the connection of the parameters is reverse.

In order to draw up a correlation matrix in Excel, one tool is used in the package "Data analysis". He is called - "Correlation". Let's find out how using it you can calculate the indicators of the plural correlation.

Step 1: Activation of the analysis package

Immediately need to say that by default package "Data analysis" Disabled. Therefore, before proceeding to the procedure for directly calculating the correlation coefficients, it is necessary to activate it. Unfortunately, not every user knows how to do it. Therefore, we will focus on this issue.


After the specified action of the tool package "Data analysis" will be activated.

Stage 2: Calculation of the coefficient

Now you can go directly to the calculation of the multiple correlation coefficient. Let's make an example of the table of performance indicators of labor, stock project and energy-related enterprises in various enterprises, calculate the multiple correlation coefficient of these factors.


Stage 3: Analysis of the result

Now let's figure out how to understand the result we received in the processing process tool "Correlation" In Excel program.

As we see from the table, the correlation coefficient of stock project (Column 2.) and energy-relatedness ( Column 1.) is 0.92, which corresponds to a very strong relationship. Between labor productivity ( Column 3.) and energy-relatedness ( Column 1.) This indicator is 0.72, which is a high degree of addiction. The correlation coefficient between labor productivity ( Column 3.) and stockocks ( Column 2.) equal to 0.88, which also corresponds high degree Dependencies. Thus, it can be said that the relationship between all the factors studied is pretty strong.

As you can see a package "Data analysis" Excel is a very convenient and rather lightweight tool to determine the multiple correlation coefficient. With its help, it is possible to calculate and the usual correlation between two factors.

LABORATORY WORK

Correlation analysis B.Excel

1.1 Correlation analysis in MS Excel

The correlation analysis consists in determining the degree of communication between two random values \u200b\u200bof X and Y. The correlation coefficient is used as a measure of such a relationship. The correlation coefficient is estimated by the sample of the volume of associated persons of observations (X I, Y i) from the joint general combination of X and Y. To assess the degree of relationship of X and Y, measured in quantitative scales, used linear correlation coefficient(Pearson coefficient), assuming that the X and Y samples are distributed in a normal law.

The correlation coefficient varies from -1 (strict reverse linear dependence) to 1 (strict direct proportional dependence). With the value of 0 linear relationship between two samples there.

General classification of correlation ties (by Ivanter E.V., Korowov A.V., 1992):

There are several types of correlation coefficients, which depends on the variables x and, which can be measured in different scales. It is this fact that determines the choice of the corresponding correlation coefficient (see Table 13):

In MS Excel, a special function is used to calculate pairing coefficients of linear correlation Correlate (array1; array2),

tests

where the array is reference to the range of first sample cells (x);

Example 1:10 schoolchildren were given tests on visual-shaped and verbal thinking. The average time for solving test tasks in seconds was measured. The researcher is interested in the question: is there any relationship between the time of solving these tasks? The variable x - denotes the average solution time of visual-shaped, and the variable y is the average time to solve verbal tasks of tests.

R measure:To identify the degree of relationship, first of all, you must enter the data in the MS Excel table (see Table., Fig. 1). Then the correlation coefficient value is calculated. To do this, set the cursor to the C1 cell. On the toolbar, click the Insert Function (FX) button.

In the Function Wizard dialog box that appears, select a category Statisticaland function CornelAfter, click OK. Mouse pointer Enter the sampling data range in the field of array1 (A1: A10). In the Array field, enter the sample data range (B1: B10). Click OK. The C1 cell will appear the correlation coefficient - 0.54119. Next, it is necessary to look at the absolute number of correlation coefficient and determine the type of communication (close, weak, medium, etc.)

Fig. 1. Results of the calculation of the correlation coefficient

Thus, the relationship between the time of solving visual-shaped and verbal tasks of the test is not proven.

Exercise 1.There are data on 20 agricultural farms. To find correlation coefficientbetween the values \u200b\u200bof the yield of grain crops and the quality of the Earth and evaluate its significance. Data is shown in the table.

Table 2. Dependence of the yield of grain crops from the quality of land

Economy number

Earth quality, score

Yield, c / ha


Task 2.Determine whether there is a connection between the work time of the sports simulator for fitness (thousand hours) and the cost of its repair (thousand rubles):

Time work simulator (thousand hours)

Repair cost (thousand rubles)

1.2 Multiple Correlation in MS Excel

For big number Observations when the correlation coefficients need to consistently calculate for several samples, for convenience, the resulting coefficients are reduced to tables called correlation matrices.

Correlation Matrix- This is a square table in which the correlation coefficient between the corresponding parameters is located at the intersection of the corresponding rows and columns.

In MS Excel, the procedure is used to calculate correlation matrices Correlationfrom the package Data analysis. The procedure allows you to obtain a correlation matrix containing correlation coefficients between different parameters.

To implement the procedure, it is necessary:

1. Run the command Service - Analysis data;

2. In the list that appears Analysis tools Select a string Correlation And click the button OK;

3. In the dialog that appears, specify Input intervalThat is, enter the link to the cells containing the analyzed data. The input interval must contain at least two columns.

4. In the section Grouping Set the switch in accordance with the data entered (by columns or by lines);

5. Specify output intervalThat is, enter a link to the cell, starting with which the results of the analysis will be displayed. The size of the output range will be determined automatically, and the message will be displayed in the case of a possible overlapping of the output range to the source data. Press the button OK.

A correlation matrix will be displayed in the output range, in which the correlation coefficient between the corresponding parameters is displayed at the intersection of each row and column. The output range cells that have the coinciding coordinates of strings and columns contain a value of 1, since each column in the input range will completely correlate with itself

Example 2. There are monthly observation data for weather and attendance of museums and parks (see Table 3). It is necessary to determine whether there is a relationship between the state of the weather and attendance of museums and parks.

Table 3. Observation results

Number of clear days

Number of visitors to the museum

Number of visitors Park

Decision. To perform the correlation analysis, enter the source data in the A1: G3 range (Fig. 2). Then in the menu Service Select Analysis data and then specify the string Correlation. In the dialog box that appears, specify Input interval (A2: C7). Indicate that the data is discussed by columns. Specify the output range (E1) and click OK.

In fig. 33 It can be seen that the correlation between the weather state and the attendance of the museum is -0.92, and between the state of the weather and attendance of the park - 0.97, between the attendance of the park and the museum - 0.92.

Thus, as a result of the analysis, dependencies were revealed: a strong degree of reverse linear relationship between the attendance of the museum and the number of sunny days and almost linear (very strong direct) connection between the attendance of the park and the state of the weather. Between the attendance of the museum and the park there is a strong reverse relationship.

Fig. 2. The results of calculating the correlation matrix from Example 2

Task 3.. 10 managers were assessed according to the method of expert assessments of the psychological characteristics of the personality of the head. 15 experts made an assessment of each psychological characteristic on the five-point system (see Table 4). The psychologist is interested in the question, in which interconnection is these characteristics of the head among themselves.

Table 4. Research results

Test P / P

tact

require

critical

Correlation coefficient (or linear coefficient correlations) is denoted as "R" (in rare cases as "ρ") and characterizes linear correlation (That is, the relationship that is set by some value and direction) of two or more variables. The value of the coefficient is between -1 and +1, that is, the correlation is available both positive and negative. If the correlation coefficient is -1, the perfect negative correlation occurs; If the correlation coefficient is +1, the perfect positive correlation occurs. In other cases, there is a positive correlation between two variables, a negative correlation or a lack of correlation is observed. The correlation coefficient can be calculated manually using free online calculators or using a good graphic calculator.

Steps

Calculating the correlation coefficient manually

    Collect data. Before proceeding with the calculation of the correlation coefficient, learn these pairs of numbers. Better to record them in a table, which can be positioned vertically or horizontally. Each line or column designate as "x" and "y".

    • For example, four pairs of values \u200b\u200b(numbers) of variables "x" and "y" are given. You can create the following table:
      • x || y.
      • 1 || 1
      • 2 || 3
      • 4 || 5
      • 5 || 7
  1. Calculate the arithmetic average "x". To do this, fold all the values \u200b\u200bof "x", and then the result is divided by the number of values.

    • In our example, four values \u200b\u200bof the variable "x" are given. To calculate the arithmetic average "x", fold these values, and then divide the amount to 4. The calculations will be recorded as follows:
    • μ x \u003d (1 + 2 + 4 + 5) / 4 (\\ displaystyle \\ mu _ (x) \u003d (1 + 2 + 4 + 5) / 4)
    • μ x \u003d 12/4 (\\ displaystyle \\ mu _ (x) \u003d 12/4)
    • μ x \u003d 3 (\\ displaystyle \\ mu _ (x) \u003d 3)
  2. Find the arithmetic average "y". To do this, perform similar actions, that is, fold all the values \u200b\u200bof "y", and then divide the amount by the number of values.

    • In our example, four values \u200b\u200bof the variable "y" are given. Fold these values, and then divide the amount to 4. The calculations will be recorded like this:
    • μ y \u003d (1 + 3 + 5 + 7) / 4 (\\ displaystyle \\ mu _ (y) \u003d (1 + 3 + 5 + 7) / 4)
    • μ y \u003d 16/4 (\\ displaystyle \\ mu _ (y) \u003d 16/4)
    • μ y \u003d 4 (\\ displaystyle \\ mu _ (y) \u003d 4)
  3. Calculate the standard deviation "X". Calculate the mean values \u200b\u200bof "x" and "y", find standard deviations These variables. Standard deviation is calculated by the following formula:

    • Σ x \u003d 1 n - 1 σ (x - μ x) 2 (\\ displaystyle \\ sigma _ (x) \u003d (\\ sqrt ((\\ FRAC (1) (N - 1)) \\ Sigma (X- \\ Mu _ ( x)) ^ (2))))
    • Σ x \u003d 1 4 - 1 * ((1 - 3) 2 + (2 - 3) 2 + (4 - 3) 2 + (5 - 3) 2) (\\ displaystyle \\ sigma _ (x) \u003d (\\ sqrt ((\\ FRAC (1) (4-1)) * ((1-3) ^ (2) + (2-3) ^ (2) + (4-3) ^ (2) + (5-3) ^ (2)))))
    • Σ x \u003d 1 3 * (4 + 1 + 1 + 4) (\\ displaystyle \\ sigma _ (x) \u003d (\\ sqrt ((\\ FRAC (1) (3)) * (4 + 1 + 1 + 4)) ))
    • Σ x \u003d 1 3 * (10) (\\ displaystyle \\ sigma _ (x) \u003d (\\ sqrt ((\\ FRAC (1) (3)) * (10))))
    • Σ x \u003d 10 3 (\\ displaystyle \\ sigma _ (x) \u003d (\\ sqrt (\\ FRAC (10) (3))))
    • Σ x \u003d 1, 83 (\\ displaystyle \\ sigma _ (x) \u003d 1.83)
  4. Calculate the standard deviation "y". Follow the steps that are described in the previous step. Take advantage of the same formula, but substitute the values \u200b\u200bof "y".

    • In our example, the calculation is recorded like this:
    • σ y \u003d 1 4 - 1 * ((1 - 4) 2 + (3 - 4) 2 + (5 - 4) 2 + (7 - 4) 2) (\\ displaystyle \\ sigma _ (y) \u003d (\\ sqrt ((\\ FRAC (1) (4-1)) * ((1-4) ^ (2) + (3-4) ^ (2) + (5-4) ^ (2) + (7-4) ^ (2)))))
    • σ y \u003d 1 3 * (9 + 1 + 1 + 9) (\\ displaystyle \\ sigma _ (y) \u003d (\\ sqrt ((\\ FRAC (1) (3)) * (9 + 1 + 1 + 9)) ))
    • σ y \u003d 1 3 * (20) (\\ displaystyle \\ sigma _ (y) \u003d (\\ sqrt ((\\ FRAC (1) (3)) * (20))))
    • σ y \u003d 20 3 (\\ displaystyle \\ sigma _ (y) \u003d (\\ sqrt (\\ FRAC (20) (3))))
    • σ y \u003d 2, 58 (\\ displaystyle \\ sigma _ (y) \u003d 2.58)
  5. Record the basic formula to calculate the correlation coefficient. This formula includes mean values, standard deviations and quantities (n) of couples of both variables. The correlation coefficient is indicated as "R" (in rare cases as "ρ"). This article uses a formula for calculating the Pearson correlation coefficient.

    • Here and in other sources, the values \u200b\u200bmay be designated differently. For example, in some formulas there are "ρ" and "σ", and in other "R" and "s". In some textbooks, other formulas are given, but they are mathematical analogues of the above formula.
  6. You calculated the average values \u200b\u200band standard deviations of both variables, so you can use the formula to calculate the correlation coefficient. Recall that "N" is the number of pairs of values \u200b\u200bof both variables. The value of other values \u200b\u200bwere calculated earlier.

    • In our example, the calculation is recorded like this:
    • ρ \u003d (1 n - 1) σ (x - μ x Σ x) * (y - μ y σ y) (\\ displaystyle \\ rho \u003d \\ left ((\\ FRAC (1) (N-1)) \\ Right) \\ Sigma \\ left ((\\ FRAC (X- \\ Mu _ (x)) (\\ Sigma _ (x))) \\ Right) * \\ left ((\\ FRAC (Y- \\ MU _ (Y)) (\\ Sigma _ (y))) \\ Right))
    • ρ \u003d (1 3) * (\\ displaystyle \\ rho \u003d \\ left ((\\ FRAC (1) (3)) \\ Right) *)[ (1 - 3 1, 83) * (1 - 4 2, 58) + (2 - 3 1, 83) * (3 - 4 2, 58) (\\ DisplayStyle \\ left ((\\ FRAC (1-3) ( 1.83)) \\ Right) * \\ left ((\\ FRAC (1-4) (2.58)) \\ RIGHT) + \\ left (((\\ FRAC (2-3) (1.83)) \\ RIGHT) * \\ left ((\\ FRAC (3-4) (2.58)) \\ RIGHT))
      + (4 - 3 1, 83) * (5 - 4 2, 58) + (5 - 3 1, 83) * (7 - 4 2, 58) (\\ displaystyle + \\ left ((\\ FRAC (4-3-3 ) (1.83)) \\ Right) * \\ left ((\\ FRAC (5-4) (2.58)) \\ RIGHT) + \\ left ((\\ FRAC (5-3) (1.83)) \\ ρ \u003d (1 3) * (6 + 1 + 1 + 6 4, 721) (\\ displaystyle \\ rho \u003d \\ left ((\\ FRAC (1) (3)) \\ RIGHT) * \\ left ((\\ FRAC (6 + 1 + 1 + 6) (4,721)) \\ RIGHT))]
    • ρ \u003d (1 3) * 2, 965 (\\ displaystyle \\ rho \u003d \\ left ((\\ FRAC (1) (3)) \\ Right) * 2,965)
    • ρ \u003d (2, 965 3) (\\ displaystyle \\ rho \u003d \\ left ((\\ FRAC (2,965) (3)) \\ Right))
    • ρ \u003d 0, 988 (\\ displaystyle \\ rho \u003d 0,988)
    • Analyze the result.
  7. In our example, the correlation coefficient is 0.988. This value in some way characterizes this set of pairs of numbers. Pay attention to the sign and value value. Since the correlation coefficient value is positive, there is a positive correlation between the variables "x" and "y". That is, with an increase in the value of "X", the value "y" also increases.

    • Since the value of the correlation coefficient is very close to +1, the values \u200b\u200bof the variables "X" and "y" are strongly interrelated. If you apply points to the coordinate plane, they will be close to some straight line.
    • Using online calculators to calculate the correlation coefficient

    On the Internet, find the calculator for calculating the correlation coefficient.

    1. This coefficient is often calculated in statistics. If there are many pairs of numbers, it is almost impossible to calculate the correlation coefficient manually. Therefore, there are online calculators to calculate the correlation coefficient. In the search engine, enter the "Calculator correlation coefficient" (without quotes). Enter the data.

    2. Check out the instructions on the site to correctly enter data (pairs of numbers). It is imperative to introduce the corresponding pairs of numbers; Otherwise, you will get an incorrect result. Remember that various data entry formats on different websites.

      • For example, on the site http://ncalculators.com/statistics/correlation-coefficient-calculator.htm, the values \u200b\u200bof the variables "x" and "y" are introduced in two horizontal lines. Values \u200b\u200bare separated by commas. That is, in our example, the values \u200b\u200bof "X" are introduced as: 1,2,4,5, and the values \u200b\u200bof "y" as follows: 1,3,5,7.
      • On another site, http://www.alcula.com/calculators/statistics/correlation-cooefficient/, the data is entered by vertical; In this case, do not confuse the corresponding pairs of numbers.
    3. Calculate the correlation coefficient. Entering data, simply click on the "Calculate" button, "calculate" or similar to get the result.

      Using a graphic calculator

      1. Take a graphical calculator, go to statistical computing mode and select the Edit command.

        • On different calculators you need to press different keys. This article discusses the TEXAS INSTRUMENTS TI-86 calculator.
        • To go to statistical computing mode, press Stat (above the + key). Then press F2 - Edit.
      2. Delete previous saved data. In most calculators, the entered statistical data is stored until you erase them. In order not to confuse the old data with the new ones, first delete any saved information.

        • Use the arrow keys to move the cursor and select the Xstat header. Then click Clear to remove all values \u200b\u200bentered in the XSTAT column.
        • Use the arrow keys to highlight the "YSTAT" header. Then click Clear to remove all values \u200b\u200bentered into the Ustat column.
      3. Enter the source data. Use the arrow keys to move the cursor to the first cell under the XSTAT header. Enter the first value and press ENTER. At the bottom of the screen, XStat (1) \u003d __ is displayed, where instead of a space will be an entered value. After you press ENTER, the value entered will appear in the table, and the cursor moves to the next line; At the same time, "Xstat (2) \u003d __" appears at the bottom of the screen.

        • Enter all the values \u200b\u200bof the variable "x".
        • Entering all the values \u200b\u200bof the variable "x", using the arrow keys, go to the YSTAT column and enter the values \u200b\u200bof the variable "y".
        • After entering all pairs of numbers, press EXIT to clean the screen and exit statistical computing mode.
      4. Calculate the correlation coefficient. It characterizes how close the data is located to some straight line. A graphical calculator can quickly determine the appropriate direct and calculate the correlation coefficient.

        • Click Stat - Calc (Calculations). On Ti-86 you need to click - -.
        • Select the Linear Regression feature ( Linear regression). On Ti-86, click, which is indicated as "LINR". The "Linr _" string with a flashing cursor appears on the screen.
        • Now enter the names of two variables: xstat and ystat.
          • On Ti-86 Open a list of names; To do this, click - -.
          • Available variables will be displayed in the bottom line of the screen. Select (for this, most likely, you need to press F1 or F2), enter the comma, and then select.
          • Press ENTER to handle the entered data.
      5. Analyze the results obtained. Clicking Enter, the following information will be displayed on the screen:

        • y \u003d a + b x (\\ displaystyle y \u003d a + bx): This is a function that describes the straight. Note that the function is not written in standard form (y \u003d kx + b).
        • a \u003d (\\ displaystyle a \u003d). This is the coordinate "U" the intersection point of the line with the Y axis.
        • B \u003d (\\ DisplayStyle B \u003d). it corner coefficient straight.
        • Corr \u003d (\\ DisplayStyle (\\ Text (Corr) \u003d). This is a correlation coefficient.
        • n \u003d (\\ displaystyle n \u003d). This number of pairs of numbers that was used in the calculations.

IN scientific research It is often necessary to have a connection between the effective and factor variables (the yield of any culture and the amount of precipitation, the growth and weight of the person in homogeneous groups on the floor and age, the pulse rate and body temperature, etc.).

The second represent signs that contribute to the change of those related to them (first).

Concept of correlation analysis

There are many on the basis of the foregoing, it can be said that the correlation analysis is a method used to verify the hypothesis about statistical significance Two or more variables if the researcher can measure, but not change.

There are other definitions of the concept under consideration. Correlation analysis is a processing method consisting in the study of the correlation coefficients between variables. This compares the correlation coefficients between one pair or many pairs of features, to establish statistical relationships between them. Correlation analysis is a method for studying the statistical dependence between random values \u200b\u200bwith an optional presence of a strict functional nature, in which the dynamics of one random variable leads to a dynamics mathematical expectation other.

Concept for correlation facilities

When conducting correlation analysis, it is necessary to take into account that it can be carried out with respect to any totality of signs, often absurd to each other. Sometimes they have no causal connection with each other.

In this case, they talk about false correlation.

Tasks of correlation analysis

Based on the above definitions, you can formulate the following tasks of the described method: to obtain information about one of the desired variables with the other; Determine the closeness of the connection between the studied variables.

Correlation analysis involves determining the dependence between the studied signs, and therefore the tasks of correlation analysis can be supplemented with the following:

  • identification of factors that have the greatest impact on the productive basis;
  • identifying unexplored reasons for connections;
  • constructing a correlation model with its parametric analysis;
  • study of the significance of communication parameters and their interval assessment.

Corrective analysis with regression

The correlation analysis method is often not limited to finding the tightness of the relationship between the values \u200b\u200bunder study. Sometimes it is complemented by the preparation of the regression equations, which are obtained using the same analysis, and are a description of the correlation dependence between the resultant and factor (factor) feature (signs). This method in combination with the analysis under consideration is a method

Terms of use of the method

Excellent factors depend on one to several factors. The method of correlation analysis can be applied if there is a large number of observations of the value of the effective and factor indicators (factors), while the investigated factors must be quantitative and reflected in specific sources. The first can be determined by the normal law - in this case, the results of the correlation analysis are the coefficients of the Pearson correlation, or, if the signs are not subject to this law, the coefficient is used rank correlation Spearman.

Rules for selecting correlation factors

When applied this method It is necessary to determine the factors affecting the effective indicators. They are selected taking into account the fact that there should be causal relations between the indicators. In the case of creating a multifactor correlation model, those of them are taken, which have a significant effect on the resulting indicator, while interdependent factors with a pair correlation coefficient of more than 0.85 in the correlation model preferably not to include, as those whose communication with the effective parameter wears an indirecturine or functional character.

Display results

The results of correlation analysis can be represented in text and graphic species. In the first case, they are represented as a correlation coefficient, in the second - in the form of a scatter diagram.

In the absence of a correlation between the point parameters in the diagram, it is chaotic, the average degree of communication is characterized greater degree Organization and is characterized by a more or less uniform remoteness of applied marks from median. A strong bond tends to direct and at r \u003d 1, the point graph is a flat line. The reverse correlation is distinguished by the direction of graphics from the left top to the lower right, direct - from the lower left to the upper right corner.

Three-dimensional view of the scatter diagram (dispersion)

In addition to the traditional 2D representation of the scatter diagram, 3D-display of the graphical representation of the correlation analysis is currently used.

The dispersion diagram matrix is \u200b\u200balso used, which displays all pair graphics in one picture in the matrix format. For n variables, the matrix contains N strings and N columns. The diagram located at the intersection of the i-th row and the J-th column is a graph of the variables XI compared to the XJ. Thus, each string and column are one dimension, a separate cell displays the dispersion diagram of two measurements.

Estimation of tightness of communication

The correlation tone is determined by the correlation coefficient (R): strong - r \u003d ± 0.7 to ± 1, the average - r \u003d ± 0.3 to ± 0.699, weak - r \u003d 0 to ± 0.299. This classification is not strict. The figure shows a slightly different scheme.

An example of the application of the correlation analysis method

In the UK, a curious study was undertaken. It is devoted to the link of smoking with lung cancer, and was carried out by correlation analysis. This observation is presented below.

Source data for correlation analysis

Professional group

mortality

Farmers, Flashing and Fishermen

Miners and career workers

Gas manufacturers, coke and chemicals

Manufacturers of glass and ceramics

Employees of ovens, blacksmith, foundry and rolling mills

Employees Electrical Engineering and Electronics

Engineering and related professions

Woodworking production

Kozhevanniki

Textile workers

Workwear manufacturers

Food, drinking and tobacco industry workers

Paper and print manufacturers

Manufacturers of other products

Builders

Artists and decorators

Drivers of stationary engines, cranes, etc.

Workers not included in other places

Transport and communication workers

Warehouse workers, storekeepers, packers and workers of the casting machines

Stationery workers

Sellers

Sport and recreation workers

Administrators and managers

Professionals, technical workers and artists

We begin the correlation analysis. The solution is better to start for visibility from the graphical method, for which we construct the dispersion diagram (scatter).

It demonstrates a direct connection. However, on the basis of only the graphical method, it is difficult to make an unambiguous conclusion. Therefore, continue to perform a correlation analysis. An example of calculating the correlation coefficient is presented below.

Using software (on the example, MS Excel will be described below) define the correlation coefficient, which is 0.716, which means a strong connection between the parameters under study. We define the statistical accuracy of the value obtained on the corresponding table, for which we need to subtract out 25 pairs of values \u200b\u200b2, as a result of which we obtain 23 and on this line in the table we will find R critical for P \u003d 0.01 (since it is medical data, more strict Dependence, in other cases, sufficiently p \u003d 0.05), which is 0.51 for this correlation analysis. The example demonstrated that R is the calculated greater R critical, the correlation coefficient value is considered statistically reliable.

Use of software when conducting correlation analysis

The described view of data statistical data can be carried out using software, in particular, MS Excel. The correlation involves the calculation of the following parameters using functions:

1. The correlation coefficient is determined by the correlation function (array1; array2). Massive1,2 - cell interval of values \u200b\u200bof productive and factor variables.

The linear correlation coefficient is also called the Pearson correlation coefficient, in connection with which, starting with Excel 2007, you can use the function with the same arrays.

Graphic display of correlation analysis in Excel is performed using the "Chart" panel with a choice of "point chart".

After specifying the source data, we get a graph.

2. Assessment of the significance of the pair correlation coefficient using the Student t-criterion. The calculated value of the T-criterion is compared with the tabular (critical) value this indicator of corresponding table The values \u200b\u200bof the parameter under consideration, taking into account the specified level of significance and the number of degrees of freedom. This assessment is carried out using the StudsPobraob (probability; degree of degrees).

3. Matrix of pair correlation coefficients. The analysis is carried out using the "Data Analysis" tool, in which the "correlation" is selected. The statistical estimate of the pair correlation coefficients is carried out by comparing its absolute value with a tabular (critical) value. When the calculated pair correlation coefficient is exceeded above such, it is possible to say, taking into account the specified degree of probability that the zero hypothesis of the significance of the linear communication is not rejected.

Finally

The use of a correlation analysis method in scientific studies allows you to define the connection between various factors and effective indicators. It should be borne in mind that the high correlation coefficient can also be obtained from an absurd pair or a variety of data, in connection with which this species Analysis must be performed on a sufficiently large data array.

After obtaining the calculated value of R it is desirable to compare with R critical to confirm the statistical reliability of a certain value. Correlation analysis can be carried out manually using formulas, or using software, in particular MS Excel. Here you can also construct a scatter diagram (dispersion) in order to illustrate the presentation of the relationship between the factors studied by correlation analysis and an effective feature.