Spearman's rank correlation coefficient in excel. Correlation analysis according to the Spearman method (Spearman ranks)

In practice, to determine the closeness of the relationship between two features, the coefficient is often used rank correlation Spearman (R). The values of each feature are ranked in ascending order (from 1 to n), then the difference (d) between the ranks corresponding to one observation is determined.

Example #1. The relationship between the volume of industrial production and investments in fixed capital in 10 areas of one of federal districts RF in 2003 is characterized by the following data.
Calculate Spearman's rank correlation coefficients and Kendala. Check their significance at α=0.05. Formulate a conclusion about the relationship between the volume of industrial production and investments in fixed assets in the regions of the Russian Federation under consideration.

Assign ranks to the feature Y and the factor X . Find the sum of the difference of squares d 2 .
Using the calculator, we calculate the Spearman's rank correlation coefficient:

X	Y	rank X, dx	rank Y, d y	(dx - dy) 2
1.3	300	1	2	1
1.8	1335	2	12	100
2.4	250	3	1	4
3.4	946	4	8	16
4.8	670	5	7	4
5.1	400	6	4	4
6.3	380	7	3	16
7.5	450	8	5	9
7.8	500	9	6	9
17.5	1582	10	16	36
18.3	1216	11	9	4
22.5	1435	12	14	4
24.9	1445	13	15	4
25.8	1820	14	19	25
28.5	1246	15	10	25
33.4	1435	16	14	4
42.4	1800	17	18	1
45	1360	18	13	25
50.4	1256	19	11	64
54.8	1700	20	17	9
				364

The relationship between feature Y factor X is strong and direct.

Estimation of Spearman's rank correlation coefficient

According to the Student's table, we find Ttable.
T table \u003d (18; 0.05) \u003d 1.734
Since Tobs > Ttabl, we reject the hypothesis that the rank correlation coefficient is equal to zero. In other words, Spearman's rank correlation coefficient is statistically significant.

Interval estimate for rank correlation coefficient (confidence interval)
Confidence interval for Spearman's rank correlation coefficient: p(0.5431;0.9095).

Example #2. Initial data.

5	4
3	4
1	3
3	1
6	6
2	2

Since the matrix has related ranks (the same rank number) of the 1st row, we will reshape them. The ranks are re-formed without changing the importance of the rank, that is, the corresponding ratios (greater than, less than or equal to) must be preserved between the rank numbers. It is also not recommended to set the rank above 1 and below the value equal to the number of parameters (in this case n = 6). Reformation of ranks is made in table.

		New ranks
1	1	1
2	2	2
3	3	3.5
4	3	3.5
5	5	5
6	6	6

Since there are bound ranks of the 2nd row in the matrix, we will reshape them. Reformation of ranks is made in table.

Seat numbers in ordered row	Location of factors according to the expert's assessment	New ranks
1	1	1
2	2	2
3	3	3
4	4	4.5
5	4	4.5
6	6	6

Rank matrix.

rank X, dx	rank Y, d y	(dx - dy) 2
5	4.5	0.25
3.5	4.5	1
1	3	4
3.5	1	6.25
6	6	0
2	2	0
21	21	11.5

Since among the values of features x and y there are several identical ones, i.e. bound ranks are formed, then in this case the Spearman coefficient is calculated as:

where

j - numbers of links in order for feature x;
And j is the number of identical ranks in the j-th bundle in x;
k - numbers of sheaves in order for feature y;
In k - the number of identical ranks in k-th bundle by y.
A = [(2 3 -2)]/12 = 0.5
B = [(2 3 -2)]/12 = 0.5
D = A + B = 0.5 + 0.5 = 1

The relationship between feature Y and factor X is moderate and direct. is a quantitative assessment of the statistical study of the relationship between phenomena, used in non-parametric methods.

The indicator shows how the observed sum of squared differences between the ranks differs from the case of no connection.

Service assignment. Via this online calculator produced:

calculation of Spearman's rank correlation coefficient;
calculation confidence interval for the coefficient and assessment of its significance;

Spearman's rank correlation coefficient refers to the indicators of the assessment of the closeness of communication. Qualitative characteristic the tightness of the relationship of the rank correlation coefficient, as well as other correlation coefficients, can be estimated using the Chaddock scale.

Coefficient calculation consists of the following steps:

Properties of Spearman's rank correlation coefficient

Application area. Rank correlation coefficient used to evaluate the quality of communication between two sets. In addition, its statistical significance is used when analyzing data for heteroscedasticity.

Example. On a data sample of observed variables X and Y:

make a ranking table;
find Spearman's rank correlation coefficient and test its significance at level 2a
assess the nature of addiction

Solution. Assign ranks to the feature Y and the factor X .

X	Y	rank X, dx	rank Y, d y
28	21	1	1
30	25	2	2
36	29	4	3
40	31	5	4
30	32	3	5
46	34	6	6
56	35	8	7
54	38	7	8
60	39	10	9
56	41	9	10
60	42	11	11
68	44	12	12
70	46	13	13
76	50	14	14

Rank matrix.

rank X, dx	rank Y, d y	(dx - dy) 2
1	1	0
2	2	0
4	3	1
5	4	1
3	5	4
6	6	0
8	7	1
7	8	1
10	9	1
9	10	1
11	11	0
12	12	0
13	13	0
14	14	0
105	105	10

Checking the correctness of the compilation of the matrix based on the calculation of the checksum:

The sum over the columns of the matrix are equal to each other and the checksum, which means that the matrix is composed correctly.
Using the formula, we calculate the Spearman's rank correlation coefficient.

The relationship between trait Y and factor X is strong and direct
Significance of Spearman's rank correlation coefficient
In order to test the null hypothesis at the level of significance α about the equality of the general Spearman rank correlation coefficient to zero under the competing hypothesis H i . p ≠ 0, it is necessary to calculate the critical point:

where n is the sample size; ρ is Spearman's sample rank correlation coefficient: t(α, k) is the critical point of the two-sided critical region, which is found from the table of critical points of the Student's distribution, according to the significance level α and the number of degrees of freedom k = n-2.
If |p|< Т kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками не значима. Если |p| >T kp - null hypothesis is rejected. There is a significant rank correlation between qualitative features.
According to Student's table we find t(α/2, k) = (0.1/2;12) = 1.782

Since T kp< ρ , то отклоняем гипотезу о равенстве 0 коэффициента ранговой корреляции Спирмена. Другими словами, коэффициент ранговой корреляции статистически - значим и ранговая корреляционная связь между оценками по двум тестам значимая.

The rank correlation coefficient, proposed by K. Spearman, refers to non-parametric indicators of the relationship between variables measured on a rank scale. When calculating this coefficient, no assumptions are required about the nature of the distribution of features in the general population. This coefficient determines the degree of tightness of the connection of ordinal features, which in this case represent the ranks of the compared values.

The value of Spearman's correlation coefficient also lies in the range of +1 and -1. It, like the Pearson coefficient, can be positive and negative, characterizing the direction of the relationship between two features measured in the rank scale.

In principle, the number of ranked features (qualities, traits, etc.) can be any, but the process of ranking more than 20 features is difficult. It is possible that this is why the table of critical values of the rank correlation coefficient is calculated only for forty ranked features (n< 40, табл. 20 приложения 6).

Spearman's rank correlation coefficient is calculated by the formula:

where n is the number of ranked features (indicators, subjects);

D is the difference between the ranks in two variables for each subject;

Sum of squared rank differences.

Using the rank correlation coefficient, consider the following example.

Example: The psychologist finds out how the individual indicators of readiness for school, obtained before the start of schooling for 11 first-graders and their average performance at the end of the school year, are interconnected.

To solve this problem, we ranked, firstly, the values of indicators of school readiness obtained when entering school, and, secondly, the final performance indicators at the end of the year for these same students on average. The results are presented in Table. thirteen.

Table 13

No. of students
Ranks of indicators school readiness
Ranks of average annual performance

We substitute the obtained data into the formula and perform the calculation. We get:

To find the level of significance, we turn to Table. 20 of Annex 6, which contains critical values for rank correlation coefficients.

We emphasize that in Table. 20 Appendix 6, as in the table for linear correlation Pearson, all values of correlation coefficients are given in absolute value. Therefore, the sign of the correlation coefficient is taken into account only when interpreting it.

Finding the levels of significance in this table is carried out according to the number n, i.e., according to the number of subjects. In our case, n = 11. For this number, we find:

0.61 for P 0.05

0.76 for P 0.01

We build the corresponding ``significance axis"":

The resulting correlation coefficient coincided with the critical value for a significance level of 1%. Therefore, it can be argued that the indicators of school readiness and the final grades of first-graders are positively correlated - in other words, the higher the indicator of school readiness, the better the first-grader learns. In terms of statistical hypotheses, the psychologist must reject the null hypothesis of similarity and accept the alternative (but difference) hypothesis, which says that the relationship between school readiness and average performance is non-zero.

Case of identical (equal) ranks

In the presence of the same ranks, the formula for calculating the Spearman linear correlation coefficient will be somewhat different. In this case, two new terms are added to the formula for calculating the correlation coefficients, taking into account the same ranks. They are called corrections for the same ranks and are added to the numerator of the calculation formula.

where n is the number of identical ranks in the first column,

k is the number of identical ranks in the second column.

If there are two groups of identical ranks in any column, then the correction formula becomes somewhat more complicated:

where n is the number of equal ranks in the first group of the ranked column,

k is the number of equal ranks in the second group of the ranked column. The modification of the formula in the general case is as follows:

Example: A psychologist, using a test of mental development (ISTU), conducts a study of intelligence in 12 students in grade 9. At the same time, he asks teachers of literature and mathematics to rank these same students according to indicators of mental development. The task is to determine how the objective indicators of mental development (STI data) and expert assessments of teachers are related.

The experimental data of this problem and the additional columns required to calculate the Spearman correlation coefficient are presented in the form of a table. 14.

Table 14

No. of students	Ranks of testing with the help of SHTUR	Expert assessments of teachers in mathematics	Expert assessments of teachers in literature	D (second and third columns)	D (second and fourth columns)	(second and third columns)	(second and fourth columns)

Since the ranking used the same ranks, it is necessary to check the correctness of the ranking in the second, third and fourth columns of the table. The summation in each of these columns gives the same sum - 78.

Checking by calculation formula. The check gives:

The fifth and sixth columns of the table show the values of the difference in ranks between the expert assessments of the psychologist on the STUD test for each student and the values of the teachers' expert assessments, respectively, in mathematics and literature. The sum of the rank differences must be equal to zero. The summation of the D values in the fifth and sixth columns gave the desired result. Therefore, the subtraction of ranks was carried out correctly. A similar check must be done every time when performing complex types of ranking.

Before starting the calculation by the formula, it is necessary to calculate the corrections for the same ranks for the second, third and fourth columns of the table.

In our case, there are two identical ranks in the second column of the table, therefore, according to the formula, the correction value D1 will be:

There are three identical ranks in the third column, therefore, according to the formula, the correction value D2 will be:

In the fourth column of the table, there are two groups of three identical ranks, therefore, according to the formula, the D3 correction value will be:

Before proceeding to solve the problem, let us recall that the psychologist finds out two questions - how are the values of the ranks on the STUR test related to expert assessments in mathematics and literature. That is why the calculation is carried out twice.

We consider the first rank coefficient, taking into account the additives according to the formula. We get:

Let's calculate without taking into account the additive:

As you can see, the difference in the values of the correlation coefficients turned out to be very insignificant.

We consider the second rank coefficient, taking into account the additives according to the formula. We get:

Let's calculate without taking into account the additive:

Again, the differences were very small. Since the number of students in both cases is the same, according to Table. 20 Appendix 6 we find the critical values at n = 12 for both correlation coefficients at once.

0.58 for P 0.05

0.73 for P 0.01

Plot the first value on the ``significance axis"":

In the first case, the obtained rank correlation coefficient is in the zone of significance. Therefore, the psychologist must reject the null hypothesis that the correlation coefficient is similar to zero and accept the alternative hypothesis that the correlation coefficient is significantly different from zero. In other words, the result obtained suggests that the higher the students' expert scores on the STUD test, the higher their expert scores in mathematics.

Plot the second value on the ``significance axis"":

In the second case, the rank correlation coefficient is in the zone of uncertainty. Therefore, the psychologist can accept the null hypothesis that the correlation coefficient is similar to zero and reject the alternative hypothesis that the correlation coefficient is significantly different from zero. In this case, the result obtained indicates that the students' expert assessments on the STUD test are not related to expert assessments in literature.

To apply the Spearman correlation coefficient, the following conditions must be met:

1. The variables being compared must be obtained on an ordinal (rank) scale, but can also be measured on a scale of intervals and ratios.

2. The nature of the distribution of correlated values does not matter.

3. The number of varying features in the compared variables X and Y must be the same.

Tables for determining the critical values of the Spearman correlation coefficient (Table 20, Appendix 6) are calculated from the number of signs equal to n = 5 to n = 40, and with a larger number of compared variables, the table for the Pearson correlation coefficient should be used (Table 19, Appendix 6). Finding critical values is carried out at k = n.

37. Spearman's rank correlation coefficient.

S. 56 (64) 063.JPG

http://psystat.at.ua/publ/1-1-0-33

Spearman's rank correlation coefficient is used when:
- variables have ranking scale measurements;
- data distribution is too different from normal or not known at all
- samples are small (N< 30).

The interpretation of Spearman's rank correlation coefficient does not differ from Pearson's coefficient, but its meaning is somewhat different. To understand the difference between these methods and logically substantiate the areas of their application, let's compare their formulas.

Pearson correlation coefficient:

Spearman's correlation coefficient:

As you can see, the formulas differ significantly. Compare Formulas

The Pearson correlation formula uses the arithmetic mean and standard deviation of the correlated series, while the Spearman formula does not. Thus, in order to obtain an adequate result according to the Pearson formula, it is necessary that the correlated series be close to the normal distribution (the mean and standard deviation are parameters normal distribution ). For the Spearman formula, this is not relevant.

An element of Pearson's formula is the standardization of each series in z-score.

As you can see, the conversion of variables to the Z-scale is present in the Pearson correlation coefficient formula. Accordingly, for the Pearson coefficient, the scale of the data is absolutely irrelevant: for example, we can correlate two variables, one of which has a min. = 0 and max. = 1, and the second min. = 100 and max. = 1000. No matter how different the range of values is, they will all be converted to standard z-values that are the same in scale.

There is no such normalization in the Spearman coefficient, so

A MANDATORY CONDITION FOR USING THE SPEERMAN COEFFICIENT IS THE EQUALITY OF THE RANGE OF TWO VARIABLES.

Before using the Spearman coefficient for data series with different ranges, it is necessary to rank. Ranking causes the values of these series to acquire the same minimum = 1 (minimum rank) and a maximum equal to the number of values (maximum, last rank = N, i.e. the maximum number cases in the sample).

In what cases it is possible to do without ranking

These are cases where the data is originally ranking scale. For example, the Rokeach value orientations test.

Also, these are cases where the number of value options is small and there are fixed minimum and maximum in the sample. For example, in the semantic differential, minimum = 1, maximum = 7.

An example of calculating the Spearman rank correlation coefficient

Rokeach's value orientations test was carried out on two samples X and Y. Task: to find out how close the value hierarchies of these samples are (literally, how similar they are).

The resulting value r=0.747 is checked against critical value table. According to the table, at N=18, the obtained value is reliable at the level of p<=0,005

Rank correlation coefficients according to Spearman and Kendal

For variables belonging to the ordinal scale or for variables that do not follow a normal distribution, as well as for variables belonging to the interval scale, Spearman's rank correlation is calculated instead of the Pearson coefficient. To do this, individual values of variables are assigned ranking places, which are subsequently processed using the appropriate formulas. To reveal rank correlation, uncheck the default Pearson correlation check box in the Bivariate Correlations... dialog box. Instead, activate the Spearman correlation calculation. This calculation will give the following results. The rank correlation coefficients are very close to the corresponding values of the Pearson coefficients (the original variables have a normal distribution).

titkova-matmetody.pdf p. 45

Spearman's rank correlation method allows you to determine the tightness (strength) and direction

correlation between two signs or two profiles (hierarchies) signs.

To calculate the rank correlation, it is necessary to have two series of values,

which can be ranked. These ranges of values can be:

1) two signs measured in the same group test subjects;

2) two individual feature hierarchies, identified in two subjects for the same

a set of features;

3) two group hierarchies of features,

4) individual and group feature hierarchy.

First, the indicators are ranked separately for each of the features.

As a rule, a lower value of a feature is assigned a lower rank.

In the first case (two features), individual values are ranked according to the first

trait obtained by different subjects, and then individual values for the second

sign.

If two signs are positively related, then the subjects with low ranks in

one of them will have low ranks in the other, and the subjects with high ranks in

one of the traits will also have high ranks on the other trait. For counting rs

it is necessary to determine the differences (d) between the ranks obtained by these subjects on both

signs. Then these indicators d are transformed in a certain way and subtracted from 1. Than

the smaller the difference between the ranks, the larger rs will be, the closer it will be to +1.

If there is no correlation, then all ranks will be mixed and there will be no

no match. The formula is designed so that in this case rs will be close to 0.

In case of negative correlation low ranks of subjects on one basis

will correspond to high ranks on another attribute, and vice versa. The more mismatch

between the ranks of subjects in two variables, the closer rs is to -1.

In the second case (two individual profiles), individual

values obtained by each of the 2 subjects according to a certain (the same for them

both) a set of features. The first rank will receive the trait with the lowest value; second rank -

a sign with a higher value, etc. Obviously, all features must be measured in

the same units, otherwise ranking is impossible. For example, it's impossible

rank the indicators according to the Cattell Personality Questionnaire (16PF), if they are expressed in

"raw" scores, since the ranges of values are different for different factors: from 0 to 13, from 0 to

20 and from 0 to 26. We cannot say which of the factors will take first place in terms of

severity, until we bring all the values to a single scale (most often this is the scale of the walls).

If the individual hierarchies of two subjects are positively related, then the signs

having low ranks in one of them will have low ranks in the other, and vice versa.

For example, if for one subject the factor E (dominance) has the lowest rank, then for

another subject, it should have a low rank if one subject has factor C

(emotional stability) has the highest rank, then the other subject must also have

this factor has a high rank, and so on.

In the third case (two group profiles), the average group values are ranked,

received in 2 groups of subjects according to a certain, identical for two groups, set

signs. In what follows, the line of reasoning is the same as in the previous two cases.

In the case of the 4th (individual and group profiles), they are ranked separately

individual values of the subject and average group values for the same set

signs that are obtained, as a rule, with the exclusion of this individual subject - he

does not participate in the average group profile, with which his individual will be compared

profile. Rank correlation will allow you to check how consistent the individual and

group profiles.

In all four cases, the significance of the obtained correlation coefficient is determined by

by number of ranked values N. In the first case, this number will coincide with

sample size n. In the second case, the number of observations will be the number of features,

constituting a hierarchy. In the third and fourth cases, N is also the number of matched

signs, not the number of subjects in groups. Detailed explanations are given in the examples. If

the absolute value of rs reaches a critical value or exceeds it, the correlation

reliable.

Hypotheses.

There are two possible hypotheses. The first refers to case 1, the second to the other three

The first version of hypotheses

H0: The correlation between variables A and B is not different from zero.

H2: The correlation between variables A and B is significantly different from zero.

The second version of the hypotheses

H0: Correlation between hierarchies A and B is not different from zero.

H2: The correlation between hierarchies A and B is significantly different from zero.

Limitations of the rank correlation coefficient

1. At least 5 observations must be submitted for each variable. Upper

the sampling limit is determined by the available tables of critical values .

2. Spearman's rank correlation coefficient rs with a large number of identical

ranks for one or both matched variables gives coarse values. Perfectly

both correlated series must be two sequences of non-matching

values. If this condition is not met, an adjustment must be made for

the same ranks.

Spearman's rank correlation coefficient is calculated by the formula:

If in both compared ranking series there are groups of the same ranks,

before calculating the rank correlation coefficient, it is necessary to correct for the same

ranks Ta and Tv:

Ta \u003d Σ (a3 - a) / 12,

TV \u003d Σ (v3 - c) / 12,

where a - the volume of each group of identical ranks in the rank series A, in – volume of each

groups of equal ranks in the rank series B.

To calculate the empirical value of rs, use the formula:

38. Dotted biserial correlation coefficient.

For correlation in general, see question no. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf

Let variable X be measured on a strong scale, and variable Y on a dichotomous scale. The point biserial correlation coefficient rpb is calculated by the formula:

Here x 1 is the average value for X objects with the value "one" for Y;

x 0 - the average value for X objects with a value of "zero" for Y;

s x - standard deviation of all values for X;

n 1 - the number of objects "one" in Y, n 0 - the number of objects "zero" in Y;

n = n 1 + n 0 is the sample size.

The point biserial correlation coefficient can also be calculated using other equivalent expressions:

Here x is the overall mean value for the variable X.

Point Biserial Correlation Coefficient rpb varies from –1 to +1. Its value is equal to zero in the event that variables with a unit for Y have an average Y, equal to the mean of variables with zero over Y.

Examination significance hypotheses point biserial correlation coefficient is to check null hypothesish 0 about the equality of the general correlation coefficient to zero: ρ = 0, which is carried out using the Student's criterion. Empirical value

compared with critical values t a (df) for the number of degrees of freedom df = n– 2

If the condition | t| ≤ ta(df), the null hypothesis ρ = 0 is not rejected. The point biserial correlation coefficient significantly differs from zero if the empirical value | t| falls into the critical region, that is, if the condition | t| > ta(n– 2). Reliability of relationship calculated using point biserial correlation coefficient rpb, can also be determined using the criterion χ 2 for the number of degrees of freedom df= 2.

Dot-biserial correlation

The subsequent modification of the correlation coefficient of the product of moments was reflected in the dotted-biserial r. This stat. shows the relationship between two variables, one of which is supposedly continuous and normally distributed, while the other is discrete in the exact sense of the word. The dot-biserial correlation coefficient is denoted by r pbis Because in r pbis the dichotomy reflects the true nature of the discrete variable, and not being artificial, as in the case r bis, its sign is arbitrarily determined. Therefore, for all practices goals r pbis considered in the range from 0.00 to +1.00.

There is also such a case when two variables are considered to be continuous and normally distributed, but both are artificially dichotomized, as in the case of biserial correlation. To assess the relationship between such variables, the tetrachoric correlation coefficient is used r tet, which was also bred by Pearson. Main (exact) formulas and procedures for calculating r tet are quite complex. Therefore, with pract. this method uses the approximations r tet obtained on the basis of shortened procedures and tables.

/online/dictionary/dictionary.php?term=511

DOTTED BISERIAL COEFFICIENT OF CORRELATION is the correlation coefficient between two variables, one of which is measured on a dichotomous scale and the other on an interval scale. It is used in classical and modern testology as an indicator of the quality of a test task - reliability-consistency with the overall test score.

To correlate variables measured in dichotomous and interval scale use dot-biserial correlation coefficient.
The dot-biserial correlation coefficient is a method of correlation analysis of the ratio of variables, one of which is measured in the scale of names and takes only 2 values (for example, men / women, the answer is correct / the answer is incorrect, there is a sign / there is no sign), and the second in the scale ratios or interval scale. The formula for calculating the coefficient of point-biserial correlation:

Where:
m1 and m0 are the average values of X with a value of 1 or 0 in Y.
σx is the standard deviation of all values for X
n1 ,n0 – number of X values from 1 or 0 to Y.
n is the total number of pairs of values

Most often this species The correlation coefficient is used to calculate the relationship of test items with the summary scale. This is one type of validation check.

39. Rank-biserial correlation coefficient.

For correlation in general, see question no. 36 With. 56 (64) 063.JPG

harchenko-korranaliz.pdf p. 28

The rank-biserial correlation coefficient used when one of the variables ( X) is presented in an ordinal scale, and the other ( Y) - in dichotomous, calculated by the formula

Here, is the average rank of objects having unity in Y; is the average rank of objects with zero in Y, n is the sample size.

Examination significance hypotheses rank-biserial correlation coefficient is carried out similarly to the point biserial correlation coefficient using Student's t-test with replacement in the formulas rpb on the rrb.

When one variable is measured on a dichotomous scale (variable x), and the other in the rank scale (variable Y), using the rank-biserial correlation coefficient. We remember that the variable x, measured in a dichotomous scale, takes only two values (codes) 0 and 1. Let us emphasize in particular: despite the fact that this coefficient varies in the range from –1 to +1, its sign does not matter for interpreting the results. This is another exception to the general rule.

The calculation of this coefficient is made according to the formula:

where ` X 1– average rank over those elements of the variable Y, which corresponds to the code (feature) 1 in the variable X;

`X 0 – average rank for those elements of the variable Y, which corresponds to the code (feature) 0 in the variable X\

N- the total number of elements in the variable x.

To apply the rank-biserial correlation coefficient, the following conditions must be met:

1. The variables being compared must be measured on different scales: one X- in a dichotomous scale; another Y– in the ranking scale.

2. The number of varying features in the compared variables X and Y should be the same.

3. To assess the level of reliability of the rank-biserial correlation coefficient, one should use the formula (11.9) and the table of critical values for the Student's test when k = n - 2.

http://psystat.at.ua/publ/drugie_vidy_koehfficienta_korreljacii/1-1-0-38

Cases where one of the variables is present in dichotomous scale, and the other in rank (ordinal), require the use rank-biserial correlation coefficient:

rpb=2 / n * (m1 - m0)

where:
n is the number of measurement objects
m1 and m0 - the average rank of objects with 1 or 0 in the second variable.
This coefficient is also used when checking the validity of tests.

40. Linear correlation coefficient.

About correlation in general (and about linear correlation in particular), see question No. 36 With. 56 (64) 063.JPG

Mr. PEARSON'S CORRELATION COEFFICIENT

r-Pearson (Pearson r) is used to study the relationship between two metricother variables measured on the same sample. There are many situations in which it is appropriate to use it. Does intelligence affect performance in senior university years? Is the size of an employee's salary related to his goodwill towards colleagues? Does the mood of a student affect the success of solving a complex arithmetic problem? To answer such questions, the researcher must measure two indicators of interest to each member of the sample. The data to study the relationship is then tabulated, as in the example below.

EXAMPLE 6.1

The table shows an example of the initial measurement data for two indicators of intelligence (verbal and non-verbal) in 20 students of the 8th grade.

The relationship between these variables can be depicted using a scatter diagram (see Figure 6.3). The diagram shows that there is some relationship between the measured indicators: the greater the value of verbal intelligence, the (mainly) the greater the value of non-verbal intelligence.

Before giving the formula for the correlation coefficient, let's try to trace the logic of its occurrence, using the data of Example 6.1. The position of each /-point (subject with the number /) on the scatter diagram relative to the other points (Fig. 6.3) can be given by the magnitudes and signs of the deviations of the corresponding values of the variables from their average values: (xj - MJ and (mind at ). If the signs of these deviations coincide, then this indicates in favor of a positive relationship ( large values on X correspond to large values at or smaller values for X correspond to smaller values y).

For subject No. 1, the deviation from the average X and by at positive, and for subject No. 3, both deviations are negative. Consequently, the data of both indicate a positive relationship between the studied traits. On the contrary, if the signs of deviations from the average X and by at differ, this will indicate a negative relationship between the signs. Thus, for subject No. 4, the deviation from the average X is negative, according to y - positive, and for subject No. 9 - vice versa.

Thus, if the product of deviations (x, - M X ) X (mind at ) positive, then the data of the /-subject indicate a direct (positive) relationship, and if negative, then an inverse (negative) relationship. Accordingly, if Xwy are mostly directly proportional, then most of the products of the deviations will be positive, and if they are related inversely, then most of the products will be negative. Therefore, the sum of all products of deviations for a given sample can serve as a general indicator for the strength and direction of the relationship:

With a directly proportional relationship between the variables, this value is large and positive - for most of the subjects, the deviations coincide in sign (large values of one variable correspond to large values of the other variable and vice versa). If X and at have feedback, then for most subjects, large values of one variable will correspond to smaller values of another variable, i.e., the signs of the products will be negative, and the sum of the products as a whole will also be large in absolute value, but negative in sign. If there is no systematic relationship between the variables, then the positive terms (products of deviations) will be balanced by negative terms, and the sum of all products of deviations will be close to zero.

So that the sum of the products does not depend on the sample size, it is enough to average it. But we are interested in the measure of the relationship not as a general parameter, but as a calculated estimate of it - statistics. Therefore, as for the dispersion formula, in this case we will do the same, we divide the sum of the products of deviations not by N, and on TV - 1. It turns out a measure of communication, widely used in physics and technical sciences, which is called covariance (Covahance):

psychology, unlike physics, most variables are measured on arbitrary scales, since psychologists are not interested in the absolute value of the trait, but in the relative position of the subjects in the group. In addition, covariance is very sensitive to the scale (dispersion) in which the features are measured. To make the measure of communication independent of the units of measurement of either attribute, it is enough to divide the covariance into the corresponding standard deviations. Thus, it was obtained for-K. Pearson's correlation coefficient mule:

or, after substituting the expressions for o x and

If the values of both variables were converted to r-values using the formula

then the r-Pearson correlation coefficient formula looks simpler (071.JPG):

/dict/sociology/article/soc/soc-0525.htm

CORRELATION LINEAR- statistical non-causal linear relationship between two quantitative variables X and at. Measured using the "factor K.L." Pearson, which is the result of dividing the covariance by the standard deviations of both variables:

where s xy- covariance between variables X and at;

s x , s y- standard deviations for variables X and at;

x i , y i- variable values X and at for object number i;

x, y- arithmetic averages for variables X and at.

Pearson's ratio r can take values from the interval [-1; +1]. Meaning r = 0 means no linear relationship between variables X and at(but does not rule out a non-linear statistical relationship). Positive coefficient values ( r> 0) indicate a direct linear relationship; the closer its value is to +1, the stronger the statistical direct relationship. Negative coefficient values ( r < 0) свидетельствуют об обратной линейной связи; чем ближе его значение к -1, тем сильнее обратная связь. Значения r= ±1 mean the presence of a full linear connection, direct or reverse. In the case of a complete connection, all points with coordinates ( x i , y i) lie on a straight line y = a + bx.

"Coefficient K.L." Pearson is also used to measure the tightness of the relationship in the linear pair regression model.

41. Correlation matrix and correlation graph.

For correlation in general, see question no. 36 With. 56 (64) 063.JPG

correlation matrix. Often, correlation analysis includes the study of the relationship not of two, but of many variables measured on a quantitative scale on a single sample. In this case, correlations are calculated for each pair of this set of variables. Calculations are usually carried out on a computer, and the result is a correlation matrix.

Correlation matrix(correlation matrix) is the result of calculating correlations of the same type for each pair from the set R variables measured in a quantitative scale on one sample.

EXAMPLE

Assume that we are studying relationships between 5 variables (vl, v2,..., v5; P= 5), measured on a sample of N=30 Human. Below is a table of initial data and a correlation matrix.

AND
related data:

Correlation matrix:

It is easy to see that the correlation matrix is square, symmetrical with respect to the main diagonal (takkakg, y = /) y), with units on the main diagonal (since G and = Gu = 1).

The correlation matrix is square: the number of rows and columns is equal to the number of variables. She symmetrical relative to the main diagonal, since the correlation X With at equals correlation at With X. Units are located on its main diagonal, since the correlation of a feature with itself is equal to one. Consequently, not all elements of the correlation matrix are subject to analysis, but those that are above or below the main diagonal.

Number of correlation coefficients, P features to be analyzed in the study of relationships is determined by the formula: P(P- 1)/2. In the example above, the number of such correlation coefficients is 5(5 - 1)/2 = 10.

The main task of analyzing the correlation matrix is revealing the structure of interrelations of a set of features. This allows visual analysis correlation pleiades- graphic image structures statisticallysignificant connections if there are not very many such connections (up to 10-15). Another way is to use multivariate methods: multiple regression, factorial or cluster analysis (see section "Multivariate methods..."). Using factorial or cluster analysis, it is possible to identify groupings of variables that are more closely related to each other than to other variables. A combination of these methods is also very effective, for example, if there are many signs and they are not homogeneous.

Comparison of correlations - an additional task of analyzing the correlation matrix, which has two options. If it is necessary to compare correlations in one of the rows of the correlation matrix (for one of the variables), the comparison method for dependent samples is applied (pp. 148-149). When comparing correlations of the same name calculated for different samples, the comparison method for independent samples is used (pp. 147-148).

Comparison Methods correlations in diagonals correlation matrix (for assessing the stationarity of a random process) and comparing several correlation matrices obtained for different samples (for their homogeneity) are time-consuming and beyond the scope of this book. You can get acquainted with these methods from the book by GV Sukhodolsky 1 .

Problem statistical significance correlations. The problem is that the statistical hypothesis testing procedure involves one-multiple test carried out on one sample. If the same method is applied many times, even if in relation to different variables, then the probability of obtaining a result purely by chance increases. In general, if we repeat the same hypothesis testing method to times in relation to different variables or samples, then with the established value of a, we are guaranteed to receive confirmation of the hypothesis in ahk the number of cases.

Let's assume that the correlation matrix for 15 variables is analyzed, that is, 15(15-1)/2 = 105 correlation coefficients are calculated. To test the hypotheses, the level a = 0.05 is set. By testing the hypothesis 105 times, we will get its confirmation five times (!) regardless of whether the connection actually exists. Knowing this and having received, say, 15 "statistically significant" correlation coefficients, can we tell which of them are obtained by chance, and which ones reflect a real relationship?

Strictly speaking, in order to make a statistical decision, it is necessary to reduce the level a by as many times as the number of hypotheses being tested. But this is hardly advisable, since in an unpredictable way the probability of ignoring real existing connection(make a Type II error).

The correlation matrix alone is not a sufficient basisfor statistical conclusions regarding the individual coefficients included in itcorrelations!

There is only one really convincing way to solve this problem: divide the sample randomly into two parts and take into account only those correlations that are statistically significant in both parts of the sample. An alternative may be the use of multivariate methods (factorial, cluster or multiple regression analysis) - for the selection and subsequent interpretation of groups of statistically significantly related variables.

The problem of missing values. If there are missing values in the data, then two options for calculating the correlation matrix are possible: a) line-by-line deletion of values (excludecaseslistwise); b) pairwise deletion of values (excludecasespairwise). At line-by-line deletion observations with gaps, the entire line is deleted for the object (subject) that has at least one missing value for one of the variables. This method leads to a "correct" correlation matrix in the sense that all coefficients are calculated from the same set of objects. However, if the missing values are randomly distributed in the variables, then this method can lead to the fact that in the considered data set there will not be a single object (each row will contain at least one missing value). To avoid this situation, use another method called pairwise removal. This method takes into account only gaps in each selected pair of variable columns and ignores gaps in other variables. Correlation for a pair of variables is calculated for those objects where there are no gaps. In many situations, especially when the number of gaps is relatively small, say 10%, and the gaps are fairly randomly distributed, this method does not lead to serious mistakes. However, sometimes this is not the case. For example, in the systematic bias (shift) of the estimate, the systematic location of the gaps can be “hidden”, which is the reason for the difference in the correlation coefficients built on different subsets (for example, for different subgroups of objects). Another problem associated with the correlation matrix calculated with in pairs gap removal occurs when using this matrix in other types of analysis (for example, in multiple regression or factor analysis). They assume that a "correct" correlation matrix is used with a certain level of consistency and "correspondence" of various coefficients. Using a matrix with "bad" (biased) estimates leads to the fact that the program is either unable to analyze such a matrix, or the results will be erroneous. Therefore, if a pairwise method of eliminating missing data is used, it is necessary to check whether there are or are not systematic patterns in the distribution of gaps.

If the pairwise elimination of missing data does not lead to any systematic shift in the means and variances (standard deviations), then these statistics will be similar to those calculated with the line-wise method of removing gaps. If there is a significant difference, then there is reason to assume that there is a shift in the estimates. For example, if the mean (or standard deviation) of the values of the variable A, which was used in calculating its correlation with the variable V, much less than average (or standard deviation) the same values of the variable A, which were used in calculating its correlation with the variable C, then there is every reason to expect that these two correlations (A-BUS) based on different subsets of data. There will be a shift in the correlations caused by the non-random location of the gaps in the values of the variables.

Analysis of correlation pleiades. After solving the problem of the statistical significance of the elements of the correlation matrix, statistically significant correlations can be represented graphically in the form of a correlation pleiad or pleiades. Correlation galaxy - it is a figure consisting of vertices and lines connecting them. The vertices correspond to the features and are usually denoted by numbers - the numbers of the variables. The lines correspond to statistically significant relationships and graphically express the sign, and sometimes the /j-significance level of the relationship.

The correlation galaxy can reflect all statistically significant relationships of the correlation matrix (sometimes called correlation graph ) or only their meaningfully selected part (for example, corresponding to one factor according to the results of factor analysis).

EXAMPLE OF CONSTRUCTING A CORRELATION PLEIADI

Preparation for the state (final) certification of graduates: formation of the USE database (general list of USE participants of all categories, indicating subjects) - taking into account reserve days in case of coincidence of subjects;

Work plan (27)

Solution

2. The activities of the educational institution to improve the content and assess the quality in the subjects of natural and mathematical education MOU secondary school No. 4, Litvinovskaya, Chapaevskaya,

In cases where the measurements of the studied characteristics are carried out on an order scale, or the form of the relationship differs from linear, the study of the relationship between the two random variables carried out with the help rank coefficients correlations. Consider Spearman's rank correlation coefficient. When calculating it, it is necessary to rank (order) the sample options. Ranking is the grouping of experimental data in a certain order, either ascending or descending.

The ranking operation is carried out according to the following algorithm:

1. A lower value is assigned a lower rank. The highest value is assigned a rank corresponding to the number of ranked values. The lowest value is assigned a rank equal to 1. For example, if n=7, then the highest value will receive rank number 7, except for the cases provided for by the second rule.

2. If several values are equal, then they are assigned a rank, which is the average of those ranks that they would have received if they were not equal. As an example, consider an ascending sample consisting of 7 elements: 22, 23, 25, 25, 25, 28, 30. The values 22 and 23 occur once, so their ranks are respectively equal to R22=1, and R23=2 . The value 25 occurs 3 times. If these values did not repeat, then their ranks would be equal to 3, 4, 5. Therefore, their rank R25 is equal to the arithmetic mean of 3, 4 and 5: . The values 28 and 30 do not repeat, so their ranks are respectively R28=6 and R30=7. Finally, we have the following correspondence:

3. The total amount of ranks must match the calculated one, which is determined by the formula:

where n is the total number of ranked values.

The discrepancy between the actual and calculated amounts of ranks will indicate an error made in the calculation of ranks or their summation. In this case, you need to find and fix the error.

Spearman's rank correlation coefficient is a method that allows you to determine the strength and direction of the relationship between two features or two feature hierarchies. The use of the rank correlation coefficient has a number of limitations:

a) The expected correlation should be monotonic.
b) The volume of each of the samples must be greater than or equal to 5. To determine the upper limit of the sample, tables of critical values are used (Table 3 of the Appendix). Maximum value n in the table is 40.
c) During the analysis, it is likely that a large number of identical ranks will occur. In this case, an amendment needs to be made. The most favorable case is when both studied samples represent two sequences of mismatched values.

To conduct a correlation analysis, the researcher must have two samples that can be ranked, for example:

- two signs measured in the same group of subjects;
- two individual trait hierarchies identified in two subjects for the same set of traits;
- two group hierarchies of features;
- individual and group hierarchies of features.

We begin the calculation with ranking the studied indicators separately for each of the signs.

Let us analyze a case with two features measured in the same group of subjects. First, the individual values are ranked according to the first attribute obtained by different subjects, and then the individual values according to the second attribute. If lower ranks of one indicator correspond to lower ranks of another indicator, and higher ranks of one indicator correspond to higher ranks of another indicator, then the two features are positively related. If the higher ranks of one indicator correspond to the lower ranks of another indicator, then the two signs are negatively related. To find rs, we determine the differences between the ranks (d) for each subject. The smaller the difference between the ranks, the closer the rank correlation coefficient rs will be to "+1". If there is no relationship, then there will be no correspondence between them, hence rs will be close to zero. The greater the difference between the ranks of the subjects in two variables, the closer to "-1" will be the value of the coefficient rs. Thus, the Spearman rank correlation coefficient is a measure of any monotonic relationship between the two characteristics under study.

Consider the case with two individual feature hierarchies identified in two subjects for the same set of features. In this situation, the individual values obtained by each of the two subjects according to a certain set of features are ranked. The feature with the lowest value should be assigned the first rank; the attribute with a higher value - the second rank, etc. Should be paid Special attention to ensure that all features are measured in the same units. For example, it is impossible to rank indicators if they are expressed in points of different “price”, since it is impossible to determine which of the factors will take the first place in terms of severity until all values are brought to a single scale. If features that have low ranks in one of the subjects also have low ranks in the other, and vice versa, then the individual hierarchies are positively related.

In the case of two group hierarchies of features, the average group values obtained in two groups of subjects are ranked according to the same set of features for the studied groups. Next, we follow the algorithm given in the previous cases.

Let us analyze the case with individual and group hierarchy of features. They start by ranking separately the individual values of the subject and the mean group values according to the same set of features that were obtained, with the exception of the subject who does not participate in the mean group hierarchy, since his individual hierarchy will be compared with it. Rank correlation makes it possible to assess the degree of consistency between the individual and group hierarchy of features.

Let us consider how the significance of the correlation coefficient is determined in the cases listed above. In the case of two features, it will be determined by the sample size. In the case of two individual feature hierarchies, the significance depends on the number of features included in the hierarchy. In the last two cases, the significance is determined by the number of traits studied, and not by the size of the groups. Thus, the significance of rs in all cases is determined by the number of ranked values n.

When checking the statistical significance of rs, tables of critical values of the rank correlation coefficient are used, compiled for various numbers of ranked values and different levels significance. If the absolute value of rs reaches a critical value or exceeds it, then the correlation is significant.

When considering the first option (a case with two features measured in the same group of subjects), the following hypotheses are possible.

H0: The correlation between variables x and y is not different from zero.

H1: The correlation between variables x and y is significantly different from zero.

If we work with any of the three remaining cases, then we need to put forward another pair of hypotheses:

H0: The correlation between the x and y hierarchies is nonzero.

H1: The correlation between x and y hierarchies is significantly different from zero.

The sequence of actions in calculating the Spearman rank correlation coefficient rs is as follows.

- Determine which two features or two feature hierarchies will participate in the matching as x and y variables.
- Rank the values of the variable x, assigning rank 1 to the smallest value, according to the ranking rules. Place the ranks in the first column of the table in order of the numbers of the subjects or signs.
- Rank the values of the variable y. Place the ranks in the second column of the table in order of the numbers of the subjects or signs.
- Calculate the differences d between the ranks x and y for each row of the table. The results are placed in the next column of the table.
- Calculate the squared differences (d2). Place the obtained values in the fourth column of the table.
- Calculate the sum of the squares of the differences? d2.
- If the same ranks occur, calculate the corrections:

where tx is the volume of each group of equal ranks in sample x;

ty is the volume of each group of equal ranks in the sample y.

Calculate the rank correlation coefficient depending on the presence or absence of identical ranks. In the absence of identical ranks, the rank correlation coefficient rs is calculated using the formula:

In the presence of the same ranks, the rank correlation coefficient rs is calculated using the formula:

where?d2 is the sum of the squared differences between the ranks;

Tx and Ty - corrections for the same ranks;

n is the number of subjects or features that participated in the ranking.

Determine the critical values of rs from table 3 of the Appendix, for a given number of subjects n. A significant difference from zero of the correlation coefficient will be observed provided that rs is not less than the critical value.