Differential aptitude test sample task. Standardization of the test and interpretation of test results

Discriminativeness (differentiating ability) is the ability of a test task to differentiate students as more or less prepared. Since the main goal of a normatively oriented test is to achieve a differentiating effect, a high discriminativeness index is very important for the task.

To assess the discriminativeness of a task, we will use the calculation according to the formula:

Where is the discriminativity index for the j-th test item; (P 1) j – the percentage of students who correctly completed the j-th task in a subgroup of 27% of the best students according to the test results; (P 0) j is the percentage of students who correctly completed the j-th task in a subgroup of 27% of the worst students according to the test results.

The discriminability index varies within [-1; 1]. It reaches its maximum value in the case when all students from the strong subgroup correctly complete this task, and no one from the weak subgroup completes this task correctly. In this case, the task will have the maximum differentiating effect. The discriminativeness index reaches a zero value when in both subgroups the proportions of students who completed the task are equal. Accordingly, there is no differentiating effect at all. A value less than 0 will be in a situation where weak students perform this test task more successfully than strong students. Naturally, tasks for which the discriminativity index is equal to or below zero must be removed from the test.

Using data from the file Appendix4.xls, calculate the discriminativity index for each task. Draw conclusions.

TEST QUALITY INDICATORS

Topics for self-study:

Reliability of norm-referenced and criterion-referenced tests

Test validity

Tasks are completed in Microsoft Excel. Listeners can be given printouts of the work progress (see file in attachments Laboratory work02.doc)

Reliability of norm-referenced and criterion-referenced tests

Normatively oriented test – allows you to compare the educational achievements of individual subjects with each other. The scores scored by subjects are widely scattered on the scale. (Tests for which you can give grades: Unified State Examination, placement tests).

Criteria-Based Test are used to certify subjects in any field of knowledge. The points scored by the subjects are concentrated around one point - the criterion (for example, in a test of 50 questions, the criterion is 25 correct answers, i.e. if the subject scored 25 points, then he is certified, if not, then he is not certified. Here the assessment is not exhibited). (Professional aptitude tests, tests compiled for tests).

Correlation is the degree of agreement between the results of two measurements.



RELIABILITY

Reliability– reflects the accuracy of pedagogical measurement, to what extent the results obtained for each student correspond to his true score. Reliability is a characteristic of a test that reflects the accuracy of test measurements and the stability of the results to the action of random factors.

The measure of task difficulty provides information about the degree of involvement of the parameter of the property being studied that it is intended to measure. It is sometimes said that the measure of difficulty determines whether an item is appropriate for the target group of the test. In general, we can say that this criterion allows us to judge this.

Whether a task is difficult or easy is determined by calculating the proportion of incorrect answers to each of them. However, today a not entirely classical method of determining the difficulty of a task is used - speculatively, based on the estimated number and nature of those elements that are involved in completing the task (and are included in the parameter of the property being measured). Let’s say that in a test for memory capacity there is a task related to voluntary memorization, which may involve speech (speaking a list of numbers out loud or “to oneself”), thinking (building associative connections), etc. In this case, a task to memorize a number series with the subject distracted may increase the difficulty of its completion.

Differentiating ability.

The differentiating ability of a task is how much it can distinguish a strong subject from a weak one in terms of the property being measured. If all subjects have the same value for one of the tasks, it is inappropriate to include this task in the test. It is very difficult not to make the mistake of mistaking the differentiating ability of a task for its difficulty/ease. The fact is that in tests that measure the quality of activities performed, knowledge, etc., achievement tests, a number of identical answers to a task will mean two options: correct / incorrect. Accordingly, from this series one can draw an incorrect conclusion about the difficulty (in the case of all incorrect answers) or ease (in the case of all correct answers) of the task.

It should be noted that this criterion is often neglected by the compilers of modern tests. This brings great inconvenience to both subjects, who have to answer unnecessary questions, and psychologists, who are forced to process unnecessary information.

Item difficulty is characterized by an index that corresponds to the proportion of individuals who solve the item correctly (Bortz & Döring, 2005). Previously, this indicator was called the Popularity Index. The purpose of the difficulty index is to differentiate tasks that are highly difficult from those that are easier. Tasks for which all subjects give the correct answer, or tasks for which no one has found the answer are considered unsuitable. The difficulty index must necessarily fall between these extreme cases. In tests, the level of difficulty should cover the entire possible range of the characteristic measured by the test.

The difficulty of test items with a two-step answer (for example, true/false) is calculated as follows:

Nr = number of subjects who gave the correct answer, N = number of subjects, p = Difficulty of the item (only for items with a two-step answer!) This provides a solution for the simplest case. If the subjects did not solve the task or there is a suspicion that some tasks were completed “at random,” then one has to rely on other alternative solutions. (vgl. Fisseni, 1997, 41-42).

Calculation of the difficulty of tasks with multi-stage (alternative) answers: The case when p is not defined. Possible solutions to this problem: Perform a dichotomy of set values ​​(for example, 0 and 1), in which case the difficulty of a task with a two-step answer is calculated. Calculation of mean and variance (mean is equivalent to p, however, variance must also be taken into account).

Index for tasks with multi-level answers:

Simplified formula:

For a more accurate calculation, different authors offer different methods (vgl. Fisseni, 2004, 43-45). The difference in difficulty between two tasks can be checked using a multidisciplinary table. These formulas can only be used at the test level, that is, when no testing is required and/or when the subjects were able to cope with all the tasks.

Differentiating ability of task.

Indicators of differentiating ability of tasks

Discriminativity coefficient,

Point-biserial coefficient

correlations,

Biserial correlation coefficient,

Phi correlation coefficient.

An important indicator of the quality of a test task is differentiating ability, which determines how well a given task distinguishes between “best” and “weak” subjects.

The concept of discriminative ability is based on the fundamental assumption that examinees who demonstrate a high level of proficiency in a given subject are expected to be more likely to answer any item about that subject correctly than those who have a low level of proficiency.

On the contrary, tasks to which either all examinees answered correctly or all answered incorrectly do not have differentiating ability, i.e. do not distinguish between strong and weak subjects.

Items that lack discriminative power do not provide any information about differences between individuals. Several statistical procedures exist to quantify the discriminativeness of a task. These indicators are extremely useful in analyzing assignment quality because they point writers to specific assignments that need improvement.

Discriminativity coefficient

In classical test theory, the discriminativity coefficient - Dj - is widely used to assess the quality of test items. This coefficient is calculated based on the test results by identifying two “contrasting” groups of subjects. In most cases, these are 27% of the “weak” and 27% of the “best” students from the entire sample.

The coefficient is found by the formula Dj = Рu - Рl, where Рu and Рl are the shares of students in the best and weakest groups who answered the given (j-th) task correctly.

The value of the coefficient Dj can vary from -1 to +1.

If the value of Dj is close to -+1, then this task has a high discriminative ability, that is, the “best” group of students from the sample answers it much more often than the “weak” group.

The interpretation of the discriminative power coefficient Dj according to the classical test theory is presented in the table

Point biserial correlation coefficient.

The point-biserial correlation coefficient is a statistical indicator that can be used to analyze the differentiating ability of tasks.

This indicator assesses the degree of statistical relationship between two variables: the response profile for a specific task and the resulting test score.



For the j-th task, the point-biserial correlation coefficient is calculated using the formula:

Here x 1 is the average value over X objects with a value of “one” over Y;

x 0 – average value over X objects with a value of “zero” over Y;

s x – standard deviation of all values ​​along X;

n 1 – number of objects “one” in Y, n 0 – number of objects “zero” in Y;

n = n 1 + n 0 – sample size.

According to the test theory, a value of the point-biserial correlation coefficient rpbis equal to or greater than 0.3 is an acceptable indicator of its quality.

Using this statistical indicator, the task author can evaluate its differentiating ability. Generally speaking, tasks with a higher value of this indicator better distinguish between trained and unprepared subjects. In practice, tasks with a negative point-biserial correlation coefficient are either removed from the task bank or completely revised.

Testology is an interdisciplinary science about the creation of high-quality and scientifically based diagnostic measurement techniques. In psychology, the content of testology largely coincides with the content of differential psychometrics. But the principles and methods of testology go beyond psychology. They are widely used in other branches of science and practice - in pedagogy, medicine, technology, management (professional selection). In each of these industries, testing methods have common features related to ensuring such properties of test methods as validity, reliability, efficiency, etc. But there are also certain specifics related to the specifics of the subject of testing (professional and general educational knowledge, a set of medical symptoms etc.) and conditions for collecting empirical information. Since the test method does not exhaust the variety of methods of modern psychodiagnostics, it is incorrect to identify testology with psychodiagnostics.

Modern testology is a fully mature applied science that poses a wide range of theoretical problems to researchers and offers numerous mathematical approaches, models and methods. The wide dissemination, development and improvement of tests has been facilitated by a number of advantages that this method provides. Tests allow you to evaluate the subject in accordance with the stated purpose of the study; are a relatively quick way to assess a large number of unknown individuals; contribute to the objectivity of assessments that do not depend on the subjective attitudes of the person conducting the research. Ensure comparability of information obtained by different researchers on different subjects

Analysis and evaluation of test tasks begins after testing the test on the target group. The received data is summarized in a table with a matrix structure, in which tasks begin to be sorted according to the following criteria.

Measure of task difficulty

The measure of task difficulty provides information about the degree of involvement of the parameter of the property being studied that it is intended to measure. It is sometimes said that the measure of difficulty determines whether an item is appropriate for the target group of the test. In general, we can say that this criterion allows us to judge this.

Whether a task is difficult or easy is determined by calculating the proportion of incorrect answers to each of them. However, today a not entirely classical method of determining the difficulty of a task is used - speculatively, based on the estimated number and nature of those elements that are involved in completing the task (and are included in the parameter of the property being measured). Let’s say that in a test for memory capacity there is a task related to voluntary memorization, which may involve speech (speaking a list of numbers out loud or “to oneself”), thinking (building associative connections), etc. In this case, a task to memorize a number series with the subject distracted may increase the difficulty of its completion.

Differentiating ability

The differentiating ability of a task is how much it can distinguish a strong subject from a weak one in terms of the property being measured. If all subjects have the same value for one of the tasks, it is inappropriate to include this task in the test. It is very difficult not to make the mistake of mistaking the differentiating ability of a task for its difficulty/ease. The fact is that in tests that measure the quality of activities performed, knowledge, etc., achievement tests, a number of identical answers to a task will mean two options: correct / incorrect. Accordingly, from this series one can draw an incorrect conclusion about the difficulty (in the case of all incorrect answers) or ease (in the case of all correct answers) of the task.

It should be noted that this criterion is often neglected by the compilers of modern tests. This brings great inconvenience to both subjects, who have to answer unnecessary questions, and psychologists, who are forced to process unnecessary information.

Differentiation ability is empirically determined through data variation.

Variation and dispersion

Variation - literally, the degree of variety in the data obtained while performing a task. It reflects differentiating ability. If the differentiating ability is high, we talk about variable data, and vice versa. If the data is not variable, the task is removed from the test. The variation is determined by calculating the variance. Dispersion calculates the sum of square deviations of score values ​​from the arithmetic mean score. Simply put, the arithmetic mean of the sample is calculated, and all obtained point values ​​begin to be compared with it. Thus, we obtain information about the variation of the test task. A common measure of variation in an item's test scores is the standard deviation, which is determined by calculating the square root of the variance.

Sometimes variation is called an observable variable. The fact is that the property that the test is aimed at measuring is taken as a latent (unobservable) variable. And with the help of the test, an observable variable is determined, which reveals only approximate values ​​of the unobserved true scores of the subject.

Primary analysis of test results

So, the test has been standardized, tested, and approved by an expert commission. Now, with the help of it, you can obtain the necessary information about a person’s psychological property or ability. To do this, after testing, a primary analysis of the results is carried out. Usually they talk about it in the case of group testing.

The data obtained must first be reduced to the average value. It shows the group result more clearly. However, the average value is not very informative regarding the characteristics of the distribution of point values ​​and the frequency of occurrence of each value. Mode (Mo) is an indicator of the most frequently occurring score value. There can be several modes - the greatest number of times several values ​​could occur. Next, the sample is divided in half, and the scores of the borderline subject are taken as the median (Me).

The graph of test results usually takes the form of a bell ("Gaussian bell"), corresponding to the law of normal distribution, where the extreme values ​​​​indicate rare scores, and as you approach the middle of the curve, the frequency of scores increases. The modes, median and arithmetic mean are also plotted on the graph. In some cases they may coincide - then the data distribution is called symmetrical. The greater the distance between the mode, median and mean, the more the test results deviate from the normal distribution.

Advantages of computer testing programs

The above procedure for processing test results with a large number of subjects takes a lot of time and effort. Computer testing programs allow you to see the above sample characteristics in a matter of seconds, presented in graphs and tables for greater clarity. This saves time, money and effort for the psychologist, who, having immediately received the results of the initial analysis, can begin to develop recommendations or test scientific hypotheses.

TESTING(from the English test - experience, trial) - a method of psychological diagnostics that uses standardized questions and tasks (tests) that have a certain scale of values. Used for standardized measurement of individual differences.

There are three main areas of testing:

a) education - due to the increase in the duration of education and the complication of educational programs;

B) professional training and selection - due to the increasing growth rate and increasing complexity of production;

c) psychological counseling - in connection with the acceleration of sociodynamic processes. Testing makes it possible, with a known probability, to determine the individual’s current level of development of the necessary skills, knowledge, personal characteristics, etc.

The testing process can be divided into three stages:

1) choice of test (determined by the purpose of testing and the degree of validity and reliability of the test);

2) conducting the test (determined by the instructions for the test);

3) interpretation of results (determined by a system of theoretical assumptions regarding the subject of testing).

At all three stages, the participation of a qualified psychologist (teacher) is necessary. The procedure for processing test results with a large number of subjects takes a lot of time and effort. Computer testing programs allow you to see the characteristics of a sample in a matter of seconds, presented in graphs and tables for greater clarity; they create an atmosphere of independence, eliminating interpersonal relationships - teacher-student. This saves the time, money and effort of the educational psychologist. Modern computer programs make it possible to quickly and efficiently process the received data.

Analysis and evaluation of test tasks begins after testing the test on the target group. The received data is summarized in a table with a matrix structure, in which tasks begin to be sorted according to the following criteria:

1) measure of task difficulty;

2) differentiating ability of the task;

3) primary analysis of test results

Measure of task difficulty

The measure of task difficulty provides information about the degree of involvement of the parameter of the property being studied that it is intended to measure and determines the correspondence of the task to the target group of the test.

Whether a task is difficult or easy is determined by calculating the proportion of incorrect answers to each of them. The difficulty of a task can also be determined speculatively, based on the expected number and nature of those elements that are involved in the execution.

Differentiating ability

Differentiating ability is the extent to which a task can distinguish a strong subject from a weak one in terms of knowledge level. If all subjects have the same value for one of the tasks, it is inappropriate to include this task in the test. Differentiation ability is empirically determined through data variation.

Variation is the degree of diversity of data obtained during the execution of a task. It reflects differentiating ability. If the differentiating ability is high, we talk about variable data, and vice versa. If the data is not variable, the task is removed from the test. The variation is determined by calculating the variance. Dispersion calculates the sum of square deviations of score values ​​from the arithmetic mean score, i.e. the arithmetic mean for the sample is calculated, and all obtained point values ​​begin to be compared with it. This way you can get information about the variation of the test task. A common measure of variation in an item's test scores is the standard deviation, which is determined by calculating the square root of the variance.

Primary analysis of test results

After the test is standardized, tested, and approved by an expert commission, the necessary information about a person’s ability can be obtained. To do this, after testing, a primary analysis of the results is carried out; it is better to use the results of group testing.

The data obtained must first be reduced to the average value. It shows the group result more clearly. However, the average value is not very informative regarding the characteristics of the distribution of point values ​​and the frequency of occurrence of each value. Mode (Mo) is an indicator of the most frequently occurring score value. There can be several modes - the greatest number of times several values ​​could occur. Next, the sample is divided in half, and the scores of the borderline subject are taken as the median (Me).

The graph of test results usually takes the form of a bell ("Gaussian bell"), corresponding to the law of normal distribution, where the extreme values ​​​​indicate rare scores, and as you approach the middle of the curve, the frequency of scores increases. The modes, median and arithmetic mean are also plotted on the graph. In some cases they may coincide - then the data distribution is called symmetrical. The greater the distance between the mode, median and mean, the more the test results deviate from the normal distribution.