Calculating the median of a set of numbers. Median function in excel to perform statistical analysis

PRACTICE #4 .

Calculation of the structural characteristics of the variational distribution series.

The student must:

know:

- scope and methodology for calculating structural averages;

be able to:

- calculate structural averages;

- formulate a conclusion based on the results obtained.

Guidelines

In statistics, the mode and median are calculated, which are related to structural averages, so what value depends on buildings statistical aggregate.

fashion calculation

Fashion the value of the feature (variant) is called, more often all occurring in the studied population. In a discrete distribution series, the mode will be the variant with the highest frequency.

for instance: The distribution of women's shoes sold by size is characterized as follows:

Shoe size

Number of pairs sold

In this distribution series, the mode is size 37, i.e. Mo=37 size.

For an interval distribution series, the mode is determined by the formula:

where X Mo - the lower limit of the modal interval;

hMo - the value of the modal interval;

fMo is the frequency of the modal interval;

fMo -1and fMo +1 – interval frequency, respectively

preceding the modal and following it.

for instance: The distribution of workers by length of service is characterized by the following data.

Work experience, years

up to 2

8-10

10 or more

Number of workers, pers.

Determine the mode of the interval series of the distribution.

The mode of the interval series is

Fashion is always somewhat vague; it depends on the size of the groups and the precise position of the group boundaries. Fashion is widely used in commercial practice when studying consumer demand, when registering prices, etc.

Median calculation

median in statistics is called a variant located in the middle of an ordered data series, and which divides the statistical population into two equal parts so that one half of the value is less than the median, and the other half is greater than it. To determine the median, it is necessary to build a ranked series, i.e. a series in ascending or descending order of individual characteristic values.

In a discrete ordered series with an odd number of members, the median will be the variant located in the center of the series.

for instance: The experience of five workers was 2, 4, 7, 9 and 10 years. In this series, the median is 7 years, i.e. Me=7 years

If a discrete ordered series consists of an even number of members, then the median will be the arithmetic mean of two adjacent options in the center of the series.

for instance: The work experience of six workers was 1, 3, 4, 5, 10 and 11 years. There are two options in this row, standing in the center of the row. These are options 4 and 5. The arithmetic mean of these values ​​​​will be the median of the series

To determine the median for grouped data, the cumulative frequencies must be read.

For instance:Based on the available data, we determine the median shoe size

Shoe size

Number of pairs sold

Sum of cumulative frequencies

8+19=27

27+34=61

61+108=169

Total

To determine the median, it is necessary to calculate the sum of the accumulated frequencies of the series. The accumulation of the total continues until the accumulated sum of frequencies exceeds half the sum of the frequencies of the series. In our example, the sum of frequencies was 300, its half - 150. The accumulated sum of frequencies turned out to be 169. The variant corresponding to this sum, i.e. 37 is the median of the series.

If the sum of the accumulated frequencies against one of the options is exactly half the sum of the frequencies of the series, then the median is defined as the arithmetic mean of this option and the next one.

for instance: Based on the available data, determine the median wages workers

Monthly salary, thousand rubles

Number of workers, pers.

Sum of cumulative frequencies

14,0

14,2

2+6=8

16,0

8+12=20

16,8

18,0

Total:

The median will be:

The median of the interval variation series of the distribution is determined by the formula:

Where x me is the lower limit of the median interval;

h Me is the value of the median interval;

f- the sum of the frequencies of the series;

f Me is the frequency of the median interval;

For instance:Based on the available data on the distribution of enterprises by the number of industrial and production personnel, calculate the median in the interval variation series

Number of enterprises

Sum of cumulative frequencies

100-200

200-300

1+3=4

300-400

4+7=11

400-500

11+30=41

500-600

600-700

700-800

Total:

Let us first define the median interval. V this example the sum of accumulated frequencies exceeding half the sum of all values ​​of the series corresponds to the interval 400-500. This is the median interval, i.e. the interval containing the median of the series. Let's define its meaning

If the sum of the accumulated frequencies against one of the intervals is exactly half the sum of the frequencies of the series, then the median is determined by the formula:

where n- the number of units in the population.

For instance:Based on the available data on the distribution of enterprises by the number of industrial and production personnel, calculate the median in the interval variation series

Groups of enterprises by the number of PPPs, pers.

Number of enterprises

Sum of cumulative frequencies

100-200

200-300

1+3=4

300-400

4+6=10

400-500

10+30=40

500-600

40+20=60

600-700

700-800

Total:

people

Mode and median in the interval series can be define graphically:

the mode in discrete series - along the distribution polygon, the mode in interval series - along the distribution histogram, and the median - along the cumulate.

Mode of the interval distribution series determined by the distribution histogram determine in the following way. To do this, the tallest rectangle is selected, which in this case is modal. Then we connect the right vertex of the modal rectangle with the upper right corner of the previous rectangle. And the left vertex of the modal rectangle is with the upper left corner of the subsequent rectangle. Further, from the point of their intersection, a perpendicular is lowered to the abscissa axis. The abscissa of the point of intersection of these lines will be the distribution mode.

The median is calculated from the cumulate. To determine it, from a point on the scale of accumulated frequencies (frequencies), corresponding to 50%, a straight line is drawn, parallel to the abscissa axis, until it intersects with the cumulate. Then, from the point of intersection of the specified straight line with the cumulate, a perpendicular is lowered to the abscissa axis. The abscissa of the intersection point is the median.

In addition to the mode and median, other structural characteristics, quantiles, can be determined in the variant series. Quantiles are intended for a deeper study of the structure of the distribution series.

quantile- this is the value of a feature that occupies a certain place in the population ordered by this feature. There are the following types of quantiles:

- quartiles are the attribute values ​​dividing the ordered set into four equal parts;

- deciles – attribute values ​​dividing the ordered set into ten equal parts;

- percentels - attribute values ​​dividing the ordered set into one hundred equal parts.

Thus, to characterize the position of the center of the distribution series, 3 indicators can be used: mean feature, mode, median. When choosing the type and form of a specific indicator of the distribution center, it is necessary to proceed from the following recommendations:

- for sustainable socio-economic processes, the arithmetic mean is used as an indicator of the center. Such processes are characterized by symmetrical distributions, in which ;

- for unstable processes, the position of the distribution center is characterized by Mo or Me. For asymmetric processes, the preferred characteristic of the distribution center is the median, since it occupies a position between the arithmetic mean and the mode.

TEST

On the topic: "Mode. Median. Methods for calculating them"


Introduction

Mean values ​​and related indicators of variation play a very important role in statistics, which is due to the subject of its study. Therefore, this topic is one of the central in the course.

The average is a very common generalizing indicator in statistics. This is explained by the fact that only with the help of the average it is possible to characterize the population according to a quantitatively varying attribute. Average value in statistics, a generalizing characteristic of a set of similar phenomena according to some quantitatively varying characteristic is called. The average shows the level of this attribute, related to the unit of the population.

Studying social phenomena and seeking to identify their characteristic, typical features in specific conditions of place and time, statisticians make extensive use of average values. With the help of averages, different populations can be compared with each other according to varying characteristics.

Averages used in statistics belong to the class of power averages. Of the power averages, the arithmetic mean is most often used, less often the harmonic mean; the harmonic mean is used only when calculating the average rates of dynamics, and the mean square - only when calculating the variation indicators.

The arithmetic mean is the quotient of dividing the sum of the options by their number. It is used in cases where the volume of a variable attribute for the entire population is formed as the sum of the attribute values ​​for its individual units. The arithmetic mean is the most common type of average, as it corresponds to nature social phenomena, where the volume of varying attributes in the population is most often formed precisely as the sum of the values ​​of the attribute in individual units of the population.

According to its defining property, the harmonic mean should be used when the total volume of the attribute is formed as the sum of the reciprocal values ​​of the variant. It is used when, depending on the material available, the weights do not have to be multiplied, but divided into options or, what is the same, multiplied by their inverse value. The harmonic mean in these cases is the reciprocal of the arithmetic mean of the reciprocal values ​​of the attribute.

The harmonic mean should be used in those cases when the weights are not the units of the population - the carriers of the feature, but the products of these units and the value of the feature.


1. Definition of mode and median in statistics

The arithmetic and harmonic means are the generalizing characteristics of the population according to one or another varying attribute. Auxiliary descriptive characteristics of the distribution of a variable attribute are the mode and the median.

In statistics, fashion is the value of a feature (variant) that is most often found in a given population. In the variation series, this will be the variant with the highest frequency.

The median in statistics is called the variant, which is in the middle of the variation series. The median divides the series in half, on both sides of it (up and down) there is the same number of population units.

Mode and median, in contrast to the exponential averages, are specific characteristics, their value is any particular variant in the variation series.

Mode is used in cases where it is necessary to characterize the most frequently occurring value of a feature. If you need, for example, to find out the most common wage rate in the enterprise, the market price at which the largest number of goods were sold, the shoe size that is used most in demand consumers, etc., in these cases resort to fashion.

The median is interesting in that it shows the quantitative limit of the value of the variable characteristic, which was reached by half of the members of the population. Let the average salary of bank employees amount to 650,000 rubles. per month. This characteristic can be supplemented if we say that half of the workers received a salary of 700,000 rubles. and higher, i.e. let's take the median. The mode and median are typical characteristics in cases where the populations are homogeneous and large in number.


2. Finding the Mode and Median in a Discrete Variation Series

Finding the mode and median in a variational series, where the attribute values ​​are given by certain numbers, is not very difficult. Consider table 1. with the distribution of families by the number of children.

Table 1. Distribution of families by number of children

Obviously, in this example, the fashion will be a family with two children, since this value of options corresponds to largest number families. There may be distributions where all variants are equally frequent, in which case there is no fashion, or, in other words, all variants can be said to be equally modal. In other cases, not one, but two options may be the highest frequency. Then there will be two modes, the distribution will be bimodal. Bimodal distributions may indicate the qualitative heterogeneity of the population according to the trait under study.

To find the median in a discrete variation series, you need to divide the sum of frequencies in half and add ½ to the result. So, in the distribution of 185 families by the number of children, the median will be: 185/2 + ½ = 93, i.e. The 93rd option, which divides the ordered row in half. What is the meaning of the 93rd option? In order to find this out, it is necessary to accumulate frequencies, starting from least options. The sum of the frequencies of the 1st and 2nd option is 40. It is clear that there are no 93 options here. If we add the frequency of the 3rd option to 40, then we get the sum equal to 40 + 75 = 115. Therefore, the 93rd option corresponds to the third value of the variable attribute, and the median will be a family with two children.

Mode and median in this example coincided. If we had an even sum of frequencies (for example, 184), then applying the above formula, we get the number of the median options, 184/2 + ½ = 92.5. Since there are no fractional options, the result indicates that the median is in the middle between 92 and 93 options.

3. Calculation of the mode and median in the interval variation series

The descriptive nature of the mode and median is due to the fact that they do not compensate for individual deviations. They always correspond to a certain variant. Therefore, the mode and median do not require calculations to find them if all the values ​​of the attribute are known. However, in the interval variation series, calculations are used to find the approximate value of the mode and median within a certain interval.

To calculate a certain value of the modal value of a sign enclosed in an interval, the following formula is used:

M o \u003d X Mo + i Mo * (f Mo - f Mo-1) / ((f Mo - f Mo-1) + (f Mo - f Mo + 1)),

Where X Mo is the minimum limit of the modal interval;

i Mo is the value of the modal interval;

fMo is the frequency of the modal interval;

f Mo-1 - the frequency of the interval preceding the modal;

f Mo+1 is the frequency of the interval following the modal.

We will show the calculation of the mode using the example given in Table 2.


Table 2. Distribution of workers of the enterprise according to the implementation of production standards

To find the mode, we first define the modal interval this series. It can be seen from the example that the highest frequency corresponds to the interval where the variant lies in the range from 100 to 105. This is the modal interval. The value of the modal interval is 5.

Substituting the numerical values ​​from table 2. into the above formula, we get:

M o \u003d 100 + 5 * (104 -12) / ((104 - 12) + (104 - 98)) \u003d 108.8

The meaning of this formula is as follows: the value of that part of the modal interval, which must be added to its minimum boundary, is determined depending on the magnitude of the frequencies of the previous and subsequent intervals. In this case, we add 8.8 to 100, i.e. more than half of the interval, because the frequency of the previous interval is less than the frequency of the subsequent interval.

Let's calculate the median now. To find the median in the interval variation series, we first determine the interval in which it is located (the median interval). Such an interval will be one whose cumulative frequency is equal to or greater than half the sum of the frequencies. Cumulative frequencies are formed by gradual summation of frequencies, starting from the interval with the smallest feature value. Half the sum of the frequencies we have is 250 (500:2). Therefore, according to table 3. the median interval will be the interval with the value of wages from 350,000 rubles. up to 400,000 rubles.

Table 3. Calculation of the median in the interval variation series

Before this interval, the sum of the accumulated frequencies was 160. Therefore, in order to obtain the value of the median, it is necessary to add another 90 units (250 - 160).

Median- this is a feature value that divides the ranked distribution series into two equal parts - with feature values ​​less than the median and with feature values ​​greater than the median. To find the median, you need to find the value of the feature that is in the middle of the ordered series.

View the solution to the problem of finding the mode and median You can

In ranked series, ungrouped data for finding the median are reduced to finding the ordinal number of the median. The median can be calculated using the following formula:

where Xm is the lower limit of the median interval;
im - median interval;
Sme is the sum of observations that was accumulated before the beginning of the median interval;
fme is the number of observations in the median interval.

median properties

  1. The median does not depend on those values ​​of the attribute that are located on both sides of it.
  2. Analytic operations with the median are very limited, so when combining two distributions with known medians, it is impossible to predict in advance the value of the median of the new distribution.
  3. The median has the minimal property. Its essence lies in the fact that the sum of the absolute deviations of x values ​​from the median is the minimum value compared to the deviation of X from any other value

Graphical definition of the median

For determining medians by graphical method use the accumulated frequencies, on which the cumulative curve is built. The vertices of the ordinates corresponding to the accumulated frequencies are connected by straight line segments. Dividing in half the last ordinate, which corresponds to the total sum of frequencies, and drawing the perpendicular of the intersection with the cumulative curve to it, find the ordinate of the desired value of the median.

Definition of fashion in statistics

Fashion - feature value, which has the highest frequency in statistical series distribution.

Definition of fashion produced different ways, and it depends on whether the variable is presented as a discrete or interval series.

Finding fashion and median is done by simply looking through the frequency column. In this column, find the largest number characterizing the highest frequency. It corresponds to a certain value of the attribute, which is the mode. In the interval variation series, the central variant of the interval with the highest frequency is approximately considered the mode. In this distribution series mode is calculated by the formula:

where XMo is the lower limit of the modal interval;
imo - modal spacing;
fm0, fm0-1, fm0+1 are the frequencies in the modal, previous and following modal intervals.

The modal interval is determined by the highest frequency.

Fashion is widely used in statistical practice in the analysis of consumer demand, price registration, etc.

Relationships between the arithmetic mean, median and mode

For a unimodal symmetrical distribution series, the median and mode are the same. For asymmetric distributions, they do not coincide.

K. Pearson based alignment various types curves determined that for moderately asymmetric distributions the following approximate relationships between the arithmetic mean, median and mode are valid:

In 1906, the great scientist and renowned eugenicist Francis Galton visited the annual Animal and Poultry Exhibition in western England, where, by chance, he performed an interesting experiment.

According to James Surowetsky, author of The Wisdom of the Crowd, there was a competition at the Galton Fair in which people had to guess the weight of a slaughtered bull. The one who named the closest to the true number was declared the winner.

Galton was known for his contempt for the intellectual abilities of ordinary people. He believed that only real experts would be able to make accurate statements about the bull's weight. And 787 participants of the competition were not experts.

The scientist was going to prove the incompetence of the crowd by calculating the average number from the participants' answers. What was his surprise when it turned out that the result he received corresponded almost exactly to the real weight of the bull!

Average value - late invention

Of course, the accuracy of the answer amazed the researcher. But even more remarkable is the fact that Galton thought of using the average at all.

In today's world, averages and so-called medians are found at every turn: average temperature in New York in April is 52 degrees Fahrenheit; Stephen Curry averages 30 points per game; The median household income in the US is $51,939/year.

However, the idea that many different outcomes can be represented by a single number is quite new. Until the 17th century, averages were not generally used.

How did the concept of averages and medians come about and develop? And how did it manage to become the main measuring technique in our time?

The predominance of means over medians had far-reaching consequences for our understanding of information. And often it led people astray.

Mean and median values

Imagine that you are telling a story about four people who dined with you last night at a restaurant. You would give one of them 20 years, another 30, the third 40, and the fourth 50. What would you say about their ages in your story?

Most likely, you will call them the average age.

The mean is often used to convey information about something, as well as to describe a set of measurements. Technically, the average is what mathematicians call the "arithmetic mean" - the sum of all measurements divided by the number of measurements.

Although the word "average" is often used as a synonym for the word "median" (median), the latter is more often referred to as the middle of something. This word comes from the Latin "medianus", which means "middle".

Median value in Ancient Greece

The history of the median value originates from the teachings of the ancient Greek mathematician Pythagoras. For Pythagoras and his school, the median had a clear definition and was very different from how we understand the average today. It was used only in mathematics, not in data analysis.

In the Pythagorean school, the median value was the average number in a three-term sequence of numbers, in "equal" relation to neighboring terms. "Equal" ratio could mean the same distance. For example, the number 4 in the row 2,4,6. However, it could also express a geometric progression, such as 10 in the sequence 1,10,100.

The statistician Churchill Eisenhart explains that in ancient Greece, the median was not used as a representative or substitute for any set of numbers. It simply denoted the middle, and was often used in mathematical proofs.

Eisenhart spent ten years studying the mean and median. Initially, he tried to find the representative function of the median in early scientific constructions. Instead, however, he found that most of the early physicists and astronomers relied on single, skillfully made measurements, and they did not have a methodology to choose the best result among many observations.

Modern researchers base their conclusions on the collection of large amounts of data, as, for example, biologists studying the human genome. Ancient scientists, on the other hand, could take several measurements, but chose only the best for building their theories.

As the historian of astronomy Otto Neugebauer wrote, "this is consistent with the conscious desire of ancient people to minimize the amount of empirical data in science, because they did not believe in the accuracy of direct observations."

For example, the Greek mathematician and astronomer Ptolemy calculated the angular diameter of the moon using the method of observation and the theory of the motion of the earth. His score was 31'20. Today we know that the diameter of the Moon ranges from 29'20 to 34'6, depending on the distance from the Earth. Ptolemy used little data in his calculations, but he had every reason to believe that they were accurate.

Eisenhart writes: “It must be borne in mind that the relationship between observation and theory in antiquity was different than it is today. The results of observations were understood not as facts to which the theory should be adjusted, but as concrete cases that can be useful only as illustrative examples of the truth of the theory.

Eventually, scientists will turn to representative measurements of the data, but initially neither means nor medians were used in this role. From antiquity to today as such a representative means, another mathematical concept was used - the half-sum of extreme values.

Half sum of extreme values

New scientific tools almost always arise from the need to solve a certain problem in some discipline. The need to find the best value among many measurements arose from the need to accurately determine the geographic location.

The 11th century intellectual giant Al-Biruni is known as one of the first people to use the methodology of representative meanings. Al-Biruni wrote that when he had many measurements at his disposal and wanted to find the best among them, he used the following "rule": you need to find a number corresponding to the middle between two extreme values. When calculating the half-sum of extreme values, all numbers between the maximum and minimum values ​​\u200b\u200bare not taken into account, but only the average of these two numbers is found.

Al-Biruni applied this method in various fields, including to calculate the longitude of the city of Ghazni, which is located on the territory of modern Afghanistan, as well as in his studies of the properties of metals.

However, in the last few centuries, the half-sum of the extremes has been used less and less. In fact, in modern science it is not relevant at all. The median value replaced the half-sum.

Transition to Averages

By the early 19th century, the use of the median/mean had become a common method for finding the most accurately representative value from a group of data. Friedrich von Gauss, an outstanding mathematician of his time, wrote in 1809: “It was believed that if a certain number was determined by several direct observations made under the same conditions, then the arithmetic mean is the most true value. If it is not quite strict, then at least it is close to reality, and therefore one can always rely on it.

Why has there been such a shift in methodology?

This question is rather difficult to answer. In his research, Churchill Eisenhart suggests that the method of finding the arithmetic mean could have originated in the field of measuring magnetic deviation, that is, in finding the difference between the direction of the compass needle pointing north and the real north. This measurement was extremely important during the Age of Discovery.

Eisenhart found that until the end of the 16th century, most scientists who measured magnetic deviation used the ad hoc method (from Latin "to this, for this occasion, for this purpose") in choosing the most accurate measurement.

But in 1580, the scientist William Borough approached the problem differently. He took eight different measurements of deviation and, comparing them, came to the conclusion that the most exact value was between 11 ⅓ and 11 ¼ degrees. He probably calculated the arithmetic mean, which was in this range. However, Borough himself did not openly call his approach the new method.

Before 1635, there were no unequivocal cases of using the average value as a representative number at all. However, it was then that the English astronomer Henry Gellibrand took two different measurements of the magnetic deflection. One was done in the morning (11 degrees) and the other in the afternoon (11 degrees and 32 minutes). Calculating the most true value, he wrote:

“If we find the arithmetic mean, we can say with high probability that the result of an accurate measurement should be about 11 degrees 16 minutes.”

It is likely that this was the first time that the average was used as the closest to the true!

The word "average" was used in English language at the beginning of the 16th century to denote financial losses from damage that a ship or cargo suffered during a voyage. For the next hundred years, it denoted precisely these losses, which were calculated as the arithmetic mean. For example, if a ship was damaged during a voyage and the crew had to throw some goods overboard to save the weight of the ship, the investors suffered a financial loss equivalent to the amount of their investment - these losses were calculated in the same way as the arithmetic average. So gradually the values ​​of the average (average) and the arithmetic mean converged.

Median value

Today, the average or arithmetic mean is used as the main way to select a representative value of a set of measurements. How did it happen? Why was this role not assigned to the median value?

Francis Galton was the median champion

The term "median value" (median) - the middle term in a series of numbers, dividing this series by half - appeared at about the same time as the arithmetic mean. In 1599, the mathematician Edward Wright, who was working on the problem of normal deviation in a compass, first suggested using the median value.

“... Let's say a lot of archers shoot at some target. The target is subsequently removed. How can you find out where the target was? You need to find the middle place between all the arrows. Likewise, among the set of results of observations, the closest to the truth will be the one in the middle.

The median was widely used in the nineteenth century, becoming an indispensable part of any data analysis at that time. It was also used by Francis Galton, the eminent nineteenth-century analyst. In the bull weighing story at the beginning of this article, Galton originally used the median as representing the opinion of the crowd.

Many analysts, including Galton, preferred the median because it is easier to calculate for smaller datasets.

However, the median has never been more popular than the mean. Most likely, this happened due to the special statistical properties inherent in the mean value, as well as its relationship to the normal distribution.

Relation between mean and normal distribution

When we take many measurements, the results are, as statisticians say, "normally distributed." This means that if this data is plotted on a graph, then the points on it will depict something similar to a bell. If you connect them, you get a "bell-shaped" curve. normal distribution many statistics correspond, for example, the height of people, the indicator of intelligence, as well as the indicator of the highest annual temperature.

When the data is normally distributed, the mean will be very close to the highest point on the bell curve, and a very large number of measurements will be close to the mean. There is even a formula that predicts how many measurements will be some distance from the average.

Thus, calculating the mean gives researchers a lot of additional information.

The relationship of the average value with standard deviation gives it a great advantage, because the median value does not have such a relationship. This connection is an important part of the analysis of experimental data and statistical processing of information. That is why the average has become the core of statistics and all sciences that rely on multiple data for their conclusions.

The advantage of the mean is also due to the fact that it is easily calculated by computers. Although the median value for a small group of data is fairly easy to calculate on your own, it is much easier to write a computer program that would find the average value. If you are using Microsoft Excel, then you probably know that the median function is not as easy to calculate as the mean value function.

As a result, due to its great scientific value and ease of use, the average value has become the main representative value. However, this option is not always the best.

Advantages of the median value

In many cases when we want to calculate the center value of a distribution, the median value is the best indicator. This is because the average value is largely determined by the extreme measurements.

Many analysts believe that the thoughtless use of the average negatively affects our understanding of quantitative information. People look at the average and think it's "normal". But in fact it can be defined by some one term that stands out strongly from the homogeneous series.

Imagine an analyst who wants to know a representative value for the value of five houses. Four houses are worth $100,000 and the fifth is $900,000. The mean would then be $200,000 and the median would be $100,000. In this, as in many other cases, the median value gives a better understanding of what can be called a "standard".

Understanding how extreme values ​​can affect the average, the median value is used to reflect changes in US household income.

The median is also less sensitive to the "dirty" data that analysts deal with today. Many statisticians and analysts collect information by interviewing people on the Internet. If the user accidentally adds an extra zero to the answer, which turns 100 into 1000, then this error will affect the mean much more than the median.

Mean or median?

Choosing between the median and the mean has far-reaching implications, from our understanding of the effects of medicines on health to our knowledge of what a family's standard budget is.

As the collection and analysis of data increasingly determines how we understand the world, so does the value of the quantities we use. In an ideal world, analysts would use both the mean and median to plot the data.

But we live in conditions of limited time and attention. Because of these limitations, we often need to choose just one. And in many cases, the median value is preferable.

4. Fashion. Median. General and sample mean

The mode is on the screen, the median is in the triangle, and the averages are the temperature in the hospital and in the ward. We continue our practical course entertaining statistics (Lesson 1) study central characteristics statistical population, whose names you see in the header. And we will start from its end, because average values speech came almost from the very first paragraphs of the topic. For advanced readers table of contents:

  • General and sample mean– calculation according to primary data and for the generated discrete variational series;
  • Fashion– definition and finding for a discrete case;
  • Mediangeneral definition how to find the median;
  • Mean, mode and median of the interval variation series– calculation from primary data and from the finished series. Mode and median formulas,
  • Quartiles, deciles, percentiles - briefly about the main thing.

Well, it’s better for “dummies” to familiarize yourself with the material in order:

So let's explore some population volume, namely numerical characteristic, never mind, discrete or continuous (Lessons 2, 3).

General secondary called average all values ​​of this set:

If the numbers are the same (which is typical for discrete series) , then the formula can be written in more compact form:
, where
option repeated once;
option - times;
option - times;

option - times.

Live Calculation Example general secondary met in example 2, but in order not to be boring, I will not even remind its contents.

Further. As we remember, the processing of all population often difficult or impossible, and therefore they organize representative sampling volume, and based on the study of this sample, a conclusion is made about the entire population.

Sample mean called average all sample values:

and in the presence of the same options, the formula will be written more compactly:
- as the sum of the products of the variant on the corresponding frequencies .

The sample mean allows us to accurately estimate the true value of , which is quite enough for many studies. The larger the sample, the more accurate this estimate will be.

Let's start the practice, or rather continue, with discrete variation series and the familiar condition:

Example 8

Based on the results of a selective study of the workshop workers, their qualification categories were established: 4, 5, 6, 4, 4, 2, 3, 5, 4, 4, 5, 2, 3, 3, 4, 5, 5, 2, 3, 6, 5, 4, 6, 4, 3.

How solve task? If we are given primary data(original raw values), then they can be stupidly summed up and divided by the sample size:
- the average qualification category of the workers of the shop.

But in many problems it is required to compose a variational series (cm. Example 4) :

- or this series was originally proposed (which happens more often). And then, of course, we use the "civilized" formula:

Fashion . The mode of a discrete variational series is option with maximum frequency. In this case . Fashion is easy to find on the table, and even easier on frequency range is the abscissa of the highest point:


Sometimes there are several such values ​​(with the same maximum frequency), and then each of them is considered a fashion.

If all or almost all options different (which is typical for interval series), then modal meaning is defined in a slightly different way, about which in the 2nd part of the lesson.

Median . Median of the variation series * - this is the value that divides it into two equal parts (according to the number of options).

But now we need to find the mean, mode and median.

Solution: to find middle according to the primary data, it is best to sum up all the options and divide the result by the volume of the population:
den. units

These calculations, by the way, will not take much time even when using an offline calculator. But if there is Excel, then, of course, score in any free cell =SUM(, select all the numbers with the mouse, close the bracket ) , put a division sign / , enter the number 30 and press Enter. Ready.

As for fashion, its assessment based on initial data becomes unusable. Although we see the same numbers among them, but among them there can easily be five or six or seven options with the same maximum frequency, for example, frequency 2. In addition, prices can be rounded. Therefore, the modal value is calculated according to the generated interval series (more on that later).

What can you say about the median: plugging into excel =MEDIAN(, select all numbers with the mouse, close the bracket ) and click Enter: . Moreover, here you don’t even need to sort anything.

But in Example 6 sorted in ascending order (remember and sort - link above), and this good opportunity repeat the formal algorithm for finding the median. We divide the sample in half:

And since it consists of an even number of options, the median is equal to the arithmetic mean of the 15th and 16th options orderly(!) variation series:

den. units

Situation two. When a ready-made interval series is given (a typical learning task).

We continue to analyze the same example with boots, where, according to the initial data was compiled by IVR. To calculate middle the midpoints of the intervals are required:

– to use the familiar discrete case formula:

- excellent result! The discrepancy with the more accurate value () calculated from the primary data is only 0.04.

In fact, here we approximated the interval series by a discrete one, and this approximation turned out to be very effective. However, there is no particular benefit here, because. under modern software it is not difficult to calculate the exact value even for a very large array of primary data. But this is on condition that they are known to us :)

With other central indicators, everything is more interesting.

To find fashion, you need to find modal spacing (with maximum frequency)- in this problem, this is an interval with a frequency of 11, and use the following ugly formula:
, where:

is the lower limit of the modal interval;
is the length of the modal interval;
is the frequency of the modal interval;
– frequency of the previous interval;
– frequency of the next interval.

In this way:
den. units - as you can see, the "fashionable" price for shoes is noticeably different from the arithmetic average.

Without going into the geometry of the formula, I will simply give histogram of relative frequencies and note:


whence it is clearly seen that the mode is shifted relative to the center of the modal interval towards the left interval with a higher frequency. Logically.

For reference I will analyze rare cases:

– if the modal interval is extreme, then either ;

- if 2 modal intervals are found that are nearby, for example, and , then we consider the modal interval , while nearby intervals (left and right), if possible, are also enlarged by 2 times.

- if there is a distance between modal intervals, then we apply the formula to each interval, thereby obtaining 2 or large quantity Maud.

Here is such a dispatch mod :)

And the median. If a ready-made interval series is given, then the median is calculated using a slightly less terrible formula, but at first it’s tedious (a Freudian typo :)) to find median interval - this is an interval containing a variant (or 2 variants), which divides the variation series into two equal parts.

Above, I described how to determine the median, focusing on relative cumulative frequencies, here it is more convenient to calculate the "ordinary" accumulated frequencies . The computational algorithm is exactly the same - the first value is demolished on the left (red arrow), and each following is obtained as the sum of the previous one with the current frequency from the left column (green markings as an example):

Does everyone understand the meaning of the numbers in the right column? - this is the number of options that managed to "accumulate" on all the "passed" intervals, including the current one.

Because we have even number option (30 pieces), then the median will be the interval that contains 30/2 = 15th and 16th options. And focusing on the accumulated frequencies, it is easy to come to the conclusion that these options are contained in the interval .

Median formula:
, where:
- the volume of the statistical population;
is the lower limit of the median interval;
is the length of the median interval;
frequency median interval;
cumulative frequency previous interval.

In this way:
den. units – note that the median value, on the contrary, turned out to be shifted to the right, because on right hand there is a significant number of options:


And for reference special cases.