Calculating the median of a set of numbers. Median function in excel to perform statistical analysis

PRACTICAL LESSON No. 4 .

Calculation of the structural characteristics of the variation series of the distribution.

The student must:

know:

- scope and methodology for calculating structural averages;

be able to:

- calculate structural averages;

- formulate a conclusion based on the results obtained.

Methodical instructions

In statistics, the mode and median are calculated, which refer to the structural averages, so which value depends on buildings statistical population.

Fashion calculation

Fashion is called the value of a feature (option), more often common in the studied population. In the discrete series of distribution, the mode will be the variant with the highest frequency.

for instance: The distribution of women's shoes sold by size is characterized as follows:

Shoe size

Number of pairs sold

In this distribution, the mod is size 37, i.e. Mo = 37 size.

For the interval distribution series, the mode is determined by the formula:

where X Mo - the lower border of the modal interval;

h Mo - the value of the modal interval;

f Mo - the frequency of the modal interval;

f Mo -1and f Mo +1 - interval frequency, respectively

preceding the modal and following it.

for instance: Distribution of workers by length of service is characterized by the following data.

Work experience, years

up to 2

8-10

10 and more

Number of workers, people

Determine the mode of the interval distribution series.

The interval series mode is

Fashion is always somewhat vague, because it depends on the size of the groups and the exact position of the group boundaries. Fashion is widely used in commercial practice when studying consumer demand, when registering prices, etc.

Calculating the median

Median in statistics, a variant is called, located in the middle of an ordered series of data, and which divides the statistical population into two equal parts so that one half of the value is less than the median, and the other half has more than it. To determine the median, it is necessary to construct a ranked series, i.e. a series in ascending or descending order of the individual values ​​of the characteristic.

In a discrete ordered row with an odd number of members, the median will be the variant located in the center of the row.

for instance: The five workers were 2, 4, 7, 9 and 10 years old. In this series, the median is 7 years, i.e. Me = 7 years

If a discrete ordered row consists of an even number of members, then the median will be the arithmetic mean of two adjacent variants in the center of the row.

for instance: Work experience of six workers was 1, 3, 4, 5, 10 and 11 years. This row has two options in the center of the row. These are options 4 and 5. The arithmetic mean of these values ​​will be the median of the series

To determine the median for the grouped data, it is necessary to read the accumulated frequencies.

For instance:According to the available data, we determine the median of shoe size

Shoe size

Number of pairs sold

Sum of accumulated frequencies

8+19=27

27+34=61

61+108=169

Total

To determine the median, you need to calculate the sum of the accumulated frequencies of the series. The accumulation of the total continues until the accumulated sum of frequencies is obtained, which exceeds half the sum of the frequencies of the series. In our example, the sum of frequencies is 300, its half is 150. The accumulated sum of frequencies is equal to 169. The variant corresponding to this sum, ie. 37 is the median of the series.

If the sum of the accumulated frequencies against one of the options is exactly half the sum of the frequencies of the series, then the median is determined as the arithmetic mean of this option and the next.

for instance: According to the available data, we determine the median wages workers

Monthly salary, thousand rubles

Number of workers, people

Sum of accumulated frequencies

14,0

14,2

2+6=8

16,0

8+12=20

16,8

18,0

Total:

The median will be:

The median of the interval variation series of the distribution is determined by the formula:

Where X Me - the lower border of the median interval;

h Me - the value of the median interval;

f- the sum of the frequencies of the series;

f Me - the frequency of the median interval;

For instance:According to the available data on the distribution of enterprises by the number of industrial and production personnel, calculate the median in the interval variation series

Number of enterprises

Sum of accumulated frequencies

100-200

200-300

1+3=4

300-400

4+7=11

400-500

11+30=41

500-600

600-700

700-800

Total:

Let's define, first of all, the median interval. V this example the sum of accumulated frequencies exceeding half of the sum of all values ​​of the series corresponds to the interval 400-500, which is the median interval, i.e. the interval in which the median of the series is located. Let's define its value

If the sum of the accumulated frequencies against one of the intervals is exactly half the sum of the frequencies of the series, then the median is determined by the formula:

where n- the number of units in the aggregate.

For instance:According to the available data on the distribution of enterprises by the number of industrial and production personnel, calculate the median in the interval variation series

Groups of enterprises by the number of PPP, people

Number of enterprises

Sum of accumulated frequencies

100-200

200-300

1+3=4

300-400

4+6=10

400-500

10+30=40

500-600

40+20=60

600-700

700-800

Total:

people

The fashion and median in the interval series can be define graphically:

the mode in discrete series - by the distribution polygon, the mode in the interval series - by the distribution histogram, and the median - by the cumulative.

Interval series distribution mode determined by the distribution histogram determine in the following way. For this, the highest rectangle is selected, which in this case is modal. Then we connect the right vertex of the modal rectangle to the upper right corner of the previous rectangle. And the left vertex of the modal rectangle is with the upper left corner of the subsequent rectangle. Further, from the point of their intersection, a perpendicular is lowered onto the abscissa axis. The abscissa of the point of intersection of these straight lines will be the distribution mode.

The median is calculated cumulatively. To determine it, from a point on the scale of accumulated frequencies (frequencies) corresponding to 50%, a straight line is drawn parallel to the abscissa axis until it intersects with the cumulative. Then, from the point of intersection of the specified straight line with the cumulative, a perpendicular is lowered onto the abscissa axis. The abscissa of the intersection point is the median.

In addition to the mode and the median, other structural characteristics - quantiles - can be determined in the variant series. Quantiles are intended for a deeper study of the structure of a distribution series.

Quantile- This is the value of a feature that occupies a certain place in the population sorted by this feature. There are the following types of quantiles:

- quartiles - feature values ​​dividing the ordered population by four equal parts;

- deciles - feature values ​​dividing the ordered population into ten equal parts;

- percentiles - feature values ​​dividing the ordered population into one hundred equal parts.

Thus, to characterize the position of the center of the distribution series, 3 indicators can be used: mean feature, fashion, median. When choosing the type and form of a specific indicator of the center of distribution, it is necessary to proceed from the following recommendations:

- for sustainable socio-economic processes, the arithmetic mean is used as an indicator of the center. Such processes are characterized by symmetric distributions, in which;

- for unstable processes, the position of the distribution center is characterized by Mo or Me... For asymmetric processes, the median is the preferred characteristic of the distribution center, since it occupies a position between the arithmetic mean and the mode.

TEST

On the topic: "Fashion. Median. Methods of their calculation"


Introduction

Averages and associated indicators of variation play a very important role in statistics, which is due to the subject of its study. Therefore, this topic is one of the central topics in the course.

Average is a very common summary indicator in statistics. This is due to the fact that only with the help of the average it is possible to characterize the population in terms of quantitatively varying characteristics. Average in statistics, a generalizing characteristic of a set of phenomena of the same type is called according to some quantitatively varying attribute. The average shows the level of this trait, referred to the unit of the population.

Studying social phenomena and seeking to identify their characteristic, typical features in specific conditions of place and time, statisticians widely use averages. With the help of averages, different populations can be compared with each other according to varying characteristics.

Averages that are used in statistics belong to the class of power averages. Of the power averages, the arithmetic mean is most often used, less often the harmonic mean; the harmonic mean is used only when calculating the average rates of dynamics, and the mean square - only when calculating the indicators of variation.

The arithmetic mean is the quotient of dividing the sum of the variant by their number. It is used in cases where the volume of a varying feature for the entire population is formed as the sum of the values ​​of the feature for its individual units. The arithmetic mean is the most common type of averages, as it corresponds to nature social phenomena, where the volume of varying attributes in the aggregate is most often formed precisely as the sum of the values ​​of the attribute in individual units of the aggregate.

According to its defining property, the harmonic average should be used when the total volume of the feature is formed as the sum of the reciprocal values ​​of the variant. It is used when, depending on the weight of the material, it is necessary not to multiply, but to divide into options or, which is the same thing, to multiply by their inverse value. The harmonic mean in these cases is the reciprocal of the arithmetic mean of the reciprocal values ​​of the attribute.

The harmonic mean should be used in cases where the weights are not the aggregate units - the carriers of the attribute, but the product of these units by the attribute value.


1. Determination of fashion and median in statistics

The arithmetic and harmonic means are generalizing characteristics of the population for one or another varying attribute. Mode and median are auxiliary descriptive characteristics of the distribution of a variable feature.

Fashion in statistics is the value of a feature (option), which is most often found in a given population. In the variation series, this will be the variant with the highest frequency.

The median in statistics is the variant that is in the middle of the variation series. The median divides the row in half, on both sides of it (up and down) there are the same number of population units.

The mode and the median, in contrast to the power-law means, are specific characteristics, their significance is given to any particular variant in the variation series.

The fashion is used in those cases when it is necessary to characterize the most common value of a feature. If it is necessary, for example, to find out the most common wages in the enterprise, the price in the market at which the largest number of goods were sold, the size of the boots used most in demand among consumers, etc., in these cases they resort to fashion.

The median is interesting in that it shows the quantitative border of the value of the varying attribute, which has been reached by half of the members of the population. Let the average salary of bank employees be 650,000 rubles. per month. This characteristic can be supplemented if we say that half of the workers received a salary of 700,000 rubles. and higher, i.e. we give the median. Fashion and median are typical characteristics when populations of homogeneous and large numbers are taken.


2. Finding the Mode and Median in a Discrete Variational Series

It is not difficult to find the mode and median in the variation series, where the values ​​of the feature are given by certain numbers. Consider table 1. with the distribution of families by the number of children.

Table 1. Distribution of families by number of children

Obviously, in this example, the fashion will be a family with two children, since the options correspond to this value greatest number families. There may be distributions where all variants occur equally often, in this case there is no fashion, or, otherwise, we can say that all variants are equally modal. In other cases, not one, but two variants may be of the highest frequency. Then there will be two modes, the distribution will be bimodal. Bimodal distributions can indicate the qualitative heterogeneity of the population for the studied trait.

To find the median in a discrete variation series, divide the sum of frequencies in half and add ½ to the result. So, in the distribution of 185 families by the number of children, the median will be: 185/2 + ½ = 93, i.e. Option 93, which bisects the ordered row. What is the meaning of the 93rd option? In order to find out this, it is necessary to accumulate frequencies, starting from smallest options... The sum of the frequencies of the 1st and 2nd options is 40. It is clear that there are no 93 options here. If we add the frequency of the 3rd options to 40, then we get the sum equal to 40 + 75 = 115. Therefore, the 93rd option corresponds to the third value of the variable characteristic, and the median will be a family with two children.

The fashion and median coincide in this example. If we had an even sum of frequencies (for example, 184), then, applying the above formula, we get the number of the median variants, 184/2 + ½ = 92.5. Since there are no fractional choices, the result indicates that the median is halfway between 92 and 93 choices.

3. Calculation of the mode and median in the interval variation series

The descriptive nature of the fashion and median is due to the fact that they do not extinguish individual deviations. They always correspond to a specific option. Therefore, the mode and the median do not require calculations for their finding, if all the values ​​of the feature are known. However, in the interval variation series, to find the approximate value of the mode and median within a certain interval, one resorts to calculations.

To calculate a certain value of the modal value of a feature enclosed in an interval, use the formula:

Mo = X Mo + i Mo * (f Mo - f Mo-1) / ((f Mo - f Mo-1) + (f Mo - f Mo + 1)),

Where X Mo is the minimum border of the modal interval;

i Mo is the value of the modal interval;

f Mo is the frequency of the modal interval;

f Mo-1 - frequency of the interval preceding the modal;

f Mo + 1 is the frequency of the interval following the modal.

Let us show the calculation of the mode using the example given in Table 2.


Table 2. Distribution of workers of the enterprise according to the fulfillment of production standards

To find the fashion, we first define the modal spacing this series... The example shows that the highest frequency corresponds to the interval where the variant lies in the range from 100 to 105. This is the modal interval. The modal spacing is 5.

Substituting the numerical values ​​from table 2.in the above formula, we get:

M about = 100 + 5 * (104 -12) / ((104 - 12) + (104 - 98)) = 108.8

The meaning of this formula is as follows: the value of that part of the modal interval that needs to be added to its minimum boundary is determined depending on the value of the frequencies of the preceding and subsequent intervals. In this case, we add 8.8 to 100, i.e. more than half the interval, because the frequency of the previous interval is less than the frequency of the subsequent interval.

Let us now calculate the median. To find the median in the interval variation series, we first determine the interval in which it is located (median interval). Such an interval will be one whose comulative frequency is equal to or greater than half the sum of the frequencies. The cumulative frequencies are formed by the gradual summation of frequencies, starting from the interval with the smallest value of the feature. Half of the sum of frequencies we have is 250 (500: 2). Consequently, according to table 3. the median interval will be the interval with the value of wages from 350,000 rubles. up to 400,000 rubles

Table 3. Calculation of the median in the interval variation series

Before this interval, the sum of the accumulated frequencies was 160. Therefore, to get the median value, it is necessary to add another 90 units (250 - 160).

Median is a feature value that divides the ranked distribution series into two equal parts - with feature values ​​less than the median and with feature values ​​greater than the median. To find the median, you need to find the value of the feature, which is in the middle of the ordered row.

View the solution to the problem of finding the mode and median You can

In ranked ranks, ungrouped data for finding the median are reduced to finding the ordinal number of the median. The median can be calculated using the following formula:

where Xm is the lower border of the median interval;
im - median interval;
Sme is the sum of observations that was accumulated before the beginning of the median interval;
fme is the number of observations in the median interval.

Median properties

  1. The median does not depend on those values ​​of the characteristic that are located on either side of it.
  2. Analytical operations with the median are very limited, therefore, when combining two distributions with known medians, it is impossible to predict in advance the value of the median of the new distribution.
  3. The median possesses minimality property. Its essence lies in the fact that the sum of the absolute deviations of the values ​​of x from the median is the minimum value in comparison with the deviation of X from any other value

Graphical determination of the median

For determining graphical medians use the accumulated frequencies over which the cumulative curve is plotted. The vertices of the ordinates corresponding to the accumulated frequencies are connected by straight line segments. By dividing the pop olam the last ordinate, which corresponds to the total sum of frequencies and drawing the perpendicular to it of the intersection with the cumulative curve, find the ordinate of the desired median value.

Definition of fashion in statistics

Fashion - the meaning of a feature having the highest frequency in statistical series distribution.

Defining fashion produced different ways, and it depends on whether the variable feature is presented as a discrete or interval series.

Finding fashion and the median occurs by simply scanning the frequency column. In this column, find the largest number that characterizes the highest frequency. It corresponds to a certain value of the attribute, which is fashion. In the interval variation series, the mode is approximately considered the central variant of the interval with the highest frequency. In such a row of distribution fashion is calculated by the formula:

where ХМо is the lower limit of the modal interval;
imo - modal interval;
fm0, fm0-1, fm0 + 1 - frequencies in the modal, previous and following modal intervals.

The modal spacing is determined by the highest frequency.

Fashion is widely used in statistical practice in the analysis of purchase demand, price registration, etc.

Relationship between arithmetic mean, median and fashion

For a unimodal symmetric distribution series, the median and mode coincide. They are not the same for skewed distributions.

K. Pearson Based Alignment different types curves determined that for moderately asymmetric distributions, the following approximate relations between the arithmetic mean, median and mode are valid:

In 1906, the great scientist and renowned eugenicist Francis Galton visited the annual exhibition of the achievements of animal husbandry and poultry in western England, where he accidentally conducted an interesting experiment.

As noted by James Surowiecki, author of The Wisdom of the Crowd, at the fair, Galton was interested in a competition in which people had to guess the weight of a slaughtered bull. The one who named the number closest to the true one was declared the winner.

Galton was known for his disdain for the intellectual abilities of ordinary people. He believed that only real experts would be able to make accurate statements about the weight of a bull. And 787 competitors were not experts.

The scientist was going to prove the incompetence of the crowd by calculating the average of the participants' responses. Imagine his surprise when it turned out that the result he received almost exactly corresponded to the real weight of the bull!

Average - late invention

Of course, the researcher was struck by the accuracy of the answer. But even more remarkable is the fact that Galton even guessed to use the average.

In today's world, averages, and so-called medians, are found at every turn: average temperature in New York in April it is 52 degrees Fahrenheit; Stephen Curry averages 30 points per game; the median household income in the United States is $ 51,939 / year.

However, the idea that many different outcomes can be represented by a single number is fairly new. Until the 17th century, averages were not used at all.

How did the concept of mean and median values ​​come about and develop? And how did it manage to become the main measuring technique in our time?

The predominance of averages over medians has had far-reaching implications for our understanding of information. And often it led people astray.

Mean and median values

Imagine that you are telling the story of four people who had dinner with you in a restaurant last night. You would give one of them 20 years, the other 30, the third 40, and the fourth 50. What can you say about their age in your story?

Most likely, you will name them the average age.

The average value is often used to convey information about something, as well as to describe a set of dimensions. Technically, the average is what mathematicians call the "arithmetic mean" - the sum of all measurements divided by the number of measurements.

Although the word "average" is often used synonymously with the word "median", the latter is more often used to refer to the middle of something. This word comes from the Latin "medianus", which means "middle."

Median value in Ancient Greece

The history of the median value dates back to the teachings of the ancient Greek mathematician Pythagoras. For Pythagoras and his school, the median had a clear definition and was very different from how we understand the mean today. It was only used in mathematics, not in data analysis.

In the Pythagorean school, the median value was the average of a three-term sequence of numbers, in "equal" relation with neighboring members. An "equal" ratio could mean the same distance. For example, the number 4 in the row 2,4,6. However, it could also express a geometric progression, for example 10 in the sequence 1,10,100.

Statistician Churchill Eisenhart explains that in ancient Greece, the median value was not used to represent or replace any set of numbers. It simply indicated the middle, and was often used in mathematical proofs.

Eisenhart spent a decade studying the mean and median values. Initially, he tried to find the representative function of the median in early scientific constructions. Instead, he found, however, that most of the early physicists and astronomers relied on single, skillfully made measurements, and lacked a methodology to select the best result among multiple observations.

Modern researchers base their findings on the collection of large amounts of data, such as biologists who study the human genome. Ancient scientists, however, could take several measurements, but chose only the best to build their theories.

As the historian of astronomy Otto Neugebauer wrote, "this is consistent with the conscious desire of ancient people to minimize the amount of empirical data in science, because they did not believe in the accuracy of direct observations."

For example, the Greek mathematician and astronomer Ptolemy calculated the angular diameter of the moon using observation method and the theory of earth motion. His result was 31'20. Today we know that the diameter of the Moon ranges from 29'20 to 34'6, depending on the distance from the Earth. Ptolemy used little data in his calculations, but he had every reason to believe that they were accurate.

Eisenhart writes: “It must be borne in mind that the relationship between observation and theory was different in antiquity than it is today. The results of observations were understood not as facts to which the theory should be adjusted, but as specific cases that can be useful only as illustrative examples of the truth of the theory "

Eventually, scientists will turn to representative measurements of the data, but initially neither means nor medians were used in this role. From antiquity to today another mathematical concept was used as such a representative means - the half-sum of extreme values.

Half-sum of extreme values

New scientific tools almost always arise from the need to solve a specific problem in a discipline. The need to find the best value among multiple dimensions arose out of the need to pinpoint a geographic location.

The 11th century intellectual giant Al-Biruni is known as one of the first people to use the representational meaning methodology. Al-Biruni wrote that when he had many dimensions at his disposal, and he wanted to find the best among them, he used the following "rule": you need to find a number corresponding to the middle between two extreme values. When calculating the half-sum of extreme values, all numbers between the maximum and minimum values ​​are not taken into account, but the average is found only for these two numbers.

Al-Biruni applied this method in various fields, including calculating the longitude of the city of Ghazni, which is located on the territory of modern Afghanistan, as well as in his studies of the properties of metals.

However, in the last few centuries, the half-sum of extreme values ​​has been used less and less. In fact, in modern science it is not relevant at all. The half-sum was replaced by the median value.

Moving to averages

By the early 19th century, using the median / mean had become a common method of finding the most accurately representative value from a group of data. Friedrich von Gauss, an outstanding mathematician of his time, wrote in 1809: “It was believed that if a certain number was determined by several direct observations made under the same conditions, then the arithmetic mean is the most true value. If it is not entirely strict, then at least it is close to reality, and therefore you can always rely on it. "

Why did this shift in methodology take place?

This question is rather difficult to answer. In his research, Churchill Eisenhart suggests that the method of finding the arithmetic mean may have originated in the field of measuring magnetic deviation, that is, in finding the difference between the direction of the compass needle pointing north and real north. This dimension was extremely important in the era of the Great Geographical Discoveries.

Eisenhart found that until the end of the 16th century, most scientists who measured magnetic deflection used the ad hoc method (from Lat. "To this, for this case, for this purpose") when choosing the most accurate measurement.

But in 1580 the scientist William Borough approached the problem differently. He took eight different measurements of the deviation and, comparing them, came to the conclusion that the most exact value it was between 11 ⅓ and 11 ¼ degrees. He probably calculated the arithmetic mean, which was in this range. However, Boro himself did not openly call his approach a new method.

Until 1635, there were no unambiguous cases of using the mean as a representative number at all. However, it was then that the English astronomer Henry Gellybrand took two different measurements of the magnetic deflection. One was done in the morning (11 degrees) and the other in the afternoon (11 degrees and 32 minutes). Calculating the most true value, he wrote:

"If we find the arithmetic mean, we can most likely say that the result of an accurate measurement should be about 11 degrees 16 minutes."

It is very likely that this was the first time that the mean was used as the closest to the true value!

The word "average" was used in English language at the beginning of the 16th century to denote financial losses from damage that a ship or cargo carried while sailing. Over the next hundred years, it designated precisely these losses, which were calculated as an arithmetic average. For example, if a ship was damaged while sailing and the crew had to throw some goods overboard in order to save the weight of the ship, investors suffered financial losses equivalent to the amount of their investment - these losses were calculated in the same way as the arithmetic mean. So the values ​​of the average and the arithmetic mean gradually approached.

Median value

Today, mean or arithmetic mean is used as the primary method for choosing a representative value for multiple dimensions. How did this happen? Why was this role not assigned to the median value?

Francis Galton was the median champion

The term "median value" (median) - the middle term in a series of numbers that divides this series in half - appeared at about the same time as the arithmetic mean. In 1599, the mathematician Edward Wright, who was working on the problem of the normal deviation in the compass, first proposed the use of the median value.

“... Let's say a lot of archers shoot at some target. The target is subsequently removed. How do you know where the target was? You need to find the middle place between all the arrows. Likewise, among the multitude of observations, the one in the middle will be the closest to the truth. "

The median value was widely used in the nineteenth century, becoming a mandatory part of any data analysis at the time. It was also used by Francis Galton, an eminent analyst of the nineteenth century. In the story of weighing a bull at the beginning of this article, Galton originally used the median value as representing the opinion of the crowd.

Many analysts, including Galton, preferred the median because it is easier to calculate for small datasets.

However, the median has never been more popular than the mean. Most likely, this was due to the special statistical properties inherent in the mean, as well as its relationship to the normal distribution.

Relationship between mean and normal distribution

When we take multiple measurements, the results are, as statisticians say, "normally distributed." This means that if this data is plotted on a graph, then the points on it will represent something that looks like a bell. If you connect them, you get a "bell-shaped" curve. Normal distribution many statistics correspond, for example, the height of people, the IQ index, and also the indicator of the highest annual temperature.

When the data is normally distributed, the mean will be very close to the highest point on the bell curve, and a very large number of measurements will be close to the mean. There is even a formula that predicts how many measurements will be at some distance from the mean.

Thus, calculating the mean gives researchers a lot of additional information.

The relationship of the average value with standard deviation gives it a great advantage, because the median value does not have such a relationship. This connection is an important part of the analysis of experimental data and statistical processing of information. That is why the mean has become the core of statistics and all sciences that rely on multiple data for their conclusions.

The mean is also advantageous because it can be easily calculated by computers. While the median for a small group of data is fairly easy to compute on your own, it is much easier to write a computer program that finds the mean. If you are using Microsoft Excel, then you probably know that the median function is not as easy to calculate as the function of the mean.

As a result, due to its great scientific value and ease of use, the mean has become the main representative value. However, this is not always the best option.

Benefits of the median value

In many cases, when we want to compute the central value of a distribution, the median value is the best indicator... This is because the average value is largely determined by the extreme measurement results.

Many analysts believe that the thoughtless use of the mean negatively affects our understanding of quantitative information. People look at the average and think that this is the "norm." But in fact, it can be determined by some one strongly outstanding member of a homogeneous series.

Imagine an analyst wanting to know a representative value for the value of five houses. Four houses are worth $ 100,000 and the fifth is $ 900,000. The average will thus be $ 200,000 and the median $ 100,000. In this, as in many other cases, the median value provides a better understanding of what might be called a “standard”.

Recognizing how strongly the extremes can affect the average, the median is used to reflect changes in US household income.

The median is also less sensitive to the "dirty" data that analysts are dealing with today. Many statisticians and analysts collect information by polling people on the Internet. If the user accidentally adds an extra zero to the answer, which turns 100 into 1000, then this error will have a much stronger effect on the mean than on the median.

Average or median?

The choice between median and mean has far-reaching implications, from our understanding of the effects of drugs on health to knowing what a typical household budget is.

As data collection and analysis increasingly determines how we understand the world, so does the value of the quantities we use. In an ideal world, analysts would use both mean and median values ​​to graphically express data.

But we live in conditions of limited time and attention. Because of these limitations, we often only need to choose one. And in many cases it is the median value that is preferable.

4. Fashion. Median. General and sample mean

The fashion on the screen, the median in the triangle, and the averages are the temperatures in the hospital and in the ward. We continue our practical course entertaining statistics (Lesson 1) studying central characteristics statistical population, whose names you see in the title. And we'll start from the end, because oh average values speech came almost from the very first paragraphs of the topic. For prepared readers table of contents:

  • General and sample mean- calculation based on primary data and for the generated discrete variation series;
  • Fashion- definition and finding for a discrete case;
  • Mediangeneral definition how to find the median;
  • Mean, mode and median of the interval variation series- calculation based on primary data and on a ready-made series. Fashion and median formulas,
  • Quartiles, deciles, percentiles - briefly about the main thing.

Well, for "dummies" it is better to familiarize yourself with the material in order:

So, let some general population volume, namely its numerical characteristic, never mind, discrete or continuous (Lessons 2, 3).

General secondary called average all values ​​of this set:

If among the numbers there are the same (which is typical for discrete series) , then the formula can be written in more compact form:
, where
option repeats once;
options - times;
options - times;

options - times.

Live calculation example general secondary met in Example 2, but in order not to be boring, I will not even recall its contents.

Further. As we remember, the processing of the entire the general population is often difficult or impossible, and therefore they organize representative sampling volume, and based on the study of this sample, a conclusion is made about the entire population.

Selective average called average all sample values:

and if there are identical options, the formula will be written more compactly:
- as the sum of the works of the variant on the corresponding frequency .

The sample mean allows a fairly accurate estimate of the true value, which is quite sufficient for many studies. Moreover, the larger the sample, the more accurate this estimate will be.

Let's start the practice, or rather continue, with discrete variation series and the familiar condition:

Example 8

Based on the results of a selective study of the shop workers, their qualification categories were established: 4, 5, 6, 4, 4, 2, 3, 5, 4, 4, 5, 2, 3, 3, 4, 5, 5, 2, 3, 6, 5, 4, 6, 4, 3.

How solve task? If we are given primary data(original raw values), then they can be stupidly summed up and divided by the sample size:
- the average qualification category of shop workers.

But in many problems it is required to compose a variation series (cm. Example 4) :

- or this row was proposed initially (which happens more often). And then, of course, we use the "civilized" formula:

Fashion ... The discrete variation series mode is option with maximum frequency. In this case . The fashion is easy to find from the table, and even easier on frequency range Is the abscissa of the highest point:


Sometimes there are several such values ​​(with the same maximum frequency), and then each of them is considered a mode.

If all or nearly all options different (which is typical for interval series), then modal meaning is defined in a slightly different way, about which in the 2nd part of the lesson.

Median ... Median of the variation series * Is a value that divides it into two equal parts (by the number of options).

But now we need to find the average, fashion and median.

Solution: to find average according to primary data, it is best to sum up all the options and divide the result by the volume of the population:
den. units

These calculations, by the way, will not take much time even when using an offline calculator. But if there is Excel, then, of course, we hammer into any free cell = SUM (, select all numbers with the mouse, close the parenthesis ) , put the division sign / , enter the number 30 and press Enter... Ready.

As far as fashion is concerned, its assessment according to the initial data becomes unusable. Although we see the same numbers among the numbers, among them there can easily be five, six or seven variants with the same maximum frequency, for example, frequency 2. In addition, prices can be rounded. Therefore, the modal value is calculated according to the generated interval series (more on that later).

The same cannot be said about the median: we hammer into Excel = MEDIAN (, select all numbers with the mouse, close the parenthesis ) and press Enter:. Moreover, here you don't even need to sort anything.

But in Example 6 sorted in ascending order (remember and sort - link above), and this good opportunity repeat the formal algorithm for finding the median. Divide the sample size in half:

And since it consists of an even number of options, the median is equal to the arithmetic mean of the 15th and 16th options orderly(!) variation series:

den. units

Situation two... When a ready-made interval series is given (a typical learning task).

We continue to analyze the same example with boots, where according to the initial data was drawn up by IVR... To calculate middle the middle of the intervals will be required:

- to use the familiar discrete case formula:

- excellent result! The discrepancy with the more accurate value () calculated from the primary data is only 0.04.

In fact, here we approximated the interval series with a discrete one, and this approximation turned out to be very effective. However, there is no particular benefit here, since with modern software it is not difficult to calculate the exact value even for a very large array of primary data. But this is provided that they are known to us :)

With other central indicators, everything is more interesting.

To find fashion one has to find modal interval (with maximum frequency)- in this problem, this is an interval with a frequency of 11, and use the following ugly formula:
, where:

- the lower border of the modal interval;
- the length of the modal interval;
- the frequency of the modal interval;
- the frequency of the previous interval;
- frequency of the next interval.

In this way:
den. units - as you can see, the "fashionable" price of shoes differs markedly from the arithmetic average.

Without going into the geometry of the formula, I will just give histogram of relative frequencies and note:


whence it is clearly seen that the mode is shifted relative to the center of the modal interval towards the left interval with a higher frequency. It is logical.

For reference, I will analyze rare cases:

- if the modal interval is extreme, then either;

- if there are 2 modal intervals that are next to each other, for example, and, then we consider the modal interval, while the adjacent intervals (left and right), if possible, are also enlarged by 2 times.

- if there is a distance between modal intervals, then we apply the formula to each interval, thereby obtaining 2 or large quantity Maud.

Here is such a dispatch of the mod :)

And the median. If a ready-made interval series is given, then the median is calculated using a slightly less terrible formula, but at first it is tedious (a slip of the tongue according to Freud :)) to find median interval Is an interval containing a variation (or 2 variants) that divides the variation series into two equal parts.

Above, I described how to determine the median, focusing on relative accumulated frequencies, here it is more convenient to calculate the "usual" accumulated frequencies. The computational algorithm is exactly the same - we move the first value to the left. (red arrow), and each next is obtained as the sum of the previous one with the current frequency from the left column (green symbols as an example):

Does everyone understand the meaning of the numbers in the right column? - this is the number of options that managed to "accumulate" on all "passed" intervals, including the current one.

Since we have even number option (30 pieces), then the median will be the interval that contains 30/2 = 15th and 16th options. And focusing on the accumulated frequencies, it is easy to come to the conclusion that these options are contained in the interval.

Median formula:
, where:
- the volume of the statistical population;
- the lower border of the median interval;
- the length of the median interval;
frequency median interval;
accumulated frequency previous interval.

In this way:
den. units - note that the median value, on the contrary, turned out to be shifted to the right, since on right hand there is a significant number of options:


And for reference, special cases.