Least squares data. Linear pairwise regression analysis

which finds the widest application in various fields of science and practical activities. It can be physics, chemistry, biology, economics, sociology, psychology and so on and so forth. By the will of fate, I often have to deal with the economy, and therefore today I will arrange for you a ticket to an amazing country called Econometrics=) … How do you not want that?! It's very good there - you just have to decide! …But what you probably definitely want is to learn how to solve problems method least squares . And especially diligent readers will learn to solve them not only accurately, but also VERY FAST ;-) But first general statement of the problem+ related example:

Let indicators be studied in some subject area that have a quantitative expression. At the same time, there is every reason to believe that the indicator depends on the indicator. This assumption can be scientific hypothesis and be based on elementary common sense. Let's leave science aside, however, and explore more appetizing areas - namely, grocery stores. Denote by:

– retail space of a grocery store, sq.m.,
- annual turnover of a grocery store, million rubles.

It is quite clear what more area store, the greater its turnover in most cases.

Suppose that after conducting observations / experiments / calculations / dancing with a tambourine, we have at our disposal numerical data:

With grocery stores, I think everything is clear: - this is the area of the 1st store, - its annual turnover, - the area of the 2nd store, - its annual turnover, etc. By the way, it is not at all necessary to have access to classified materials - a fairly accurate assessment of the turnover can be obtained using mathematical statistics. However, do not be distracted, the course of commercial espionage is already paid =)

Tabular data can also be written in the form of points and depicted in the usual way for us. Cartesian system .

We will answer important question: how many points are needed for a qualitative study?

The bigger, the better. The minimum admissible set consists of 5-6 points. In addition, with a small amount of data, “abnormal” results should not be included in the sample. So, for example, a small elite store can help out orders of magnitude more than “their colleagues”, thereby distorting general pattern, which is to be found!

If it’s quite simple, we need to choose a function , schedule which passes as close as possible to the points . Such a function is called approximating (approximation - approximation) or theoretical function . Generally speaking, here immediately appears the obvious "applicant" - the polynomial high degree, whose graph passes through ALL points. But this option is complicated, and often simply incorrect. (because the chart will “wind” all the time and poorly reflect the main trend).

Thus, the desired function must be sufficiently simple and at the same time reflect the dependence adequately. As you might guess, one of the methods for finding such functions is called least squares. First, let's analyze its essence in general view. Let some function approximate the experimental data:

How to evaluate the accuracy of this approximation? Let us also calculate the differences (deviations) between the experimental and functional values (we study the drawing). The first thought that comes to mind is to estimate how big the sum is, but the problem is that the differences can be negative. (For example, ) and deviations as a result of such summation will cancel each other out. Therefore, as an estimate of the accuracy of the approximation, it suggests itself to take the sum modules deviations:

or in folded form: (suddenly, who doesn’t know: is the sum icon, and is an auxiliary variable-“counter”, which takes values from 1 to ).

Approximating the experimental points with various functions, we will obtain different meanings, and obviously, where this sum is less, that function is more accurate.

Such a method exists and is called least modulus method. However, in practice it has become much more widespread. least square method, in which possible negative values are eliminated not by the modulus, but by squaring the deviations:

, after which efforts are directed to the selection of such a function that the sum of the squared deviations was as small as possible. Actually, hence the name of the method.

And now we're back to another important point: as noted above, the selected function should be quite simple - but there are also many such functions: linear , hyperbolic, exponential, logarithmic, quadratic etc. And, of course, here I would immediately like to "reduce the field of activity." What class of functions to choose for research? Primitive but effective technique:

- The easiest way to draw points on the drawing and analyze their location. If they tend to be in a straight line, then you should look for straight line equation with optimal values and . In other words, the task is to find SUCH coefficients - so that the sum of the squared deviations is the smallest.

If the points are located, for example, along hyperbole, then it is clear that the linear function will give a poor approximation. In this case, we are looking for the most "favorable" coefficients for the hyperbola equation - those that give the minimum sum of squares .

Now notice that in both cases we are talking about functions of two variables, whose arguments are searched dependency options:

And in essence, we need to solve a standard problem - to find minimum of a function of two variables.

Recall our example: suppose that the "shop" points tend to be located in a straight line and there is every reason to believe the presence linear dependence turnover from the trading area. Let's find SUCH coefficients "a" and "be" so that the sum of squared deviations was the smallest. Everything as usual - first partial derivatives of the 1st order. According to linearity rule you can differentiate right under the sum icon:

If you want to use this information for an essay or a term paper, I will be very grateful for the link in the list of sources, you will not find such detailed calculations anywhere:

Let's compose standard system:

We reduce each equation by a “two” and, in addition, “break apart” the sums:

Note : independently analyze why "a" and "be" can be taken out of the sum icon. By the way, formally this can be done with the sum

Let's rewrite the system in an "applied" form:

after which the algorithm for solving our problem begins to be drawn:

Do we know the coordinates of the points? We know. Sums can we find? Easily. We compose the simplest system of two linear equations with two unknowns("a" and "beh"). We solve the system, for example, Cramer's method, resulting in a stationary point . Checking sufficient condition for an extremum, we can verify that at this point the function reaches precisely minimum. Verification is associated with additional calculations and therefore we will leave it behind the scenes. (if necessary, the missing frame can be viewed). We draw the final conclusion:

Function the best way (at least compared to any other linear function) brings experimental points closer . Roughly speaking, its graph passes as close as possible to these points. In tradition econometrics the resulting approximating function is also called paired linear regression equation .

The problem under consideration is of great practical importance. In the situation with our example, the equation allows you to predict what kind of turnover ("yig") will be at the store with one or another value of the selling area (one or another meaning of "x"). Yes, the resulting forecast will be only a forecast, but in many cases it will turn out to be quite accurate.

I will analyze just one problem with "real" numbers, since there are no difficulties in it - all calculations are at the level school curriculum 7-8 grade. In 95 percent of cases, you will be asked to find just a linear function, but at the very end of the article I will show that it is no more difficult to find the equations for the optimal hyperbola, exponent, and some other functions.

In fact, it remains to distribute the promised goodies - so that you learn how to solve such examples not only accurately, but also quickly. We carefully study the standard:

Task

As a result of studying the relationship between two indicators, the following pairs of numbers were obtained:

Using the least squares method, find the linear function that best approximates the empirical (experienced) data. Make a drawing on which, in a Cartesian rectangular coordinate system, plot experimental points and a graph of the approximating function . Find the sum of squared deviations between empirical and theoretical values. Find out if the function is better (in terms of the least squares method) approximate experimental points.

Note that "x" values are natural values, and this has a characteristic meaningful meaning, which I will talk about a little later; but they, of course, can be fractional. In addition, depending on the content of a particular task, both "X" and "G" values can be fully or partially negative. Well, we have been given a “faceless” task, and we start it solution:

We find the coefficients of the optimal function as a solution to the system:

For the purposes of a more compact notation, the “counter” variable can be omitted, since it is already clear that the summation is carried out from 1 to .

It is more convenient to calculate the required amounts in a tabular form:

Calculations can be carried out on a microcalculator, but it is much better to use Excel - both faster and without errors; watch a short video:

Thus, we get the following system:

Here you can multiply the second equation by 3 and subtract the 2nd from the 1st equation term by term. But this is luck - in practice, systems are often not gifted, and in such cases it saves Cramer's method:
, so the system has a unique solution.

Let's do a check. I understand that I don’t want to, but why skip mistakes where you can absolutely not miss them? Substitute the found solution into the left side of each equation of the system:

The right parts of the corresponding equations are obtained, which means that the system is solved correctly.

Thus, the desired approximating function: – from all linear functions experimental data is best approximated by it.

Unlike straight dependence of the store's turnover on its area, the found dependence is reverse (principle "the more - the less"), and this fact is immediately revealed by the negative angular coefficient. Function informs us that with an increase in a certain indicator by 1 unit, the value of the dependent indicator decreases average by 0.65 units. As they say, the higher the price of buckwheat, the less sold.

To plot the approximating function, we find two of its values:

and execute the drawing:

The constructed line is called trend line (namely, the line linear trend, i.e. in general, the trend is not necessarily a straight line). Everyone is familiar with the expression "to be in trend", and I think that this term does not need additional comments.

Calculate the sum of squared deviations between empirical and theoretical values. Geometrically, this is the sum of the squares of the lengths of the "crimson" segments (two of which are so small you can't even see them).

Let's summarize the calculations in a table:

They can again be carried out manually, just in case I will give an example for the 1st point:

but it's much more efficient to do in a certain way:

Let's repeat: what is the meaning of the result? From all linear functions function the exponent is the smallest, that is, it is the best approximation in its family. And here, by the way, the final question of the problem is not accidental: what if the proposed exponential function will it be better to approximate the experimental points?

Let's find the corresponding sum of squared deviations - to distinguish them, I will designate them with the letter "epsilon". The technique is exactly the same:

And again for every fire calculation for the 1st point:

In Excel we use standard function EXP (Syntax can be found in Excel Help).

Conclusion: , so the exponential function approximates the experimental points worse than the straight line .

But it should be noted here that "worse" is doesn't mean yet, what is wrong. Now I built a graph of this exponential function - and it also passes close to the points - so much so that without an analytical study it is difficult to say which function is more accurate.

This completes the solution, and I return to the question of the natural values of the argument. In various studies, as a rule, economic or sociological, months, years or other equal time intervals are numbered with natural "X". Consider, for example, such a problem.

If some physical quantity depends on another quantity, then this dependence can be studied by measuring y at different values x . As a result of measurements, a series of values is obtained:

x 1 , x 2 , ..., x i , ... , x n ;

y 1 , y 2 , ..., y i , ... , y n .

Based on the data of such an experiment, it is possible to plot the dependence y = ƒ(x). The resulting curve makes it possible to judge the form of the function ƒ(x). However constant coefficients, which are included in this function, remain unknown. They can be determined using the least squares method. The experimental points, as a rule, do not lie exactly on the curve. The method of least squares requires that the sum of the squared deviations of the experimental points from the curve, i.e. 2 was the smallest.

In practice, this method is most often (and most simply) used in the case of a linear relationship, i.e. When

y=kx or y = a + bx.

Linear dependence is very widespread in physics. And even when the dependence is non-linear, they usually try to build a graph in such a way as to get a straight line. For example, if it is assumed that the refractive index of glass n is related to the wavelength λ of the light wave by the relation n = a + b/λ 2 , then the dependence of n on λ -2 is plotted on the graph.

Consider the dependence y=kx(straight line passing through the origin). Let us compose the value φ the sum of the squared deviations of our points from the straight line

The value of φ is always positive and turns out to be the smaller, the closer our points lie to the straight line. The method of least squares states that for k one should choose such a value at which φ has a minimum

or
(19)

The calculation shows that the root-mean-square error in determining the value of k is equal to

, (20)
where n is the number of dimensions.

Let us now consider a somewhat more difficult case, when the points must satisfy the formula y = a + bx(a straight line not passing through the origin).

The task is to find the best values of a and b from the given set of values x i , y i .

Again we compose a quadratic form φ equal to the sum of the squared deviations of the points x i , y i from the straight line

and find the values a and b for which φ has a minimum

;

The joint solution of these equations gives

(21)

The root-mean-square errors of determining a and b are equal

(23)

. (24)

When processing the measurement results by this method, it is convenient to summarize all the data in a table in which all the amounts included in formulas (19)(24) are preliminarily calculated. The forms of these tables are shown in the examples below.

Example 1 The basic equation of the dynamics of rotational motion ε = M/J (a straight line passing through the origin) was studied. For various values of the moment M, the angular acceleration ε of a certain body was measured. It is required to determine the moment of inertia of this body. The results of measurements of the moment of force and angular acceleration are listed in the second and third columns tables 5.

Table 5

n	M, N m	ε, s-1	M2	M ε	ε - kM	(ε - kM) 2
1	1.44	0.52	2.0736	0.7488	0.039432	0.001555
2	3.12	1.06	9.7344	3.3072	0.018768	0.000352
3	4.59	1.45	21.0681	6.6555	-0.08181	0.006693
4	5.90	1.92	34.81	11.328	-0.049	0.002401
5	7.45	2.56	55.5025	19.072	0.073725	0.005435
∑			123.1886	41.1115		0.016436

By formula (19) we determine:

To determine the root-mean-square error, we use formula (20)

0.005775kg-1 · m -2 .

By formula (18) we have

; .

SJ = (2.996 0.005775)/0.3337 = 0.05185 kg m 2.

Given the reliability P = 0.95, according to the table of Student's coefficients for n = 5, we find t = 2.78 and determine absolute errorΔJ = 2.78 0.05185 = 0.1441 ≈ 0.2 kg m 2.

We write the results in the form:

J = (3.0 ± 0.2) kg m 2;

Example 2 We calculate the temperature coefficient of resistance of the metal using the least squares method. Resistance depends on temperature according to a linear law

R t \u003d R 0 (1 + α t °) \u003d R 0 + R 0 α t °.

The free term determines the resistance R 0 at a temperature of 0 ° C, and slope product of temperature coefficient α and resistance R 0 .

The results of measurements and calculations are given in the table ( see table 6).

Table 6

n	t°, s	r, Ohm	t-¯t	(t-¯t) 2	(t-¯t)r	r-bt-a	(r - bt - a) 2,10 -6
1	23	1.242	-62.8333	3948.028	-78.039	0.007673	58.8722
2	59	1.326	-26.8333	720.0278	-35.581	-0.00353	12.4959
3	84	1.386	-1.83333	3.361111	-2.541	-0.00965	93.1506
4	96	1.417	10.16667	103.3611	14.40617	-0.01039	107.898
5	120	1.512	34.16667	1167.361	51.66	0.021141	446.932
6	133	1.520	47.16667	2224.694	71.69333	-0.00524	27.4556
∑	515	8.403		8166.833	21.5985		746.804
∑/n	85.83333	1.4005

By formulas (21), (22) we determine

R 0 = ¯ R- α R 0 ¯ t = 1.4005 - 0.002645 85.83333 = 1.1735 Ohm.

Let us find an error in the definition of α. Since , then by formula (18) we have:

Using formulas (23), (24) we have

;

0.014126 Ohm.

Given the reliability P = 0.95, according to the table of Student's coefficients for n = 6, we find t = 2.57 and determine the absolute error Δα = 2.57 0.000132 = 0.000338 deg -1.

α = (23 ± 4) 10 -4 hail-1 at P = 0.95.

Example 3 It is required to determine the radius of curvature of the lens from Newton's rings. The radii of Newton's rings r m were measured and the numbers of these rings m were determined. The radii of Newton's rings are related to the radius of curvature of the lens R and the ring number by the equation

r 2 m = mλR - 2d 0 R,

where d 0 the thickness of the gap between the lens and the plane-parallel plate (or lens deformation),

λ is the wavelength of the incident light.

λ = (600 ± 6) nm;
r 2 m = y;
m = x;
λR = b;
-2d 0 R = a,

then the equation will take the form y = a + bx.

The results of measurements and calculations are entered in table 7.

Table 7

n	x = m	y \u003d r 2, 10 -2 mm 2	m-¯m	(m-¯m) 2	(m-¯m)y	y-bx-a, 10-4	(y - bx - a) 2, 10 -6
1	1	6.101	-2.5	6.25	-0.152525	12.01	1.44229
2	2	11.834	-1.5	2.25	-0.17751	-9.6	0.930766
3	3	17.808	-0.5	0.25	-0.08904	-7.2	0.519086
4	4	23.814	0.5	0.25	0.11907	-1.6	0.0243955
5	5	29.812	1.5	2.25	0.44718	3.28	0.107646
6	6	35.760	2.5	6.25	0.894	3.12	0.0975819
∑	21	125.129		17.5	1.041175		3.12176
∑/n	3.5	20.8548333

Least square method is used to estimate the parameters of the regression equation.

One of the methods for studying stochastic relationships between features is regression analysis.
Regression analysis is the derivation of a regression equation, which is used to find average value a random variable (feature-result), if the value of another (or other) variables (feature-factors) is known. It includes the following steps:

choice of the form of connection (type of analytical regression equation);
estimation of equation parameters;
evaluation of the quality of the analytical regression equation.

Most often, a linear form is used to describe the statistical relationship of features. Attention to a linear relationship is explained by a clear economic interpretation of its parameters, limited by the variation of variables, and by the fact that in most cases, non-linear forms of a relationship are converted (by taking a logarithm or changing variables) into a linear form to perform calculations.
In the case of a linear pair relationship, the regression equation will take the form: y i =a+b·x i +u i . Options given equation a and b are estimated from the statistical observation x and y . The result of such an assessment is the equation: , where , - estimates of the parameters a and b , - the value of the effective feature (variable) obtained by the regression equation (calculated value).

The most commonly used for parameter estimation is least squares method (LSM).
The least squares method gives the best (consistent, efficient and unbiased) estimates of the parameters of the regression equation. But only if certain assumptions about the random term (u) and the independent variable (x) are met (see OLS assumptions).

The problem of estimating the parameters of a linear pair equation by the least squares method consists in the following: to obtain such estimates of the parameters , , at which the sum of the squared deviations of the actual values of the effective feature - y i from the calculated values - is minimal.
Formally OLS criterion can be written like this: .

Classification of least squares methods

Least square method.
Maximum likelihood method (for a normal classical linear regression model, normality of regression residuals is postulated).
The generalized least squares method of GLSM is used in the case of error autocorrelation and in the case of heteroscedasticity.
Weighted least squares method (a special case of GLSM with heteroscedastic residuals).

Illustrate the essence classical method least squares graphically. To do this, we will build a dot plot according to the observational data (x i , y i , i=1;n) in a rectangular coordinate system (such a dot plot is called a correlation field). Let's try to find a straight line that is closest to the points of the correlation field. According to the least squares method, the line is chosen so that the sum of squared vertical distances between the points of the correlation field and this line would be minimal.

Mathematical notation of this problem: .
The values of y i and x i =1...n are known to us, these are observational data. In the function S they are constants. The variables in this function are the required estimates of the parameters - , . To find the minimum of a function of 2 variables, it is necessary to calculate the partial derivatives of this function with respect to each of the parameters and equate them to zero, i.e. .
As a result, we obtain a system of 2 normal linear equations:
Solving this system, we find the required parameter estimates:

The correctness of the calculation of the parameters of the regression equation can be checked by comparing the sums (some discrepancy is possible due to rounding of the calculations).
To calculate parameter estimates , you can build Table 1.
The sign of the regression coefficient b indicates the direction of the relationship (if b > 0, the relationship is direct, if b<0, то связь обратная). Величина b показывает на сколько единиц изменится в среднем признак-результат -y при изменении признака-фактора - х на 1 единицу своего измерения.
Formally, the value of the parameter a is the average value of y for x equal to zero. If the sign-factor does not have and cannot have a zero value, then the above interpretation of the parameter a does not make sense.

Assessment of the tightness of the relationship between features is carried out using the coefficient of linear pair correlation - r x,y . It can be calculated using the formula: . In addition, the coefficient of linear pair correlation can be determined in terms of the regression coefficient b: .
The range of admissible values of the linear coefficient of pair correlation is from –1 to +1. The sign of the correlation coefficient indicates the direction of the relationship. If r x, y >0, then the connection is direct; if r x, y<0, то связь обратная.
If this coefficient is close to unity in modulus, then the relationship between the features can be interpreted as a fairly close linear one. If its modulus is equal to one ê r x , y ê =1, then the relationship between the features is functional linear. If features x and y are linearly independent, then r x,y is close to 0.
Table 1 can also be used to calculate r x,y.

Table 1

N observations	x i	y i	x i ∙ y i
1	x 1	y 1	x 1 y 1
2	x2	y2	x 2 y 2
...
n	x n	y n	x n y n
Column Sum	∑x	∑y	∑x y
Average value

To assess the quality of the obtained regression equation, the theoretical coefficient of determination is calculated - R 2 yx:

,
where d 2 is the variance y explained by the regression equation;
e 2 - residual (unexplained by the regression equation) variance y ;
s 2 y - total (total) variance y .
The coefficient of determination characterizes the share of variation (dispersion) of the resulting feature y, explained by regression (and, consequently, the factor x), in the total variation (dispersion) y. The coefficient of determination R 2 yx takes values from 0 to 1. Accordingly, the value 1-R 2 yx characterizes the proportion of variance y caused by the influence of other factors not taken into account in the model and specification errors.
With paired linear regression R 2 yx =r 2 yx .

Example.

Experimental data on the values of variables X And at are given in the table.

As a result of their alignment, the function

Using least square method, approximate these data with a linear dependence y=ax+b(find parameters A And b). Find out which of the two lines is better (in the sense of the least squares method) aligns the experimental data. Make a drawing.

The essence of the method of least squares (LSM).

The problem is to find the linear dependence coefficients for which the function of two variables A And b takes the smallest value. That is, given the data A And b the sum of the squared deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, the solution of the example is reduced to finding the extremum of a function of two variables.

Derivation of formulas for finding coefficients.

A system of two equations with two unknowns is compiled and solved. Finding partial derivatives of a function with respect to variables A And b, we equate these derivatives to zero.

We solve the resulting system of equations by any method (for example substitution method or ) and obtain formulas for finding coefficients using the least squares method (LSM).

With data A And b function takes the smallest value. The proof of this fact is given.

That's the whole method of least squares. Formula for finding the parameter a contains the sums , , , and the parameter n- amount of experimental data. The values of these sums are recommended to be calculated separately. Coefficient b found after calculation a.

It's time to remember the original example.

Solution.

In our example n=5. We fill in the table for the convenience of calculating the amounts that are included in the formulas of the required coefficients.

The values in the fourth row of the table are obtained by multiplying the values of the 2nd row by the values of the 3rd row for each number i.

The values in the fifth row of the table are obtained by squaring the values of the 2nd row for each number i.

The values of the last column of the table are the sums of the values across the rows.

We use the formulas of the least squares method to find the coefficients A And b. We substitute in them the corresponding values from the last column of the table:

Hence, y=0.165x+2.184 is the desired approximating straight line.

It remains to find out which of the lines y=0.165x+2.184 or better approximates the original data, i.e. to make an estimate using the least squares method.

Estimation of the error of the method of least squares.

To do this, you need to calculate the sums of squared deviations of the original data from these lines And , a smaller value corresponds to a line that better approximates the original data in terms of the least squares method.

Since , then the line y=0.165x+2.184 approximates the original data better.

Graphic illustration of the least squares method (LSM).

Everything looks great on the charts. The red line is the found line y=0.165x+2.184, the blue line is , the pink dots are the original data.

What is it for, what are all these approximations for?

I personally use to solve data smoothing problems, interpolation and extrapolation problems (in the original example, you could be asked to find the value of the observed value y at x=3 or when x=6 according to the MNC method). But we will talk more about this later in another section of the site.

Proof.

So that when found A And b function takes the smallest value, it is necessary that at this point the matrix of the quadratic form of the second-order differential for the function was positive definite. Let's show it.

Least squares method (OLS, eng. Ordinary Least Squares, OLS)- a mathematical method used to solve various problems, based on minimizing the sum of squared deviations of some functions from the desired variables. It can be used to "solve" overdetermined systems of equations (when the number of equations exceeds the number of unknowns), to find a solution in the case of ordinary (not overdetermined) nonlinear systems of equations, to approximate the point values of a certain function. OLS is one of the basic methods of regression analysis for estimating unknown parameters of regression models from sample data.

Encyclopedic YouTube

1 / 5

✪ Least squares method. Subject

✪ Mitin I.V. - Processing the results of physical. experiment - Least squares method (Lecture 4)

✪ Least squares, lesson 1/2. Linear function

✪ Econometrics. Lecture 5. Least squares method

✪ Least squares method. Answers

Subtitles

Story

Until the beginning of the XIX century. scientists did not have certain rules for solving a system of equations in which the number of unknowns is less than the number of equations; Until that time, particular methods were used, depending on the type of equations and on the ingenuity of the calculators, and therefore different calculators, starting from the same observational data, came to different conclusions. Gauss (1795) is credited with the first application of the method, and Legendre (1805) independently discovered and published it under its modern name (fr. Methode des moindres quarres) . Laplace connected the method with the theory of probabilities, and the American mathematician Adrain (1808) considered its probabilistic applications. The method is widespread and improved by further research by Encke, Bessel, Hansen and others.

The essence of the method of least squares

Let x (\displaystyle x)- kit n (\displaystyle n) unknown variables (parameters), f i (x) (\displaystyle f_(i)(x)), , m > n (\displaystyle m>n)- set of functions from this set of variables. The problem is to choose such values x (\displaystyle x) so that the values of these functions are as close as possible to some values y i (\displaystyle y_(i)). Essentially we are talking about on the "solution" of an overdetermined system of equations f i (x) = y i (\displaystyle f_(i)(x)=y_(i)), i = 1 , … , m (\displaystyle i=1,\ldots ,m) in the indicated sense of the maximum proximity of the left and right parts systems. The essence of LSM is to choose as a "measure of proximity" the sum of the squared deviations of the left and right parts | f i (x) − y i | (\displaystyle |f_(i)(x)-y_(i)|). Thus, the essence of the LSM can be expressed as follows:

∑ i e i 2 = ∑ i (y i − f i (x)) 2 → min x (\displaystyle \sum _(i)e_(i)^(2)=\sum _(i)(y_(i)-f_( i)(x))^(2)\rightarrow \min _(x)).

If the system of equations has a solution, then the minimum of the sum of squares will be zero and exact solutions of the system of equations can be found analytically or, for example, by various numerical optimization methods. If the system is overdetermined, that is, loosely speaking, the number of independent equations more quantity unknown variables, then the system does not have an exact solution and the least squares method allows us to find some "optimal" vector x (\displaystyle x) in the sense of the maximum proximity of the vectors y (\displaystyle y) And f (x) (\displaystyle f(x)) or the maximum proximity of the deviation vector e (\displaystyle e) to zero (proximity is understood in the sense of Euclidean distance).

Example - system of linear equations

In particular, the least squares method can be used to "solve" the system of linear equations

A x = b (\displaystyle Ax=b),

Where A (\displaystyle A) rectangular size matrix m × n , m > n (\displaystyle m\times n,m>n)(i.e. the number of rows of matrix A is greater than the number of required variables).

Such a system of equations generally has no solution. Therefore, this system can be "solved" only in the sense of choosing such a vector x (\displaystyle x) to minimize the "distance" between vectors A x (\displaystyle Ax) And b (\displaystyle b). To do this, you can apply the criterion for minimizing the sum of squared differences of the left and right parts of the equations of the system, that is (A x − b) T (A x − b) → min (\displaystyle (Ax-b)^(T)(Ax-b)\rightarrow \min ). It is easy to show that the solution of this minimization problem leads to the solution of the following system of equations

A T A x = A T b ⇒ x = (A T A) − 1 A T b (\displaystyle A^(T)Ax=A^(T)b\Rightarrow x=(A^(T)A)^(-1)A^ (T)b).

OLS in regression analysis (data approximation)

Let there be n (\displaystyle n) values of some variable y (\displaystyle y)(this may be the results of observations, experiments, etc.) and the corresponding variables x (\displaystyle x). The challenge is to make the relationship between y (\displaystyle y) And x (\displaystyle x) approximate by some function known up to some unknown parameters b (\displaystyle b), that is, actually find the best values of the parameters b (\displaystyle b), maximally approximating the values f (x , b) (\displaystyle f(x,b)) to actual values y (\displaystyle y). In fact, this reduces to the case of "solution" of an overdetermined system of equations with respect to b (\displaystyle b):

F (x t , b) = y t , t = 1 , … , n (\displaystyle f(x_(t),b)=y_(t),t=1,\ldots ,n).

In regression analysis, and in particular in econometrics, probabilistic models of the relationship between variables are used.

Y t = f (x t , b) + ε t (\displaystyle y_(t)=f(x_(t),b)+\varepsilon _(t)),

Where ε t (\displaystyle \varepsilon _(t))- so called random errors models.

Accordingly, the deviations of the observed values y (\displaystyle y) from model f (x , b) (\displaystyle f(x,b)) already assumed in the model itself. The essence of LSM (ordinary, classical) is to find such parameters b (\displaystyle b), at which the sum of squared deviations (errors, for regression models they are often called regression residuals) e t (\displaystyle e_(t)) will be minimal:

b ^ O L S = arg ⁡ min b R S S (b) (\displaystyle (\hat (b))_(OLS)=\arg \min _(b)RSS(b)),

Where R S S (\displaystyle RSS)- English. Residual Sum of Squares is defined as:

R S S (b) = e T e = ∑ t = 1 n e t 2 = ∑ t = 1 n (y t − f (x t , b)) 2 (\displaystyle RSS(b)=e^(T)e=\sum _ (t=1)^(n)e_(t)^(2)=\sum _(t=1)^(n)(y_(t)-f(x_(t),b))^(2) ).

In the general case, this problem can be solved by numerical methods of optimization (minimization). In this case, one speaks of nonlinear least squares(NLS or NLLS - eng. Non-Linear Least Squares). In many cases, an analytical solution can be obtained. To solve the minimization problem, it is necessary to find the stationary points of the function R S S (b) (\displaystyle RSS(b)), differentiating it with respect to unknown parameters b (\displaystyle b), equating the derivatives to zero and solving the resulting system of equations:

∑ t = 1 n (y t − f (x t , b)) ∂ f (x t , b) ∂ b = 0 (\displaystyle \sum _(t=1)^(n)(y_(t)-f(x_ (t),b))(\frac (\partial f(x_(t),b))(\partial b))=0).

LSM in the case of linear regression

Let the regression dependence be linear:

y t = ∑ j = 1 k b j x t j + ε = x t T b + ε t (\displaystyle y_(t)=\sum _(j=1)^(k)b_(j)x_(tj)+\varepsilon =x_( t)^(T)b+\varepsilon _(t)).

Let y is the column vector of observations of the variable being explained, and X (\displaystyle X)- This (n × k) (\displaystyle ((n\times k)))- matrix of factor observations (rows of the matrix - vectors of factor values in a given observation, by columns - vector of values of a given factor in all observations). The matrix representation of the linear model has the form:

y = Xb + ε (\displaystyle y=Xb+\varepsilon ).

Then the vector of estimates of the explained variable and the vector of regression residuals will be equal to

y ^ = X b , e = y − y ^ = y − X b (\displaystyle (\hat (y))=Xb,\quad e=y-(\hat (y))=y-Xb).

accordingly, the sum of the squares of the regression residuals will be equal to

R S S = e T e = (y − X b) T (y − X b) (\displaystyle RSS=e^(T)e=(y-Xb)^(T)(y-Xb)).

Differentiating this function with respect to the parameter vector b (\displaystyle b) and equating the derivatives to zero, we obtain a system of equations (in matrix form):

(X T X) b = X T y (\displaystyle (X^(T)X)b=X^(T)y).

In the deciphered matrix form, this system of equations looks like this:

(∑ x t 1 2 ∑ x t 1 x t 2 ∑ x t 1 x t 3 … ∑ x t 1 x t k ∑ x t 2 x t 1 ∑ x t 2 2 ∑ x t 2 x t 3 … ∑ x t 2 x t k ∑ x t 3 x t 1 ∑ x t 3 x t 2 ∑ x t 3 2 … ∑ x t 3 x t k ⋮ ⋮ ⋮ ⋱ ⋮ ∑ x t k x t 1 ∑ x t k x t 2 ∑ x t k x t 3 … ∑ x t k 2) (b 1 b 2 b 3 ⋮ b k) = (∑ x t 1 y t ∑ x t 2 y t ∑ x t 3 y t ⋮ ∑ x t k y t) , (\displaystyle (\begin(pmatrix)\sum x_(t1)^(2)&\sum x_(t1)x_(t2)&\sum x_(t1)x_(t3)&\ldots &\sum x_(t1)x_(tk)\\\sum x_(t2)x_(t1)&\sum x_(t2)^(2)&\sum x_(t2)x_(t3)&\ldots &\ sum x_(t2)x_(tk)\\\sum x_(t3)x_(t1)&\sum x_(t3)x_(t2)&\sum x_(t3)^(2)&\ldots &\sum x_ (t3)x_(tk)\\\vdots &\vdots &\vdots &\ddots &\vdots \\\sum x_(tk)x_(t1)&\sum x_(tk)x_(t2)&\sum x_ (tk)x_(t3)&\ldots &\sum x_(tk)^(2)\\\end(pmatrix))(\begin(pmatrix)b_(1)\\b_(2)\\b_(3 )\\\vdots \\b_(k)\\\end(pmatrix))=(\begin(pmatrix)\sum x_(t1)y_(t)\\\sum x_(t2)y_(t)\\ \sum x_(t3)y_(t)\\\vdots \\\sum x_(tk)y_(t)\\\end(pmatrix))) where all sums are taken over all admissible values t (\displaystyle t).

If a constant is included in the model (as usual), then x t 1 = 1 (\displaystyle x_(t1)=1) for all t (\displaystyle t), therefore, in the upper left corner of the matrix of the system of equations is the number of observations n (\displaystyle n), and in the remaining elements of the first row and first column - just the sum of the values of the variables: ∑ x t j (\displaystyle \sum x_(tj)) and the first element of the right side of the system - ∑ y t (\displaystyle \sum y_(t)).

The solution of this system of equations gives the general formula for the least squares estimates for the linear model:

b ^ O L S = (X T X) − 1 X T y = (1 n X T X) − 1 1 n X T y = V x − 1 C x y (\displaystyle (\hat (b))_(OLS)=(X^(T )X)^(-1)X^(T)y=\left((\frac (1)(n))X^(T)X\right)^(-1)(\frac (1)(n ))X^(T)y=V_(x)^(-1)C_(xy)).

For analytical purposes, the last representation of this formula turns out to be useful (in the system of equations when divided by n, arithmetic means appear instead of sums). If the data in the regression model centered, then in this representation the first matrix has the meaning of the sample covariance matrix of factors, and the second one is the vector of covariances of factors with dependent variable. If, in addition, the data is also normalized at the SKO (that is, ultimately standardized), then the first matrix has the meaning of the sample correlation matrix of factors, the second vector - the vector of sample correlations of factors with the dependent variable.

An important property of LLS estimates for models with a constant- the line of the constructed regression passes through the center of gravity of the sample data, that is, the equality is fulfilled:

y ¯ = b 1 ^ + ∑ j = 2 k b ^ j x ¯ j (\displaystyle (\bar (y))=(\hat (b_(1)))+\sum _(j=2)^(k) (\hat (b))_(j)(\bar (x))_(j)).

In particular, in the extreme case when the only regressor is a constant, we find that the OLS estimate of a single parameter (the constant itself) is equal to the mean value of the variable being explained. That is, the arithmetic mean, known for its good properties from the laws of large numbers, is also an least squares estimate - it satisfies the criterion for the minimum sum of squared deviations from it.

The simplest special cases

In the case of pairwise linear regression y t = a + b x t + ε t (\displaystyle y_(t)=a+bx_(t)+\varepsilon _(t)), when the linear dependence of one variable on another is estimated, the calculation formulas are simplified (you can do without matrix algebra). The system of equations has the form:

(1 x ¯ x ¯ x 2 ¯) (a b) = (y ¯ x y ¯) (\displaystyle (\begin(pmatrix)1&(\bar (x))\\(\bar (x))&(\bar (x^(2)))\\\end(pmatrix))(\begin(pmatrix)a\\b\\\end(pmatrix))=(\begin(pmatrix)(\bar (y))\\ (\overline(xy))\\\end(pmatrix))).

From here it is easy to find estimates for the coefficients:

( b ^ = Cov ⁡ (x , y) Var ⁡ (x) = x y ¯ − x ¯ y ¯ x 2 ¯ − x ¯ 2 , a ^ = y ¯ − b x ¯ . (\displaystyle (\begin(cases) (\hat (b))=(\frac (\mathop (\textrm (Cov)) (x,y))(\mathop (\textrm (Var)) (x)))=(\frac ((\overline (xy))-(\bar (x))(\bar (y)))((\overline (x^(2)))-(\overline (x))^(2))),\\( \hat (a))=(\bar (y))-b(\bar (x)).\end(cases)))

Despite the fact that in the general case models with a constant are preferable, in some cases it is known from theoretical considerations that the constant a (\displaystyle a) should be equal to zero. For example, in physics, the relationship between voltage and current has the form U = I ⋅ R (\displaystyle U=I\cdot R); measuring voltage and current, it is necessary to estimate the resistance. In this case, we are talking about a model y = b x (\displaystyle y=bx). In this case, instead of a system of equations, we have a single equation

(∑ x t 2) b = ∑ x t y t (\displaystyle \left(\sum x_(t)^(2)\right)b=\sum x_(t)y_(t)).

Therefore, the formula for estimating a single coefficient has the form

B ^ = ∑ t = 1 n x t y t ∑ t = 1 n x t 2 = x y ¯ x 2 ¯ (\displaystyle (\hat (b))=(\frac (\sum _(t=1)^(n)x_(t )y_(t))(\sum _(t=1)^(n)x_(t)^(2)))=(\frac (\overline (xy))(\overline (x^(2)) ))).

The case of a polynomial model

If the data is fitted by a polynomial regression function of one variable f (x) = b 0 + ∑ i = 1 k b i x i (\displaystyle f(x)=b_(0)+\sum \limits _(i=1)^(k)b_(i)x^(i)), then, perceiving degrees x i (\displaystyle x^(i)) as independent factors for each i (\displaystyle i) it is possible to estimate the parameters of the model based on the general formula for estimating the parameters of the linear model. To do this, it suffices to take into account in the general formula that with such an interpretation x t i x t j = x t i x t j = x t i + j (\displaystyle x_(ti)x_(tj)=x_(t)^(i)x_(t)^(j)=x_(t)^(i+j)) And x t j y t = x t j y t (\displaystyle x_(tj)y_(t)=x_(t)^(j)y_(t)). Hence, matrix equations in this case will take the form:

(n ∑ n x t … ∑ n x t k ∑ n x t ∑ n x i 2 … ∑ m x i k + 1 ⋮ ⋮ ⋱ ⋮ ∑ n x t k ∑ n x t k + 1 … ∑ n x t 2 k) [ b 0 b 1 ⋮ b k ] = [ ∑ n y t ∑ n x t y t ⋮ ∑ n x t k y t ] . (\displaystyle (\begin(pmatrix)n&\sum \limits _(n)x_(t)&\ldots &\sum \limits _(n)x_(t)^(k)\\\sum \limits _( n)x_(t)&\sum \limits _(n)x_(i)^(2)&\ldots &\sum \limits _(m)x_(i)^(k+1)\\\vdots & \vdots &\ddots &\vdots \\\sum \limits _(n)x_(t)^(k)&\sum \limits _(n)x_(t)^(k+1)&\ldots &\ sum \limits _(n)x_(t)^(2k)\end(pmatrix))(\begin(bmatrix)b_(0)\\b_(1)\\\vdots \\b_(k)\end( bmatrix))=(\begin(bmatrix)\sum \limits _(n)y_(t)\\\sum \limits _(n)x_(t)y_(t)\\\vdots \\\sum \limits _(n)x_(t)^(k)y_(t)\end(bmatrix)).)

Statistical Properties of OLS Estimates

First of all, we note that for linear models, the least squares estimates are linear estimates, as follows from the above formula. For unbiased least squares estimators, it is necessary and sufficient that essential condition regression analysis: conditional on the factors, the mathematical expectation of a random error must be equal to zero. This condition is satisfied, in particular, if

the mathematical expectation of random errors is zero, and
factors and random errors are independent random values.

The second condition - the condition of exogenous factors - is fundamental. If this property is not satisfied, then we can assume that almost any estimates will be extremely unsatisfactory: they will not even be consistent (that is, even very large volume data does not allow to obtain qualitative estimates in this case). In the classical case, a stronger assumption is made about the determinism of factors, in contrast to a random error, which automatically means that the exogenous condition is satisfied. In the general case, for the consistency of the estimates, it is sufficient to satisfy the exogeneity condition together with the convergence of the matrix V x (\displaystyle V_(x)) to some nondegenerate matrix as the sample size increases to infinity.

In order for, in addition to the consistency and unbiasedness, the estimates of the (usual) LSM to be also effective (the best in the class of linear unbiased estimates), it is necessary to fulfill additional properties of a random error:

These assumptions can be formulated for the covariance matrix of the vector of random errors V (ε) = σ 2 I (\displaystyle V(\varepsilon)=\sigma ^(2)I).

A linear model that satisfies these conditions is called classical. The least squares estimators for classical linear regression are unbiased, consistent, and the most efficient estimators in the class of all linear unbiased estimators (the abbreviation blue (Best Linear Unbiased Estimator) is the best linear unbiased estimate; in domestic literature, the Gauss - Markov theorem is more often cited). As it is easy to show, the covariance matrix of the coefficient estimates vector will be equal to:

V (b ^ O L S) = σ 2 (X T X) − 1 (\displaystyle V((\hat (b))_(OLS))=\sigma ^(2)(X^(T)X)^(-1 )).

Efficiency means that this covariance matrix is "minimal" (any linear combination of coefficients, and in particular the coefficients themselves, have a minimum variance), that is, in the class of linear unbiased estimates, the OLS estimates are the best. The diagonal elements of this matrix - the variances of coefficient estimates - important parameters quality of the received estimates. However, it is not possible to calculate the covariance matrix because the random error variance is unknown. It can be proved that the unbiased and consistent (for the classical linear model) estimate of the variance of random errors is the value:

S 2 = R S S / (n − k) (\displaystyle s^(2)=RSS/(n-k)).

Substituting this value into the formula for the covariance matrix, we obtain an estimate of the covariance matrix. The resulting estimates are also unbiased and consistent. It is also important that the estimate of the error variance (and hence the variance of the coefficients) and the estimates of the model parameters are independent. random variables, which allows you to get test statistics to test hypotheses about the coefficients of the model.

It should be noted that if the classical assumptions are not met, the least squares parameter estimates are not the most efficient and, where W (\displaystyle W) is some symmetric positive definite weight matrix. Ordinary least squares is a special case of this approach, when the weight matrix is proportional to identity matrix. As is known, for symmetric matrices (or operators) there is a decomposition W = P T P (\displaystyle W=P^(T)P). Therefore, this functional can be represented as follows e T P T P e = (P e) T P e = e ∗ T e ∗ (\displaystyle e^(T)P^(T)Pe=(Pe)^(T)Pe=e_(*)^(T)e_( *)), that is, this functional can be represented as the sum of the squares of some transformed "residuals". Thus, we can distinguish a class of least squares methods - LS-methods (Least Squares).

It is proved (Aitken's theorem) that for a generalized linear regression model (in which no restrictions are imposed on the covariance matrix of random errors), the most effective (in the class of linear unbiased estimates) are estimates of the so-called. generalized OLS (OMNK, GLS - Generalized Least Squares)- LS-method with a weight matrix equal to the inverse covariance matrix of random errors: W = V ε − 1 (\displaystyle W=V_(\varepsilon )^(-1)).

It can be shown that the formula for the GLS-estimates of the parameters of the linear model has the form

B ^ G L S = (X T V − 1 X) − 1 X T V − 1 y (\displaystyle (\hat (b))_(GLS)=(X^(T)V^(-1)X)^(-1) X^(T)V^(-1)y).

The covariance matrix of these estimates, respectively, will be equal to

V (b ^ G L S) = (X T V − 1 X) − 1 (\displaystyle V((\hat (b))_(GLS))=(X^(T)V^(-1)X)^(- 1)).

In fact, the essence of the OLS lies in a certain (linear) transformation (P) of the original data and the application of the usual least squares to the transformed data. The purpose of this transformation is that for the transformed data, the random errors already satisfy the classical assumptions.

Weighted least squares

In the case of a diagonal weight matrix (and hence the covariance matrix of random errors), we have the so-called weighted least squares (WLS - Weighted Least Squares). In this case, the weighted sum of squares of the residuals of the model is minimized, that is, each observation receives a “weight” that is inversely proportional to the variance of the random error in this observation: e T W e = ∑ t = 1 n e t 2 σ t 2 (\displaystyle e^(T)We=\sum _(t=1)^(n)(\frac (e_(t)^(2))(\ sigma _(t)^(2)))). In fact, the data is transformed by weighting the observations (dividing by an amount proportional to the assumed standard deviation of the random errors), and normal least squares is applied to the weighted data.

ISBN 978-5-7749-0473-0.

Econometrics. Textbook / Ed. Eliseeva I. I. - 2nd ed. - M. : Finance and statistics, 2006. - 576 p. - ISBN 5-279-02786-3.

Alexandrova N.V. History of mathematical terms, concepts, designations: a dictionary-reference book. - 3rd ed. - M. : LKI, 2008. - 248 p. - ISBN 978-5-382-00839-4. I.V. Mitin, Rusakov V.S. Analysis and processing of experimental data - 5th edition - 24p.