Regression coefficients. Regression lines

During their studies, students very often come across a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This kind equalities are used in statistics and econometrics.

Defining Regression

In mathematics, regression means a quantity that describes the dependence of the average value of a set of data on the values ​​of another quantity. The regression equation shows, as a function of a particular feature, the mean of another feature. The regression function is simple equation y = x, in which y is the dependent variable, and x is the independent (attribute-factor). In fact, the regression is expressed as y = f (x).

What are the types of relationships between variables

In general, there are two opposite types of relationship: correlation and regression.

The first is characterized by the equality of conditional variables. In this case, it is not known for certain which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a relationship of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

Regression types

Today, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, paired, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

A linear regression equation is used in statistics to clearly explain the parameters of an equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. Logarithmically linear equation expresses the relationship using a logarithmic function: In y = In c + t * In x + In E.

Multiple and nonlinear

Two more complex types of regression are multiple and non-linear. The equation multiple regression expressed by the function y = f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable, and x is the explanatory one. Variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit controversial. On the one hand, it is not linear with respect to the indicators taken into account, but on the other hand, in the role of assessing indicators, it is linear.

Inverse and Paired Regressions

The inverse is the kind of function that needs to be converted to a linear form. In the most traditional applications, it takes the form of a function y = 1 / c + m * x + E. The paired regression equation demonstrates the relationship between the data as a function of y = f (x) + E. In the same way as in other equations, y depends on x, and E is a stochastic parameter.

Correlation concept

This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1; +1]. Negative indicator indicates the presence of feedback, positive - about direct. If the coefficient takes a value equal to 0, then there is no relationship. How closer meaning to 1 - the stronger the connection between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can assess the closeness of the relationship. They are used on the basis of a distribution estimate to study parameters obeying the normal distribution law.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and to evaluate the indicators of the selected relationship formula. The correlation field is used as a method for identifying a link. For this, all existing data must be displayed graphically. In a rectangular 2D coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values ​​of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they are lined up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about an almost complete absence of communication. If it is between 30% and 70%, then this indicates the presence of links of average density. 100% indicator - evidence functional communication.

A non-linear regression equation, like a linear one, must be supplemented with a correlation index (R).

Correlation for multiple regression

The coefficient of determination is a measure of the square of multiple correlation. He speaks of the tightness of the relationship between the presented complex of indicators with the studied feature. He can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

In order to calculate the index of multiple correlation, it is necessary to calculate its index.

Least square method

This method is a way of estimating regression factors. Its essence lies in minimizing the sum of the squared deviations obtained due to the dependence of the factor on the function.

A paired linear regression equation can be estimated using this method. This type of equations is used in the case of detection between the indicators of a paired linear relationship.

Equation parameters

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter m shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c has no economic meaning. The only effect on the function is the sign before the factor c. If there is a minus, then we can say about a delayed change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of a regression equation can be expressed through an equation. For example, factor c has the form c = y - tx.

Grouped data

There are conditions of the problem in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator, depending on x, changes. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, averages are often subject to external fluctuations. These fluctuations are not a reflection of the regularity of the relationship, they only mask its "noise". The averages show much worse relationship patterns than the linear regression equation. However, they can be used as a base for finding an equation. By multiplying the size of an individual population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the indicator of the amount xy. In the event that the intervals are small, it is possible to conventionally take the x exponent for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Further, all the amounts are knocked together and the total amount xy is obtained.

Multiple Pairwise Regression Equation: Assessing the Importance of a Link

As discussed earlier, multiple regression has a function of the form y = f (x 1, x 2,…, x m) + E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and study the reasons and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, such an equation is used a little less often.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what influence each of the factors individually and in their general totality has on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. At the same time, two types of functions are usually used to assess the relationship: linear and nonlinear.

A linear function is depicted in the form of such a relationship: y = a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m, are considered the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

Nonlinear equations are, for example, of the form power function y = ax 1 b1 x 2 b2 ... x m bm. In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how many%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors need to be considered when constructing multiple regression

In order to correctly construct multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have a certain understanding of the nature of the relationship between economic factors and the modeled. The factors that will need to be included must meet the following criteria:

  • Must be quantifiable. In order to use a factor describing the quality of an object, in any case, it should be quantified.
  • There should be no intercorrelation of factors, or a functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditioned, and this entails its unreliability and unclear estimates.
  • If there is a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction methods

There are a myriad of methods and techniques to explain how you can select factors for an equation. However, all these methods are based on the selection of coefficients using the correlation index. Among them are:

  • Exclusion method.
  • Method of inclusion.
  • Regression analysis step by step.

The first method involves filtering out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has a right to exist. They have their pros and cons, but they all in their own way can solve the issue of filtering out unnecessary indicators. As a rule, the results obtained by each individual method are fairly close.

Multivariate analysis methods

Such methods of determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, face recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared as a result of the development of the method of components. All of them apply in certain circumstances, under certain conditions and factors.

With a linear type of connection between the two studied characteristics, in addition to calculating correlations, the calculation of the regression coefficient is applied.

In the case of a straight-line correlation, each of the changes in one feature corresponds to a well-defined change in another feature. However, the correlation coefficient shows this relationship only in relative terms - in fractions of a unit. With the help of regression analysis this linkage value is obtained in named units. The amount by which the first sign changes on average when the second changes per unit of measurement is called the regression coefficient.

In contrast to correlation, regression analysis gives broader information, since by calculating two regression coefficients Rx / y and Ru / x it is possible to determine both the dependence of the first feature on the second, and the second on the first. The expression of a regression relationship using an equation allows for a certain value of one feature to establish the value of another feature.

The regression coefficient R is the product of the correlation coefficient and the ratio of standard deviations calculated for each feature. It is calculated by the formula

where, R is the regression coefficient; SX is the standard deviation of the first feature, which changes due to the change in the second; SУ - standard deviation of the second feature due to the change in which the first feature changes; r is the correlation coefficient between these features; x - function; y-argument.

This formula determines the magnitude of the value of x when y changes per unit of measurement. If a reverse calculation is necessary, you can find the value of y when x changes per unit of measure using the formula:


In this case, the active role in changing one attribute in relation to another changes, in comparison with the previous formula, the argument becomes a function and vice versa. SX and SY are accepted in a named expression.

There is a clear relationship between the values ​​of r and R, expressed in the fact that the product of the regression x in y and the regression y in x is equal to the square of the correlation coefficient, i.e.

Rx / y * Ry / x = r2

This indicates that the correlation coefficient is the geometric mean of both values ​​of the regression coefficients of a given sample. This formula can be used to check the correctness of calculations.

When processing digital material on calculating machines, detailed regression coefficient formulas can be used:

R or


For the regression coefficient, its representativeness error can be calculated. The regression coefficient error is equal to the correlation coefficient error multiplied by the ratio of the quadratic ratios:

The criterion for the reliability of the regression coefficient is calculated using the usual formula:

as a result, it is equal to the criterion of reliability of the correlation coefficient:

The reliability of the tR value is established according to the Student's table at  = n - 2, where n is the number of observation pairs.

Curvilinear regression.

REGRESSION, CURVED... Any nonlinear regression in which the regression equation for changes in one variable (y) as a function of t changes in another (x) is quadratic, cubic, or a higher order equation. While it is always mathematically possible to obtain a regression equation that will fit every squiggle on the curve, most of these perturbations result from sampling or measurement errors, and such a “perfect” fit does nothing. It is not always easy to determine whether a curvilinear regression fits a dataset, although statistical tests exist to determine whether each higher power of the equation significantly increases the degree of fit of that dataset.

Curve fitting is performed in the same way using the method least squares as alignment in a straight line. The regression line must satisfy the condition of the minimum of the sum of the squared distances to each point of the correlation field. In this case, in equation (1), y is the calculated value of the function determined using the equation of the selected curvilinear relationship from the actual values ​​of x j. For example, if a second-order parabola is chosen to approximate the connection, then y = a + bx + cx2, (14). And the difference between a point lying on the curve and a given point of the correlation field for the corresponding argument can be written similarly to Eq. (3) in the form yj = yj (a + bx + cx2) (15) In this case, the sum of the squared distances from each point of the correlation field to the new regression line in the case of a second-order parabola will have the form: S 2 = yj 2 = 2 (16) Based on the minimum condition of this sum, the partial derivatives of S 2 with respect to a, b and c are equal to zero. After performing the necessary transformations, we obtain a system of three equations with three unknowns to determine a, b and c. , y = m a + b x + c x 2 yx = a x + b x 2 + c x 2.yx2 = a x 2 + b x 3 + c x4. (17). Solving the system of equations for a, b and c, we find the numerical values ​​of the regression coefficients. The values ​​y, x, x2, yx, yx2, x3, x4. Are found directly from production measurements. The estimate of the tightness of the relationship with a curvilinear dependence is the theoretical correlation ratio xy, which is the square root of the ratio of two variances: the mean square p2 of the deviations of the calculated values ​​y "j of the function according to the found regression equation from the arithmetic mean value Y of the value y to the mean square of the deviations y2 of the actual values ​​of the function yj from its arithmetic mean value: xу = (р2 / y2) 1/2 = ((y "j - Y) 2 / (yj - Y) 2) 1/2 (18) The squared correlation ratio xy2 shows the share of total variability of the dependent variable у due to the variability of the argument x. This indicator is called the coefficient of determination. In contrast to the correlation coefficient, the value of the correlation ratio can take only positive values ​​from 0 to 1. In the absence of a connection, the correlation ratio is zero, in the presence of a functional connection, it is equal to one, and in the presence of a regression connection of varying tightness, the correlation ratio takes on values ​​between zero and one ... The choice of the type of curve is of great importance in regression analysis, since the accuracy of the approximation depends on the type of the selected relationship and statistical estimates tightness of communication. The simplest method for choosing the type of curve is to build correlation fields and select the appropriate types regression equations by the location of points on these fields. Regression analysis methods allow you to find the numerical values ​​of the regression coefficients for complex types of interrelation of parameters described, for example, by polynomials high degrees... Often, the shape of the curve can be determined based on the physical nature of the process or phenomenon under consideration. It makes sense to use polynomials of high degrees to describe rapidly changing processes in the event that the ranges of fluctuations in the parameters of these processes are significant. As applied to the research of the metallurgical process, it is sufficient to use curves of lower orders, for example, a parabola of the second order. This curve can have one extremum, which, as practice has shown, is quite enough to describe different characteristics metallurgical process. The results of calculating the parameters of the pairwise correlation relationship would be reliable and would be of practical value if the information used was obtained for conditions of wide ranges of fluctuations of the argument with the constancy of all other parameters of the process. Consequently, the methods for studying the pair correlation relationship of parameters can be used to solve practical problems only when there is confidence in the absence of other serious influences on the function, except for the analyzed argument. In production conditions, it is impossible to carry out the process in this way for a long time. However, if we have information about the main parameters of the process that affect its results, then the influence of these parameters can be mathematically eliminated and the relationship between the function and argument of interest to us can be isolated in a “pure form”. Such a relationship is called private, or individual. To determine it, the multiple regression method is used.

Correlation ratio.

The correlation ratio and the correlation index are numerical characteristics, closely related to the concept of a random variable, or rather to the system random variables... Therefore, to introduce and define their meaning and role, it is necessary to clarify the concept of a system of random variables and some properties inherent in them.

Two or more random variables describing a certain phenomenon are called a system or a complex of random variables.

The system of several random variables X, Y, Z,…, W is usually denoted by (X, Y, Z,…, W).

For example, a point on a plane is described not by one coordinate, but by two, but in space - even by three.

The properties of a system of several random variables are not limited to the properties of individual random variables included in the system, but also include mutual connections (dependencies) between random variables. Therefore, when studying a system of random variables, one should pay attention to the nature and degree of dependence. This dependence can be more or less pronounced, more or less close. And in other cases, random variables turn out to be practically independent.

A random variable Y is called independent of a random variable X if the distribution law of a random variable Y does not depend on what value X has taken.

It should be noted that the dependence and independence of random variables is always a mutual phenomenon: if Y does not depend on X, then the value of X does not depend on Y. Taking this into account, we can give the following definition of the independence of random variables.

Random variables X and Y are called independent if the distribution law of each of them does not depend on what value the other has taken. Otherwise, the quantities X and Y are called dependent.

The distribution law of a random variable is any relation that establishes a connection between the possible values ​​of a random variable and the corresponding probabilities.

The concept of "dependence" of random variables, which is used in the theory of probability, is somewhat different from the usual concept of "dependence" of quantities, which is used in mathematics. So, the mathematician under "dependence" means only one type of dependence - complete, rigid, so-called functional dependence. Two quantities X and Y are called functionally dependent if, knowing the value of one of them, it is possible to accurately determine the value of the other.

In the theory of probability, there is a slightly different type of dependence - probabilistic dependence. If the value of Y is related to the value of X by a probabilistic dependence, then, knowing the value of X, it is impossible to accurately indicate the value of Y, but you can indicate its distribution law, depending on what value the value of X has taken.

The probabilistic dependence can be more or less close; as the closeness of the probabilistic dependence increases, it approaches the functional one more and more. Thus, functional dependence can be considered as an extreme, limiting case of the closest probabilistic dependence. Another extreme case is complete independence of random variables. Between these two extreme cases lie all the gradations of probabilistic dependence - from the strongest to the weakest.

The probabilistic relationship between random variables is often encountered in practice. If the random variables X and Y are in a probabilistic relationship, this does not mean that with a change in the value of X, the value of Y changes in a quite definite way; it only means that with a change in the value of X, the value of Y also tends to change (increase or decrease with an increase in X). This trend is observed only in general outline, and in each individual case deviations from it are possible.

The regression coefficient is an absolute value by which the value of one feature changes on average when another related feature changes by a specified unit of measurement. Definition of regression. The relationship between y and x determines the sign of the regression coefficient b (if> 0 - direct relationship, otherwise - reverse). The linear regression model is the most commonly used and most studied in econometrics.

1.4. Approximation error Let us estimate the quality of the regression equation using the absolute approximation error. The predicted values ​​of the factors are substituted into the model and point predictive estimates of the studied indicator are obtained. Thus, the regression coefficients characterize the degree of significance of individual factors for increasing the level of the effective indicator.

Regression coefficient

Consider now task 1 of the regression analysis tasks on p. 300-301. One of the mathematical results of linear regression theory says that the estimate N is an unbiased estimate with the minimum variance in the class of all linear unbiased estimates. For example, you can calculate the number of colds on average at certain values ​​of the average monthly air temperature in the autumn-winter period.

Regression line and regression equation

Regression sigma is used to construct a regression scale, which reflects the deviation of the values ​​of the effective trait from its mean value, plotted on the regression line. 1, x2, x3 and the corresponding mean values ​​y1, y2 y3, as well as the smallest (y - σy / x) and largest (y + σy / x) values ​​(y) construct a regression scale. Conclusion. Thus, the scale of regression within the calculated values ​​of body weight allows you to determine it at any other value of height or to assess the individual development of the child.

In matrix form, the regression equation (RE) is written as: Y = BX + U (\ displaystyle Y = BX + U), where U (\ displaystyle U) is the error matrix. The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

Paired linear regression can be extended to include more than one independent variable; in this case, it is known as multiple regression. Both for outliers and for “influential” observations (points), models are used, both with and without them, paying attention to the change in the estimate (regression coefficients).

Because of the linear relationship, and we expect that it changes as it changes, and we call this variation, which is caused or explained by regression. If this is the case, then most of the variation will be due to regression, and the points will lie close to the regression line, i.e. the line matches the data well. The difference is the percentage of variance that cannot be explained by the regression.

This method is used to visualize the form of connection between the studied economic indicators. Based on the correlation field, it is possible to put forward a hypothesis (for the general population) that the relationship between all possible values ​​of X and Y is linear.

Reasons for the existence of a random error: 1. Not including significant explanatory variables in the regression model; 2. Aggregation of variables. System of normal equations. In our example, the connection is direct. To predict the dependent variable of the effective indicator, it is necessary to know the predicted values ​​of all factors included in the model.

Comparison of correlation and regression coefficients

It can be guaranteed with a probability of 95% that the values ​​of Y at unlimited a large number observations will not go beyond the found intervals. If the calculated value with lang = EN-US> n-m-1) degrees of freedom is greater than the tabular value for a given significance level, then the model is considered significant. This ensures that there is no correlation between any deviations and, in particular, between adjacent deviations.

Regression coefficients and their interpretation

In most cases, positive autocorrelation is caused by the directional constant influence of some factors that were not taken into account in the model. Negative autocorrelation effectively means that a positive deviation is followed by a negative one, and vice versa.

What is regression?

2. Inertia. Many economic indicators (inflation, unemployment, GNP, etc.) have a certain cyclical nature associated with the waveform of business activity. In many industrial and other areas, economic performance responds to change economic conditions with a delay (time lag).

If the preliminary standardization of factor indicators is carried out, then b0 is equal to the average value of the effective indicator in the aggregate. The specific values ​​of the regression coefficients are determined from empirical data according to the least squares method (as a result of solving systems of normal equations).

The linear regression equation has the form y = bx + a + ε Here ε is a random error (deviation, perturbation). Since the error is more than 15%, then given equation not desirable to use as a regression. Substituting the corresponding x values ​​into the regression equation, you can determine the aligned (predicted) values ​​of the effective y (x) indicator for each observation.

Regression coefficients show the intensity of the influence of factors on the effective indicator. If the preliminary standardization of factor indicators is carried out, then b 0 is equal to the average value of the effective indicator in the aggregate. The coefficients b 1, b 2, ..., b n show how many units the level of the effective indicator deviates from its average value, if the values ​​of the factor indicator deviate from the average, equal to zero, for one standard deviation... Thus, the regression coefficients characterize the degree of significance of individual factors for increasing the level of the effective indicator. The specific values ​​of the regression coefficients are determined from empirical data according to the least squares method (as a result of solving systems of normal equations).

Regression line- the line that most accurately reflects the distribution of experimental points on the scatter diagram and the steepness of the slope of which characterizes the relationship between two interval variables.

The regression line is most often looked for as a linear function (linear regression), the best way approximating the desired curve. This is done using the least squares method, when the sum of the squares of the deviations actually observed from their estimates is minimized (we mean estimates using a straight line that claims to represent the desired regression dependence):

(M is the sample size). This approach is based on the fact known fact that the sum appearing in the above expression takes the minimum value precisely for the case when.
57. The main tasks of the theory of correlation.

Correlation theory is an apparatus that evaluates the tightness of connections between phenomena that are not only in causal relationships. With the help of the theory of correlation, stochastic, but not causal relationships are estimated. The author, together with M. L. Lukatskaya, made an attempt to obtain estimates for causal relationships. However, the question of the causal relationship of phenomena, of how to identify the cause and effect, remains open, and it seems that at the formal level it is fundamentally impossible to resolve.

Correlation theory and its applied to the analysis of production.

Correlation theory, which is one of the branches of mathematical statistics, makes it possible to make reasonable assumptions about the possible limits in which the investigated parameter will be located with a certain degree of reliability if other statistically related parameters receive certain values.

In the theory of correlation, it is customary to distinguish two main tasks.

First task correlation theory - to establish the form of the correlation, i.e. kind of regression function (linear, quadratic, etc.).

Second task correlation theory - to assess the tightness (strength) of the correlation.

The tightness of the correlation (dependence) Y on X is estimated by the magnitude of the dispersion of the Y values ​​around the conditional mean. A large scattering indicates a weak dependence of Y on X, a small scattering indicates a strong dependence.
58. Correlation table and its numerical characteristics.

In practice, as a result of independent observations of the values ​​of X and Y, as a rule, one deals not with the entire set of all possible pairs of values ​​of these values, but only with a limited sample from the general population, and the size n of the sample population is defined as the number of pairs available in the sample.

Let the value X in the sample take the values ​​x 1, x 2, .... x m, where the number of differing values ​​of this value, and in the general case, each of them in the sample can be repeated. Let the value Y in the sample take the values ​​y 1, y 2, .... y k, where k is the number of different values ​​of this value, and in the general case, each of them in the sample can also be repeated. In this case, the data is entered into the table, taking into account the frequencies of occurrence. Such a table with grouped data is called a correlation table.

The first stage of statistical processing of the results is the compilation of a correlation table.

Y \ X x 1 x 2 ... x m n y
y 1 n 12 n 21 n m1 n y1
y 2 n 22 n m2 n y2
...
y k n 1k n 2k n mk n yk
n x n x1 n x2 n xm n

The first line of the main part of the table lists in ascending order all the values ​​of X in the sample. The first column also lists in ascending order all the values ​​of Y in the sample. At the intersection of the corresponding rows and columns, the frequencies n ij (i = 1,2 , ..., m; j = 1,2, ..., k) equal to the number of occurrences of the pair (xi; yi) in the sample. For example, the frequency n 12 represents the number of occurrences in the sample of the pair (x 1; y 1).

Also n xi n ij, 1≤i≤m, the sum of the elements of the i-th column, n yj n ij, 1≤j≤k, is the sum of the elements of the j-th row and n xi = n yj = n

The analogs of the formulas obtained from the data of the correlation table are as follows:


59. Empirical and theoretical regression lines.

Theoretical regression line can be calculated in this case from the results of individual observations. To solve the system of normal equations, we need the same data: x, y, xy and xr. We have data on the volume of cement production and the volume of fixed assets in 1958. The task is to investigate the relationship between the volume of cement production (in physical terms) and the volume of fixed assets. [ 1 ]

The less the theoretical regression line (calculated by the equation) deviates from the actual (empirical), the less average error approximation.

The process of finding the theoretical regression line is the flattening of the empirical regression line based on the least squares method.

The process of finding a theoretical regression line is called empirical regression line alignment and consists in choosing and justifying a type; curve and the calculation of the parameters of its equation.

Empirical regression is based on the data of analytical or combination groupings and represents the dependence of the group mean values ​​of the result-attribute on the group average values ​​of the attribute-factor. The graphic representation of empirical regression is a broken line made up of points, the abscissas of which are the group mean values ​​of the attribute-factor, and the ordinates are the group average values ​​of the attribute-result. The number of points is equal to the number of groups in the grouping.

The empirical regression line reflects the main trend of the considered dependence. If the empirical regression line in its appearance approaches a straight line, then we can assume the presence of a straight-line correlation between the features. And if the communication line approaches the curve, then this may be due to the presence of a curvilinear correlation.
60. Selected coefficients of correlation and regression.

If the relationship between the signs on the graph indicates linear correlation, count correlation coefficient r, which allows you to assess the closeness of the relationship of variables, as well as to find out what proportion of changes in the trait is due to the influence of the main trait, which is the influence of other factors. The coefficient varies from –1 to +1. If r= 0, then there is no connection between the features. Equality r= 0 indicates only the absence of a linear correlation dependence, but not in general about the absence of a correlation, and even more so a statistical dependence. If r= ± 1, then this means the presence of a complete (functional) connection. In this case, all the observed values ​​are located on the regression line, which is a straight line.
Practical significance the correlation coefficient is determined by its value squared, which is called the coefficient of determination.
Regression approximated (roughly described) by a linear function y = kX + b. For regression Y on X, the regression equation is: `y x = ryx X + b; (one). The slope ryx of the Y-on-X regression line is called the Y-on-X regression coefficient.

If equation (1) is found from sample data, then it is called sample regression equation... Accordingly, ryx is the sample Y-X regression coefficient, and b is the sample intercept. The regression coefficient measures the variation Y per unit of variation X. The parameters of the regression equation (coefficients ryx and b) are found using the least squares method.
61. Assessment of the significance of the correlation coefficient and the tightness of the correlation in the general population

Significance of correlation coefficients checked by the Student's criterion:

where - root-mean-square error of the correlation coefficient, which is determined by the formula:

If the calculated value (higher than the tabular value, then it can be concluded that the value of the correlation coefficient is significant. Table values t are found according to the table of values ​​of the Student's criteria. This takes into account the number of degrees of freedom (V = n - 1) and the level of confidence (in economic calculations usually 0.05 or 0.01). In our example, the number of degrees of freedom is: P - 1 = 40 - 1 = 39. At the level of confidence R = 0,05; t= 2.02. Since (the actual in all cases is higher than the t-tabular, the relationship between the effective and factor indicators is reliable, and the value of the correlation coefficients is significant.

Estimation of the correlation coefficient calculated from a limited sample is almost always different from zero. But it still does not follow from this that the correlation coefficient the general population also nonzero. It is required to evaluate the significance of the sample value of the coefficient or, in accordance with the formulation of the problems of testing statistical hypotheses, to test the hypothesis that the correlation coefficient is equal to zero. If the hypothesis N 0 on the equality of the correlation coefficient to zero will be rejected, then the sample coefficient is significant, and the corresponding values ​​are related by a linear relationship. If the hypothesis N 0 is accepted, then the estimate of the coefficient is not significant, and the values ​​are not linearly related to each other (if, for physical reasons, the factors can be related, then it is better to say that this relationship has not been established based on the available ED). Testing the hypothesis about the significance of the correlation coefficient estimate requires knowledge of the distribution of this random variable. Distribution of quantity  ik studied only for a special case when the random variables U j and U k distributed according to the normal law.

As a criterion for testing the null hypothesis N 0 apply a random value ... If the modulus of the correlation coefficient is relatively far from unity, then the value t if the null hypothesis is true, it is distributed according to the Student's law with n- 2 degrees of freedom. Competing hypothesis N 1 corresponds to the statement that the value  ik nonzero (greater than or less than zero). Therefore, the critical area is two-sided.
62. Calculation of the sample correlation coefficient and construction of the sample equation of the straight line of regression.

Selective correlation coefficient is found by the formula

where are sample means square deviations quantities and.

The sample correlation coefficient shows the closeness of the linear relationship between and: the closer to one, the stronger the linear relationship between and.

Simple linear regression finds the linear relationship between one input variable and one output variable. To do this, a regression equation is determined - this is a model that reflects the dependence of the values ​​of Y, the dependent value of Y on the values ​​of x, the independent variable x and the general population, is described by the equation:

where A0- free term of the regression equation;

A1- coefficient of the regression equation

Then the corresponding straight line is constructed, called the regression line. Coefficients А0 and А1, also called model parameters, are chosen so that the sum of squares of deviations of points corresponding to real observations of data from the regression line would be minimal. The selection of coefficients is performed using the least squares method. In other words, simple linear regression describes a linear model that best approximates the relationship between one input variable and one output variable.

The study of correlation dependences is based on the study of such relationships between variables in which the values ​​of one variable, it can be taken as the dependent variable, "on average" change depending on what values ​​the other variable takes, considered as the cause in relation to the dependent variable. The action of this reason is carried out in a complex interaction various factors, as a result of which the manifestation of the pattern is obscured by the influence of chance. By calculating the average values ​​of the effective attribute for a given group of values ​​of the attribute-factor, the influence of accidents is partly eliminated. By calculating the parameters of the theoretical communication line, their further elimination is carried out and an unambiguous (in form) change in "y" is obtained with a change in the factor "x".

To study stochastic relationships, the method of comparing two parallel rows, the method of analytical groupings, correlation analysis, regression analysis and some nonparametric methods. V general view the task of statistics in the field of studying relationships is not only to quantify their presence, direction and strength of connection, but also to determine the form (analytical expression) of the influence of factor signs on the effective one. To solve it, the methods of correlation and regression analysis are used.

CHAPTER 1. THE REGRESSION EQUATION: THEORETICAL BASIS

1.1. Regression equation: essence and types of functions

Regression (lat.regressio- reverse movement, transition from more complex shapes development to less complex) is one of the basic concepts in the theory of probability and mathematical statistics, which expresses the dependence of the mean value of a random variable on the values ​​of another random variable or several random variables. This concept was introduced by Francis Galton in 1886.

The theoretical regression line is the line around which the points of the correlation field are grouped and which indicates the main direction, the main trend of the relationship.

The theoretical regression line should reflect the change in the average values ​​of the effective attribute "y" as the values ​​of the factor attribute "x" change, provided that all other - random in relation to the factor "x" - reasons are completely canceled out. Therefore, this line should be drawn so that the sum of the deviations of the points of the correlation field from the corresponding points of the theoretical regression line is equal to zero, and the sum of the squares of these deviations is the minimum value.

y = f (x) - the regression equation is a formula for the statistical relationship between variables.

A straight line on a plane (in a two-dimensional space) is given by the equation y = a + b * x. In more detail: the variable y can be expressed in terms of the constant (a) and slope(b) multiplied by the variable x. The constant is sometimes also called the intercept, and the slope is sometimes referred to as the regression or B-coefficient.

An important step in regression analysis is to determine the type of function that characterizes the relationship between features. The main basis should be a meaningful analysis of the nature of the studied dependence, its mechanism. At the same time, it is far from always possible to theoretically substantiate the form of connection of each of the factors with the effective indicator, since the studied socio-economic phenomena are very complex and the factors that form their level are closely intertwined and interact with each other. Therefore, on the basis of theoretical analysis, the most general conclusions can often be drawn regarding the direction of the connection, the possibility of changing it in the studied population, the legitimacy of using a linear relationship, the possible presence of extreme values, etc. An analysis of specific evidence should be a necessary complement to such assumptions.

An approximate idea of ​​the communication line can be obtained based on the empirical regression line. The empirical regression line is usually a broken line, has a more or less significant kink. This is explained by the fact that the influence of other unaccounted factors that affect the variation of the effective indicator is not fully extinguished in the mean, due to the insufficient number of observations, therefore, the empirical link can be used to select and substantiate the type of theoretical curve provided that the number of observations is sufficient. great.

One of the elements of case studies is the comparison different equations dependences based on the use of quality criteria for the approximation of empirical data by competing models

1. Linear:

2. Hyperbolic:

3. Indicative:

4. Parabolic:

5. Degree:

6. Logarithmic:

7. Logistic:

The model with one explanatory and one explanatory variable is a pairwise regression model. If two or more explanatory (factor) variables are used, then we speak of using a multiple regression model. At the same time, linear, exponential, hyperbolic, exponential and other types of functions connecting these variables can be selected as options.

To find the parameters a and b, the regression equations use the least squares method. When applying the least squares method to find the function that best fits the empirical data, it is considered that the bag of squares of deviations of empirical points from the theoretical regression line should be the minimum value.

The criterion for the least squares method can be written as follows:

Consequently, the application of the least squares method to determine the parameters a and b of the straight line most consistent with empirical data is reduced to an extremum problem.

The following conclusions can be drawn regarding the ratings:

1. The least squares estimates are sampling functions that make them easy to compute.

2. The least squares estimates are point estimates of the theoretical regression coefficients.

3. The empirical regression line necessarily passes through the point x, y.

4. The empirical regression equation is constructed in such a way that the sum of the deviations

.

A graphic representation of the empirical and theoretical link is shown in Figure 1.


The b parameter in the equation is the regression coefficient. In the presence of a direct correlation dependence, the regression coefficient has a positive value, and in the case inverse relationship the regression coefficient is negative. The regression coefficient shows how much, on average, the value of the effective attribute "y" changes when the factor attribute "x" changes by one. Geometrically, the regression coefficient is the slope of the straight line representing the correlation equation relative to the x-axis (for the equation

).

Multidimensional section statistical analysis, devoted to recovering dependencies, is called regression analysis. The term "linear regression analysis" is used when the considered function linearly depends on the estimated parameters (dependence on independent variables can be arbitrary). Estimation theory

unknown parameters is well developed precisely in the case of linear regression analysis. If there is no linearity and it is impossible to go to linear problem, then, as a rule, good properties not to be expected from the estimates. Let's demonstrate approaches in the case of dependencies of various kinds... If the dependence has the form of a polynomial (polynomial). If the calculation of correlation characterizes the strength of the relationship between two variables, then regression analysis serves to determine the type of this relationship and makes it possible to predict the value of one (dependent) variable based on the value of the other (independent) variable. For linear regression analysis, the dependent variable must have an interval (or ordinal) scale. At the same time, binary logistic regression reveals the dependence of a dichotomous variable on some other variable related to any scale. The same application conditions are valid for probit analysis. If the dependent variable is categorical, but has more than two categories, then multinomial logistic regression would be an appropriate method here, and non-linear relationships between variables that belong to an interval scale can be analyzed. The nonlinear regression method is intended for this.