Simple linear regression. Linear Regression Basics

Basics of data analysis.

A typical challenge in practice is defining dependencies or relationships between variables. V real life variables are linked to each other... For example, in marketing, the amount of money invested in advertising affects sales; in medical research, the dose of the drug affects the effect; in textile production, the quality of fabric dyeing depends on temperature, humidity, and other parameters; in metallurgy, the quality of steel depends on special additives, etc. Finding data dependencies and using them for your own purposes is the task of data analysis.

Suppose you are observing the values ​​of a pair of variables X and Y and you want to find the relationship between them. For instance:

X is the number of visitors to the online store, Y is the volume of sales;

X - plasma panel diagonal, Y - price;

X is the purchase price of the stock, Y is the sale price;

X - the cost of aluminum on the London Stock Exchange, Y - sales volumes;

X is the number of breakthroughs on oil pipelines, Y is the amount of losses;

X is the "age" of the aircraft, Y is the cost of its repair;

X - sales area, Y - store turnover;

X is income, Y is consumption, etc.

The X variable is usually called the independent variable, the Y variable is called the dependent variable. Sometimes the variable X is called the predictor, and the variable Y is called the response.



We want to determine exactly the dependence on X or predict what the values ​​of Y will be for given values ​​of X. In this case, we observe the X values ​​and their corresponding Y values. The task is to build a model that allows determining Y from the values ​​of X that differ from the observed ones. In statistics, such tasks are solved within the framework regression analysis.

There are various regression models determined by the choice of the function f (x 1, x 2, ..., x m):

1) Simple Linear Regression

2) Multiple regression

3) Polynomial regression

Odds are called regression parameters.

The main feature of regression analysis: with its help, you can get specific information about the form and nature of the relationship between the studied variables.

Sequence of Regression Analysis Steps

1. Statement of the problem. At this stage, preliminary hypotheses about the dependence of the investigated phenomena are formed.

2. Determination of dependent and independent (explanatory) variables.

3. Collection of statistical data. Data must be collected for each of the variables included in the regression model.

4. Formulation of the hypothesis about the form of communication (simple or multiple, linear or nonlinear).

5. Determination of the regression function (consists in calculating the numerical values ​​of the parameters of the regression equation)

6. Assessment of the accuracy of the regression analysis.

7. Interpretation of the results obtained. The obtained results of the regression analysis are compared with preliminary hypotheses. The correctness and likelihood of the results obtained are evaluated.

8. Prediction of unknown values ​​of the dependent variable.

With the help of regression analysis, it is possible to solve the problem of forecasting and classification. The predicted values ​​are calculated by substituting the explanatory variable values ​​into the parameter regression equation. The classification problem is solved in the following way: the regression line divides the entire set of objects into two classes, and that part of the set where the value of the function is greater than zero belongs to one class, and the one where it is less than zero belongs to another class.

The main tasks of regression analysis: establishing the form of dependence, determining the regression function, evaluating the unknown values ​​of the dependent variable.

Linear regression

Linear regression is reduced to finding an equation of the form

Or . (1.1)

x- called the independent variable or predictor.

Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. this "predicted value y»

· a- free member (intersection) of the line of assessment; this value Y, when x = 0(Fig. 1).

· bslope or the gradient of the evaluated line; it represents the amount by which Y increases on average if we increase x by one unit.

· a and b are called the regression coefficients of the estimated line, although this term is often used only for b.

· e- unobservable random variables with a mean of 0, or they are also called observation errors, it is assumed that the errors are not correlated with each other.

Fig. 1. Line linear regression showing the intersection of a and the slope of b (the amount of increase in Y as x increases by one unit)

The equation of the form allows for the given values ​​of the factor X have the theoretical values ​​of the effective indicator, substituting the actual values ​​of the factor into it X... In the graph, the theoretical values ​​represent the regression line.

In most cases (if not always), there is a certain scatter of observations relative to the regression line.

Theoretical regression line is called the line around which the points of the correlation field are grouped and which indicates the main direction, the main trend of the connection.

An important stage regression analysis is the definition of the type of function by which the relationship between features is characterized. The main basis for choosing the type of equation should be a meaningful analysis of the nature of the studied dependence, its mechanism.

To find parameters a and b we use regression equations method least squares(OLS). When applying OLS to find a function that the best way corresponds to empirical data, it is believed that the sum of the squared deviations (remainder) of empirical points from the theoretical regression line should be the minimum value.

The fit is estimated by considering the residuals (the vertical distance of each point from the line, e.g. residual = observed y- predicted y, Rice. 2).

The best fit line is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dashed lines) for each point.

After simple transformations, we obtain a system of normal equations of the least squares method for determining the values ​​of the parameters a and b equations of straight-line correlation based on empirical data:

. (1.2)

Solving this system equations for b, we get the following formula for determining this parameter:

(1.3)

Where and are the average values ​​of y, x.

Parameter value a we obtain by dividing both sides of the first equation in this system by n:

Parameter b in the equation is called the regression coefficient. In the presence of a direct correlation dependence, the regression coefficient has a positive value, and in the case inverse relationship the regression coefficient is negative.

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

The regression coefficient shows how much, on average, the value of the effective indicator changes y when the factor attribute changes X per unit, the geometric regression coefficient is the slope of the straight line representing the correlation equation relative to the axis X(for the equation).

Because of the linear relationship, and we expect that it changes as it changes, and we call this variation, which is caused or explained by regression. The residual variation should be as small as possible.

If this is the case, then most of the variation will be due to regression, and the points will lie close to the regression line, i.e. the line matches the data well.

A quantitative characteristic of the degree of linear dependence between random variables X and Y is the correlation coefficient r ( An indicator of the tightness of the relationship between two signs ) .

Correlation coefficient:

where x is the value of the factor attribute;

y is the value of the effective feature;

n is the number of data pairs.


Fig. 3 - Variants of the location of the "cloud" of points

If the correlation coefficient r = 1 then between X and Y there is a functional linear dependence, all points (x i, y i) will lie on a straight line.

If the correlation coefficient r = 0 (r ~ 0) then they say that X and Y uncorrelated, i.e. there is no linear relationship between them.

The relationship between traits (on the Chaddock scale) can be strong, medium and weak . The tightness of communication is determined by the value of the correlation coefficient, which can take values ​​from -1 to +1 inclusive. The criteria for assessing the tightness of communication are shown in Fig. one.

Rice. 4. Quantitative criteria for assessing the tightness of communication

Any relationship between variables has two important properties: magnitude and reliability. The stronger the relationship between two variables, the greater the magnitude of the relationship and the easier it is to predict the value of one variable from the value of the other variable. The magnitude of the relationship is easier to measure than the reliability.

The reliability of the dependence is no less important than its magnitude. This property is associated with the representativeness of the sample under study. The reliability of a dependence characterizes how likely it is that this dependence will be found again on other data.

As the value of the dependence of the variables grows, its reliability usually increases.

Share total variance, which is explained by the regression is called coefficient of determination, usually expressed in terms of percentage and denote R 2(in paired linear regression, this is the value r 2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The coefficient of determination measures the proportion of the spread relative to the mean, which is "explained" by the constructed regression. The coefficient of determination lies in the range from 0 to 1. The closer the coefficient of determination is to 1, the better the regression "explains" the dependence in the data, a value close to zero means the poor quality of the constructed model. The coefficient of determination can approach 1 as much as possible if all predictors are different.

The difference is the percentage of variance that cannot be explained by the regression.

Multiple regression

Multiple regression is used in situations where one dominant factor cannot be distinguished from the multitude of factors influencing the effective trait and the influence of several factors must be taken into account. For example, the volume of production is determined by the value of the main and working capital, the number of personnel, the level of management, etc., the level of demand depends not only on the price, but also on the funds available to the population.

The main goal of multiple regression is to build a model with several factors and to determine the influence of each factor separately, as well as their combined effect on the studied indicator.

Multiple regression is a relationship equation with several independent variables:

During their studies, students very often come across a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This kind equalities are used in statistics and econometrics.

Defining Regression

In mathematics, regression means a quantity that describes the dependence of the average value of a set of data on the values ​​of another quantity. The regression equation shows, as a function of a particular feature, the mean of another feature. The regression function is simple equation y = x, in which y is the dependent variable, and x is the independent (attribute-factor). In fact, the regression is expressed as y = f (x).

What are the types of relationships between variables

In general, there are two opposite types of relationship: correlation and regression.

The first is characterized by the equality of conditional variables. In this case, it is not known for certain which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a relationship of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

Regression types

Today, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, paired, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

A linear regression equation is used in statistics to clearly explain the parameters of an equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. Logarithmically linear equation expresses the relationship using a logarithmic function: In y = In c + t * In x + In E.

Multiple and nonlinear

Two more complex types of regression are multiple and non-linear. The multiple regression equation is expressed by the function y = f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable, and x is the explanatory one. Variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit controversial. On the one hand, it is not linear with respect to the indicators taken into account, but on the other hand, in the role of assessing indicators, it is linear.

Inverse and Paired Regressions

The inverse is the kind of function that needs to be converted to a linear form. In the most traditional applications, it takes the form of a function y = 1 / c + m * x + E. The paired regression equation demonstrates the relationship between the data as a function of y = f (x) + E. In the same way as in other equations, y depends on x, and E is a stochastic parameter.

Correlation concept

This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1; +1]. Negative indicator indicates the presence of feedback, positive - about direct. If the coefficient takes a value equal to 0, then there is no relationship. How closer meaning to 1 - the stronger the connection between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can assess the closeness of the relationship. They are used on the basis of a distribution estimate to study parameters obeying the normal distribution law.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and to evaluate the indicators of the selected relationship formula. The correlation field is used as a method for identifying a link. For this, all existing data must be displayed graphically. In a rectangular 2D coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values ​​of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they are lined up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about an almost complete absence of communication. If it is between 30% and 70%, then this indicates the presence of links of average density. 100% indicator - evidence functional communication.

A non-linear regression equation, like a linear one, must be supplemented with a correlation index (R).

Correlation for multiple regression

The coefficient of determination is a measure of the square of multiple correlation. He speaks of the tightness of the relationship between the presented complex of indicators with the studied feature. He can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

In order to calculate the index of multiple correlation, it is necessary to calculate its index.

Least square method

This method is a way of estimating regression factors. Its essence lies in minimizing the sum of the squared deviations obtained due to the dependence of the factor on the function.

A paired linear regression equation can be estimated using this method. This type of equations is used in the case of detection between the indicators of a paired linear relationship.

Equation parameters

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter m shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c has no economic meaning. The only effect on the function is the sign before the factor c. If there is a minus, then we can say about a delayed change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of a regression equation can be expressed through an equation. For example, factor c has the form c = y - tx.

Grouped data

There are conditions of the problem in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator, depending on x, changes. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, averages are often subject to external fluctuations. These fluctuations are not a reflection of the regularity of the relationship, they only mask its "noise". The averages show much worse relationship patterns than the linear regression equation. However, they can be used as a base for finding an equation. By multiplying the size of an individual population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the indicator of the amount xy. In the event that the intervals are small, it is possible to conventionally take the x exponent for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Further, all the amounts are knocked together and the total amount xy is obtained.

Multiple Pairwise Regression Equation: Assessing the Importance of a Link

As discussed earlier, multiple regression has a function of the form y = f (x 1, x 2,…, x m) + E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and study the reasons and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, such an equation is used a little less often.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what influence each of the factors individually and in their general totality has on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. At the same time, two types of functions are usually used to assess the relationship: linear and nonlinear.

A linear function is depicted in the form of such a relationship: y = a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m, are considered the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

Nonlinear equations are, for example, of the form power function y = ax 1 b1 x 2 b2 ... x m bm. In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how many%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors need to be considered when constructing multiple regression

In order to correctly construct multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have a certain understanding of the nature of the relationship between economic factors and the modeled. The factors that will need to be included must meet the following criteria:

  • Must be quantifiable. In order to use a factor describing the quality of an object, in any case, it should be quantified.
  • There should be no intercorrelation of factors, or a functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditioned, and this entails its unreliability and unclear estimates.
  • If there is a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction methods

There are a myriad of methods and techniques to explain how you can select factors for an equation. However, all these methods are based on the selection of coefficients using the correlation indicator. Among them are:

  • Exclusion method.
  • Method of inclusion.
  • Regression analysis step by step.

The first method involves filtering out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has a right to exist. They have their pros and cons, but they all in their own way can solve the issue of filtering out unnecessary indicators. As a rule, the results obtained by each individual method are fairly close.

Multivariate analysis methods

Such methods of determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, face recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared as a result of the development of the method of components. All of them apply in certain circumstances, under certain conditions and factors.

With a linear type of connection between the two studied characteristics, in addition to calculating correlations, the calculation of the regression coefficient is applied.

In the case of a straight-line correlation, each of the changes in one feature corresponds to a well-defined change in another feature. However, the correlation coefficient shows this relationship only in relative terms - in fractions of a unit. With the help of regression analysis, this value of the relationship is obtained in named units. The amount by which the first sign changes on average when the second changes per unit of measurement is called the regression coefficient.

In contrast to the correlation regression analysis gives broader information, since by calculating two regression coefficients Rx / y and Ru / x it is possible to determine both the dependence of the first feature on the second, and the second on the first. The expression of a regression relationship using an equation allows for a certain value of one feature to establish the value of another feature.

The regression coefficient R is the product of the correlation coefficient and the ratio of standard deviations calculated for each feature. It is calculated by the formula

where, R is the regression coefficient; SX is the standard deviation of the first feature, which changes due to the change in the second; SУ - standard deviation of the second feature due to the change in which the first feature changes; r is the correlation coefficient between these features; x - function; y-argument.

This formula determines the magnitude of the value of x when y changes per unit of measurement. If a reverse calculation is necessary, you can find the value of y when x changes per unit of measure using the formula:


In this case, the active role in changing one attribute in relation to another changes, in comparison with the previous formula, the argument becomes a function and vice versa. SX and SY are accepted in a named expression.

There is a clear relationship between the values ​​of r and R, expressed in the fact that the product of the regression x in y and the regression y in x is equal to the square of the correlation coefficient, i.e.

Rx / y * Ry / x = r2

This indicates that the correlation coefficient is the geometric mean of both values ​​of the regression coefficients of a given sample. This formula can be used to check the correctness of calculations.

When processing digital material on calculating machines, detailed regression coefficient formulas can be used:

R or


For the regression coefficient, its representativeness error can be calculated. The regression coefficient error is equal to the correlation coefficient error multiplied by the ratio of the quadratic ratios:

The criterion for the reliability of the regression coefficient is calculated using the usual formula:

as a result, it is equal to the criterion of reliability of the correlation coefficient:

The reliability of the tR value is established according to the Student's table at  = n - 2, where n is the number of observation pairs.

Curvilinear regression.

REGRESSION, CURVED... Any nonlinear regression in which the regression equation for changes in one variable (y) as a function of t changes in another (x) is quadratic, cubic, or a higher order equation. While it is always mathematically possible to obtain a regression equation that will fit every squiggle on the curve, most of these perturbations result from sampling or measurement errors, and such a “perfect” fit does nothing. It is not always easy to determine whether a curvilinear regression fits a dataset, although statistical tests exist to determine whether each higher power of the equation significantly increases the degree of fit of that dataset.

Curve fitting is performed in the same way using the least squares method as straight line fitting. The regression line must satisfy the condition of the minimum of the sum of the squared distances to each point of the correlation field. In this case, in equation (1), y is the calculated value of the function determined using the equation of the selected curvilinear relationship from the actual values ​​of x j. For example, if a second-order parabola is chosen to approximate the connection, then y = a + bx + cx2, (14). And the difference between a point lying on the curve and a given point of the correlation field for the corresponding argument can be written similarly to Eq. (3) in the form yj = yj (a + bx + cx2) (15) In this case, the sum of the squared distances from each point of the correlation field to the new regression line in the case of a second-order parabola will have the form: S 2 = yj 2 = 2 (16) Based on the minimum condition of this sum, the partial derivatives of S 2 with respect to a, b and c are equal to zero. After performing the necessary transformations, we obtain a system of three equations with three unknowns to determine a, b and c. , y = m a + b x + c x 2 yx = a x + b x 2 + c x 2.yx2 = a x 2 + b x 3 + c x4. (17). Solving the system of equations for a, b and c, we find the numerical values ​​of the regression coefficients. The values ​​y, x, x2, yx, yx2, x3, x4. Are found directly from production measurements. The estimate of the tightness of the relationship with a curvilinear dependence is the theoretical correlation ratio xy, which is the square root of the ratio of two variances: the mean square p2 of the deviations of the calculated values ​​y "j of the function according to the found regression equation from the arithmetic mean value Y of the value y to the mean square of the deviations y2 of the actual values ​​of the function yj from its arithmetic mean value: xу = (р2 / y2) 1/2 = ((y "j - Y) 2 / (yj - Y) 2) 1/2 (18) The squared correlation ratio xy2 shows the share of total variability of the dependent variable у due to the variability of the argument x. This indicator is called the coefficient of determination. In contrast to the correlation coefficient, the value of the correlation ratio can take only positive values ​​from 0 to 1. In the absence of a connection, the correlation ratio is zero, in the presence of a functional connection, it is equal to one, and in the presence of a regression connection of varying tightness, the correlation ratio takes on values ​​between zero and one ... The choice of the type of curve is of great importance in regression analysis, since the accuracy of the approximation depends on the type of the selected relationship and statistical estimates tightness of communication. The simplest method for choosing a curve type is to build correlation fields and select the appropriate types of regression equations based on the location of points in these fields. Regression analysis methods allow you to find the numerical values ​​of the regression coefficients for complex types of interrelation of parameters described, for example, by polynomials high degrees... Often, the shape of the curve can be determined based on the physical nature of the process or phenomenon under consideration. It makes sense to use polynomials of high degrees to describe rapidly changing processes in the event that the ranges of fluctuations in the parameters of these processes are significant. As applied to the research of the metallurgical process, it is sufficient to use curves of lower orders, for example, a parabola of the second order. This curve can have one extremum, which, as practice has shown, is quite enough to describe different characteristics metallurgical process. The results of calculating the parameters of the pairwise correlation relationship would be reliable and would be of practical value if the information used was obtained for conditions of wide ranges of fluctuations of the argument with the constancy of all other parameters of the process. Consequently, the methods for studying the pair correlation relationship of parameters can be used to solve practical problems only when there is confidence in the absence of other serious influences on the function, except for the analyzed argument. In production conditions, it is impossible to carry out the process in this way for a long time. However, if we have information about the main parameters of the process that affect its results, then the influence of these parameters can be mathematically eliminated and the relationship between the function and argument of interest to us can be isolated in a “pure form”. Such a relationship is called private, or individual. To determine it, the multiple regression method is used.

Correlation ratio.

The correlation ratio and the correlation index are numerical characteristics, closely related to the concept of a random variable, or rather to a system of random variables. Therefore, to introduce and define their meaning and role, it is necessary to clarify the concept of a system of random variables and some properties inherent in them.

Two or more random variables describing a certain phenomenon are called a system or a complex of random variables.

The system of several random variables X, Y, Z,…, W is usually denoted by (X, Y, Z,…, W).

For example, a point on a plane is described not by one coordinate, but by two, but in space - even by three.

The properties of a system of several random variables are not limited to the properties of individual random variables included in the system, but also include mutual connections (dependencies) between random variables. Therefore, when studying a system of random variables, one should pay attention to the nature and degree of dependence. This dependence can be more or less pronounced, more or less close. And in other cases, random variables turn out to be practically independent.

A random variable Y is called independent of a random variable X if the distribution law of a random variable Y does not depend on what value X has taken.

It should be noted that the dependence and independence of random variables is always a mutual phenomenon: if Y does not depend on X, then the value of X does not depend on Y. Taking this into account, we can give the following definition of the independence of random variables.

Random variables X and Y are called independent if the distribution law of each of them does not depend on what value the other has taken. Otherwise, the quantities X and Y are called dependent.

The distribution law of a random variable is any relation that establishes a connection between the possible values ​​of a random variable and the corresponding probabilities.

The concept of "dependence" of random variables, which is used in the theory of probability, is somewhat different from the usual concept of "dependence" of quantities, which is used in mathematics. So, the mathematician under "dependence" means only one type of dependence - complete, rigid, so-called functional dependence. Two quantities X and Y are called functionally dependent if, knowing the value of one of them, it is possible to accurately determine the value of the other.

In the theory of probability, there is a slightly different type of dependence - probabilistic dependence. If the value of Y is related to the value of X by a probabilistic dependence, then, knowing the value of X, it is impossible to accurately indicate the value of Y, but you can indicate its distribution law, depending on what value the value of X has taken.

The probabilistic dependence can be more or less close; as the closeness of the probabilistic dependence increases, it approaches the functional one more and more. Thus, functional dependence can be considered as an extreme, limiting case of the closest probabilistic dependence. Another extreme case is complete independence of random variables. Between these two extreme cases lie all the gradations of probabilistic dependence - from the strongest to the weakest.

The probabilistic relationship between random variables is often encountered in practice. If the random variables X and Y are in a probabilistic relationship, this does not mean that with a change in the value of X, the value of Y changes in a quite definite way; it only means that with a change in the value of X, the value of Y also tends to change (increase or decrease with an increase in X). This trend is observed only in general outline, and in each individual case deviations from it are possible.

What is regression?

Consider two continuous variables x = (x 1, x 2, .., x n), y = (y 1, y 2, ..., y n).

Let's place the points on a 2D scatter plot and say we have linear relationship if the data is fitted with a straight line.

If we believe that y depends on x, and changes in y are caused precisely by changes in x, we can define the regression line (regression y on the x), which best describes the straightforward relationship between the two variables.

The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that although tall fathers tend to have tall sons, the average height of sons is smaller than that of their tall fathers. The average height of sons "regressed" and "reversed" to the average height of all fathers in the population. Thus, on average, tall fathers have lower (but still tall) sons, and lower fathers have higher (but still rather short) sons.

Regression line

A mathematical equation that estimates a simple (paired) linear regression line:

x called the independent variable or predictor.

Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. this "predicted value y»

  • a- free member (intersection) of the line of assessment; this value Y, when x = 0(Fig. 1).
  • b- the slope or gradient of the evaluated line; it represents the amount by which Y increases on average if we increase x by one unit.
  • a and b are called the regression coefficients of the estimated line, although this term is often used only for b.

Paired linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig. 1. Linear regression line showing the intersection of a and slope b (the amount of increase in Y as x increases by one unit)

Least square method

We perform regression analysis using a sample of observations where a and b- sample estimates of the true (general) parameters, α and β, which determine the linear regression line in the population (general population).

Most simple method determination of coefficients a and b is an least square method(OLS).

The fit is estimated by considering the residuals (the vertical distance of each point from the line, e.g. residual = observed y- predicted y, Rice. 2).

The best fit line is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dashed lines) for each point.

Linear Regression Assumptions

So, for each observed value, the residual is equal to the difference and the corresponding predicted value. Each residual can be positive or negative.

You can use residuals to test the following assumptions underlying linear regression:

  • The balances are normally distributed with a zero mean;

If the assumptions of linearity, normality and / or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (for example, use a logarithmic transformation, etc.).

Abnormal values ​​(outliers) and influence points

An "influential" observation, if omitted, changes one or more estimates of the model parameters (ie, slope or intercept).

An outlier (an observation that contradicts most of the values ​​in a dataset) can be an “influential” observation and can be well detected visually when viewed from a 2D scatter plot or a residual plot.

Both for outliers and for “influential” observations (points), models are used, both with and without them, and they pay attention to the change in the estimate (regression coefficients).

When performing analysis, do not automatically discard outliers or influence points, as simple ignoring can affect the results obtained. Always investigate and analyze the causes of these outliers.

Linear regression hypothesis

When constructing a linear regression, the null hypothesis is tested that the general slope of the regression line β is zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate a test statistic equal to the ratio that obeys a distribution with degrees of freedom, where the standard error of the coefficient is


,

- estimation of the variance of the residuals.

Usually, if the level of significance achieved is the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom that gives the probability of a two-sided test

This is the interval that contains the general slope with a 95% probability.

For large samples, let's say we can approximate with a value of 1.96 (that is, the criterion statistics will tend to normal distribution)

Evaluation of the quality of linear regression: coefficient of determination R 2

Because of the linear relationship, and we expect it to change as it changes , and we call this variation that is caused or explained by regression. The residual variation should be as small as possible.

If this is the case, then most of the variation will be due to regression, and the points will lie close to the regression line, i.e. the line matches the data well.

The proportion of the total variance that is explained by the regression is called coefficient of determination, usually expressed in terms of percentage and denote R 2(in paired linear regression, this is the value r 2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference is the percentage of variance that cannot be explained by the regression.

There is no formal test to evaluate, we have to rely on subjective judgment to determine the quality of the regression line fit.

Applying a regression line to forecast

You can use a regression line to predict a value from a value within the observed range (never extrapolate outside these limits).

We predict the mean for observables that have a particular value by plugging that value into the regression line equation.

So, if we predict how We use this predicted value and its standard error to estimate confidence interval for true average size in the population.

Repeating this procedure for different values ​​allows you to build confidence limits for this line. This is the band or area that contains the true line, for example, with a 95% confidence level.

Simple regression designs

Simple regression designs contain one continuous predictor. If there are 3 cases with predictor values ​​P, for example, 7, 4, and 9, and the design includes a first-order effect P, then the design matrix X will have the form

a regression equation using P for X1 looks like

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P, such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation takes the form

Y = b0 + b1 P2

Sigma-restricted and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (since there are simply no categorical predictors). Regardless of the coding method chosen, the values ​​of the continuous variables are increased to the appropriate degree and used as the values ​​for the X variables. In this case, no recoding is performed. In addition, when describing regression designs, you can omit consideration of the design matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data presented in the table:

Rice. 3. Table of initial data.

Data compiled from a comparison of the 1960 and 1970 census in a randomly selected 30 districts. District names are represented as observation names. Information regarding each variable is presented below:

Rice. 4. Table of variable specifications.

Research task

For this example, the correlation between the poverty rate and the degree will be analyzed, which predicts the percentage of families that are below the poverty line. Therefore, we will treat variable 3 (Pt_Poor) as a dependent variable.

It can be hypothesized that population change and the percentage of families below the poverty line are related. It seems reasonable to expect that poverty leads to population outflow, hence there will be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng) as a predictor variable.

Viewing Results

Regression coefficients

Rice. 5. Regression coefficients Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and the Param. the non-standardized coefficient for the Pt_Poor regression on Pop_Chng is -0.40374. This means that for every unit decrease in population, there is a 40374 increase in the poverty rate. The upper and lower (default) 95% confidence limits for this non-standardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Distribution of variables

Correlation coefficients can become significantly overestimated or underestimated if there are large outliers in the data. Let us examine the distribution of the dependent variable Pt_Poor by district. To do this, let's build a histogram of the Pt_Poor variable.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even the two counties (the two right-hand columns) have a higher percentage of households below the poverty line than expected from the normal distribution, they appear to be "within the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. As a rule of thumb, outliers should be accounted for if the observation (or observations) do not fall within the interval (mean ± 3 times the standard deviation). In this case, it is worth repeating the analysis with and without outliers to ensure that they do not have a significant effect on the correlation between members of the population.

Scatter plot

If one of the hypotheses is a priori about the relationship between the given variables, then it is useful to check it on the graph of the corresponding scatterplot.

Rice. 8. Scatter diagram.

The scatter plot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, that is, with a 95% probability the regression line falls between the two dashed curves.

Significance criteria

Rice. 9. Table containing criteria for significance.

The criterion for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor, p<.001 .

Outcome

This example showed how to analyze a simple regression design. An interpretation of non-standardized and standardized regression coefficients was also presented. The importance of studying the distribution of responses of the dependent variable is discussed, and a technique for determining the direction and strength of the relationship between the predictor and the dependent variable is demonstrated.