Regression analysis assesses. Basics of data analysis

The concepts of correlation and regression are directly related. In the correlation and regression analysis, many common computers. They are used to identify causal relations between phenomena and processes. However, if correlation analysis allows you to evaluate the strength and direction of stochastic communication, then regression analysis - Even the form of addiction.

Regression can be:

a) depending on the number of phenomena (variables):

Simple (regression between two variables);

Multiple (regression between the dependent variable (y) and several explanatory variables (x1, x2 ... xn);

b) depending on the form:

Linear (displayed by a linear function, and a linear ratio exist between the studied variables);

Nonlinear (it is displayed by a nonlinear function, the connection is non-linear in nature between the variable variables);

c) by the nature of the connection between the variables included in consideration:

Positive (an increase in the value of the explanatory variable leads to an increase in the value of the dependent variable and vice versa);

Negative (with an increase in the value of the explanatory variable, the value of the explanable variable decreases);

d) by type:

Immediately (in this case, the reason has a direct impact on the investigation, i.e. dependent and explaining variables are associated directly with each other);

Indirect (explaining variable has an indirect effect through a third or number of other variables on the dependent variable);

False (nonsense regression) - may occur with a superficial and formal approach to the processes and phenomena under study. An example of meaningless is the regression that establishes the relationship between a decrease in the amount of alcohol consumed in our country and a decrease in the sale of washing powder.

When conducting regression analysis, the following main tasks are resolved:

1. Definition of the Dependency Form.

2. Determination of the regression function. To do this, use a mathematical equation of a type or another, which allows, first, to establish a general tendency to change the dependent variable, and, secondly, to calculate the effect of explaining the variable (or several variables) to the dependent variable.

3. Evaluation of unknown values \u200b\u200bof the dependent variable. The resulting mathematical dependence (regression equation) allows you to determine the value of the dependent variable both within the interval of the specified values \u200b\u200bof the explanatory variables and beyond. In the latter case, the regression analysis acts as a useful tool when predicting changes in socio-economic processes and phenomena (subject to the preservation of existing trends and relationships). Typically, the length of the time segment to which forecasting is carried out is not more than half of the time interval on which the initial indicators are monitored. It can be carried out as a passive forecast, solving the task of extrapolation and active, conducting reasoning according to the well-known scheme "if ..., then" and substitute various values \u200b\u200binto one or more explanatory regression variables.

For regression constructing A special method called called method of the smallest squares. This method has advantages over other methods of smoothing: a relatively simple mathematical determination of the desired parameters and a good theoretical substantiation from a probabilistic point of view.

When choosing a regression model, one of the significant requirements for it is to ensure the greatest possible simplicity, which allows to obtain a solution with sufficient accuracy. Therefore, to establish statistical connections at first, as a rule, we consider a model from a class of linear functions (as the most simplest of all possible classes of functions):

where Bi, b2 ... bj are coefficients that determine the effect of independent variables xij by yi; ai is a free member; EI is a random deviation that reflects the effect of unaccounted factors on the dependent variable; n - the number of independent variables; N The number of observations, and the condition should be observed (n. N + 1).

Linear model It can describe a very wide class of various tasks. However, in practice, in particular in socio-economic systems, it is sometimes difficult to apply linear models due to large approximation errors. Therefore, the functions of nonlinear multiple regression, allowing linearization, are often used. To their number, for example, there is a production function (Cobb-Douglas power function), which has been used in various socio-economic research. It has the form:

where b 0 is the normalization factor, B 1 ... b j - unknown coefficients, E i is a random deviation.

Using natural logarithms, it is possible to convert this equation to a linear form:

The resulting model allows the use of standard linear regression procedures described above. By building a model of two types (additive and multiplicative), you can choose the best and further research with smaller approximation errors.

There is a well-developed system of selection of approximating functions - methods of group accounting arguments (MGU).

The correctness of the selected model can be judged by the results of the study of residual residues between the observed values \u200b\u200bof Y i and the corresponding projected by the regression equation the values \u200b\u200bY i. In this case to verify the adequacy of the model calculated average error of approximation:

The model is considered adequate if E is within no more than 15%.

We emphasize that in relation to socio-economic systems, the main conditions for the adequacy of the classical regression model are not always fulfilled.

Without stopping at all causes of emergence of inadequacy, let's call only multicollinarity - the most difficult problem of the effective application of regression analysis procedures in the study of statistical dependencies. Under multicollarinarity It is understood by the presence of a linear connection between explanatory variables.

This phenomenon:

a) distorts the meaning of the regression coefficients in their meaningful interpretation;

b) reduces the accuracy of the estimation (the variance of estimates increases);

c) enhances the sensitivity of rating estimates to selective data (an increase in sampling can strongly affect the values \u200b\u200bof the estimates).

There are various techniques to reduce multicollinearity. The most affordable way is to eliminate one of two variables if the correlation coefficient between them exceeds the value equal to 0.8. What variables leave decide, based on the meaningful considerations. Then the calculation of the regression coefficients is again.

The use of a step-by-step regression algorithm allows you to consistently include in the model on one independent variable and analyze the significance of the regression coefficients and the multicolinearity of variables. Finally, only those variables remain in the dependence under study that ensure the necessary significance of regression coefficients and the minimum effect of multicollinarity.

Regression analysis is one of the most sought-after methods of statistical research. With it, it is possible to establish the degree of influence of independent values \u200b\u200bon the dependent variable. The Microsoft Excel functionality has tools intended for a similar type of analysis. Let's analyze that they represent themselves and how to use them.

But, in order to use a function that allows you to carry out a regression analysis, first of all, you need to activate the analysis package. Only then the tools needed for this procedure will appear on the exile tape.

Now, when we proceed to the tab "Data"on the ribbon in the tool block "Analysis" We will see a new button - "Data analysis".

Types of regression analysis

There are several types of regressions:

parabolic;
power;
logarithmic;
exponential;
indicative;
hyperbolic;
linear regression.

We will talk more about the implementation of the last type of regression analysis in Excele more.

Linear regression in the Excel program

Below, as an example, a table is presented in which the average daily air temperature on the street, and the number of shop buyers for the appropriate working day is indicated. Let us find out with the help of regression analysis, exactly how the weather conditions in the form of air temperature may affect the attendance of the commercial institution.

The general equation of regression of the linear species is as follows: y \u003d a0 + a1x1 + ... + AKK. In this formula Y. means a variable, the influence of the factors on which we are trying to explore. In our case, this is the number of buyers. Value x. - These are various factors affecting the variable. Parameters a. Are regression coefficients. That is, it is they who determine the importance of a particular factor. Index k. Indicates the total number of these factors.

Analysis of the results of the analysis

The results of the regression analysis are displayed in the form of a table in the place indicated in the settings.

One of the main indicators is R-square. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. Dependence less than 0.5 is bad.

Another important indicator is located in the cell on the crossing line "Y-intersection" and column "Factors". It indicates what value will be in y, and in our case, this is the number of buyers, with all other factors equal to zero. This table is 58.04 in this table.

The value at the intersection of the graph "The variable x1" and "Factors" Shows the level of Y depending on X. In our case, it is the level of dependence of the number of store clients from temperature. The coefficient of 1.31 is considered a rather high indicator of influence.

As you can see, using the Microsoft Excel program it is quite easy to make a table of regression analysis. But, to work with the data obtained at the exit, and understand their essence, only a prepared person will be able.

In statistical modeling, the regression analysis is a study used to assess the relationship between variables. This mathematical method includes many other methods for modeling and analyzing several variables when the relationship is paid to the relationship between the dependent variable and one or more independent. Speaking more specifically, regression analysis helps to understand how the typical value of the dependent variable changes, if one of the independent variables changes, while other independent variables remain fixed.

In all cases, target evaluation is a function of independent variables and is called regression function. In regression analysis, the characteristic of the change in the dependent variable as a regression function can be described by the probability distribution.

Problems of regression analysis

This statistical research method is widely used to predict where its use has a significant advantage, but sometimes it can lead to illusion or false relationship, so it is recommended to use it carefully in the specified question, since, for example, the correlation does not mean causal relationships.

A large number of methods have been developed for regression analysis, such as linear and ordinary regression on the least squares method, which are parametric. Their essence is that the regression function is determined in terms of a finite number of unknown parameters that are evaluated from the data. Non-parametric regression allows its function to lie in a specific set of functions that may be infinite-dimensional.

As a statistical research method, regression analysis in practice depends on the form of the data generation process and on how it relates to a regression approach. Since the true form of data process generating, as a rule, an unknown number, a regression analysis of data is often depends to some extent on the assumptions about this process. These assumptions are sometimes checked if there is a sufficient amount of available data. Regression models are often useful even when assumptions are moderately violated, although they cannot work with maximum efficiency.

In a narrower sense, regression may relate specifically to the evaluation of continuous response variables, in contrast to the discrete response variables used in the classification. The case of a continuous output variable is also called metric regression to distinguish it from related problems.

History

The earliest form of regression is to all the well-known method of least squares. He was published by Legendre in 1805 and Gauss in 1809. The Lenaland and Gauss applied the method for the task of determining the orbit of the orbit around the Sun (mostly comets, but later and newly open small planets). Gauss published a further development of the theory of least squares in 1821, including the version of Theorem Gauss Markov.

The term "regress" came up with Francis Galton in the XIX century to describe a biological phenomenon. The essence was that the growth of descendants from the growth of ancestors, as a rule, regresses down to normal average. For Galton, regression had only this biological meaning, but later his work was continued by Joli and Karl Pearson and brought to a more general statistical context. In the work of Yol and Pearson, the joint distribution of responses and explanatory variables is considered to be Gaussian. This assumption was rejected by Fisher in the works of 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution should not be such. In this regard, the suggestion of Fisher is closer to the statement of Gauss 1821. Until 1970, sometimes left until 24 hours to obtain the result of regression analysis.

The methods of regression analysis continue to remain an area of \u200b\u200bactive studies. In recent decades, new methods have been developed for reliable regression; regression with the participation of correlating responses; regression methods that accommodate various types of missing data; non-parametric regression; Bayesian regression methods; regressions in which variables of forecasting are measured with an error; Regression with most of the predictors than observations, as well as causal conclusions with regression.

Regression models

Regression analysis models include the following variables:

Unknown parameters indicated as beta that can be a scalar or vector.
Independent variables, X.
Dependent variables, Y.

In various fields of science, where regression analysis is used, various terms are used instead of dependent and independent variables, but in all cases the regression model refers y to the function X and β.

The approximation is usually made in the form of E (y | x) \u003d f (x, β). To carry out regression analysis, a form of F's function should be determined. It is less likely based on the relationship of the relationship between Y and X, which are not relying on the data. If such knowledge is not available, a flexible or convenient form F is selected.

Dependent variable y.

Suppose now that the vector of unknown parameters β has a length k. To perform regression analysis, the user must provide information on the dependent variable Y:

If there are points n data of the form (y, x), where n< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.

If exactly n \u003d k is observed, and the function f is linear, the equation y \u003d f (x, β) can be solved exactly, and not approximately. This reduces to solving the set of N-equations with N-unknown (β elements), which has a single solution until x is linearly independent. If F is non-linear, the solution may not exist, or there may be many solutions.
The most common is the situation where N\u003e points to the data are observed. In this case, there is sufficient information in the data to estimate the unique value for β, which in the best way corresponds to the data, and the regression model, when the application to data can be considered as the redefined system in β.

In the latter case, regression analysis provides tools for:

Solution search for unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
With certain statistical assumptions, regression analysis uses excess information to provide statistical information on unknown parameters β and predicted values \u200b\u200bof the dependent variable Y.

Required amount of independent measurements

Consider the regression model, which has three unknown parameters: β 0, β 1 and β 2. Suppose that the experimenter performs 10 measurements in the same value of the independent variable vector X. In this case, the regression analysis does not give a unique set of values. The best thing to be done is to evaluate the average value and the standard deviation of the dependent variable y. Similarly, measuring two different values \u200b\u200bX, it is possible to obtain enough data for regression with two unknown, but not for three or more unknown.

If the measurements of the experimenter were carried out at three different values \u200b\u200bof the independent variable vector x, the regression analysis will provide a unique set of estimates for three unknown parameters in β.

In the case of the general linear regression, the above approval is equivalent to the requirement that the X T X matrix is \u200b\u200breversible.

Statistical assumptions

When the number of measurements n is larger than the number of unknown parameters K and measurement errors ε i, then, as a rule, it then applies to an excess of information contained in measurements, and is used for statistical forecasts of relatively unknown parameters. This excess of information is called the degree of freedom of regression.

Fundamental assumptions

Classic assumptions for regression analysis include:

The sample is a representative of the forecasting of logical output.
The error is a random variable with an average zero value, which is conditional on explanatory variables.
Independent variables are measured without errors.
As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor in the form of a linear combination of the rest.
Errors are uncorrelated, that is, a covariance matrix of errors of diagonals and each nonzero element is an error dispersion.
Dispersion error is constant by observations (homocyadasticity). If not, you can use the method of suspended smallest squares or other methods.

These sufficient conditions for assessing the smallest squares have the required properties, in particular these assumptions mean that the parameter estimates will be objective, consistent and effective, especially when they are registered in the linear estimate class. It is important to note that actual data rarely satisfy the conditions. That is, the method is used, even if the assumptions are not true. Variation from assumptions can sometimes be used as a measure showing how much this model is useful. Many of these assumptions can be mitigated in more advanced methods. Reports of statistical analysis, as a rule, include test analysis according to sampling and methodology for the usefulness of the model.

In addition, variables in some cases refer to the values \u200b\u200bmeasured in point places. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographical weighted regression is the only method that deals with such data.

In linear regression, a feature is that a dependent variable, which is Y i, is a linear combination of parameters. For example, in a simple linear regression, one independent variable, X i, and two parameters, β 0 and β 1 are used to simulate N-points.

With multiple linear regression, there are several independent variables or their functions.

With a random sample from the population, its parameters allow you to obtain a sample model of linear regression.

In this aspect, the most square method is the most popular. With the help of it, there are estimates of parameters that minimize the sum of the squares of the residues. This kind of minimization (which is characteristic of the linear regression) of this function leads to a set of normal equations and a set of linear equations with parameters that are solved with obtaining parameter estimates.

With a further assumption that the population error is usually distributed, the researcher can use these estimates of standard errors to create confidence intervals and checking the hypotheses of its parameters.

Nonlinear regression analysis

An example when the function is not linear relative to the parameters indicates that the sum of the squares must be minimized using an iterative procedure. This makes a lot of complications that determine the differences between the linear and nonlinear methods of the smallest squares. Consequently, the results of the regression analysis using the nonlinear method are sometimes unpredictable.

Calculation of power and sampling

Here, as a rule, there are no agreed methods relating to the number of observations compared to the number of independent variables in the model. The first rule was proposed to good and hardin and looks like n \u003d t ^ n, where n is the sample size, n is the number of independent variables, and T is the number of observations necessary to achieve the desired accuracy, if the model has only one independent variable. For example, a researcher builds a linear regression model using a data set, which contains 1000 patients (N). If the researcher decides that five observations are needed to accurately determine the direct (M), then the maximum number of independent variables that the model can support is equal to 4.

Other methods

Despite the fact that the parameters of the regression model are usually evaluated using the least squares method, there are other methods that are used much less. For example, these are the following methods:

Bayesian methods (for example, Bayesian linear regression method).
Interest regression, used for situations where the decline in interest errors is considered more appropriate.
The smallest absolute deviations, which is more resistant in the presence of emissions leading to quantile regression.
Non-parametric regression requiring a large number of observations and calculations.
Training metric distance, which is studied in search of a significant metric distance in a given input space.

Software

All major statistical software packages are performed using the smallest squares of regression analysis. Simple linear regression and multiple regression analysis can be used in some applications of spreadsheets, as well as on some calculators. Although many software statistics packages can perform various types of non-parametric and reliable regression, these methods are less standardized; Various software packages implement various methods. Specialized regression software was designed for use in such areas as an analysis of surveys and neurovalization.

During study, students are very often encountered with a variety of equations. One of them is the regression equation - considered in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equalities are used in statistics and econometrics.

Definition of the concept of regression

In mathematics under regression, it is meant a certain amount describing the dependence of the average value of the set of data from the values \u200b\u200bof another value. The regression equation shows as a function of a certain feature the mean value of another feature. The regression function has a form of a simple equation y \u003d x, in which it acts as a dependent variable, and x - independent (sign factor). In fact, regression is expressed as y \u003d f (x).

What are the types of links between variables

In general, there are two opposite types of interconnection: correlation and regression.

The first is characterized by the equality of conditional variables. In this case, it is not reliably known which variable depends on the other.

If there is no equality between the variables and is not observed and in the conditions it is said, which variable explains, and which is dependent, then we can talk about the availability of the second type. In order to build a linear regression equation, it will be necessary to find out which type of communication is observed.

Types of regression

To date, 7 diverse types of regression are distinguished: hyperbolic, linear, multiple, nonlinear, steam room, reverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y \u003d c + t * x + e. The hyperbolic equation has the form of the correct hyperbole y \u003d C + T / X + E. The logarithmically linear equation expresses the relationship with the help of a logarithmic function: in y \u003d in s + t * in x + in E.

Multiple and nonlinear

Two more complex types of regression are multiple and nonlinear. The multiple regression equation is expressed by the function y \u003d f (x 1, x 2 ... x C) + E. In this situation, it acts as a dependent variable, and x - explaining. The variable E is stochastic, it includes the influence of other factors in the equation. The nonlinear regression equation is slightly contradictory. On the one hand, with respect to the recorded indicators, it is not linear, and on the other hand, in the role of evaluation of indicators it is linear.

Reverse and Stery Types of Regression

Reverse is such a type of function that needs to be converted into a linear view. In the most traditional applied programs, it has the form of the function y \u003d 1 / s + t * x + e. The pair of regression equation demonstrates the relationship between the data as a function y \u003d f (x) + E. In the same way, as in other equations, it depends on x, and E is a stochastic parameter.

Concept of correlation

This is an indicator that demonstrates the existence of the relationship of two phenomena or processes. The relationship force is expressed as the correlation coefficient. Its value varies within the interval [-1; +1]. The negative indicator indicates about the presence of feedback, positive - about direct. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value to 1 is the stronger the connection between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can evaluate the relationship of the relationship. They are used on the basis of the distribution estimate to explore the parameters subject to the law of the normal distribution.

The parameters of the linear regression equation are necessary to identify the type of dependence, the functions of the regression equation and evaluating the indicators of the elected formula of the relationship. A correlation field is used as a communication identification method. To do this, all existing data must be depicted graphically. In a rectangular two-dimensional coordinate system, you must apply all known data. So the field of correlation is formed. The value of the describing factor is noted along the abscissa axis, while the values \u200b\u200bof the dependent - along the axis of the ordinate. If there is a functional dependence between the parameters, they are built in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about the almost complete absence of communication. If it is between 30% and 70%, then this indicates the presence of connections of medium tightness. 100% indicator - evidence of functional communication.

The nonlinear regression equation is as long as the linear one, it is necessary to supplement the correlation index (R).

Correlation for multiple regression

The determination coefficient is an indicator of the square of the plural correlation. He talks about the closeness of the relationship of the presented complex of indicators with the studied sign. He can also talk about the nature of the effect of parameters on the result. The multiple regression equation is estimated by this indicator.

In order to calculate the indicator of the plural correlation, it is necessary to calculate its index.

Least square method

This method is a way of assessing regression factors. Its essence is to minimize the amount of deviations in the square obtained due to the dependence of the factor from the function.

The pair linear regression equation can be assessed using this method. This type of equations are used in case of detection between pair linear dependence indicators.

Parameters of equations

Each parameter of the linear regression function carries a certain meaning. The pair linear regression equation contains two parameters: C, etc. The parameter T demonstrates the average change in the final function of the function y, under the condition of decreasing (increase) of the variable x per intercourse. If the variable x is zero, the function is equal to the parameter with. If the variable x is not zero, then the factor C does not carry economic meaning. The only effect on the function has a sign before the factor with. If there is minus, then we can say about a slow change in the result compared to the factor. If there is plus, this indicates an accelerated change in the result.

Each parameter changing the value of the regression equation can be expressed through the equation. For example, the factor C has the form C \u003d Y - TX.

Grouped data

There are such conditions for the task in which all information is grouped by the X attribute, but at the same time the corresponding average values \u200b\u200bof the dependent indicator are indicated for a specific group. In this case, the average values \u200b\u200bare characterized by how the indicator is changed depending on x. Thus, the grouped information helps to find the regression equation. It is used as an analysis of relationships. However, this method has its drawbacks. Unfortunately, the average indicators are often exposed to external fluctuations. Data oscillations are not displaying patterns of relationships, they just mask it "noise". The average indicators demonstrate the patterns of relationships are much worse than the linear regression equation. However, they can be applied as a base for searching the equation. Multifying the number of a separate aggregate on the appropriate average can be obtained by the amount of in the group. Next, it is necessary to sink all the amounts received and find the final indicator. Easily harder to make calculations with the indicator of the amount of Hu. In that case, if the intervals are small, it is possible to conditionally take the x for all units (within the group) the same. It should multiply it with the amount of U to find out the amount of the works of X on y. Further, all amounts get together and the total amount of Hu is obtained.

Multiple pair of regression equation: Evaluation of the importance of communication

As previously was considered, the multiple regression has a function of the form y \u003d f (x 1, x 2, ..., x m) + e. Most often, such an equation is used to solve the problem of supply and supply for goods, interest income on redefined shares, studying the causes and type of production costs. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics such an equation is applied a little less.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what effect each of the factors are separate and in their total aggregate to the indicator that must be modeled and its coefficients. The regression equation can take a wide variety of values. At the same time, two types of functions are used to assess the relationship: linear and nonlinear.

The linear function is depicted in the form of such interconnection: y \u003d a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. At the same time A2, A M, are considered the coefficients of "clean" regression. They are necessary for the characteristics of the average parameter of the parameter y with a change (decrease in or increasing) of each corresponding parameter x per unit, with the condition of the stable value of other indicators.

Nonlinear equations have, for example, the type of power function y \u003d ah 1 b1 x 2 b2 ... x m bm. In this case, the indicators B 1, B 2 ..... b M are called elasticity coefficients, they demonstrate how the result will change (as%) with an increase in (decreasing) of the corresponding indicator of X by 1% and with a stable indicator of the remaining factors.

What factors need to be considered when building multiple regression

In order to properly build multiple regression, it is necessary to find out which factors should pay special attention to.

It is necessary to have a certain understanding of the nature of the relationship between economic factors and is modeled. The factors that will need to include are obliged to meet the following features:

Must be subject to quantitative measurement. In order to use a factor describing the quality of the subject, in any case, it should be quantified.
Intercorregulation of factors, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations is not caused, and this entails its unreliability and non-accuracy of estimates.
In the event of an existence of a huge indicator of the correlation, there is no way to determine the isolated influence of factors on the final result of the indicator, therefore, the coefficients become non-interpreted.

Construction methods

There is a huge number of methods and methods explaining how the factors can be selected for the equation. However, all these methods are built on the selection of coefficients using the correlation indicator. Among them are distinguished:

Method of exception.
Method of inclusion.
Step-by-step regression analysis.

The first method implies the examinations of all coefficients from the total set. The second method includes the introduction of a plurality of additional factors. Well, the third - outliest factors that were previously applied to the equation. Each of these methods has the right to exist. They have their pros and cons, but they can all in their own way to solve the question of discraining unnecessary indicators. As a rule, the results obtained by each individual method are quite close.

Methods of multidimensional analysis

Such methods for determining factors are based on the consideration of individual combinations of interconnected signs. They include discriminant analysis, appearance recognition, the method of the main component and the analysis of clusters. In addition, there is also a factor analysis, however, it appeared due to the development of the component method. All of them are applied in certain circumstances, with certain conditions and factors.

The purpose of the regression analysis is to measure the connection between the dependent variable and one (paired regression analysis) or several (multiple) independent variables. Independent variables are also called factor, explaining, defining, regressors and predictors.

The dependent variable is sometimes referred to as the defined explained, "response". The extremely widespread regression analysis in empirical studies is not only associated with the fact that this is a convenient testing tool hypotheses. Regression, especially multiple, is an effective method of modeling and forecasting.

Explanation of the principles of working with regression analysis will begin with a simpler pair method.

Paired regression analysis

The first actions using regression analysis will be almost identical to us in the framework of calculating the correlation coefficient. Three main conditions for the effectiveness of correlation analysis using the Pearson method - the normal distribution of variables, interval measurement of variables, linear bond between variables are relevant for multiple regression. Accordingly, at the first stage, scattering diagrams are built, a statistically descriptive analysis of variables is carried out and the regression line is calculated. As in the framework of the correlation analysis, the regression lines are built by the smallest square method.

To more clearly illustrate the differences between the two methods of data analysis, we turn to the already considered example with the variables "Support of the ATP" and "Share of the Rural Population". The source data is identical. The difference in scattering diagrams will be that in the regression analysis, the dependent variable is correctly disappointing - in our case "Support for ATP" along the Y axis, whereas in the correlation analysis it does not matter. After cleaning emissions, the scattering diagram is:

The fundamental idea of \u200b\u200bregression analysis is that having a general trend for variables - in the form of a regression line, - you can predict the value of the dependent variable, having an independent value.

Imagine a conventional mathematical linear function. Any direct in the Euclidean space can be described by the formula:

where a is a constant that sets the offset along the axis of the ordinate; B is a coefficient that determines the angle of the lines.

Knowing the angular coefficient and constant, you can calculate (predict) value for any x.

This simplest function has formed the basis of the regression analysis model with the reservation that the value of we will predict not accurately, but within a certain confidence interval, i.e. about.

The constant is the intersection point of the regression line and the ordinate axis (F-crossing, in statistical packages, as a rule, denoted by "interceptor"). In our example with a voting for the ATP, its rounded value will be 10.55. The angular coefficient Kommersant will be approximately -0.1 (as in the correlation analysis, the sign shows the type of communication - direct or reverse). Thus, the resulting model will have the form of the joint venture C \u003d -0.1 x villages. us. + 10.55.

ATP \u003d -0.10 x 47 + 10.55 \u003d 5.63.

The difference between the initial and predicted values \u200b\u200bis called the residue (with this term - principled for statistics - we have already encountered when analyzing the conjugacy tables). So, for the case of the "Republic of Adygea" the residue will be equal to 3.92 - 5.63 \u003d -1.71. The larger the modular value of the residue, the less successfully the value is predicted.

Calculate predicted values \u200b\u200band residues for all cases:

Happening	Sel. us.	THX (initial)	THX (predicted)	Residues
Republic of Adygea	47	3,92	5,63	-1,71 -
Altai Republic	76	5,4	2,59	2,81
Republic of Bashkortostan	36	6,04	6,78	-0,74
The Republic of Buryatia	41	8,36	6,25	2,11
The Republic of Dagestan	59	1,22	4,37	-3,15
The Republic of Ingushetia	59	0,38	4,37	3,99
Etc.

Analysis of the ratio of initial and predicted values \u200b\u200bis used to assess the quality of the model obtained, its prognostic ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the initial and predicted values \u200b\u200bof the dependent variable. In pair regression analysis, it is equal to the usual peonon correlation coefficient between the dependent and independent variable, in our case - 0.63. To substantively interpret the multiple R, it must be converted to the determination coefficient. This is done in the same way as in the correlation analysis - the construction of the square. The determination coefficient R -Kvadrat (R 2) shows the proportion of variation of the dependent variable, explained by independent (independent) variables.

In our case, R 2 \u003d 0.39 (0.63 2); This means that the variable "share of the rural population" explains about 40% of the variation of the "Support for ATP" variation. The greater the value of the determination coefficient, the higher the quality of the model.

Another model quality indicator is a standard estimate error (Standard Error of Estimate). This is an indicator of how much the point is "scattered" around the regression line. The variation measure for interval variables is the standard deviation. Accordingly, the standard evaluation error is the standard deviation of the residue distribution. The higher its value, the stronger the spread and the worse the model. In our case, the standard error is 2.18. It is for this magnitude that our model will "be wrong on average" when predicting the value of the "Support for ATP" variable.

Regression statistics also includes dispersion analysis. With it, we find out: 1) which proportion of variation (dispersion) of the dependent variable is explained by an independent variable; 2) which proportion of the dispersion of the dependent variable falls on the balance (inexplicable part); 3) What is the attitude of these two values \u200b\u200b(/ "- attitude). Dispersion statistics are especially important for sample studies - it shows how likely the availability of communication between independent and dependent variables in the general population. However, for continuous research (as in our example), learning The results of the dispersion analysis are not inspected. In this case, they are checked if the identified statistical pattern is caused by a coincidence that it is characteristic of that complex of conditions in which the surveyed set is set, i.e. not the truth of the result obtained for some more extensive general Aggregate, and the degree of its patterns, freedom from accidental impact.

In our case, the dispersion analysis statistics is as follows:

	SS.	df.	MS.	F.	value
Regnet.	258,77	1,00	258,77	54,29	0.000000001
Left.	395,59	83,00	L, 11.
Total	654,36

F-ratio 54.29 significantly at the level of 0.0000000001. Accordingly, we can confidently reject the zero hypothesis (that the connection we discovered is random character).

A similar function is performed by the criterion T, but already in relation to regression coefficients (angular and F-intersection). With the help of a criterion / we check the hypothesis that in the general set regression coefficients are zero. In our case, we can again confidently discard the zero hypothesis.

Multiple regression analysis

The multiple regression model is almost identical to the paired regression model; The only difference is that several independent variables are sequentially included in the linear function:

Y \u003d b1x1 + b2x2 + ... + bpxp + a.

If independent variables are more than two, we do not have the opportunity to get a visual idea of \u200b\u200btheir connection, in this regard multiple regression less "visual" than steam room. If there are two independent variables, the data is useful to display on a three-dimensional scattering diagram. In professional statistical software packages (for example, Statistic), there is an option to rotate a three-dimensional diagram, which allows you to visually imagine the data structure.

When working with multiple regression, in contrast to the steam room, it is necessary to determine the analysis algorithm. The standard algorithm includes all existing predictors in the final regression model. A step-by-step algorithm implies a sequential inclusion (exception) of independent variables, based on their explanatory "weight". Step-by-step method is good when there are many independent variables; He "cleans" a model from frankly weak predictors, making it more compact and laconic.

An additional condition for the correctness of multiple regression (along with intervality, normality and linearity) is the absence of multicollinarity - the presence of strong correlation bonds between independent variables.

The interpretation of multiple regression statistics includes all the glements considered by us for the case of pair regression. In addition, there are other important components in the statistics of multiple regression analysis.

We will illustrate work with multiple regression on the example of testing hypotheses that explain differences in electoral activity in the regions of Russia. In the course of concrete empirical studies, assumptions were made that the level of voter turnover affects:

The national factor (the variable "Russian population"; is survivored as the share of the Russian population in the subjects of the Russian Federation). It is assumed that the increase in the share of the Russian population leads to a decrease in voter activity;

The urbanization factor (the variable "urban population"; is surveyed as the share of urban population in the subjects of the Russian Federation, we have already worked with this factor within the framework of the correlation analysis). It is assumed that an increase in the share of urban population also leads to a decrease in voter activity.

The dependent variable - the "intensity of electoral activity" ("asset") is survived through the averaged data of the appearance of the regions in the federal elections from 1995 to 2003. The source table of data for two independent and one dependent variable will have the following form:

Happening	Variables
Happening	Assets.	Mountains us.	Rus. us.
Republic of Adygea	64,92	53	68
Altai Republic	68,60	24	60
The Republic of Buryatia	60,75	59	70
The Republic of Dagestan	79,92	41	9
The Republic of Ingushetia	75,05	41	23
Republic of Kalmykia	68,52	39	37
Karachay-Circassian	66,68	44	42
Republic of Karelia	61,70	73	73
Komi Republic	59,60	74	57
Mari El Republic	65,19	62	47

Etc. (after cleaning emissions 83 cases out of 88)

Statistics describing the quality of the model:

1. Multiple R \u003d 0.62; L-square \u003d 0.38. Consequently, the national factor and urbanization factor together explain about 38% of variation of the variable "electoral activity".

2. The average error is 3.38. It is so "the average is wrong" the constructed model when predicting the level of appearance.

3. / L-ratio of explained and inexplicable variation is 25.2 at the level of 0.000000003. The zero hypothesis about the chance of identified connections is rejected.

4. Criterion / for the constant and regression coefficients of variables "Urban population" and "Russian population" meaning at the level of 0.0000001; 0.00005 and 0.007, respectively. The zero hypothesis about the randomness of the coefficients is rejected.

Additional useful statistics in the analysis of the ratio of initial and predicted values \u200b\u200bof the dependent variable are the distance of Mahalabis and the distance of Cook. The first - measure of the uniqueness of the case (shows how much the combination of values \u200b\u200bof all independent variables for a given case deviates from the mean value on all independent variables simultaneously). Second - measure of the influence of the case. Different observations in different ways affect the slope of the regression line, and with the help of the Cook distance, they can be compared to this indicator. This is useful when cleaning emissions (the emission can be represented as an overly influential case).

In our example, Dagestan refers to unique and influential cases.

Happening	Source values	Predica values	Residues	Distance Mahalanobis	Distance
Adygea	64,92	66,33	-1,40	0,69	0,00
Altai Republic	68,60	69.91	-1,31	6,80	0,01
The Republic of Buryatia	60,75	65,56	-4,81	0,23	0,01
The Republic of Dagestan	79,92	71,01	8,91	10,57	0,44
The Republic of Ingushetia	75,05	70,21	4,84	6,73	0,08
Republic of Kalmykia	68,52	69,59	-1,07	4,20	0,00

The actual regression model has the following parameters: U-intersection (constant) \u003d 75.99; B (mountains. Us.) \u003d -0.1; Kommersant (Rus. Us.) \u003d -0.06. Formula.