What is regression analysis in statistics. Linear Regression Basics

The purpose of regression analysis is to measure the relationship between a dependent variable and one (paired regression analysis) or multiple (multiple) independent variables. The explanatory variables are also called factorial, explanatory, determinative, regressors, and predictors.

The dependent variable is sometimes called the determinable, explainable, "response." Extremely wide use Regression analysis in empirical research is not only due to the fact that it is a convenient tool for testing hypotheses. Regression, especially multiple regression, is effective method modeling and forecasting.

To explain the principles of working with regression analysis, we will start with a simpler one - the pairwise method.

Paired Regression Analysis

The first steps when using regression analysis will be almost identical to those we took in calculating the correlation coefficient. Three basic conditions for efficiency correlation analysis according to Pearson's method - normal distribution of variables, interval measurement of variables, linear relationship between variables - are also relevant for multiple regression. Accordingly, at the first stage, scatterplots are built, a statistically descriptive analysis of the variables is carried out, and the regression line is calculated. As in the framework of correlation analysis, regression lines are constructed using the method least squares.

To illustrate more clearly the differences between the two methods of data analysis, let us turn to the example already considered with the variables “PCA support” and “rural population share”. The original data are identical. The difference in scatterplots will be that in regression analysis it is correct to postpone the dependent variable - in our case, "support for PCA" along the Y axis, while in correlation analysis it does not matter. After cleaning out the outliers, the scatter diagram looks like:

The basic idea of regression analysis is that, having a general tendency for variables - in the form of a regression line - it is possible to predict the value of the dependent variable, having the values of the independent.

Let's imagine an ordinary mathematical linear function. Any straight line in Euclidean space can be described by the formula:

where a is a constant specifying the displacement along the ordinate; b - coefficient that determines the angle of inclination of the line.

Knowing the slope and the constant, you can calculate (predict) the value of y for any x.

This simplest function formed the basis of the regression analysis model with the proviso that we will not predict the value of y accurately, but within the framework of a certain confidence interval, i.e. approximately.

A constant is the point of intersection of the regression line and the ordinate (F-intersection, in statistical packages, usually denoted "interceptor"). In our example with a vote for PCA, its rounded value will be 10.55. The slope b will be approximately -0.1 (as in correlation analysis, the sign indicates the type of relationship - direct or reverse). Thus, the resulting model will have the form SP C = -0.1 x Sel. US. + 10.55.

ATP = -0.10 x 47 + 10.55 = 5.63.

The difference between the initial and predicted values is called the remainder (we have already encountered this term, which is fundamental for statistics, when analyzing contingency tables). So, for the case of "Republic of Adygea" the remainder will be 3.92 - 5.63 = -1.71. The larger the modular value of the remainder, the less well-predicted the value.

We calculate the predicted values and residuals for all cases:

Happening	He sat down. US.	THX (original)	THX (predicted)	Leftovers
Republic of Adygea	47	3,92	5,63	-1,71 -
Altai Republic	76	5,4	2,59	2,81
Republic of Bashkortostan	36	6,04	6,78	-0,74
The Republic of Buryatia	41	8,36	6,25	2,11
The Republic of Dagestan	59	1,22	4,37	-3,15
The Republic of Ingushetia	59	0,38	4,37	3,99
Etc.

Analysis of the ratio of the initial and predicted values serves to assess the quality of the resulting model, its predictive ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the original and predicted values of the dependent variable. In paired regression analysis, it is equal to Pearson's usual correlation coefficient between the dependent and independent variables, in our case 0.63. In order to meaningfully interpret the multiple R, it must be converted into a coefficient of determination. This is done in the same way as in correlation analysis - by squaring. The coefficient of determination R -square (R 2) shows the proportion of variation in the dependent variable explained by the independent (independent) variables.

In our case, R 2 = 0.39 (0.63 2); this means that the variable “rural share” explains about 40% of the variation in the variable “CPS support”. The greater the value of the coefficient of determination, the higher the quality of the model.

Another measure of model quality is the standard error of estimate. It is a measure of how much the points are "scattered" around the regression line. A measure of the spread for interval variables is standard deviation... Accordingly, the standard error of the estimate is the standard deviation of the distribution of the residuals. The higher its value, the greater the spread and the worse the model. In our case, the standard error is 2.18. It is by this value that our model will be “mistaken on average” when predicting the value of the variable “SPS support”.

Regression statistics also includes analysis of variance. With its help, we find out: 1) what proportion of the variation (variance) of the dependent variable is explained by the independent variable; 2) what proportion of the variance of the dependent variable falls on the residuals (unexplained part); 3) what is the ratio of these two quantities (/ "- ratio). Dispersion statistics is especially important for sample studies - it shows how likely it is that there is a relationship between the independent and dependent variables in the general population... However, for continuous studies (as in our example), the study of the results of analysis of variance is useful. In this case, it is checked whether the revealed statistical regularity is caused by a coincidence of random circumstances, how characteristic it is for the set of conditions in which the surveyed population is located, i.e. it is not the truth of the result obtained for some wider general population that is established, but the degree of its regularity, freedom from random influences.

In our case, the analysis of variance statistics is as follows:

	SS	df	MS	F	meaning
Regres.	258,77	1,00	258,77	54,29	0.000000001
The remainder.	395,59	83,00	L, 11
Total	654,36

An F-ratio of 54.29 is significant at 0.0000000001. Accordingly, we can confidently reject the null hypothesis (that the relationship we discovered is random).

A similar function is performed by the t criterion, but with respect to regression coefficients(corner and F-intersection). Using the / criterion, we test the hypothesis that in the general population the regression coefficients are equal to zero. In our case, we can again confidently reject the null hypothesis.

Multiple regression analysis

Model multiple regression almost identical to the paired regression model; the only difference is that several independent variables are sequentially included in the linear function:

Y = b1X1 + b2X2 +… + bpXp + a.

If there are more than two independent variables, we are not able to get a visual idea of their relationship; in this regard, multiple regression is less "clear" than pair regression. When there are two independent variables, it can be useful to display the data in a 3D scatterplot. In professional statistical software packages (for example, Statisticа) there is an option for rotating a three-dimensional diagram, which allows a good visual representation of the data structure.

When working with multiple regression, as opposed to a pair regression, it is necessary to define an analysis algorithm. The standard algorithm includes all available predictors in the final regression model. Step by step algorithm assumes the sequential inclusion (exclusion) of independent variables, based on their explanatory "weight". The stepwise method is good when there are many independent variables; it "cleans" the model of frankly weak predictors, making it more compact and laconic.

An additional condition for the correctness of multiple regression (along with interval, normality, and linearity) is the absence of multicollinearity - the presence of strong correlations between independent variables.

The interpretation of multiple regression statistics includes all the elements that we considered for the case of paired regression. In addition, there are other important components to multiple regression statistics.

We will illustrate the work with multiple regression by the example of testing hypotheses explaining the differences in the level of electoral activity in the regions of Russia. Specific empirical studies have suggested that voter turnout is influenced by:

National factor (variable “Russian population”; operationalized as the share of the Russian population in the constituent entities of the Russian Federation). It is assumed that an increase in the share of the Russian population leads to a decrease in voter turnout;

The factor of urbanization (the variable "urban population"; operationalized as the share of the urban population in the constituent entities of the Russian Federation, we have already worked with this factor in the framework of the correlation analysis). It is assumed that an increase in the share of the urban population also leads to a decrease in voter turnout.

The dependent variable - "the intensity of electoral activity" ("asset") is operationalized through the averaged data of the turnout by region in the federal elections from 1995 to 2003. The initial data table for two independent and one dependent variable will have the following form:

Happening	Variables
Happening	Assets.	Mountains. US.	Rus. US.
Republic of Adygea	64,92	53	68
Altai Republic	68,60	24	60
The Republic of Buryatia	60,75	59	70
The Republic of Dagestan	79,92	41	9
The Republic of Ingushetia	75,05	41	23
Republic of Kalmykia	68,52	39	37
Karachay-Cherkess Republic	66,68	44	42
Republic of Karelia	61,70	73	73
Komi Republic	59,60	74	57
Mari El Republic	65,19	62	47

Etc. (after cleaning out emissions, 83 cases out of 88 remain)

Statistics describing the quality of the model:

1. Multiple R = 0.62; L-square = 0.38. Consequently, the national factor and the factor of urbanization together explain about 38% of the variation in the variable "electoral activity".

2. The average error is 3.38. This is how the constructed model is “wrong on average” when predicting the turnout level.

3. / L-ratio of explained and unexplained variation is 25.2 at the level of 0.000000003. The null hypothesis about the randomness of the identified links is rejected.

4. Criterion / for the constant and regression coefficients of the variables "urban population" and "Russian population" is significant at the level of 0.0000001; 0.00005 and 0.007, respectively. The null hypothesis of the randomness of the coefficients is rejected.

Additional useful statistics in analyzing the relationship between the original and predicted values of the dependent variable are Mahalanobis distance and Cook distance. The first is a measure of the uniqueness of a case (shows how much the combination of the values of all independent variables for a given case deviates from the mean for all independent variables simultaneously). The second is the measure of the impact of the event. Different observations have different effects on the slope of the regression line, and the Cook distance can be used to compare them for this indicator. This is useful when cleaning out outliers (a blowout can be thought of as an overly influential case).

In our example, Dagestan is one of the unique and influential cases.

Happening	The original meaning	Predsca meaning	Leftovers	Distance Mahalanobis	Distance
Adygea	64,92	66,33	-1,40	0,69	0,00
Altai Republic	68,60	69.91	-1,31	6,80	0,01
The Republic of Buryatia	60,75	65,56	-4,81	0,23	0,01
The Republic of Dagestan	79,92	71,01	8,91	10,57	0,44
The Republic of Ingushetia	75,05	70,21	4,84	6,73	0,08
Republic of Kalmykia	68,52	69,59	-1,07	4,20	0,00

The regression model itself has the following parameters: Y-intersection (constant) = 75.99; B (Hor. Sat.) = -0.1; B (Rus. Us.) = -0.06. Final formula.

The concepts of correlation and regression are directly related. There are many common computational techniques in correlation and regression analysis. They are used to identify causal relationships between phenomena and processes. However, if correlation analysis allows you to estimate the strength and direction of the stochastic connection, then regression analysis- also a form of addiction.

Regression can be:

a) depending on the number of phenomena (variables):

Simple (regression between two variables);

Multiple (regression between the dependent variable (y) and several variables explaining it (x1, x2 ... xn);

b) depending on the form:

Linear (displayed by a linear function, and there are linear relations between the studied variables);

Non-linear (displayed by a non-linear function, the relationship between the studied variables is non-linear);

c) by the nature of the relationship between the variables included in the consideration:

Positive (an increase in the value of the explanatory variable leads to an increase in the value of the dependent variable and vice versa);

Negative (with an increase in the value of the explanatory variable, the value of the explained variable decreases);

d) by type:

Immediate (in this case, the cause has direct impact to the investigation, i.e. dependent and explanatory variables are directly related to each other);

Indirect (the explanatory variable has an indirect effect through a third or a number of other variables on the dependent variable);

False (nonsense regression) - can occur with a superficial and formal approach to the processes and phenomena under study. An example of meaningless is the regression that establishes a link between a decrease in the amount of alcohol consumed in our country and a decrease in the sale of washing powder.

When conducting regression analysis, the following main tasks are solved:

1. Determination of the form of dependence.

2. Determination of the regression function. For this, a mathematical equation of one type or another is used, which allows, firstly, to establish the general tendency of the dependent variable, and, secondly, to calculate the influence of the explanatory variable (or several variables) on the dependent variable.

3. Estimation of unknown values of the dependent variable. The obtained mathematical relationship (regression equation) allows one to determine the value of the dependent variable both within the interval of the given values of the explanatory variables, and beyond it. In the latter case, the regression analysis acts as useful tool when predicting changes in socio-economic processes and phenomena (subject to the preservation of existing trends and relationships). Usually, the length of the time interval for which the forecasting is carried out is chosen no more than half of the time interval during which the observations of the initial indicators were carried out. It is possible to carry out both a passive forecast, solving the extrapolation problem, and an active one, conducting reasoning according to the well-known scheme "if ..., then" and substituting various values into one or several explanatory variables of the regression.

For building regression used by special method named least squares method... This method has advantages over other smoothing methods: a relatively simple mathematical definition of the sought parameters and a good theoretical foundation from a probabilistic point of view.

When choosing a regression model, one of the essential requirements for it is to ensure the greatest possible simplicity, which makes it possible to obtain a solution with sufficient accuracy. Therefore, to establish statistical relationships, first, as a rule, consider a model from the class of linear functions (as the simplest of all possible classes of functions):

where bi, b2 ... bj are coefficients that determine the influence of independent variables хij on the value of yi; ai - free member; ei - random deviation, which reflects the influence of unaccounted factors on the dependent variable; n is the number of independent variables; N is the number of observations, and the condition (N. N + 1) must be met.

Linear model can describe a very broad class different tasks... However, in practice, in particular in socio-economic systems, it is sometimes difficult to use linear models due to large approximation errors. Therefore, nonlinear multiple regression functions that can be linearized are often used. These include, for example, the production function ( power function Cobb-Douglas), which has found application in various socio-economic studies. It looks like:

where b 0 - normalization factor, b 1 ... b j - unknown coefficients, e i - accidental deviation.

Using natural logarithms, you can convert this equation to linear form:

The resulting model allows the use of standard procedures linear regression described above. Having built models of two types (additive and multiplicative), one can choose the best one and carry out further research with smaller approximation errors.

There is a well-developed system for the selection of approximating functions - method of group accounting of arguments(MGUA).

The correctness of the fitted model can be judged by the results of the study of the residuals, which are the differences between the observed values of y i and the corresponding predicted values using regression equation quantities y i. In this case to check the adequacy of the model calculated average error approximations:

The model is considered adequate if e is within 15% or less.

We emphasize that in relation to socio-economic systems, the basic conditions for the adequacy of the classical regression model are far from always fulfilled.

Without dwelling on all the reasons for the emerging inadequacy, we will only name multicollinearity- the most difficult problem of effective application of regression analysis procedures in the study of statistical dependencies. Under multicollinearity the presence of a linear relationship between the explanatory variables is understood.

This phenomenon:

a) distorts the meaning of the regression coefficients in their meaningful interpretation;

b) decreases the estimation accuracy (the variance of estimates increases);

c) increases the sensitivity of the estimates of the coefficients to the sample data (an increase in the sample size can greatly affect the values of the estimates).

There are various techniques for reducing multicollinearity. Most affordable way- elimination of one of the two variables, if the correlation coefficient between them exceeds a value equal in absolute value to 0.8. It is decided which of the variables to leave on the basis of substantive considerations. Then the regression coefficients are calculated again.

Using the stepwise regression algorithm allows you to sequentially include one independent variable in the model and analyze the significance of the regression coefficients and the multicollinearity of the variables. Finally, only those variables remain in the studied dependence that provide the necessary significance of the regression coefficients and the minimum effect of multicollinearity.

Characterization of causal dependencies

Causal relationships- this is a connection between phenomena and processes, when a change in one of them - the cause - leads to a change in the other - the effect.

According to their significance for studying the relationship, the signs are divided into two classes.

Signs that cause changes in other related signs are called factorial (or factors).

Signs that change under the influence of factor signs are effective.

There are the following forms of communication: functional and stochastic. Functional is called such a relationship in which a certain value of the factor attribute corresponds to one and only one value of the effective attribute. The functional relationship is manifested in all cases of observation and for each specific unit of the studied population.

The functional relationship can be represented by the following equation:
y i = f (x i), where: y i - effective sign; f (x i) - the known function of the relationship between the effective and factorial features; x i - factor sign.
In real nature functional links no. They are only abstractions, useful in the analysis of phenomena, but simplifying reality.

Stochastic (statistical or random)connection is a relationship between quantities, in which one of them reacts to a change in another quantity or other quantities by changing the distribution law. In other words, with this connection different meanings one variable corresponds to different distributions of another variable. This is due to the fact that the dependent variable, in addition to the considered independent ones, is subject to the influence of a number of unaccounted or uncontrolled random factors, as well as some inevitable measurement errors of the variables. Due to the fact that the values of the dependent variable are subject to random scatter, they cannot be predicted with sufficient accuracy, but can only be indicated with a certain probability.

Due to the ambiguity of the stochastic dependence between Y and X, in particular, the scheme of dependence averaged over x is of interest, i.e. the regularity in the change in the mean value - the conditional mathematical expectation Мх (Y) (the mathematical expectation of the random variable Y, found under the condition that the variable X took the value x) depending on x.

Correlation is a special case of stochastic connection. Correlation(from lat. correlatio- ratio, relationship). Forward current term correlation - stochastic, probable, possible connection between two (pair) or several (multiple) random variables.

A correlation dependence between two variables is also called a statistical relationship between these variables, in which each value of one variable corresponds to a certain average value, i.e. conditional mathematical expectation is different. Correlation dependence is a special case of stochastic dependence, in which a change in the values of factor attributes (x 1 x 2 ..., x n) entails a change in the average value of the effective attribute.

It is customary to distinguish between the following types of correlation:

1. Pairwise correlation is a connection between two characteristics (effective and factorial or two factorial).

2. Partial correlation - the relationship between the effective and one factor signs with a fixed value of other factor signs included in the study.

3. Multiple correlation - the dependence of the effective and two or more factor signs included in the study.

Purpose of regression analysis

Regression models are an analytical form for representing causal relationships. The scientific validity and popularity of regression analysis makes it one of the main mathematical tools for modeling the phenomenon under study. This method is used to smooth experimental data and obtain quantitative estimates of the comparative effect various factors to the result variable.

Regression analysis consists in the definition of the analytical expression of the relationship, in which a change in one quantity (dependent variable or effective indicator) is due to the influence of one or more independent quantities (factors or predictors), and many of all other factors that also affect the dependent quantity are taken as constant and average values ...

Regression Analysis Objectives:

Assessment of the functional dependence of the conditional average value of the effective attribute y on the factorial (x 1, x 2, ..., x n);

Predicting the value of the dependent variable using the independent (s).

Determination of the contribution of individual independent variables to the variation of the dependent variable.

Regression analysis cannot be used to determine the existence of a relationship between variables, since the presence of such a relationship is a prerequisite for applying the analysis.

In regression analysis, it is assumed in advance that there are causal relationships between the effective (Y) and factorial x 1, x 2 ..., x n features.

Function , op the outgoing dependence of the indicator on the parameters is called the regression equation (function) one . The regression equation shows the expected value of the dependent variable at specific values of the explanatory variables.
Depending on the number of factors included in the model X The models are divided into one-way (paired regression model) and multivariate (multiple regression model). Depending on the type of function, the models are divided into linear and non-linear.

Paired Regression Model

Due to the influence of unaccounted for random factors and reasons, individual observations y will deviate to a greater or lesser extent from the regression function f (x). In this case, the equation for the relationship of two variables (paired regression model) can be represented as:

Y = f (X) + ɛ,

where ɛ is a random variable characterizing the deviation from the regression function. This variable is called disturbance or disturbance (residual or error). Thus, in the regression model, the dependent variable Y there is some function f (X) up to a random perturbation ɛ.

Consider the classic linear paired regression model (CLMPR). It has the form

y i = β 0 + β 1 x i + ɛ i (i = 1,2, ..., n),(1)

where i–Explained (resulting, dependent, endogenous variable); x i- explanatory (predictor, factorial, exogenous) variable; β 0, β 1- numerical coefficients; ɛ i- random (stochastic) component or error.

Basic conditions (prerequisites, hypotheses) of KLMPR:

1) x i- deterministic (non-random) value, while it is assumed that among the values x i - not all are the same.

2) Expected value(mean) perturbations ɛ i is equal to zero:

M [ɛ i] = 0 (i = 1,2, ..., n).

3) The dispersion of the disturbance is constant for any values of i (the condition of homoscedasticity):

D [ɛ i] = σ 2 (i = 1,2, ..., n).

4) Perturbations for different observations are uncorrelated:

cov [ɛ i, ɛ j] = M [ɛ i, ɛ j] = 0 for i ≠ j,

where cov [ɛ i, ɛ j] is the covariance coefficient (correlation moment).

5) Perturbations are normally distributed random variables with zero mean and variance σ 2:

ɛ i ≈ N (0, σ 2).

To obtain the regression equation, the first four prerequisites are sufficient. The requirement to fulfill the fifth prerequisite is necessary to assess the accuracy of the regression equation and its parameters.

Comment: Attention to linear relationships is explained by the limited variation of variables and the fact that in most cases, nonlinear forms of communication for performing calculations are transformed (by taking the logarithm or changing variables) into a linear form.

Traditional method least squares (OLS)

The estimation of the model for the sample is the equation

ŷ i = a 0 + a 1 x i(i = 1,2, ..., n), (2)

where ŷ i - theoretical (approximating) values of the dependent variable obtained by the regression equation; a 0, a 1 - coefficients (parameters) of the regression equation (sample estimates of the coefficients β 0, β 1, respectively).

According to OLS, the unknown parameters a 0, a 1 are chosen so that the sum of the squares of the deviations of the values ŷ i from the empirical values y i (residual sum of squares) is minimal:

Q e = ∑e i 2 = ∑ (y i - ŷ i) 2 = ∑ (yi - (a 0 + a 1 x i)) 2 → min, (3)

where e i = y i - ŷ i is the sample estimate of the disturbance ɛ i, or the regression residual.

The problem is reduced to finding such values of the parameters a 0 and a 1, for which the function Q e takes the smallest value. Note that the function Q e = Q e (a 0, a 1) is a function of two variables a 0 and a 1 until we have found and then fixed their "best" (in the sense of the least squares method) values, and x i, yi - constant numbers found experimentally.

The necessary conditions extremum (3) are found by equating to zero the partial derivatives of this function of two variables. As a result, we get a system of two linear equations, which is called the system of normal equations:

(4)

Coefficient a 1 is a sample regression coefficient of y on x, which shows by how many units on average the variable y changes when the variable x changes by one unit of its measurement, that is, the variation in y per unit of variation x. Sign a 1 indicates the direction of this change. Coefficient a 0 - displacement, according to (2) is equal to the value of ŷ i at x = 0 and may not have meaningful interpretation. For this, the dependent variable is sometimes called the response.

Statistical properties of the estimates of the regression coefficients:

The estimates of the coefficients a 0, a 1 are unbiased;

The variances of estimates a 0, a 1 decrease (the accuracy of estimates increases) with an increase in the sample size n;

Variance of the estimate slope a 1 decreases with increasing and therefore it is desirable to choose x i so that their spread around the average value is large;

For х¯> 0 (which is of the greatest interest), there is a negative statistical relationship between a 0 and a 1 (an increase in a 1 leads to a decrease in a 0).

Regression analysis is a method for establishing the analytical expression of the stochastic relationship between the studied features. The regression equation shows how the average changes at when changing any of x i , and has the form:

where y - dependent variable (it is always one);

X i - independent variables (factors) (there may be several of them).

If there is only one explanatory variable, this is a simple regression analysis. If there are several of them ( P 2), then such an analysis is called multivariate.

In the course of regression analysis, two main tasks are solved:

construction of a regression equation, i.e. finding the type of relationship between the final indicator and independent factors x 1 , x 2 , …, x n .

an estimate of the significance of the resulting equation, i.e. determining to what extent the selected factor attributes explain the variation of the attribute at.

Regression analysis is used mainly for planning, as well as for the development of a regulatory framework.

Unlike correlation analysis, which only answers the question of whether there is a relationship between the analyzed features, regression analysis also gives its formalized expression. In addition, if correlation analysis studies any interconnection of factors, then regression analysis studies one-sided dependence, i.e. a connection showing how a change in factor signs affects the effective sign.

Regression analysis is one of the most developed methods of mathematical statistics. Strictly speaking, in order to implement regression analysis, it is necessary to fulfill a number of special requirements (in particular, x l , x 2 , ..., x n ;y must be independent, normally distributed random variables with constant variances). V real life strict compliance with the requirements of regression and correlation analysis is very rare, but both of these methods are quite common in economic research. Dependencies in the economy can be not only direct, but also inverse and nonlinear. A regression model can be built in the presence of any dependence, however, in multivariate analysis, only linear models of the form are used:

The construction of the regression equation is carried out, as a rule, by the least squares method, the essence of which is to minimize the sum of squares of deviations of the actual values of the resultant attribute from its calculated values, i.e.:

where T - number of observations;

j =a + b 1 x 1 j + b 2 x 2 j + ... + b n X n j - the calculated value of the resultant factor.

It is recommended to determine the regression coefficients using analytical packages for a personal computer or a special financial calculator. In the simplest case, the regression coefficients of a one-way linear regression equation of the form y = a + bx can be found by the formulas:

Cluster Analysis

Cluster analysis is one of the multivariate analysis methods designed for grouping (clustering) a population, the elements of which are characterized by many characteristics. The values of each of the attributes serve as the coordinates of each unit of the studied population in the multidimensional space of attributes. Each observation, characterized by the values of several indicators, can be represented as a point in the space of these indicators, the values of which are considered as coordinates in a multidimensional space. Distance between points R and q With k coordinates is defined as:

The main criterion for clustering is that the differences between clusters should be more significant than between observations assigned to the same cluster, i.e. in a multidimensional space, the following inequality must be observed:

where r 1, 2 - distance between clusters 1 and 2.

Just like the regression analysis procedures, the clustering procedure is quite laborious, it is advisable to perform it on a computer.

The method of regressive analysis is used to determine the technical and economic parameters of products related to a specific parametric series in order to build and align value relationships. This method is used to analyze and substantiate the level and price ratios of products characterized by the presence of one or more technical and economic parameters that reflect the main consumer properties... Regression analysis allows you to find an empirical formula describing the dependence of the price on the technical and economic parameters of products:

P = f (X1X2, ..., Xn),

where P is the value of the unit price, rubles; (X1, X2, ... Xn) - technical and economic parameters of products.

The method of regressive analysis - the most advanced of the used normative-parametric methods - is effective in carrying out calculations based on the use of modern information technologies and systems. Its application includes the following main stages:

determination of classification parametric product groups;
selection of parameters that have the greatest impact on the price of the product;
selection and justification of the form of communication of price changes when changing parameters;
building a system of normal equations and calculating regression coefficients.

The main qualifying group of products, the price of which is subject to equalization, is a parametric series, within which products can be grouped according to different designs depending on their application, operating conditions and requirements, etc. When forming parametric series, automatic classification methods can be applied, which allow single out its homogeneous groups from the total mass of products. The selection of technical and economic parameters is based on the following basic requirements:

the selected parameters include the parameters fixed in the standards and technical conditions; besides technical parameters(power, carrying capacity, speed, etc.) indicators of serial production, coefficients of complexity, unification, etc. are used;
the set of selected parameters should sufficiently fully characterize the design, technological and operational properties of the products included in the series, and have a fairly close correlation with the price;
parameters should not be interdependent.

To select technical and economic parameters that significantly affect the price, a matrix of pair correlation coefficients is calculated. By the magnitude of the correlation coefficients between the parameters, one can judge the tightness of their relationship. At the same time, a correlation close to zero shows an insignificant effect of the parameter on the price. The final selection of technical and economic parameters is carried out in the process of step-by-step regression analysis using computer technology and related standard programs.

In pricing practice, the following set of functions is applied:

linear

P = ao + alXl + ... + antXn,

linear power

P = ao + a1X1 + ... + anXn + (an + 1Xn) (an + 1Xn) + ... + (an + nXn2) (an + nXn2)

inverse logarithm

P = a0 + a1: In X1 + ... + an: In Xn,

sedate

P = a0 (X1 ^ a1) (X2 ^ a2) .. (Xn ^ an)

indicative

P = e ^ (a1 + a1X1 + ... + anXn)

hyperbolic

P = ao + a1: X1 + a2: X2 + ... + an: Xn,

where Р - price equalization; X1 X2, ..., Xn - the value of the technical and economic parameters of the series products; a0, a1 ..., an are the calculated coefficients of the regression equation.

V practical work for pricing, depending on the form of the relationship between prices and technical and economic parameters, other regression equations can be used. The type of the link function between the price and the set of technical and economic parameters can be preset or selected automatically during processing on a computer. The tightness of the correlation between the price and the set of parameters is estimated by the value of the multiple correlation coefficient. Its closeness to one indicates a close connection. According to the regression equation, the aligned (calculated) values of the prices of products of a given parametric series are obtained. To evaluate the alignment results, the relative values of the deviation of the calculated price values from the actual ones are calculated:

Tsr = Rf - Rr: P x 100

where Рф, Рр - actual and estimated prices.

The value of Cr should not exceed 8-10%. In case of significant deviations of the calculated values from the actual ones, it is necessary to investigate:

the correctness of the formation of the parametric series, since it may include products that differ sharply in their parameters from other products of the series. They must be excluded;
correct selection of technical and economic parameters. A set of parameters is possible that is weakly correlated with the price. In this case, it is necessary to continue the search and selection of parameters.

The procedure and technique for conducting regression analysis, finding unknown parameters of the equation and economic assessment of the results obtained are carried out in accordance with the requirements of mathematical statistics.