Driven Factors Analysis of China ’ s Irrigation Water Use Efficiency by Stepwise Regression and Principal Component Analysis

This paper introduces an integrated approach to find out the major factors influencing efficiency of irrigation water use in China. It combinesmultiple stepwise regression (MSR) and principal component analysis (PCA) to obtainmore realistic results. In real world case studies, classical linear regression model often involves too many explanatory variables and the linear correlation issue among variables cannot be eliminated. Linearly correlated variables will cause the invalidity of the factor analysis results. To overcome this issue and reduce the number of the variables, PCA technique has been used combining withMSR. As such, the irrigation water use status in China was analyzed to find out the five major factors that have significant impacts on irrigation water use efficiency. To illustrate the performance of the proposed approach, the calculation based on real data was conducted and the results were shown in this paper.


Introduction
Agriculture is the basis for human survival, consuming up to 75% of current human water use to feed about 7.2 billion people in the world [1].At present, intensifying water scarcity and water pollution cause the freshwater resources for irrigation to reduce, hindering agriculture's development [2].And then the demand of food and living space for the additional 1.8 billion people by 2050 according to population projections will put further enormous pressure on irrigation water use and limit the arable land [3,4].In other words, future increases in agricultural production will have to come mainly by growing more food on existing land and water [5].And the decrease of water resources and severe environmental pollution are changing the way the agriculture is developing, with emphasis on sustainable and clean development [6].As such, improving the efficiency of irrigation water use attracts more and more attention currently and this is consistent with the sustainable development.
In China, about 80% of production of food comes from irrigated farmland.The irrigated area reaches to 5 × 10 7 hectares, accounting for about 95% of arable land [7].And the irrigation water use is always the biggest consumer of water.However, the proportion of water use for irrigation has been narrowing from 80% in 1970s to less than 60% at present.In the past three years, total amount of water use in China is about 614 billion m 3 /year on average, among which the irrigation water use is about 346.9 billion m 3 /year accounting for about 56.5% [8][9][10].And the proportion for irrigation continues to decline in its trend, which will widen the gap between supply and demand of irrigation water use.
In addition, the water-wasting phenomena always exist in the agriculture.Compared with the developed countries' irrigation water use efficiency of about 70%-80%, China's water utilization efficiency is very low.China's irrigation water use efficiency is about 45% at present, wasting more than 50% of water in the process of water delivery and irrigation.Therefore, the grain production is also very low in China; only 1 kg grain can be produced by 1 m 3 of water [11].Given all of the above, the government of China put forward the most stringent water resource management to enhance the water use efficiency.Under this situation, firstly the government needs to find out the major factors which influence the water use efficiency significantly, and these major factors can be called driven factors.
In real world, the correlations among the factors influencing the irrigation water use efficiency (IWUE), namely, multicollinearity, will increase the complexity of factor analysis.As such, PCA is introduced to the driven factor analysis of IWUE [12].PCA can compress data by reducing the dimensionality of a data set which includes a large number of interrelated variables without losing information.After compression, the original data is transformed into a new set of variables, namely, new principal components, which are uncorrelated to each other [13].
In this paper, to overcome the mentioned weakness of the stepwise regression, PCA approach is introduced and combined with the stepwise regression for driven factors analysis of irrigation water use efficiency in China.First, stepwise regression is used to find out the explanatory variables which have no multicollinearity with other variables.Second, the PCA approach is used to reduce the number of the remaining variables transforming them into a new set of variables, called principal components, which are uncorrelated with each other.At last, a linear regression is established for modeling the relationship between a scalar dependent variable (IWUE) and one or more explanatory variables (driven factors).
The rest of the paper is organized as follows: "Review of the Research on Factor Analysis" provides a detailed review of the state of the art on the methods that can be used to solve similar problems.Then this paper justifies the interest of the proposed approach with respect to others."Methodology" introduces the methodology of the proposed model.Then in the "Driven Factors Screening," the process for selecting driven factors has been explained.In "Calculation," the case study is presented to show the computation process.Finally, the results and conclusions are summarized in "Results" and "Conclusions," respectively.

Review of the Research on Factor Analysis
2.1.Subjective Weighting Methods.Subjective weighting methods evaluate the factors using qualitative measures according to empirical expertise, such as AHP (Analytic Hierarchy Process), Delphi method, fuzzy comprehensive method, DARE method, and order relation analysis method.Among the subjective weighting methods, AHP method is a common mean for deciding the indicators' weight which is frequently used in various fields.For example, Wang et al. [14] used the AHP-GRA analytic hierarchy process-gray relational analysis method to make the decision for ergonomic evaluation.
The biggest advantage of subjective weighting method is that it could reasonably determine the order of all the indicators.That is to say, though the subjective weighting method cannot well know the weighting coefficient of each indicator, it can effectively ensure the sequence of given indicators according to their influencing degrees.Then the biggest disadvantage of this method is its subjectiveness and randomness, because the different the experts, the different the weighting coefficients.As such, in some cases, applying subjective weighting methods may cause the significant difference compared with actual situation.

Objective Weighting Methods.
Determinations or evaluation results of objective weighting methods are based on the strong mathematical theory reducing the burden of decision-makers.Except the popular objective methods like least squares and eigenvector method [15,16], there are other advanced methods such as mean-squared deviation weight method, simple correlation function method, principal component analysis, and factor analysis [17,18].To date, the research history of objective weighting methods is still short and immature.And this weighting method is confined to the realistic problem domain, which fails to own the generality.In addition, the computational methods are relatively complex without the consideration of subjective intent of the policy makers, causing poor participation.
Among the mentioned objective weighting methods above, principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.The number of principal components is less than or equal to the number of original variables.This transformation is defined in such a way that the first principal component has the largest possible variance (i.e., accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.The resulting vectors are an uncorrelated orthogonal basis set.The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric.PCA is sensitive to the relative scaling of the original variables.
The limitations of PCA areas follows: (1) getting the weight of principal components rather than the objective weight of each independent indicator and (2) invalid results once the correlations among indicators are too low.

Advantages of Proposed Methods.
To sum up, the determination of indicators' weighting in the multiobjective decision is significantly important.Weighting is a complicated metric for subjective assessment and objective reality.It reflects the indicator's important level in the evaluation process.Rationality of weighting's value affects the judging results significantly.As such, we need to figure out the value of indicator's weighting effectively and scientifically, as well as the appropriate weighting methods.In this paper, the proposed methods were put forward combining the subjective and objective weighting methods.And the methods' interests compared with other methods are shown in the following.
There were too many original indicators collected from the literature in the beginning, and we just need some of them.So we used the subjective weighting methods to screen the indicators firstly because it could reasonably determine the order of all the indicators.The useless indicators we deleted were meaningless for the assessment, so their weighting values were not necessary to be estimated accordingly (i.e., why we denied objective methods).We chose Delphi method to get the effective indicators so that the efficiency of assessment operation was relatively high.
Then, the driven factors after screening need to be dealt with on the strong basis of mathematical theory because we need to know their meanings that influence the water use efficiency.We used the method of correlation coefficient and modified PCA method to analyze the indicators.In accordance with the limitations of PCA, we introduced the MSR to reduce the invalid results.And we abstracted all the principal components to retain the original data so we can get the objective weight of each independent indicator.

Methodology
3.1.PCA Model.According to [13,19,20], PCA is applied in this paper to eliminate the multicollinearity among the variables.
After literature analysis, a lot of factors that affect the irrigation water use efficiency have been found.Then 22 driven factors were introduced to the irrigation water use efficiency analysis and taken as the variables in the PCA model.For the period of 1997-2010, 14 samples were collected for analysis of IWUE in China.Consider a data matrix, X, with column-wise zero empirical means (the sample mean of each column has been shifted to zero), where each of the  rows represents a different repetition of the experiment, and each of the  columns gives a particular kind of datum.Driven factors are exhibited in  columns, and the data matrix X is shown in the following: In matrix X,  represents the number of samples and  represents the number of driven factors.
Mathematically, the transformation is defined by a set of -dimensional vectors of weight or loadings L () = ( 1 , . . .,   ) () that map each row vector As such, the individual variables of Z considered over the data set successively inherit the maximum possible variance from X, with each loading vector L constrained to be a unit vector.And the new variables in Z are uncorrelated with each other.New variables are presented as follows: For the first principal component  1 in the above, the corresponding loading vector L (1) has to satisfy A standard result for a symmetric matrix such as X T X is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when L (1) is the corresponding eigenvector.With L (1) found, the first component of a data vector  () can then be given as a score  1() =  () ⋅ (1) in the transformed coordinates, or as the corresponding vector in the original variables.
The th component can be found by subtracting the first  − 1 principal components from X: Then we find the loading vector which extracts the maximum variance from this new data matrix in the following: It turns out that this gives the remaining eigenvectors of X T X, with the maximum values for the quantity in brackets given by their corresponding eigenvalues.Then the th principal component of a data vector  () can therefore be given as a score  () =  () ⋅  () in the transformed coordinates or as the corresponding vector in the space of the original variables, where  () is the th eigenvector of X T X.
The full principal components decomposition of X can therefore be given as T = X ⋅ L.

Stepwise Regression.
In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure [21,22].Usually, this takes the form of a sequence of -tests or -tests, but other techniques are possible, such as adjusted  2 , Akaike information criterion, Bayesian information criterion, Mallows' Cp, PRESS, or false discovery rate.The frequent practice of fitting the final selected model followed by reporting estimates and confidence intervals without adjusting them to take the model building process into account has led to calls to stop using stepwise model building altogether [23,24] or to at least make sure model uncertainty is correctly reflected [25,26].Stepwise regression includes three main approaches: forward selection, backward elimination, and bidirectional elimination.A widely used algorithm was first proposed by [27].This is an automatic procedure for statistical model selection in cases where there are a large number of potential explanatory variables, and no underlying theory on which to base the model selection.The procedure is used primarily in regression analysis, though the basic approach is applicable in many forms of model selection.This is a variation on forward selection.At each stage in the process, after a new variable is added, a test is made to check if some variables can be deleted without appreciably increasing the residual sum of squares (RSS).The procedure terminates when the measure is (locally) maximized or when the available improvement falls below some critical value.
One of the main issues with stepwise regression is that it searches a large space of possible models.Hence it is prone to the overfitting of the data.In other words, stepwise regression will often fit much better in sample than it does on new outof-sample data.This problem can be mitigated if the criterion for adding (or deleting) a variable is stiff enough.The key line in the sand is at what can be thought of as the Bonferroni point, namely, how significant the best spurious variable should be based on chance alone.On -statistic scale, this occurs at about √2 log , where  is the number of predictors.This fence turns out to be the right trade-off between overfitting and missing signal.If we look at the risk of different cutoffs, then using this bound will be within a 2 log  factor of the best possible risk.Any other cutoff will end up having a larger such risk inflation [28,29].

Stepwise Regression-PCA Model.
The purpose of irrigation water use efficiency study is to find out five major driven factors from all the driven factors of China.In other words, the final results must be presented using the original data, and the number of chosen factors cannot be less than five.The major challenges in the mathematical model selecting are shown below.
(1) While eliminating the multicollinearity, the factors strongly unrelated with other factors must be contained.
(2) The variables which actually carry data information must be included as many as possible.
(3) The data can be retransformed back to its origin.
As such, the stepwise regression model combined with PCA model is applied in this paper.Mathematically, one or some factors among these driven factors may be uncorrelated with other factors.So we use variance inflation factor (VIF) which quantifies the severity of multicollinearity in an ordinary least squares regression analysis to measure how much the variance (the square of the estimate's standard deviation) of an estimated regression coefficient is increased because of collinearity.Then the factors uncorrelated with others can be found.For the remaining factors, we use PCA model to eliminate their multicollinearity by transforming them into a new set of variables while carrying all the original information.At last, stepwise regression is used to establish an optimal regression model to fit the data.The flowchart of irrigation water use efficiency analysis is shown in Figure 1.
(1) Comprehensive.The driven factors must be related to the irrigation water use efficiency closely and the factors can reflect the China's agricultural development status synthetically.
(2) Objectivity.The driven factors must be objective to eliminate the subjective factors as far as possible.
As such, the experts could judge the importance of factors fairly.
(3) Independence.Although it is inevitable that the interconnected relationship exists among variables, we could choose the factors inferred with high independent in accordance with the experts' empirical conclusion.
(4) Convenient.Other than the three rules above, we must remove the factors whose data cannot be collected for convenient.It is helpful to decrease the difficulties of this study.
According to the four rules above, the flowchart of driven factors screening is shown in Figure 2.After the screening, the final driven factors are shown in the Table 1.

Calculation
Taking the data of 22 independent variables and IWUC of China in the period of 1997-2010 into consideration, we have conducted the stepwise regression and principal component analysis.The data used here come from National Bureau of Statistics of China, 2011-2014.

Stepwise Regression for Collinearity Diagnostics.
The collinearity can be diagnosed by Tolerance.Tolerance is the ratio of residual sum of squares to total sum of squares after regression analyzing one independent variable (as the dependent variable) and the other independent variables.The smaller this index is, the more serious the collinearity is.That is to say this independent variable can be predicted by the remaining independent variables precisely.The criterion of Tolerance in this paper is 0.1: namely, if an independent variable's Tolerance is less than 0.1, there will exist collinearity.Variance inflation factor (VIF) is the reciprocal of Tolerance.
The outcome of collinearity diagnosis by SPSS is shown in Table 2.
The first industrial output (FIO) is the only indicator which does not relate to other 21 indicators.However, the coefficient of determination  2 = 0.286 indicating the data does not fit the statistical model very well; there remains    = √ 1 −  2 = 0.845 which can be enhanced.Then we need to introduce other variables without collinearity.

Principal Component Analysis for Remaining Variables.
As mentioned above, only one factor has been picked up; the remaining 21 variables related to each other.As such, the principal component analysis has been conducted to eliminate the collinearity among the 21 variables, figuring out the principal factors under the condition that the original data information must be retained.At first, the data standardization of 21 variables is shown in Table 3.Then these standardized data have been used in the SPSS to conduct the principal component analysis.The total variance of interpretation for the 21 variables is shown in Table 4.And we got the principal component score coefficient matrix, shown in Table 5.
According to principal components score coefficient matrix, we introduced the original data forming new data of principal components.For example, the data of first component (FAC 1) was obtained according to the following equation: where F 1 is the new data of FAC 1 and X t is the original data of 21 variables;  = 1, 2, . . ., 21; t = 1997, 1998, . . ., 2010.
As such, the numerical matrix of 13 principal components was listed in Table 6.Because the first industrial output (FIO) has been picked up, its standardized data is also listed in Table 6.

Multivariate Linear Regression Analysis.
Based on the stepwise regression and principal component analysis, the 14 independent variables which are not related to each other have been found.After principal component analysis, we could see that the first three components contribute about 88.42% to the total variance.Then combined with the variable of FIO, it is considered that the FAC 4 also could be included into the linear regression.Namely, the first four components and the FIO have been considered in multivariate linear regression analysis taking the IWUC as the dependent variable.
Then the results after multivariate linear regression analysis are shown in Table 7, and the coefficients of regression are shown in Table 8.
Table 7 indicates that the introduction of principal components affected the dependent variable positively, because the determination coefficient  2 increased along with the introduction of principal components step by step.And  2 = 0.81 > 0.8, which is big enough according to the criterion in Figure 1.So we got the regression equation:

Results
Table 6 represents the projections of original factors on the components and we get four principal components and one original driven factor according to the regression equation.
As such, we need to get other original driven factors by formula conversion using Table 9, based on ( 7) and (8).Then five major driven factors have been selected in Table 10, because they have the large regression coefficients relatively, affecting the regression equation significantly.

Conclusions
The enforcement of "Efficient Red Line" is the core of "the most strict water management institution."This paper made a research on the irrigation water use efficiency serving for the dynamic management of water resources in China.The main aim of this paper is to improve the sustainable water resources use so that China's economy and society can develop smoothly.
Through the literature data research and survey, there are 22 indicators that have been selected as the driven factors.By the experts discussion and correlation analysis, the index system of irrigation water use efficiency has been established.And this paper analyzed a variety of mathematical models in order to pick up the most appropriate methods.After analysis, the MSR and PCA were selected and modified to fulfill our requirements for research.At last, the major driven factors influencing China's irrigation water use efficiency mostly were found out.The lessons learned from the application of the methods, in view of the results and the process, are shown in the following, as well as the consideration that how the proposed approach can be reused in similar problems and how it can be improved.(1) We improve the agricultural irrigation water use efficiency index system.Delphi method could reasonably determine the order of all the indicators, so we used it to get the effective indicators.This method is feasible because we need to know the order of indicators rather than their weighting values for simplifying index system.After the applying of Delphi method, the useful indicators were obtained and then we tried our best to collect the indicators and their data.Members of the investigation team collected the indicators and data sharing out the work and cooperating with one another.Through a variety of channels, means, and mathematical prediction models, each indicator has its own data source for construction of agricultural irrigation water use efficiency control system.
(2) Quantitative analysis is in a strong basis of mathematical theory for scientific research.We used the method of correlation coefficient and modified PCA method to analyze the indicators.For eliminating the limitations of PCA, we introduced the MSR to reduce the invalid results, and we abstracted all the principal components to retain the original data.We combine the multiple linear regression with principal components analysis to eliminate the multicollinearity within indicators.The indicators without multicollinearity have been found out by the stepwise regression; then the principal components analysis has been conducted to figure out the principal components of the factors with multicollinearity on the basis of national data in China.As such, the regression equation of irrigation water use efficiency has been constructed.And the major driven factors have been found out because the principal components preserved the original data information.
(3) In this paper, up to different districts and datum years, the authors gave their objective assessment scientifically about the national irrigation development model, saving irrigation style, and saving input and output.Meanwhile, this paper made a conclusion of the processing of irrigation water use efficiency development.Then, the "principal component analysis" (PCA) was applied to eliminate the interfering factors of irrigation water use efficiency; the information included 31 provinces, crossing the period of 1997 to 2010.The study provided the zoning of national irrigation water use efficiency with reference, eliminating the linear interference between indicators and summing up five principal factors.
(4) Outlook about the proposed approach's use in similar problems and its improvement was as follows.The utilization efficiency of irrigation water is in the process of continuous dynamic change, limited or influenced by the national policies and resource endowment.So it is very hard to collect the data of indicators, which causes the subjective index selection inevitably.The proposed approach in this paper will solve this issue effectively and it can be reused in similar problems because of the combination of subjective and objective weighting methods.For example, the ecological problems draw the public attention increasingly nowadays, so it is inevitable to introduce ecological indicators in the assessment of irrigation water use efficiency, such as groundwater depth and fertilization frequency in irrigation district.However the popular study methods of ecological indicators are limited to qualitative methods, if this proposed approach in this paper can be adopted, the research results of ecological indicators of irrigation water use will be more convincing.
(5) The improvement of the proposed approach in the future is as follows.Path analysis, a statistical method of testing cause/effect relationships, can effectively explain the impact path of independent variable parameters on the variable factors.Through the analysis of the relationship between the direct and indirect effects on the study, the impact mode of factors can be figured out.As such, we can apply the path analysis to study the mechanism of the effect of driven factors on irrigation water use efficiency.Then the development of irrigation water use efficiency control system is our ultimate goal.Combined with software development technology, we can further research and develop the irrigation water use efficiency control system; the research results of the computation information system also can be applied in the agricultural watersaving management.Through the computer network, the agricultural water-saving information collection, storage, and calculation can be conducted.The agricultural water use monitoring and scheduling also can constantly improve our agricultural water-saving management level; the modern agricultural watersaving management will come true finally.

Figure 1 :
Figure 1: Flowchart of irrigation water use efficiency analysis.

Figure 2 :
Figure 2: Screening and classification process chart of driven factors.

Table 1 :
Explanation of the selected driven factors.
Note: the economic indices in this table are calculated at comparable prices.
a First industrial output.

Table 3 :
Data of remaining 21 variables and IWUC after standardization.

Table 4 :
Total variance of interpretation.

Table 5 :
Principal components score coefficient matrix.