A Novel Clustering Model Based on Set Pair Analysis for the Energy Consumption Forecast in China

The energy consumption forecast is important for the decision-making of national economic and energy policies. But it is a complex and uncertainty systemproblem affected by the outer environment and various uncertainty factors. Herein, a novel clusteringmodel based on set pair analysis (SPA) was introduced to analyze and predict energy consumption.The annual dynamic relative indicator (DRI) of historical energy consumptionwas adopted to conduct a cluster analysis with Fisher’s optimal partitionmethod. Combined with indicator weights, group centroids of DRIs for influence factors were transferred into aggregating connection numbers in order to interpret uncertainty by identity-discrepancy-contrary (IDC) analysis. Moreover, a forecasting model based on similarity to group centroid was discussed to forecast energy consumption of a certain year on the basis of measured values of influence factors. Finally, a case study predicting China’s future energy consumption as well as comparison with the grey method was conducted to confirm the reliability and validity of the model. The results indicate that the method presented here is more feasible and easier to use and can interpret certainty and uncertainty of development speed of energy consumption and influence factors as a whole.


Introduction
Nowadays China is in the middle term of industrialization and urbanization and is the world's second largest energy consumer.As we know, energy is an essential material base for economic development.Energy consumption skyrockets along with the rapid and steady economic growth, industrialization, and urbanization in China, which has resulted in a serious imbalance between supply and demand of energy [1,2].In China, the total energy consumption increased from 0.987 billion ton coal equivalents (tce) in 1990 to 3.25 billion tce in 2010 [3].This has a significant influence on the current global energy profiles, and China also faces a big challenge of carbon emission decrease in the future.As we know, forecast of energy consumption is a significant precondition and basis for making energy policies of a developing country like China.Consequently, to keep sustainable and stable development, accurate forecast of energy consumption is essential, and the development of rational forecast model especially is urgent and necessary.
Many researchers have studied the relation between energy consumption and economic growth in national or regional level and proposed some forecast models for countries such as Turkey, India, Iran, UK, Finland, New Zealand, and China [4][5][6][7].Nonetheless, energy consumption is affected by outer environment and various elements.Thus, to obtain rational forecast results, various prediction models based on regression analysis [8,9], artificial neural networks (ANN) theory [10][11][12][13][14][15][16][17], grey theory [18][19][20], multivariate statistical analysis theory, and time series theory [21,22] were proposed.However, there is not a general model up to now because the previous models do not reveal general characteristics and have drawbacks themselves.Besides, traditional techniques, such as regression analysis, econometric model, and autoregressive integrated moving average (ARIMA) model, are poor in precision when data are few or exhibit nonlinear characteristics [20,[23][24][25].Although the grey models (GMs) can overcome this problem induced by small data sets or data with limited information, the weakness of grey theory is uncontrollable on the accuracy of prediction since the effectiveness of the residual series of GM(1, 1) depends on the number of data points with the same sign, which is low when the observations are limited [26].
ANN techniques can avoid this disadvantage, but it shows inability to present an explicit relationship between energy consumption and impact factors, and its application is limited by knowledge acquisition.Other effective approaches such as support vector regression (SVR), adaptive particle swarm optimization (PSO), and genetic algorithm (GA) were also introduced to forecast electricity consumption [27][28][29].But their procedures are complex and rather inconvenient for the decision-makers.Meantime, some integrated models [30,31] were also proposed and the results exhibited superiority compared with the single optimization.To study energy savings in buildings, Gaitani et al. [32] defined five energy classes for 1100 schools from all the prefectures of Greece through clustering techniques and introduced the principal components analysis to identify the typical characteristics of school buildings belonging to a particular energy class.As mentioned above, much research has been performed to advance the techniques for energy consumption forecast, whereas energy consumption forecast is still not well resolved nowadays.And there are few studies focusing on energy consumption forecast by uncertainty analysis methods, and the previous researches cannot consider and depict quantitatively certainty and uncertainty relationship between energy consumption and influence factors as a whole.A newly proposed method of set pair analysis (SPA) can deal with the uncertainty problem from three aspects of identity, discrepancy, and contrary features and depict comprehensively essential characteristics of things [33][34][35].So this method provides a fresh idea for energy consumption forecast.
This paper introduces a novel clustering model based on SPA for energy consumption prediction to deal with uncertainty relation between energy consumption and its influence factors.And the proposed model is used to forecast China's energy consumption, and its feasibility and effectiveness are also further discussed.

Theory
2.1.Brief of Set Pair Analysis Theory.The set pair analysis theory put forward by Zhao [33] is an uncertainty analysis method.Certainty and uncertainty can be taken into consideration as a whole and be treated dialectically during the process of SPA.The connection number, a basic idea of SPA theory, was described to depict the relation of a set pair made up by two sets [33][34][35][36][37][38][39][40].And the corresponding function of the connection number can be written as where  is the discrepancy coefficient,  ∈ [−1, 1], and the uncertainty number sometimes denotes only a mark of difference;  is the contrary coefficient,  ≡ −1, or a sign of the contrary; , , and  are identical degree, different degree, and contrary degree, respectively, and  +  +  = 1.
In case application, SPA theory can describe certainty and uncertainty by connection numbers in one system.Meanwhile, IDC analysis of set pair is a dynamic process, in which ideality can transfer into discrepancy or contrary with condition changes.Therefore, SPA can be utilized to clearly interpret the uncertainty of energy consumption forecast.

Optimal Partition Method.
Cluster analysis is widely used in pattern recognition, image analysis, information retrieval, and other fields.Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals, or particular statistical distributions [41,42].The reported literatures show that most cluster methods cannot deal with the problem of ordered samples since the ordered samples cannot be scattered in the clusters.However, the optimal partition method presented by Fisher in 1958 can effectively overcome this problem [43,44].This method is of concise calculation process and efficient to find out reasonable groups and save computing cost.Suppose that  samples are partitioned into  clusters (, ).The core of Fisher algorithm can be expressed as the following recursion formulas: where [(, )] denotes an objective function.(, ) is the diameter of a cluster (, ) = {  ,  +1 , . . .,   } ( < ).  and  are indicator value and mean value of ordinal samples in (, ).Step 1. Conduct cluster analysis with Fisher's optimal partition method for the ascending DRI of historical energy consumption.And calculate group centroid values of energy consumption and influence factors for each cluster.

Development of Cluster Forecasting Model Based on Set Pair Analysis
Step 2. A criterion is constructed to conduct set pair analysis, and the corresponding formulas are described to transfer mean values of DRIs into connection numbers in order to calculate the identical degree, different degree, and contrary degree between influence factors and reference sets.Then, combined with weights of influence factors, obtain integrated connection numbers.
Step 3. Express influence factors of energy consumption to be forecasted in a certain year with connection numbers obtained from the IDC analysis, and calculate their similarities to each cluster.
Step 4. Construct a forecasting model and predict the energy consumption in a specified year.

Definition of IDC Criteria.
To combine the concept of IDC with the development speed, suppose that the reference set is    = 1 ( = 1, 2, . . ., ;  = 1, 2, . . ., ) because the original state of annual DRI is a relative certainty variable.Then, the IDC analysis between the annual DRI and the reference set can be conducted for the discussed year on a basis of IDC criteria.The IDC criteria are defined as follows.If the annual DRI value is greater than 1, it is defined as identity according to the SPA theory, and the bigger the amount of the dynamic relative number appears, the higher the possibility of the growth situation is.When the annual dynamic relative number is less than 1, this is defined as contrary, and the smaller the amount of dynamic relative number exhibits the higher possibility of attenuation.Besides, while the annual DRI is 1, it is called discrepancy, which means that the discussed indicator shows uncertainty characteristic against the initial state of the influence factor.

Expressions of Group Centroids.
Let  ( = 1, 2, . . ., ) represent the number of clusters or categories.The corresponding group centroid of influence factors for each category is defined as where  ()  is the mean value of th influence factor on category . ()   is the corresponding DRI value of th influence factor on sample  ( = 1, 2, . . ., ).Then, based on the set pair analysis theory [40], integrated connection number   , used to depict growth and decline features, can be written as where where  L and  U are the lower and upper values of DRI for historical indicators.The value ranges of  and  are presented in Figure 2.
The intersection point of two linear functions ( 9) and ( 10) is always used to specify  and .Namely, In a similar way, the connection degree between the measured indicator value of a forecasting year and reference set (   = 1) can also be achieved.

The Forecast Model
where   ,   , and   are identical degree, different degree, and contrary degree, respectively.The similarity between set pairs  1 and  2 can be defined as Coefficient value scope 0 where   is the similarity to category .And the forecast model for the annual development speed is given as where   is the centre of energy consumption on category  and  is the forecasted value.

Case Study
The model discussed is applied to analyze and predict China's energy consumption.Data from China statistics yearbook were used to confirm its validity and effectiveness [3].Many reports have shown that energy consumption has a deep association with factors such as GDP, proportion of secondary industry, urbanization level, and price index.Thus, in this model, GDP, proportion of secondary industry, urbanization level, and price index were taken as major influence factors for energy consumption.Detailed materials of sample data and annual DRI values are listed in Tables 1 and 2.
According to Table 2, from 1990 to 2010, energy consumption increased significantly from 0.987 to 3.25 billion tce with an average annual development speed of 1.07.Chained development speed of Chinese energy consumption varied from 1.002 to 1.162.Based on Fisher's optimal partition method, the objective function varied with the number of clusters as shown in Figure 3.The knee point of the objective function was observed when the number of clusters is equal to 4. So chained speed of development for historical energy consumption should be divided into 4 categories.Then, the group centroids of influence factors as listed in Table 3 were obtained according to (3).Thereby, we substituted mean values of influence factors into ( 5), (6), and ( 7), which got corresponding connection numbers for each category.Herein,  and  were 0.5 and 0.2, respectively, obtained on the basis of formulas (11).Combined with the same weights of influence factors, through (4), the integrated connection degree was obtained.The results were presented in Table 4.
In the year of 2010 China's practical information about energy consumption was used to test and verify this model (see Table 5).Firstly, let set , which consists of the statistical values of GDP, be proportion of secondary industry, urbanization level, and price index for the year of 2010.Based on SPA theory, conduct IDC analysis between influence factors value and reference set  = {1, 1, 1, 1}, and calculate the corresponding connection numbers.Meanwhile, according to the definition of similarity in expression (13) and a nearby rule, chained development speed for 2010 was specified as category 3. Finally, by (14), the forecasted chained development speed of energy consumption value was 1.085 in reference to 2009, which is of 2.37% relative error in comparison to the real value of 1.06.Besides, the prediction was conducted with the GM(1, 1) model, and its forecasted value was 1.024, whose relative error reached 3.40%.It indicated that the proposed model improves prediction performance significantly and is feasible and effective.
As noted above, the method proposed here overcomes drawbacks of conventional methods based on single type information and a static perspective.And it will enable us to provide a more comprehensive background for the characterization of energy consumption and to make appropriate energy policies of a developing country.However, the forecast of energy consumption involves various factors of incompatibility, complexity and diversity, combination, and dynamic uncertainty.Consequently, it would be important to clarify effects of factors on the energy forecast in various time frames.The same weights of influence factors used to calculate the integrated connection degree in the case study may neglect the importance of indicators and effect on the prediction.To provide more information about the most sensitive parameters and improve the forecast accuracy, considerable amount of work both on the sensitivity analysis and on the comparison with other methods is required to conduct with actual indicator weights in future.
Based on mean values of influence factors within 5 years from 2006 to 2010, the energy consumption in 2015 can be predicted with this discussed method as showed in Table 5.If the economy continues as usual, energy demand in China will continue to increase rapidly to 5.35 billion tce by 2015.However, there is a huge potential for reducing this projected level, since growth could be better by adjusting the energy and industrial structure and strengthening technology innovation.They may be especially helpful to frame suitable energy policy.

Conclusions
To provide reliable data for the decision-making of macroeconomic policy, a rational forecast model for energy consumption is of significance since well-targeted policies and reasonable measures are indispensable for rational energy consumption forecast.However, energy consumption forecast is a complex and uncertainty problem due to interactive factors.In this study, based on historical data of China's energy consumption and influence factors, a novel clustering forecast model based on SPA was presented to analyze energy consumption.Some conclusions can be drawn as follows.
(1) The results indicate that this novel method used to forecast energy consumption is feasible and effective and convenient for practical applications.This cluster forecast model provides a potential method for other uncertainty problems.
(2) The expressions in terms of connection number for group centroids of influence factors can depict the certainty and uncertainty of development speed as a whole.
(3) Based on the similarities of DRIs of influence factors, interaction among the influence factors and similar information between historical samples and prediction object can be taken into account in the proposed model.Although our work has provided a useful clustering tool for making full use of similar information from historical samples for the energy consumption forecast and analyzing the certainty and uncertainty of evaluation indicators from three aspects embracing identity, discrepancy, and contrary, further investigations will still be in progress with sensitivity analysis to clarify effects of indicators on the prediction in various time frames.

3. 1 .
Basic Principle and Procedures.Basic principle of SPAbased clustering model is depicted as follows.Based on ordered DRIs of historical energy consumption, first conduct cluster analysis with Fisher's partition method to obtain categories of development speed.Then analyze the uncertainty of influence factors through identity-discrepancy-contrary analysis.And forecast the energy consumption in a certain year according to the similarities of measured influence factors to each category.The corresponding flow chart is sketched as showed in Figure1.And the detailed procedure is illustrated as below.

Figure 1 :
Figure 1: Flow chart of SPA-based cluster forecast model.
Based on Similarity.Let set  be DRIs of influence factors of energy consumption, set  is Reference set in which all the DRI values are 1, and set  represents measured DRIs of influence factors of a forecast object.According to the definition of set pair, two set pairs,  1 (, ) and  2 (, ), can be established.And corresponding connection numbers are written as follows:

Figure 2 :
Figure 2: Coefficient value scope of  and .

Figure 3 :
Figure 3: Relationship between the number of clusters and the objective function.
,   , and   are integrated identical degree, different degree, and contrary degree, respectively;   is an indicator weight;   ,   , and   are identical degree, discrepancy degree, and contrary degree of th indicator relative to reference set on category , respectively;  and  are coefficients; if () ∈ [ L ,  U ],  and  will satisfy 0 <  ≤ 0.25,

Table 1 :
Energy consumption and its influence factor from 1990 to 2010 in China.

Table 2 :
Ordered relative numbers of energy consumption and corresponding DRIs of influence factors.

Table 3 :
Interval of energy consumption and mean values of DRIs of influenced factors for each cluster.

Table 4 :
Connection numbers between influence factor and reference set.

Table 5 :
Forecasted results and comparison with the gray method.