Simulation Study on Clustering Approaches for Short-Term Electricity Forecasting

Advanced metering infrastructures such as smart metering have begun to attract increasing attention; a considerable body of research is currently focusing on load profiling and forecasting at different scales on the grid. Electricity time series clustering is an effective tool for identifying useful information in various practical applications, including the forecasting of electricity usage, which is important for providing more data to smart meters. This paper presents a comprehensive study of clustering methods for residential electricity demand profiles and further applications focused on the creation of more accurate electricity forecasts for residential customers. The contributions of this paper are threefold: (1) using data from 46 homes in Austin, Texas, the similarity measures from different time series are analyzed; (2) the optimal number of clusters for representing residential electricity use profiles is determined; and (3) an extensive load forecasting study using different segmentation-enhanced forecasting algorithms is undertaken. Finally, from the operator’s perspective, the implications of the results are discussed in terms of the use of clustering methods for grouping electrical load patterns.


Introduction
Throughout the EU, there is considerable interest in smart electricity networks, where increased control over electricity supply and consumption is achieved by investments and improvements in new technologies such as advanced metering infrastructure.Smart metering is part of this movement and is perceived as a necessary step in achieving the EU's energy policy goals by the year 2020 (i.e., cut greenhouse gas emissions by 20%, improve energy efficiency by 20%, and ensure that 20% of the EU's energy demand is supplied by renewable sources) [1].
Clustering analysis is an unsupervised learning technique that has been widely used to identify different energy consumption patterns, particularly among commercial customers, individual industries, or large aggregations of residential customers [2].Recently, the fast-growing stream of meter data has motivated further research on the application of clustering techniques to individual residential customers [3].By clustering time series of hourly load data, each customer can be represented by a number of load patterns, thus allowing variability information to be derived.Clustering can therefore serve as a valuable preprocessing step, providing fine-grained information on customer attributes and sources of variation for subsequent modeling and customer segmentation.
Many dissimilarity measures between time series have been proposed in the literature.They can be grouped into four categories [4,5]: (i) Shape: Minkowski distance, short time series distance, and dynamic time warping distance.
(ii) Editing: edit distance for real sequences and longest common subsequence distance.
(iv) Structure: Maharaja distance and Piccolo distance.
This study examines the first three categories, as they can be used in the majority of cases.The fourth category requires 2 Complexity some a priori assumptions that affect the obtained results.For instance, they assume that the observed time series is the result of a certain parametric base model, mainly the autoregressive integrated moving-average (ARIMA) model.This implies the need to prioritize the relevant parameters of the model in advance.
Dissimilarity between time series can be also computed using information theory.For instance, Kullback-Leibler (KL) divergence is a measure of how one time series probability distribution diverges from a second (expected) probability distribution [6].This measure is distribution-wise and asymmetric and thus does not qualify as a statistical metric of spread.In the simple case, a KL divergence of 0 indicates that we can expect similar, if not the same, behavior of two time series distributions, while a KL divergence of 1 indicates that the two time series distributions behave in such a different manner that the expectation given the first distribution approaches zero.The second example is mutual information (MI) of two time series which is a measure of the mutual dependence between two investigated time series [7].Intuitively, mutual information measures the information that two time series share: it gives the idea of how one of these time series reduces uncertainty about the other.For example, if two time series are independent, then knowing the first time series does not give any information about the second time series and vice versa, so their mutual information is zero.At the other extreme, if the first time series is a deterministic function of the second one, then the mutual information is maximal.
As their input, most clustering algorithms take parameters such as the number of clusters, density of clusters, or, at least, the number of points in a cluster.Nonhierarchical procedures usually require the user to specify the number of clusters before any clustering is accomplished, whereas hierarchical methods routinely produce a series of solutions ranging from  clusters to only a single cluster [8].As such, the problems of determining a suitable number of clusters for a dataset and evaluating the clustering results have been considered by several researchers.The procedure of evaluating the results of a clustering algorithm is known as cluster validity analysis [9].
In general, there are three approaches for investigating the cluster validity.The first, based on external criteria, compares the cluster analysis results to some known results, such as externally provided class labels.The second approach, based on internal criteria, uses information obtained from within the clustering process to evaluate how well the results fit the data without reference to external information.The third approach is based on relative criteria and consists of evaluating a clustering structure by comparing it with other structures given by the same algorithm using different parameter values (e.g., the number of clusters).In this paper, we consider this third class of measures [10].
Based on the selected time series similarity measures and hierarchical clustering algorithms, this study developed forecasting models for the aggregate electricity demand of individual groups of households.The predictions given by these models were compared with the results of a base model built for all households (aggregate over 46 consumers for WikiEnergy data [11]).In particular, using smart metering data, we aim to answer the following research questions: (1) To what extent is it possible to provide accurate 24-hour load forecasting for the group of households?
(2) Which of the proposed time series similarity measures and the measures determining the relevant number of clusters gives the greatest increase in forecast accuracy?
(3) What kind of forecasting methods and algorithms is appropriate to address highly volatile data?
The remainder of this paper is organized as follows.In Section 2, various time series similarity measures, grouped into three categories, are introduced.As different clustering algorithms usually lead to different numbers of clusters, Section 3 discusses several measures for determining the relevant number of clusters.In Section 4, based on WikiEnergy data gathered from 46 households, various numerical experiments regarding the clustering of these households are presented.Section 5 describes the methods used in our forecasting experiments, and Section 6 presents the results from a number of numerical experiments that provide 24-hour forecasts at different data aggregation levels.Finally, Section 7 concludes the paper.

Time Series Similarity Measures
The following subsections briefly describe three categories of time series similarity measures.In the remainder of this section, unless otherwise specified,   = ( 1 ,  2 , . . .,   )  and   = ( 1 ,  2 , . . .,   )  denote partial realizations from two real-valued processes {  ,  ∈ N} and {  ,  ∈ N}, respectively.Note that serial realizations of the same length  are initially assumed, although this limitation can be omitted in some cases.

Measures Based on the Shape of Time Series.
A simple approach for measuring the similarity between   and   is to consider conventional metrics based on the closeness of their values at specific points in time.Some commonly used rawvalues-based dissimilarity measures are introduced below.

Minkowski Distance. The Minkowski distance of order
, where  is a positive integer, is also known as the  norm distance [12].This measure is typically used with  = 2, giving the Euclidean distance, and is very sensitive to signal transformations such as shifting or time scaling (stretching or shrinking of the time axis).The proximity notion relies on the closeness of the values observed at corresponding points in time, so that the observations are treated as if they were independent.In particular,   is invariant to permutations over time [4,5].

Short Time Series Distance.
The short time series (STS) distance was introduced by Möller-Levet et al. [13] as a metric that adapts to the characteristics of irregularly sampled series [4,5].Complexity 3 2.1.3.Dynamic Time Warping Distance.The goal of dynamic time warping (DTW) is to find patterns in time series [14].The DTW distance determines a mapping between the series which minimizes a specific distance measure between the coupled observations.This allows similar shapes to be recognized, even in the presence of signal transformations such as shifting and/or scaling, and ignores the temporal structure of the values because the proximity is based on the differences, regardless of the behavior around these values [4,5].

Measures Based on
Editing the Time Series.The edit distance, which was initially developed to calculate the similarity between two sequences of strings, is based on the idea of counting the minimum number of edit operations (delete, insert, and replace) that are necessary to transform one sequence into the other.The problem of working with real numbers is that it is difficult to find exact matching points in two different sequences and, therefore, the edit distance is not directly applicable.

Edit Distance for Real Sequences.
The distance between points in the time series is reduced to 0 or 1 [15].If two points   and   are closer to each other in the absolute sense than some user-specified threshold , they are considered to be equal.On the contrary, if they are further apart, they are considered to be distinct and the distance between them is set to 1.As an additional property, the edit distance for real sequences (EDR) permits gaps or unmatched regions in the time series but penalizes them with a value equal to their length values [4,5].

Longest Common Subsequence Distance.
In this metric, the similarity between two time series is quantified in terms of the longest common subsequence (LCSS), with gaps or unmatched regions permitted [16].As with EDR, the initial mapping between the series uses the Euclidean distance between two points, which is reduced to 0 or 1 depending on some threshold  values [4,5].

Measures Based on the Time Series Features.
Instead of using the raw data values in the series, this category of distance measures aims to extract a set of features from the time series and calculate the similarity between these features.

Distance Based on the Cross-Correlation.
This distance is based on the cross-correlation between two time series.The maximum lag considered in the calculation should not exceed the length of the series values [4,5].

Autocorrelation-Based Distances.
Several researchers have considered measures based on the estimated autocorrelation functions [17].These rely on the autocorrelation between two time series with a maximum lag of  values [4,5].

Periodogram-Based Distances.
Most of the measures discussed so far operate in the same domain, that is, the time domain.However, the signal representation in the frequency domain provides a good alternative for measuring the similarity between time series.The key idea is to assess the similarity between the corresponding spectral representations of the time series values [4,5,18].

Fourier Coefficients-Based
Distances.This measure is based on comparing the discrete Fourier transform coefficients of the series [19].The value of each coefficient measures the contribution of its associated frequency to the series.Based on this, the inverse Fourier transform provides a means of representing the sequences as a combination of sinusoidal forms.Note that the Fourier coefficients are complex numbers that can be expressed as    =   +   .In the case of real sequences such as time series, the discrete Fourier transform is symmetric, and therefore it is sufficient to study the first /2 + 1 coefficients.Furthermore, it is commonly considered that most of the information is found within the first  Fourier coefficients, where  < /2+1.Based on this information, the distance between two time series is given by the Euclidean distance between the first  coefficients [4,5].

TQuest Distance.
The fundamental idea of the TQuest distance [20] is to define a set of intervals in a time series in which the stochastic processes exceed a predetermined threshold .The final distance between two time series is defined in terms of the similarity between sets of intervals based on this threshold value.Intuitively, two time intervals are said to be similar if they have similar start and end points.TQuest is independent of the size of the individual time series; this is important because time series exhibiting similar properties can be converted into intervals of different lengths.Another advantage is that this measure only takes into account the local similarity, with the remaining (continued) intervals not affecting the final result [4,5].

Measures for Determining the Relevant Number of Clusters
Different clustering algorithms usually lead to different clusters of data; even for the same algorithm, the selection of different parameters or the order in which data objects are presented can greatly affect the final clustering partitions.Thus, effective evaluation standards and criteria are critically important to ensure confidence in the clustering results.At the same time, these assessments provide meaningful insights into how many clusters are hidden in the data.In most reallife clustering situations, users face the dilemma of selecting the number of clusters or partitions in the underlying data.As such, numerous indices for determining the number of clusters in a dataset have been proposed.

CH Index.
The value of  that maximizes CH() specifies the number of clusters [21].

C Index.
The minimum value of  [21] is considered to be the relevant number of clusters in a given dataset.Complexity 3.3.Duda Index.Reference [21] proposed the ratio Je(2)/ Je(1) as a criterion, where Je(2) is the sum of squared errors within clusters when the data are partitioned into two clusters and Je(1) is the squared error when only one cluster is present.The optimal number of clusters is that which gives the smallest value of this ratio.
3.4.Ptbiserial Index.This index [21] is simply a point-biserial correlation between the raw input dissimilarity matrix and a corresponding matrix consisting of 0s or 1s.A value of 0 is assigned if the two corresponding points are clustered together by the algorithm; a value of 1 is assigned otherwise.
Given that larger positive values reflect a better fit between the data and the obtained partition, the maximum value of the index is used to select the optimal number of clusters in the dataset.
3.5.DB Index.This index [21] is a function of the ratio of within-cluster scatter to between-cluster separation.The  value that minimizes the index is considered to be the optimal number of clusters in a given dataset.
3.6.Frey Index.The Frey index [21] can only be applied to hierarchical methods; it is the ratio of difference scores from two successive levels in the hierarchy.The numerator is the difference between the mean between-cluster distances from the two hierarchy levels (level  and level  + 1), whereas the denominator is the difference between the mean within-cluster distances of levels  and  + 1.

Hartigan Index.
The maximum difference between hierarchy levels is taken to indicate the correct number of clusters in the data [21].
3.8.Ratkowsky Index.Charrad et al. [21] proposed a criterion for determining the optimal number of clusters based on / 0.5 .The value of  is the average of the ratios of (BGSS  /TSS  ), where BGSS is the sum of squares between the clusters (groups) for each variable and TSS is the total sum of squares for each variable.The optimal number of clusters is the value of  that maximizes /  0.5 .
3.9.Ball Index.Ball and Hall [21] proposed an index based on the average distance between data items and their respective cluster centroids.The largest difference between levels is used to indicate the optimal solution.[21] is the ratio of the average within-cluster distance divided by the number of within-cluster distances to the average betweencluster distance divided by the number of cluster distances.The minimum value of this index indicates the optimal number of clusters.

McClain Index. The McClain and Rao index
3.11.KL Index.The value of  that maximizes KL() [21] specifies the optimal number of clusters.
3.12.Silhouette Index.The maximum value of this index is used to determine the optimal number of clusters in the data [21].
3.13.Dunn Index.The Dunn index [21] is the ratio between the minimal intercluster distance and the maximal intracluster distance.If the dataset contains compact and wellseparated clusters, the diameter of the clusters is expected to be small and the distance between the clusters is expected to be large.Thus, the Dunn index should be maximized.
3.14.SD Index.The SD validity index is based on the concepts of average scattering for clusters and total separation between clusters [21].The number of clusters  that minimizes this index gives the optimal value for the number of clusters in the dataset.To analyze the volatility of the load, we prepared the box and whisker plots, for each of the 24 hours over the whole year, for two customers: one with quite stable load profile and the second with highly volatile characteristics (see Figures 2 and 3 for details).The whiskers show the minimum and maximum value in a given hour and the box encloses 50% of the total data (top edge represents the 75th quartile and bottom edge the 25th quartile, and the line in the middle is the median).For instance, the household, as shown in Figure 2, on average consumes 1.5 kWh in each hour while the volatility is rather low.The other household, as shown in Figure 3, can be characterized as the one using, on average, only 0.2 kWh in each hour; however, the volatility of the load is very high, regardless of the hour.

Splitting Households into Clusters
It should be noted that the source dataset was of relatively high quality.The time series in the dataset had no missing values.In some time series, few outliers were observed; however, they were left in the dataset.Please refer to Table 1 for the descriptive statistics of the electric load observed at each of the households.In the preprocessing step, only normalization of the data was applied.Additionally, only few features describing particular households were available (i.e., structural and building specific data).Those were building type (apartment, single-family home, and town home),

Determining the Similarity between Time Series.
The starting point for the process of partitioning households was the selection of training samples for the period considered in determining the similarity between the electricity consumption time series.
In the second step, using the R-CRAN software, matrixes of similarity measures were calculated between all 46 time series.All measures were implemented in the following libraries: (i) TSdist [5]: the tsDatabaseDistances function was used with arguments implementing the following similarity measures:  The households were divided into clusters using the Ward method, contained in the hclust function.Each of the 10 The size of each cluster according to the type of partition and the similarity of hourly time series is listed in Table 2.For the same type of division, the number of households included in the individual clusters varies depending on the similarity measure.The quality of the proposed similarity measures when splitting the data into two clusters is shown in Figure 4.
The most unbalanced group sizes are given by the distance based on cross-correlation (one very large cluster, the others very small), STS distance (two large clusters, the others very small), and TQuest distance (one very large cluster).The most balanced group sizes are given by autocorrelation-based distances (the smallest cluster contains three households), EDR (only in the seventh division do groups with fewer than three households appear), and periodogram-based distances (only in the last split do groups with fewer than three households appear).Other similarity measures generate clusters with a similar number of customers, but small clusters are present at some levels of division.
To determine the similarity between the results of time series clustering (Table 3), Baker's correlation coefficient [23] (function cor bakers gamma) was implemented in the dendextend library.This coefficient measures the similarity between two dendrograms, that is, the results of a hierarchical clustering, by calculating the highest possible value of q (number of groups when cutting a tree) for which two time series still belong to the same subtree.It is known that there are exactly ( 462 ) = 1035 combinations of such pairs of time series, based on hierarchical clustering.The same operation is then performed on the basis of the second dendrogram.In the next step, these two sets of values (for each dendrogram) are paired according to the order of comparing elements.Eventually, the similarity of the clustering results is assessed on the basis of Spearman's correlation coefficient.
Based on the results presented in Tables 2 and 3, it can be observed that similar clustering results are related to the DTW distance (dtw),   distance (euclidean), Fourier coefficients-based distance (fourier), and periodogram-based distances (int.per)(denoted by italic font in Table 3).Distances carrying the most dissimilar dendrograms are the autocorrelation-based distances and the longest common subsequence distance.2).The input parameters for the NbClust function were as follows: (i) ch: Calinski and Harabasz index.
The aggregated results for the relevant number of clusters for hourly electricity demand are presented in Table 4. Using a simple majority vote for the relevant number of clusters (partitions in italic font denote the greatest number of measures), it was observed that most measures of time series similarity would split the households into two groups.The LCSS distance gives a very different result, and the EDR and TQuest distances give two equally applicable cluster numbers.
Additionally, it can be realized that Ball's, Frey's, and McClain's measures tend to indicate relatively few clusters.For the remaining measures, some subgroups indicate a similar number of clusters for particular time series similarity measures [24].For example, Silhouette's, Ptbiserial's, Davies and Bouldin's, Ratkowski's, Dunn's, and SD's measures indicate almost the maximum number of clusters for the LCSS and STS distances, whereas they indicate the minimum number of clusters for DTW, EDR, and the periodogrambased distances.

An Approach to Forecasting
5.1.Short-Term Load Forecasting.Recently, with advances in smart metering, there has been a lot of interest in residential power load forecasting and application of analytical techniques to load forecasting on the individual household level [1, [25][26][27].However, such a level of granularity is challenging due to the extreme load volatility which is Complexity the result of many different dynamic processes observed at individual households, including behavioral and sociodemographic components.Aggregation of individual and to some extent stochastic components reduces the inherent variability of electricity usage resulting in smoother load shapes, and, as a result, the forecasting applied to the higher aggregation levels (power stations or regions) can be achieved with quite low errors.Different methods have been proposed for forecasting the electric load demand in the last decades.Several modeling techniques are typically used for energy load forecasting.These techniques can be classified into nine categories [28]: (1) multiple regression, (2) exponential smoothing, (3) iterative reweighted least squares, (4) adaptive load forecasting, (5) stochastic time series, (6) ARMAX models based on genetic algorithms, (7) fuzzy logic, (8) artificial neural networks, and (9) expert systems.However, the list is not closed and new techniques are still being adopted and tested for the purpose of accurate electricity load forecasting, including support vector machines or random forests, just to name a few.
Based on literature review, one can conclude that time series analyses are not effective in highly volatile data [29], and therefore time series methods such as regression models, ARIMA models, GARCH, and hybrid models such as the combination of ARIMA and GARCH using wavelet transform are not considered for short-term forecasting when dealing with volatile data [25,27].Rather than these, the machine learning techniques like artificial neural networks, support vector machines, or random forests are recently applied in this area with positive outcome [30][31][32][33].
Therefore, for the forecasting experiments, we focused on application of machine learning techniques including artificial neural networks, support vector machines, random regression forests, regression trees, and -NN regression, and as benchmark models, we have selected ARIMA models and some simple techniques like stepwise regression, naive forecast, and random forecasts.

Accuracy Measures.
To assess the forecasting performance of the model, three measures were used: precision, resistant mean absolute percentage error (MAPE), and accuracy [34].Precision is defined as how well the model is able to forecast the actual load.To measure the precision, the mean squared error (MSE) was used: where   and   denote the observed and forecasted load in hour , respectively [1].
The MAPE satisfies the criteria of reliability, ease of interpretation, and clarity of presentation.However, it does not meet the validity criterion, because the distribution of the absolute percentage errors is usually skewed to the right because of the presence of outliers.In these cases, MAPE can be overinfluenced by some very bad instances, which disrupt otherwise good forecasts [34].Therefore, we propose the alternative measure of resistant MAPE (r-MAPE) based on the calculation of Huber's M-estimator, which helps to overcome the aforementioned limitation [35].
The M-estimator for the location parameter  using the maximum likelihood (ML) estimator is defined as a solution  to min or where  =   and  is the scale parameter.For a given positive constant , Huber's estimator is defined by the following function  in (3): where  is a tuning constant that determines the degree of robustness; in this study, we set  = 1.5 [36].The application Finally, an accuracy measure was used to identify how many "correct" forecasts the model makes.Correctness is defined by the user, for example, forecasts within a percentage range of the actual load.However, for low loads, a percentage range may become insignificant [1].For a load of 0.1 kWh, a 15% range would be 0.085-0.115,and a forecast of 0.2 kWh will be considered wrong, whereas such a forecast would be acceptable in practice.To address this false loss of accuracy, we set a 15% range of error to define the general accuracy, but if the load is smaller than 1 kWh, then a range of ±0.15 kWh is considered acceptable [36].Therefore, the accuracy for hour  is given by

Predictors.
In this study, we focus on forecasting the 24-hour-ahead electricity usage.To forecast the load, we constructed a feature vector with the attributes presented in Table 5.These attributes were constructed based on time series of the hourly electricity demand.Additional variables such as temperature, humidity, and date were also collected.Electricity demand varies depending on the time of day (daily cycles), day of the week (weekly cycles), day of the month (monthly cycles), season (seasonal cycles), and occurrence of holidays.Therefore, we enriched the analysis by considering 76 dummy variables describing the hour (1-24), day of the month (31 variables), day of the week (seven variables), month (12 variables), occurrence of holidays (one variable), and sunset in a particular hour (one variable) [1].
The main variables in the forecasting process are those derived directly from the time series, and they include the maximum, minimum, and actual demand at certain intervals.The features were created by decomposing the time series.

Implementation.
Building predictive models involves huge volumes of data and complex algorithms.Therefore, an efficient computing environment with high-performance computers is necessary.In our case, all the numerical calculations were performed on computing clusters located at the Interdisciplinary Center for Mathematical and Computational Modelling at the University of Warsaw.The HYDRA engine with the Scientific Linux 6 operating system was used with the following nodes and parameters [1]: (i) Istanbul: AMD Opteron6 2435 2.6 GHz, 2 CPU × 6 cores, 32 GB RAM.
R-CRAN was used as the computing environment.This is an advanced statistical package and an interpreted programming language; it is licensed under the GNU GPL and based on the S language [36].
The starting point for the numerical experiments was the division of the WikiEnergy dataset into training (330 days, instead of 365 days, due to up to 35-day time window for attributes construction), validation (28 days), and testing (14 days) samples as described earlier.
The main criterion for training the models is the efficient generalization of knowledge with the least error.The most Complexity commonly used measure to assess the quality of forecasts in the electric power system is MAPE.Therefore, to find the best parameters for all models and ensure their generalization, the following function was minimized: where MAPE  and MAPE  denote the training and validation errors, respectively [36].
The experiments tested a broad set of machine learning algorithms, including artificial neural networks, regression trees, random forest regression, -nearest neighbors regression, and support vector regression.In the following subsections, these algorithms are briefly introduced along with their settings.The proposed machine learning algorithms were challenged against some typical approaches used for forecasting (benchmarks), namely, naive forecast, random forecast, the ARIMA model, and stepwise regression [1].

Artificial Neural Networks.
Artificial neural networks are mathematical objects in the form of equations or systems of equations, usually nonlinear, for analysis and data processing.The purpose of neural networks is to convert input data into output data with a specific characteristic or to modify such systems of equations to read useful information from their structure and parameters.On the statistical basis, selected types of neural networks can be interpreted in general nonlinear regression categories.
In studies related to forecasting in power engineering, multilayer, one-way artificial neural networks with no feedback are most commonly used.Multilayer perceptron (MLP) networks are one of the most popular types of supervised neural networks.For example, the MLP network (3, 4, 1) means a neural network with three inputs, four neurons in the hidden layer, and one neuron in the output layer.In general, the three-layer MLP neural network (, , ) is described by the expression where x  = ( 1 , . . .,   )  represents the input data, W 1 is the matrix of the first layer weights with dimensions  × , W 2 is the matrix of the second layer weights with dimensions  × , and ℎ  (u) and b  are nonlinearities (functions of neuron activation, e.g., logistic function) and constant values in subsequent layers, respectively.The goal of supervised learning of the neural network is to search for such network parameters that minimize the error between the desired values   and those received at the output of the network   .The most frequently minimized error function is the sum of the squares of differences between the actual value of the explained variable and its theoretical value determined by the model, with the values of the synaptic weight vector set: where  is the number of training samples,  ()  and  ()  are the predicted and reference values, and  is the number of training epochs of the neural network.
The neural network learning process involves the iterative modification of the values of the synaptic weight vector w (all weights are set in one vector), in iteration  + 1: where p  is the direction of the minimization of the function (w) and  is the magnitude of the learning error.The most popular optimization methods are undoubtedly gradient methods, which are based on the knowledge of the function gradient: where g and H denote the gradient and the Hessian of the last known solution w  , respectively.
In the practical implementations of the algorithm, the exact determination of Hessian H(w  ) is abandoned, and its approximation G(w  ) is used instead.One of the most popular methods of learning neural networks is the algorithm of variable metrics.In this method, the Hessian (or its reversal) in each step is modified from the previous step by some correction.If by c  and r  the increments of the vector w and the gradient g in two successive iterative steps are marked, c  = w  − w −1 , r  = g  − g −1 , and by V  the inverse matrix of the approximate Hessian according to the most effective formula of Broyden-Fletcher-Goldfarb-Shanno (BFGS) [37], the process of updating the value of the V  matrix is described by the recursive relationship As a starting value, V 0 = 1 is usually assumed, and the first iteration is carried out in accordance with the algorithm of the largest slope.Artificial neural networks are often used to estimate or approximate functions that can depend on a large number of inputs.In contrast to the other machine learning algorithms considered in these experiments, the ANN required the input data to be specially prepared.The vector of continuous variables was standardized, whereas the binary variables were converted such that 0s were transformed into values of −1.Finally, the dependent variable was normalized by zero unitarization.To generate a forecast, the reverse transformation of the dependent variable was applied [1].
To train the neural networks, we used the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, which is a quasi-Newton optimization method (available in the nnet library).The network had an input layer with 146 neurons and a hidden layer with 10, 15, 20, . . ., 50 neurons (i.e., nine sets of experiments).
A logistic function was used to activate all of the neurons in the network, and a regularization factor was introduced to penalize weights that were too high in the network (to control overfitting).Factor values of 0.01, 0.1, and 0.5 were considered [1].
In each experiment, 27 neural networks were learned with various parameters (the number of neurons in the hidden layer multiplied by the number of penalties).To avoid overfitting, after each learning iteration had finished (with a maximum of 50 iterations), the models were checked using the error measure defined in (7).Finally, out of the 27 learned networks, that with the smallest error was chosen as the best for delivering forecasts [36].

k-Nearest Neighbors Regression.
The -nearest neighbors regression [38] is a nonparametric method, which means that no assumptions are made regarding the model that generates the data.Its main advantage is the simplicity of the design and low computational complexity.The forecast of the value of the explained variable   on the basis of the vector of explanatory variables x  is determined as where whereas x  is one of the -nearest neighbors x  , in the case where the distance (x  , x  ) belongs to  smallest distance between the observations from the set X and x  .The most commonly used distance is the Euclid distance.
To improve the algorithm, we normalized the explanatory variables (standardization for quantitative variables and replacement of 0 by −1 for binary variables).The normalization ensures that all dimensions for which the Euclidean distance is calculated have the same importance.Otherwise, a single dimension could dominate the other dimensions [1].

Regression Trees.
Regression trees are popular machine learning induction tools.They approximate target functions in a discreet way and represent them in a tree structure or alternatively in a set of decision rules.Characteristic for decision trees is the division of a multidimensional space of features into disjoint   segments, within which a simple model approximating locally the target function within the segment is under consideration [39].
The regression tree in the CART version was used in this work.This tree consists of intermediate nodes in which tests are performed on explanatory variables, end nodes (leaves) storing values of the dependent variable   (expressed as the average value of objects belonging to a given segment), and branches connecting nodes.The discussed type of model is a binary tree, which means that each branch has two branches (descendants).In the tree construction process, the intermediate sets divide the set of examples into two subsets: positive examples that meet the test assigned to the node and negative examples that do not meet this test.The aim is to ensure that the values of the target function of examples in these subsets have the smallest possible variance: where   is the number of observations in the   segment and    is the average value of the target function of the examples in the set   .The variance ( 15) is a measure of the inhomogeneity of the node (impurity).The division of the set of examples in a node into two subsets as a result of the test is justified if it leads to the reduction of heterogeneity; that is, Regression trees are usually built according to the descending scheme.Starting from the start node (root) in which the entire learning set is considered, it moves to the next nodes as a result of dividing the examples into subsets   and   .The size of the tree depends on the complexity of the problem under consideration.The size of the tree, which implies the degree of its fit to the learning data, can be controlled, for example, using the parameter cp.This parameter indicates to the algorithm how much the general model performance should be increased in each step, so that a given decision node can be divided.
To train regression trees, the rpart package was used to implement the CART algorithm.In the process of dividing a multidimensional space, the dispersion (variance) around the mean value of the dependent variable for observations belonging to the same node (leaf) was minimized.At each stage of node splitting, the variable value that minimized the sum of squares was chosen.The minimum number of observations in the node was set to 20, and the leaf was set to at least six observations; otherwise, the node was split [1].
Instead of pruning the tree at the end of the algorithm, we used pruning during the growth stage.Generally, this approach prevents new splits from being created when the previous splits provided only a slight increase in predictive accuracy.The complexity parameter (cp) varied from 0 to 0.1 in increments of 0.001, meaning that if any split did not increase the model's overall coefficient of determination by at least cp, then the split was decreed to be not worth pursuing.The tree was built up to a depth of 30 levels.Out of 1000 regression trees that were tested, the final structure was chosen based on the error measure defined in (7) [1].

Random Regression Forests.
Random forests are an ensemble learning method for regression that operate by constructing a multitude of decision trees at training time and outputting the prediction of the individual trees.Random forests correct for decision trees' habit of overfitting to their training set [40].
The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners.Given a training set X with responses L, bagging repeatedly ( times) selects a random sample with replacement of the training set and fits trees   to these samples.After training, predictions for unseen samples X  can be made by averaging the predictions from all the individual regression trees on X  : This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias.This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated.
The above procedure describes the original bagging algorithm for trees.Random forests differ in only one way from this general scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features.This process is sometimes called feature bagging.The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the  trees, causing them to become correlated.Typically, for a regression problem with  features, /3 (rounded down) features are used in each split.
To train the random regression forest, an algorithm from the randomForest library was used.Prior to the training, element samples were selected with replacement, and these accounted for approximately 63% of the population.The samples were used to construct the CART tree.Each tree was built to its maximum size (without pruning), preventing the occurrence of five or fewer observations in a leaf [1].
A randomized subset of variables was used to construct each tree.The total number of trees in the forest was 500.The final forecast was defined based on Huber's robust estimator [35].Finally, as in previous cases, the best forest structure was chosen based on the error measure defined in (7).5.4.5.Support Vector Regression.Support vector learning is based on simple ideas which originated in statistical learning theory [41].The simplicity comes from the fact that support vector machines (SVMs) apply a simple linear method to the data but in a high-dimensional feature space nonlinearly related to the input space.Moreover, even though we can think of SVMs as a linear algorithm in a high-dimensional space, in practice, it does not involve any computations in that high-dimensional space.
SVMs use an implicit mapping Φ of the input data into a high-dimensional feature space defined by a kernel function, that is, a function returning the inner product ⟨Φ(x  ), Φ(x  i )⟩ between the images of two data points x  , x  i in the feature space.The learning then takes place in the feature space, and the data points only appear inside dot products with other points [42].More precisely, if a projection Φ : X → H is used, the dot product ⟨Φ(x  ), Φ(x  i )⟩ can be represented by a kernel function : which is computationally simpler than explicitly projecting x  and x  i into the feature space H. Training an SVM regression involves solving a quadratic optimization problem.Using a standard quadratic problem solver for training an SVM would involve solving a big QP problem even for a moderate sized dataset, including the computation of an × matrix in memory ( training points).In general, predictions correspond to the decision function where where  is the cost parameter that controls the penalty paid by the SVM for misclassifying a training point and thus the complexity of the prediction function.A high cost value  will force the SVM to create a complex enough prediction function to misclassify as few training points as possible, while a lower cost parameter will lead to a simpler prediction function.
To construct the support vector regression model, -SVR from the kernlab library with sequential minimal optimization (SMO) was used to solve the quadratic programming problem.A linear kernel function was used, and  (which defines the margin width for which the error function is zero) was arbitrarily taken from the following set: {0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1} [1].
The regularized parameter  that controls overfitting was arbitrarily set to one of the following values {0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1}.Finally, as in all previous cases, the model that minimized the error function (7) was chosen [1].5.4.6.Naive Forecast.The naive forecast was constructed in the following manner.For the forecasting horizon of 24 h, the value recorded on the previous day at that hour was taken as the forecast [1].
5.4.7.Random Forecast.The random forecast was constructed in the following way.Given the electricity consumption in a given hour on a particular day of the week, empirical distribution functions were computed.Using a runif function, a value between 0 and 1 was then drawn from a uniform distribution ( probability).This value was used to estimate the quantile of the empirical distribution (the final value of the forecast) according to a weighted averaging of the order statistic   (quantile function): where  =  +  − ,  is the number of observations,  = floor( + ), and  = 1 −  [1].

ARIMA Model.
Autoregressive integrated movingaverage model provides a description of a stationary stochastic process in terms of two polynomials, one for the autoregression and the other for the moving average that is applied in some cases where data show evidence of nonstationarity, where an initial differencing step (corresponding to the integrated part of the model) can be applied one or more times to eliminate the nonstationarity [43].For a given time series of data   , an ARIMA(, , ) model is given by where  is the order (number of time lags) of the autoregressive model,  is the degree of differencing (the number of times the data have had past values subtracted),  is the order of the moving-average model,   is the lag operator,   are the parameters of the autoregressive part of the model,   are the parameters of the moving-average part,   are error terms, and finally  is a drift factor  = (1 −  1 − ⋅ ⋅ ⋅ −   ), where  is the mean of (1 − )    .The error terms   are generally assumed to be independent, identically distributed variables sampled from a normal distribution with zero mean.
To determine the order of an ARIMA model, a useful criterion is the Akaike information criterion (AIC) written as AIC = −2 log () + 2 ( +  +  + 1) , (23) where  is the likelihood of the data; the parameter  in this criterion is defined as the number of parameters in the model being fitted to the data.Therefore, this part uses the auto.arimafunction implemented in the forecast library.The function identifies and estimates the model by minimizing Akaike's information criterion.In this model, the maximum values for the AR and MA orders were arbitrarily set to  = 14 and  = 14.The degree of differencing was tested with the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test, which examines the null hypothesis that an observable time series is stationary around a deterministic trend [1].

Stepwise Regression. Stepwise linear regression of the form
was used as an automated tool to identify a useful subset of predictors (where x  denotes vector of independent variables,  is the vector of the models' parameters, and   is the error variable).Two procedures were tested: the first adds the most significant variable (forward selection) and the second removes the least significant variable during each step (backward elimination).The forward approach stops when all variables not in the model have  values that are greater than the specified alpha-to-enter value (5% significance level was used).The backward elimination stops when all variables in the model have p values that are less than or equal to the specified alpha-to-remove value (5% significance level was used) [1].

Forecasting Aggregated Electricity Demand
6.1.Forecasting Electricity Demand for the Entire Population.
In the first step of the forecasting stage, the hourly () values of electricity demand were aggregated by summing all  values of  ()  in the entire population: For the clarity of the presented results, the following notation is introduced:  24 , naive forecast;   , random forecast; ARIMA, ARIMA model;  , stepwise regression with forward selection;  , stepwise regression with backward elimination; k-NN, -nearest neighbors regression; RPART, regression trees; RF, random regression forests; NNET, artificial neural network; and SVR, support vector regression [1].
Forecasting results for hourly models with a forecast horizon of 24 h are presented in Table 6.The best results for the test set were obtained by SVR, NNET, and the two comparative methods (  and  ).After examining each of the separate subsets (training, validation, and test) and the error metrics, it is clear that the machine learning models (except for random forests) exhibit good stability.There is also no difference between the MAPE and r-MAPE metrics, which indicates that the forecasts are largely inaccurate.

Forecasting Electricity Demand within Clusters.
The hourly () values of electricity demand were then aggregated by summing all   values of  ()   in the th cluster: Similar to the previous experiment, the aggregate explanatory variables were determined based on Table 5.For each of the 10 similarity measures and eight possible partitions of household aggregates (total 80 combinations), forecasting models based on machine learning algorithms were developed, as described in Section 5. A total of 440 different "definitive" forecasting models have been developed.Taking into account the number of indirect models developed during the optimization process, the number of models increases to 328680 (32(RPNN) + 102(RPART) + 12(RF) + 27(NNET) + 77(SVR) = 249).
In the next step, the forecasted demand for the whole population was determined as the sum of all  predicted values  ()   for the th cluster: For example, let the actual aggregate value of the hourly electricity demand for the entire population at time  be  population  = 46 kWh.Taking into account the second split of households into clusters in terms of the autocorrelationbased distances, the predicted aggregates for these clusters are  (1)  = 25 kWh,  (2)  = 12 kWh, and  (3)  = 10 kWh.By aggregating the above values in accordance with (11), the final forecast for the whole population is  population  = 47 kWh.To identify situations in which partitioning households into clusters improved the final forecast, we introduced a percentage point sensitivity range and denoted different cases using the following fonts [1]: (i) Italic font: forecast improves in terms of Acc and MAPE/r-MAPE when determining the final forecast according to (11); for example, for a model with 20% error, the improvement should be at least 0.5 percentage points, so the model error should be less than 19.5%.(ii) Bold font: forecast worsens in terms of Acc and MAPE/r-MAPE when determining the final forecast according to (11); for example, for a model with 20% error, an increase of at least 0.5 percentage points should be expected, so the model error is greater than 20.5%.(iii) Roman font: neutral cases in which Acc and MAPE/r-MAPE remain at similar levels; for example, for a model with 20% error, we define a 1 percentage point range (19.5-20.5%)over which no improvement is observed when determining the final forecast according to (11).
Forecasts for the hourly demand models developed in this study, with a 24-hour forecast horizon, are presented in Table 7.This table compares the results of the forecasts for a particular partition against the results of the base model developed for the aggregate of all households (see Table 6).
In the majority of cases, application of k-NN and RPART leads to a deterioration with respect to the baseline models developed for the aggregate population and the same forecasting methods (for the second partition of the TQuest distance and the RPART model, there is some improvement).The RF approach gives improved predictions using EDR, STS, and TQuest.However, the results with RF and SVR are often neutral.The NNET forecasting method generally improves the results for each time series similarity measure (worsening results in only 5% of cases), as presented in Figure 5.Using the distance based on cross-correlation, EDR,   , LCSS, and TQuest, the forecast accuracy increased by nearly 1.5 percentage points.In general, the following pattern is observed: the MAPE error decreases when the number of clusters increases (please refer to Average and BASE NNET labels in Figure 5).This observation suggests that the underlying population was very heterogeneous and splitting it into a number of clusters leads to improved forecast accuracy.
To provide a quantitative summary of the experiments, we tested whether there are statistically significant differences  Finally, note that this research was carried out on a sample of 46 households.The conclusions serve as an indication of how the specific measures of time series similarity and optimal clustering, as well as forecasting methods, could behave on a much larger scale.
When load shape clustering is used as a preprocessing step for subsequent household-level segmentation, the requirement for a low number of clusters, which is practical for tariff purposes, should be relaxed to capture the diverse time-of-use behavior in residential hourly consumption.In addition, regarding the size distribution and visual examination conducted in the postcluster checking, future work should evaluate the broader validation of the robustness of the identified clusters using additional data sources, such as household demographic and socioeconomic information and rate structures.In particular, the sources of variability in the identified hourly patterns need to be established and better understood to support more effective demand-side management and behavior-based programs [44].

Figure 2 :Figure 3 :
Figure 2: Customer with the least volatile consumption (in kWh) in the analyzed period (1 March 2013 to 28 February 2014).
), (  ), (  ) are sets of indexes of examples in the parent node (parent) and in child nodes with positive or negative examples.In addition   and   are numbers of positive and negative examples in set   :   = |  |/|  |,   = |  |/|  |.

Table 8 :
Results for the relevant number of clusters based on the partition type and the time series similarity measure.there are groups of measures that select the optimal number of clusters more often than others, with partitions leading to improved forecast quality.

Table 1 :
Descriptive statistics for the electric load observed at each of the households.

Table 2 :
Size of each cluster by the type of partition and time series similarity measure.

Table 3 :
Baker's correlation coefficient between clustering results on the basis of different measures of time series similarity.
[21]Determining the Relevant Number of Clusters.Using theNbClust library[21], the relevant number of clusters for all 46 time series was determined.The input arguments of this package are the real-time values of the time series and a similarity array.The optimum number of clusters was determined on the basis of the eight different partitions defined earlier (Table

Table 4 :
Voting results for the relevant number of clusters based on the partition type and the time series similarity measure.

Table 5 :
Feature vector used in forecasting.

Table 7 :
Forecasting results in terms of MAPE error for grouped households using the test set.