Development of a Representative EV Urban Driving Cycle Based on a k-Means and SVM Hybrid Clustering Algorithm

,


Introduction
Fuel depletion, environmental disruption, and air pollution have contributed to the development of electric vehicles (EVs) [1].The driving range calculation and state of charge estimation for EVs are generally performed based on the international standard driving cycles [2,3].In the study of energy consumption prediction and energy management optimization strategies for EVs, some scholars have also adopted the international standard driving cycles [4][5][6].However, the data of these international standard driving cycles were collected from internal combustion engine vehicles (ICEVs).Because of the differences in the torque characteristics, power characteristics [7][8][9][10][11], and braking characteristics between EVs and ICEVs, the driving cycles of EVs are significantly different from those of ICEVs [12][13][14].The research on driving range calculation, state of charge estimation, energy consumption prediction, and energy management optimization strategies for EV under the international standard driving cycles can produce large errors.Therefore, it is important to conduct research on the driving cycles of EVs [15,16].
The driving cycle is a speed-time profile that represents a typical real-world driving pattern in a certain city or region [3,17].Driving cycles have a wide range of applications in the vehicle and transportation fields [18], and they are often used to simulate traffic conditions during laboratory chassis dynamometer bench tests and in automotive simulation research [3].Moreover, these cycles serve as a standardized measurement procedure for the evaluation and certification of new vehicle models, the monitoring of exhaust emissions and fuel consumption for ICEVs, and the estimation of the driving range and energy consumption per unit mileage for EVs [3,19].Because of differences in city size, road topology, road types, traffic conditions, vehicle ownership, economics, cultures, geographical features, and vehicle types, the characteristics of driving cycles in different cities and regions are different [20,21].Hence, it is necessary to develop a unique driving cycle for each typical city and region.Moreover, many researchers have found that the energy consumption of an EV for a real-world driving cycle greatly differs from that based on the international standard driving cycles [19,[22][23][24][25].
Many researchers have also successfully developed realworld driving cycles, such as the Dublin driving cycle [3],

Test Route Selection and Data Collection Method
The test routes must cover various urban road types, business and nonbusiness districts, densely populated areas and nonpopulated areas and consider the urban topological structure, driving speed, traffic flow, travel time, and origin-destination (O-D) pattern [30,[33][34][35].First, the overall topological structure of urban roads in Xi'an was analyzed using ArcGIS software.The lengths and proportions of urban roads in Xi'an are shown in Figure 1.The total length of Xi'an urban roads is 2562 km, including 141 km of expressways, 597 km of main roads, 1001 km of secondary roads, and 823 km of branch roads.mainly caused by the viaducts, and the influence of the road gradient on data acquisition is not considered in the test route selection process.
In driving cycle construction, the common data acquisition methods include the on-board measurement method and chase car method.On-board measurement method uses on-board diagnostics (OBD) installed in a test vehicle to collect trip activity information.The advantage of this approach is that the collected data can accurately represent the actual driving cycle based on large-scale studies.However, the test costs to obtain a reasonable sample size can be extremely high, and the data processing workload is large [32,34,36].Chase car method is widely used to collect speed-time data from real-world driving cycles.Such method involves randomly following a target vehicle and imitating its driving pattern in the traffic stream.When the previous target vehicle stops or is lost, another tracking target is selected [37].A global position system (GPS) is installed in the test car to collect data, and a large number of driving characteristics are simulated and sampled through the car-following process.Chase car method has been commonly adopted due to its low cost requirements [17,27,33,35].Considering on-board measurement method and chase car method has distinct merits and demerits, and this paper used a hybrid method of on-board measurement method and chase car method to collect test data.When an available EV was driving on the preset test route, the test vehicle chased it and imitated its driving pattern; when no target EV was available, the driving patterns of the test vehicle were collected.The GPS signal is easily influenced by the urban buildings, which can cause signal loss and data spikes; therefore, both GPS and OBD were used to collect driving pattern data.Speed-time data obtained by OBD were mainly used to supplement and improve abnormal GPS data.Moreover, 14 trained professional drivers were selected to minimize the influence of the driver on the collected data.In addition, a BYD E6 pure EV, which is the most commonly owned EV in the market, was selected as the test vehicle.The technical characteristics of the test EV are shown in Table 2.The experimental equipment installed in the test vehicle mainly included a GPS, OBD, a gyroscope, a driving recorder, a 12-V lead-acid battery, and a power inverter.The GPS recorded the speed, time, and distance of each trip.The OBD recorded the speed, time, distance, and battery state of charge.Additionally, the gyroscope recorded the acceleration signal and road gradient, and the driving recorder was used to collect traffic condition and chased vehicle information.Finally, 12-V lead-acid battery and power inverter were used to power the test equipment.The test equipment is shown in Figure 4.All data were recorded at a constant sampling frequency of 1 Hz [38][39][40].Driving data was collected by test vehicles moving along preset routes under real-world traffic conditions.Moreover, to consider the effect of the traffic flow and travel time on the test results, a 14-day test was conducted.The daily test periods were the morning peak from 7:30 to 9:30, the midday off-peak from 12:00 to 13:30, the afternoon peak from 17:30 to 19:30, and the evening off-peak from 19:30 to 21:00.The tests were conducted under clear weather conditions, on dry roads, and at wind speeds less than 1 m/s.Data collection ceased in cases of abnormal traffic conditions [30].Test data were collected for a total of 56 trips totaling approximately 428 hours.

Data Processing
In the data processing procedure, the raw data were first denoised and smoothed by using a wavelet decomposition and reconstruction method.Then, the preprocessed data were partitioned into 36388 kinematic segments according to (1).In addition, a PCA algorithm was adopted to reduce the dimensionality of the characteristic parameters.Finally, the k-means and SVM hybrid clustering algorithm was used to classify the kinematic segments that were used to construct the driving cycle. acceleration where  is acceleration and V is speed.[41,42].PCA is a multivariate statistical method that transforms a large number of correlated variables into a few uncorrelated principal components through orthogonal transformation and can maintain the original variable information as much as possible.Generally, we use characteristic parameters to describe the vehicle motion state.If all the characteristic parameters are used for classification, it will increase both the computational complexity and the difficulty of analysis [41,42].Although each parameter characterizes different motion information, the parameters are not independent, and some parameters can be expressed by a combination of several other parameters.Therefore, we use the PCA algorithm to reduce the dimensionality of the characteristic parameters [41,42].
According to the existing literature, eight characteristic parameters are selected: the maximum speed, minimum speed, average speed, standard deviation of speed, maximum acceleration, maximum deceleration, average acceleration, and standard deviation of acceleration [30,33,36].
We assume that the sample size is , the number of observed parameters is , and the sample observation matrix is  = [  ] × .
where   is the -th characteristic parameter of the -th kinematic segment.The principal component expressions obtained after orthogonal transformation using the PCA are as follows: where   1 ,   2 , ⋅ ⋅ ⋅ ,    are independent of each other and called the first, second, . . . . .., p-th principal components, respectively;  1 ,  2 , ⋅ ⋅ ⋅ ,   are eigenvectors corresponding to eigenvalues arranged in descending order, namely, The principal component variance contribution ratio  and cumulative variance contribution ratio  are calculated as follows: The principal components corresponding to the eigenvalues whose cumulative variance contributions ratio are greater than 90% are selected to form the comprehensive evaluation index that encompasses 90% of the entire data information.
The number of principal components with cumulative variance contribution ratio exceeding 90% is obtained, where   =    ,  = 1, 2, ...,  ( ≤ ).Additionally, the principal component expression and the principal component score are calculated.
The calculated principal component expression is shown in (6).
The PCA algorithm is used to reduce the dimensionality of the characteristic parameters.The principal component variance, variance contribution ratio, and the cumulative variance contribution ratio are shown in Table 3. Notably, from the first principal component to the eighth principal component, as the principal component variance decreases, the amount of information contained in the principal component gradually decreases.The first principal component  1 contains the most information, in this case, 36.15% of the original indicator information.The second principal component  2 contains 32.94% of the original indicator information, and the third principal component  3 contains 28.50% of the original indicator information.The cumulative contribution ratio of the first three principal components reaches 97.59%; therefore, they essentially encompass all the original indicator information.Thus, to simplify the problem and reduce the complexity of the analysis, the first three principal components are selected as the characteristic parameters for clustering analysis.The principal component score matrix is shown in Table 4.These scores are calculated for the first three principal components of 36388 kinematic segments.The component matrix is shown in Table 5, which  displays the correlation coefficient between the principal component and the original characteristic parameter.

Clustering Based on the k-Means and SVM
Hybrid Algorithm

Clustering Based on k-Means.
The k-means algorithm has been widely used in cluster computing in the existing literature [3,4,19,24,43].The k-means clustering algorithm uses the Euclidean distance between a sample and cluster center as a criterion for similarity determination, and it has the advantages of efficiency and simplicity [19,43].The Euclidean distance is calculated as shown in (7).
where  is the number of characteristic parameters and   and   are the -th characteristic parameters of the -th sample and the -th sample, respectively.The number k of clusters in the k-means algorithm has an important influence on the clustering results, but generally, an accurate value of k is initially unknown [44,45].If the number of clusters is greater than the true value, the algorithm wrongly divides the same class of data into multiple classes, which will cause the clustering result boundaries to be blurred [46].Conversely, merging different classes of data into one class will result in a decrease in the compactness of the cluster [47].Therefore, clustering stability is generally used to determine the k value [48].The concept of the approach is that if repeated clustering is performed for the same sample data, then a suitable k value should produce the same clustering results.In other words, stability is viewed as an indication of whether the k value fits the data.We repeated the clustering stability simulation test and finally determined that the number of clusters was seven.The classification results of the k-means clustering algorithm are shown in Figure 5.
The analysis of the k-means clustering effect shows that when the distances between the segment and other cluster centers are large enough, and the characteristic parameters of the segment are close to those of a certain class, the clustering effect is good.In contrast, when the distance between the segment and all cluster centers is similar and the characteristic parameters of the segment are significantly different from those of most of the segments in the class, the clustering effect is poor.There are two main reasons for these results.First, k-means is a hard clustering algorithm.When clustering involves multiple classes or the distance between cluster centers is small, the clustering effect is poor, and it is easy to reach local optima that cannot be incrementally clustered.Second, the convergence condition of the k-means clustering algorithm is iterated until the cluster center no longer changes.This approach may cause the data within the class to be very similar but does not fully consider the distance between classes; therefore, only local optima are guaranteed, and the global optimization is not achieved.

Clustering Based on the k-Means and SVM Hybrid
Algorithm.The SVM method was first proposed by Cortes and Vapnik in 1995 and has significant advantages in solving nonlinear and high-dimensional pattern recognition problems [49,50].The SVM method is based on the Vapnik-Chervonenkis dimensional statistical learning theory and is an approximate implementation of minimizing structural risk.The basic principle of SVM is to establish an optimal classification hyperplane that can not only separate samples without errors, but also maximize the classification interval [50].A schematic diagram of the optimal hyperplane is shown in Figure 6.SVM has been widely used for classification identification due to its many advantages, such as a high level of robustness and strong generalization ability.
We assume that the size of the sample set is  and that the sample set consists of two classes:  = {(  ,   ),  = 1, 2, ⋅ ⋅ ⋅ , ,   ∈   ,   ∈ {+1, −1}}.If   belongs to the first class,   = 1; otherwise,   = −1.If the sample can be correctly divided into two classes using a classification hyperplane, then the sample is linearly separable and satisfies (8).  +  ≤ −1, where   is the sample data,   is a decision attribute,  is the normal vector of the hyperplane, and  is the offset of the classification straight line.
If the interval from the sample point to the classification hyperplane is  = |  + | = 1, the distance between two types of samples is 2/‖‖.To select an optimal classification hyperplane from a large number of classification hyperplanes and separate the samples of different classes as much as possible, the distance from the sample set to the classification hyperplane is maximized.Therefore, the optimization goal is to find the optimal classification hyperplane under the constraint of (9).
The above problem can be transformed into a dual problem according to the Lagrange theory and then solved with a quadratic programming method.In the linear inseparability problem, the nonlinear mapping of Φ :   →  is used to map the original input space samples to the feature space .Then, the optimal classification hyperplane is constructed in the high-dimensional feature space .The dual problem after mapping to the high-dimensional feature space becomes (10).
where   > 0,  = 1, 2, ⋅ ⋅ ⋅ ,  is the Lagrange coefficient and (  ,   ) is the kernel function that satisfies the Mercer condition. is a penalty factor used to control penalty degree for misclassified samples.
This paper builds a multiclassification model based on SVM to classify the driving segments.The specific classification processes include training set screening, kernel function parameter optimization, classification prediction, and clustering result evaluation.
The first step is training set screening.Because a SVM is a supervised learning algorithm, selecting an appropriate training set before classification is extremely important.To improve the classification accuracy and efficiency, the optimal segments were selected from the k-means clustering results as the SVM training set, and the remaining segments were used as the testing set.The selection principles of the optimal segments are as follows.We selected an appropriate number of training sets to avoid under-learning and over-learning issues.Moreover, we selected representative segments from the k-means clustering results, and these segments were as close to a cluster center as possible and as far from the other cluster centers as possible.According to the above principles, 2353 representative segments were selected as the training set of the SVM, and the remaining segments were used as the testing set.Additionally, all segments were normalized to eliminate the influence of the dimensions on the classification results.
The second step is kernel function parameter optimization.Kernel functions can map the sample data in the original low-dimensional space to a high-dimensional feature space and transform them into linearly separable data.Meanwhile, the optimal classification hyperplane in the high-dimensional space can be obtained.Because the radial basis kernel function (RBF) can precisely express the features of the training set, accurately reflect the structure feature of the high-dimensional space, effectively control the dimension of the model solution set, and obtain the global optimal solution, this paper uses a RBF as the basic kernel function of the SVM.

𝐾 (𝑥, 𝑥
where ‖ −   ‖ is the two-norm distance,   is the support vector,  is the predicted sample, and gamma is the kernel function parameter .
Based on K-fold cross validation method, this paper adopts a grid search algorithm to obtain the optimal parameters  and .The optimization results are shown in Figure 7.When c=0.25 and =4.0,classification recognition accuracy reaches a maximum of 100%.
The third step in the proposed method involves classification prediction and clustering result evaluation.The SVM model was trained using the optimal parameters.The clustering results of driving segments using the k-means and SVM hybrid model are shown in Figure 8.
To evaluate the clustering results, this paper introduces two indicators: compactness and separation [47,51].
Compactness (CP) is an internal cluster evaluation criterion that uses the norm distance between all data sets in each cluster and the cluster center to evaluate the compactness of the cluster.A small CP value indicates that the data within the cluster are similar or close to one another.
where Ω  is the th cluster,   is the cluster center in Ω  ,   represents the data in Ω  , and |Ω  | represents the quantity of data in Ω  .As a global measure of compactness, the average of all clusters is calculated as follows.
where  is the number of clusters.Separation (SP) is an external evaluation criterion between clusters that uses the average Euclidean distance between each two cluster centers to evaluate the degree of separation of the clusters.A high  value indicates that the distance between clusters is large.
Table 6 gives the calculation results for CP and SP.A comparison of the values indicates that the k-means and SVM hybrid clustering results have small CP and large SP values; therefore, the clustering results with the k-means and SVM hybrid model are significantly better than those of the kmeans alone.
The categorized driving segment characteristics are shown in Table 7 and include low constant speed driving segments, medium constant speed driving segments, high constant speed driving segments, weak acceleration driving segments, strong acceleration driving segments, weak deceleration driving segments, and strong deceleration driving segments.

Driving Cycle Construction
The driving cycle is constructed by connecting the most representative driving segments according to predetermined rules until the driving cycle duration is reached.Generally, the duration of the general international standard driving cycles and real-world representative driving cycles is between 600 s and 1800 s.Therefore, according to the Hong Kong, Colombo, and Sydney driving cycles, the duration of the Xi'an EV urban driving cycle in this paper is set to 1200 s [17,33,36,43].Then, the duration of each driving segment class in the constructed Xi'an EV urban driving cycle is determined based on experimental data.First, this paper calculates the cumulative time of the seven driving segment classes in the collected experimental data and determines the time proportions of the seven driving segment classes.As shown in Table 8, the time proportions of the seven driving segment classes are 30.53%,13.98%, 12.51%, 14.28%, 7.84%, 14.72%, and 6.14%, respectively.Then, according to the time proportions of the seven driving segment classes in the experimental data, the time length of the seven driving segment classes in the constructed Xi'an representative EV urban driving cycle is determined.Their time lengths are 366 s, 168 s, 150 s, 171 s, 94 s, 177 s, and 74 s.In the process of constructing the driving cycle, we first randomly select a driving segment of no more than 5 s as the initial segment.Then, according to the following principles, the most representative driving segments are selected.The selection principle of representative driving segments is as follows: (1) the selected driving segment should be as close as possible to the cluster center; (2) the difference between the final speed of the previous driving segment and the initial speed of the current driving segment is less than 1km/h; (3) the selection of the driving segment is without replacement.Finally, the selected current driving segment is linked to the previous driving segment until the time length of each driving segment class is reached.This approach yields a constructed candidate driving cycle.
According to the test data and classification results, the time proportion and length of the seven driving segments are shown in Table 8.
Considering the randomness in the construction process of the driving cycle, one construction result is difficult to fully represent the Xi'an EV urban driving cycle.Therefore, the construction process is repeated to generate a large number of candidate driving cycles.Moreover, the scientific assessment criteria are set, and the most representative driving cycle is selected from candidate driving cycles to form the Xi'an EV urban driving cycle.
The assessment criteria flow diagram is shown in Figure 9. First, the average speed, maximum speed, average acceleration, average deceleration, standard deviation of acceleration, proportion of acceleration, proportion of deceleration, proportion of uniform speed, and proportion of idling are selected as assessment parameters [17,37].Then, we calculate the relative error (RE) of the assessment parameters of the candidate driving cycles and the overall test data and select the driving cycles for which the RE of all the assessment parameters is less than 10%.Finally, the mean relative error (MRE) is calculated, and the candidate driving cycle with the minimum MRE is selected.If there is more than one candidate driving cycle with the same minimum MRE, the root mean square error (RMSE) of the speed-acceleration probability distribution (SAPD) is calculated.The driving cycle with the minimum RMSE is then selected as the most representative Xi'an EV urban driving cycle.

𝑅𝐸 = ( 𝛼
where   and   are the assessment parameters of the candidate driving cycles and test data, respectively.
where  is the number of all assessment parameters.where  and  are the numbers of speed bins and acceleration bins, respectively, and   and   are the frequency values of the th bin of the candidate driving cycle and the test data, respectively.
The construction of Xi'an EV urban driving cycle is shown in Figure 10.The SAPDs of Xi'an EV urban driving cycle and test data are shown in Figures 11 and 12, respectively.A comparison of the Xi'an EV urban driving cycle and test data is shown in Table 9.According to the comparison results, the REs of all assessment parameters are less than 8.81%, the MRE is 4.03%, and the RMSE of the SAPD is 1.32%.The small MRE and RMSE values indicate that the constructed Xi'an EV urban driving cycle is very close to that based on experimental data from the real-world driving cycle.Therefore, the Xi'an EV urban driving cycle constructed in this paper can effectively represent the speed-time driving pattern of the real-world cycle in Xi'an.

Comparison of the Xi' an and International Standard
Driving Cycles.To study the differences between the Xi'an EV urban driving cycle constructed in this paper and the international standard driving cycles, eight characteristic parameters of seven general international standard driving cycles are compared as shown in Figure 13 and Table 10.The seven general international standard driving cycles are the European Economic Commission 15-mode test cycle (ECE 15), New European Driving Cycle (NEDC), Japanese Industrial Standards Committee 08 test cycle (JC08), Japan 10/15 mode test cycle (J10/15), Federal Test Procedures 72 and 75 (FTP72, FTP75), and Worldwide harmonized light duty test cycle (WLTC) [52], respectively.The comparison of results shows that the Xi'an EV urban driving cycle has greater average acceleration and deceleration values, larger proportions of acceleration and deceleration, and smaller proportions of uniform speed and idling compared to the other driving cycles.Therefore, compared with various general international standard driving cycles, the Xi'an EV urban driving cycle is characterized by more aggressive driving features.

Verification of the Construction Method.
To verify the advanced and scientific nature of the construction method of the driving cycle proposed in the paper, the driving cycles

Figure 1 :Figure 2 :
Figure 1: Lengths and proportions of urban road types.

Figure 3 :
Figure 3: Test route.Note: the red data in the A-B-C format represents road information, where A represents the serial number of the roads, B represents the length of the roads, and C represents the number of traffic lights.

Figure 6 :
Figure 6: Schematic diagram of the optimal hyperplane.

Table 1 :
Details of the test route.

Table 2 :
Technical characteristics of the test EV.

Table 4 :
Principal component score matrix.

Table 6 :
The calculation results of CP and SP.

Table 8 :
Time proportion and length of seven driving segments.