Clustering Vehicle Temporal and Spatial Travel Behavior Using License Plate Recognition Data

Understanding travel patterns of vehicle can support the planning and design of better services. In addition, vehicle clustering can improve management efficiency through more targeted access to groups of interest and facilitate planning by more specific survey design. This paper clustered 854,712 vehicles in a week using K-means clustering algorithm based on license plate recognition (LPR) data obtained in Shenzhen, China. Firstly, several travel characteristics related to temporal and spatial variability and activity patterns are used to identify homogeneous clusters. Then, Davies-Bouldin index (DBI) and Silhouette Coefficient (SC) are applied to capture the optimal number of groups and, consequently, six groups are classified in weekdays and three groups are sorted in weekends, including commuting vehicles and some other occasional leisure travel vehicles. Moreover, a detailed analysis of the characteristics of each group in terms of spatial travel patterns and temporal changes are presented. This study highlights the possibility of applying LPR data for discovering the underlying factor in vehicle travel patterns and examining the characteristic of some groups specifically.


Introduction
The trip starting and ending time, travel distance, travel frequency, activity duration, and some analogous features are the typical form of vehicle travel behaviors.All these aspects have a significant effect on the traffic condition in a direct or indirect way [1,2].For example, the distribution of the trip starting and ending time of all vehicles will decide the peak-hour time.Better understanding of these characteristics will be helpful to analyze the travel pattern and travel mode of vehicles.Identifying homogeneous travel behavior groups has been the research subject in several prior studies and the travel behavior analysis has always attracted great interest of transport authorities, since vehicle travel behavior has a vital impact on strategic and operational decisions [3][4][5].
Clustering is one of the most important methods to count and mine meaningful information in large amount of data since understanding the main differences between groups can contribute to a better understanding of their travel behaviors, which can provide valuable information for transportation planning [6].Meanwhile, clustering vehicles based on their travel characteristics is one of the vital methods for studying the representativeness of specific groups among the whole vehicle population and the travel profile of each group provides an aggregated characterization for the vehicles of a group as a whole [7].It can also provide transportation planners with richer travel demand information for improving the system performance or better assessing network investments.
In the field of transportation, clustering has been widely accepted in dealing with big data and traffic problems [8,9].Reference [10] investigated the determination of historical traffic patterns by means of Ward's hierarchical clustering procedure.It classifies the traffic patterns in highways with the data collected by automatic vehicle identification (AVI) system into four groups and the resultant weekday traffic patterns can be used as input for macroscopic traffic models and as a basis for traffic management.Moreover, when predicting traffic flows based on historical data, a preclassification (e.g., holidays, Mondays, core weekdays, and Fridays) can be made to guide the authorities, and these patterns can be used to detect and replace erroneous data and to impute missing data.
Besides, [11,12] utilized the density-based clustering algorithms to classify trajectories using GPS data.The study of trajectory data can reveal individual trajectory patterns, understand the characteristics of human dynamics, and thus support trajectory prediction, urban planning, traffic monitoring, and so forth.The characteristics will be similar inside each group and significantly different outside the groups.According to the similarity, the individual similar trajectory recognition can be achieved; and by clustering, the abnormal trajectory mode detection can also be conducted.Similarly, literature [13,14] combined DBSCAN and SVM (support vector machines) cluster algorithm to sort the GPS trajectories to identify the activity stop locations, which has significance in analyzing human urban mobility.
In the analysis of the time series characteristics of traffic flow data [15], clustering method is popular too.According to the similarity of traffic flow characteristics, the traffic sections are divided into different groups and in the literature [16]; performance of the proposed approach and the stability of the clustering technique are evaluated using the extensive simulation for different traffic densities.
Numerous researches concerning traffic and travel have been conducted by previous studies but there are some drawbacks at the same time.It is difficult to obtain the large amount of data.The acquisition is mostly based on artificial method but at the expense of consuming lots of manpower and resources.Worse still, there are much error and abnormities in the information usually; thus, the research results always show lower reliability and higher deviation.
Recently, a number of well-established technologies for collecting vehicle related data have emerged, including loop detectors, GPS data, and probe car data [17,18].Loop detectors have the merit that once they are installed, there will be continuous record when every vehicle is passing the monitored road section.However, the share of segments in the network equipped with these sensors is typically low and cannot represent the urban network as a whole, which will leave the traffic conditions in most of the network unknown.Dedicated probe vehicles, meanwhile, are used to collect the travel time and other data for designated routes in the network.Nevertheless, due to cost considerations, the number of traffic studies with probe vehicles is typically small and the number of vehicles involved is very few.Hence, they can only cover a limited number of routes for a limited duration of time.
A number of limitations mean that new sophisticated methods are needed to process the data and generate useful information, compared to traditional sensors [19].Most recently, with the emerging technologies and advanced devices, image recognition technology has been greatly improved.License plate recognition (LPR) system provides the opportunity to study in detail vehicle travel patterns.Compared to manual data collection techniques, LPR provides lower marginal costs, more detailed and disaggregated information, large sample size, and real-time data availability [20,21].LPR data is mainly applied in LPR data is mainly applied in solving three kinds of problems in the field of transportation, that is, (1) road network state discrimination, (2) vehicle microscopic characteristics mining, and (3) vehicle travel time/path estimation [22,23].Zhan et al. [24] proposed a lane-based real-time queue length estimation model applying the LPR data.By using ground truth information of the maximum queue length from the city of Langfang in China, the model is validated.In addition, a novel trip route estimation method was given by researchers to estimate the vehicle travel path [25].Similarly, based on LPR data, an approach for forecasting urban short-term OD matrix which can be used to obtain the original OD information was came up with, and then the OD amount between the detection points can be inferred and finally the OD information between fast track ramps is obtained [26,27].All of that mentioned above has proved that the massive amount of LPR data has been created and provides us with rich information and thus can be an effective analytical data source.
Methods for clustering are usually divided into two categories, supervised and unsupervised.Supervised methods use the past data as training samples or previously known outputs to create and learn a clustering rule that allows the clustering of future or new observations [28].Because the form of the data is not fitting for this study, unsupervised methods are more applicable.Unsupervised cluster algorithms include the hierarchical algorithms and the partition algorithms.Hierarchical clustering algorithms have high computational complexity and cost, limiting their application to large-scale data sets and the shortcomings and advantages of these algorithms will be explained in the following paragraph.
-means clustering algorithm, which belongs to the distance-based clustering algorithms, is not only the most classic, but also the most widely used.It has the property of rapid computing speed, easily explained principle, and high efficiency.-means clustering algorithm is tested using load profiles of 100 residential smart meters collected over the interval extending from July 20th until August 9th, 2009.The method has shown high accuracy in dealing with traffic problems, which proved its great applicability [29].
In this paper, data from LPR system in Shenzhen, China, from November 4th to 10th, 2013, during seven days (a week) in total are analyzed.Variables chosen for clustering include the proportion of different starting/ending points, maximum/minimum/average travel distance for one trip, days of travel within a week, the number of trips per day, the average start time of the first trip, the average end time of the last trip, and activity duration [30].Firstly, data cleaning is conducted to remove the wrong and repeated data.Then, deviation standardization is utilized to normalize each value for eliminating the error caused by dimension and considerable differences of magnitude.After preliminary treatment, data is divided into two groups, namely, the weekdays and weekends.Finally, to measure the optimal number of clusters, Davies-Bouldin index (DBI) [31] and Silhouette Coefficient (SC) [32] are employed.
In general, the purpose of this study is to classify vehicles into several categories based on some variables and determine travel behavior consistency over time and space by analyzing the vehicle temporal and spatial variability.It can support the study of representing specific groups among the   total population and help establish the predictive level of vehicle trips.The rest of the paper is organized in the following way.Section 2 offers a brief description of data source.The methodology is introduced in Section 3. Section 4 displays the variables chosen for clustering.Section 5 shows the results of the clustered data and Section 6 is the conclusion and findings.

Data Description
2.1.Data Overview.The potential of LPR system has been explored for planning, managing, and assessing the performance of traffic systems.Further, data collected by these systems allows more comprehensive view of vehicle travel patterns and travel behaviors.
2.1.1.Data Source.The LPR system in Shenzhen, China, covers majority of parking lots and expressways for this city.Over 0.9 million vehicles are detected in a week and according to Shenzhen Statistical Yearbook in 2013, the total number of vehicles in Shenzhen is about 2.1 million, implying 42.86% of vehicles are detected by the LPR system.After data cleaning, there are still almost 128,000 recorded vehicles each day.Figure 1 is the sketched network of Shenzhen, where the red points represent the detectors installed on roads and the black lines show the roads.
LPR detectors are mainly installed in the expressways of the city unevenly, most of which are on the intersection or the pedestrian bridge nearby.They are denser in the city center area, while more are dispersed in the rest of the region.The sample of raw data is given in Table 1.
It is worth noting that the detector ID has two types, 10100610 and 101A0753.If "A" is contained in the ID, the detector is a parking lot.Otherwise, it represents a detector on road.Table 2 shows the amount of detectors for each day from November 4th to November 10th, 2013, for which more than 83% detectors are parking lots.There are three main types of parking lots, (1) residential parking lots (including residential and office buildings, commercial places, and shared parking lots), (2) temporary parking lots, and (3) public parking lots.The parking lots with detection data account for about 20% of all the parking lots in Shenzhen.

Data Cleaning.
The data cleaning is conducted before vehicle clustering and there are two main steps.
(1) Extract the Data by Day.The whole dataset is for seven days (a week), which has been separated into seven files by date thus each file contains the data of the same day.
(2) Verify the Original LPR Data

Identification of Taxi.
The purpose of this paper is to cluster all the vehicles in the dataset according to some temporal and spatial variables.Each group will have some characteristics different from the other groups, so as to explore vehicle travel patterns we may not know before.Traffic researchers have always paid much attention to taxi, due to its special travel mode.It has the following characteristics [33]: (1) There are no fixed route and running time.
(2) Operation is for 24 hours and can be located in any place of the city.(3) The origins and destinations of taxi are completely determined by passengers.(4) The operating routes are up to the driver, such as his experience and hobbies.On account of these features, taxis are removed from the dataset to make sure the analysis in this paper is more specific on noncommercial vehicles and the future research will focus on the travel behavior of taxi.
Figure 2 shows the distribution of the number of taxi trips in Shenzhen.There are over 70% of vehicles traveling 20-35 trips per day, and only 7% vehicles are traveling less than 10 trips.Meanwhile, from the clustering result, the travel frequency of nontaxis in a day is no more than 10 trips per day.As a result, we removed vehicles whose travel frequency exceeded 10 trips per day.Under such a definition, there may be two inaccurate results: (1) nontaxis traveling more than 10 times a day were removed and (2) taxis traveling less than 10 times were still retained.
However, in the light of Shenzhen Statistical Yearbook in 2013, the number of taxis is around 17,000 in total, in which less than 50% were detected by the LPR system.Thus, the amount of these two kinds of vehicles will be no more than a thousand, which appears insignificant when compared with tens of thousands ordinary vehicles.
On the basis of the rule proposed above, almost 6,000 taxis for one day are removed from the dataset and when taxis are removed, there are around 122,000 vehicles for each day and 854,000 vehicles in a week.

Clustering Methods.
Clustering methods encompass several techniques and algorithms used to group observations based on similar qualitative or quantitative characteristics.They are usually divided into supervised and unsupervised clustering.Supervised methods require a training sample which contains previously known information on each group membership [34].In accordance with the form of data in this  study, the training sample is not available and there are no previously known classes; unsupervised clustering method is the best option.Unsupervised clustering methods aim at categorizing the data objects without a training sample; the goal is to find clusters based on similarities of the input data.There are two main types of unsupervised clustering, the hierarchical algorithms and the partition algorithms.Table 3 discusses the advantages and disadvantages of some unsupervised algorithms [35].3, hierarchical algorithms have been criticized for low robustness and high sensitivity to noise and outliers.Since the assignment of an object to a cluster is not iterative, hierarchical algorithms are not able to correct potential misclassifications.On the contrast, partition algorithms optimize either a locally or a globally defined objective function to generate groups of observations so they are preferred in studies involving large-scale dataset.

𝐾-Means Algorithm. As shown in Table
-means is chosen for this study as a computationally efficient method, which is suitable for situations where all variables are quantitative.It is easy to understand and apply and thus is popular in dealing with the clustering problems.The time complexity of -means algorithm is close to linear, is simultaneously suitable for mining large-scale data sets, and is scalable.In this study, the variables used for clustering are all quantitative and we have a large amount of data.So, -means is chosen for this study.Nevertheless, the only disadvantage is the difficulty of choosing the number of clusters and their dependency on the initialization scenario.For the first drawback, it can be adjusted by repeated iterations to find the optimal result.For the second one, we have tried several cluster numbers and applied Davies-Bouldin index (DBI) and Silhouette Coefficient (SC) to find the optimal cluster number.

Criteria for One Trip.
For the sake of turning the raw data into the form of vehicle trips and the value of its corresponding variables, the criteria for one trip should be given firstly.Due to the inherent limitation of the LPR data, only partial trajectory points of a vehicle can be obtained.As a result, the realistic starting and ending points of a trip cannot be speculated.
Figure 3 shows the travel trajectory of a vehicle in a brief network, where the yellow curve represents the first trip of the vehicle and the green one displays its second trip.In addition, the blue short lines show the detecting points.It is definite that the true trip starting time in origin 1 (1) is earlier than the time of the first trip record, and the trip ending time in destination 1 (1) is later than the time of the last record, as well as the second trip or other trips of the vehicle.
Hence, deviation will exist in the value of some variables inevitably.The average starting time of the first trip will be a little later, and the average ending time of the last trip will be a little earlier.The whole activity duration will be longer and the travel distance will be shorter.However, the main purpose of our study is to extract the travel characteristics of vehicles instead of the estimation of the  matrix; these errors are offset in one direction for all vehicles; thus it may not have a critical impact on the clustering result.From this point of view, the definition of one trip is applicable.When applying these values of variables in realistic transportation planning, the deviation should be taken into account.
As mentioned, the "segmentation" refers to the interval between two trips that is the interval of the last record of the first trip and the first record of the second trip, which is different from the vehicle's accumulated travel time.In order to find the optimal value of the segmentation, we have tested the threshold.
Set AR to be the true threshold, BR to be the true number of trips,  to be the threshold that we will apply, and  to be the number of trips that we will calculate.If  ≤ AR, then  ≥ BR; if  ≥ AR, then  ≤ BR; only when  = AR, then  = BR.Different thresholds ranging from 20 min to 80 min have been tested, and the average number of trips under all circumstance is calculated.The result was illustrated as Table 4.
When the threshold spans from 50 min to 80 min, the value of number of trips has been moving towards stabilization.It implies that the probability of trips to be not detected in this interval is relatively small.Also, the interval of two trips from LPR data is larger than the actual interval.Hence, it is reasonable that one hour is chosen to be the threshold.

Spatial and Temporal Variables.
To estimate homogeneous vehicle groups based on their travel patterns using any clustering method, it is necessary to have input information on travel behaviors.Travel patterns can be described by looking at specific variables that together characterize each vehicle's travel routines [36].The selected variables must include those vehicles' characteristics that make their travel patterns distinct [37,38].A set of descriptive variables is presented and vehicles are analyzed in weekdays and weekend separately.
(1) The Proportion of Different Origins/Destinations.The percentage of different origins/destinations has the potential to be a useful indicator of their mobility patterns.To illustrate, vehicles with the same starting point for the first trip in a day or the same ending point for the last trip in a day over a week are more likely to be commuters with work or study purposes.This variable is an indicator of spatial travel variability, which could help to infer the vehicle travel predictability.For such vehicles that traveled 3 days in weekdays, the percentage of different origins for the first trip in a day is defined as follows: 0: The origins of the first trip in a day over the three days are all the same.1/3: There is one difference for the origins of the first trip in a day over the three days.
2/3: There are two differences for the origins of the first trip in a day over the three days.

1: The origins of the first trip in a day over the three days are all different.
When the value is 0, the origins for one trip are all the same in the days of travel, suggesting that the behavior of this kind of vehicles has much regularity.In contrast, if the value is 1, the origins for one trip are all different in the days of travel, indicating the irregularity of the travel behaviors.
The calculation for percentage of different destinations for the last trip in a day is defined in the same way, and for vehicles in weekends the dealing method is comparable.
(2) Travel Distance.The geometric distance between the origin and destination of one trip can show how accessible activity locations are to a vehicle.Travel distance variability among the trip of a vehicle can also demonstrate travel flexibility and vehicle mobility around the city.The travel distance variables adopted in this study incorporate the maximum/minimum/average travel distance for one trip in the whole week.For the lack of the track points, complete travel trajectory of one trip for a vehicle cannot be obtained.As a result, in this study, the distance of one trip for a vehicle is defined as the exact distance between the start and end points of one trip, which is calculated by the latitude and longitude of the two points.
(3) Travel Frequency.The travel frequency of vehicles, that is, trips made over a day/a week (or any other period) incarnates the uncertainty of the travel for vehicles.There are two descriptive variables, number of trips per day, which is the number of complete trips performed on each day of the week and days of travel, which is the number of days within the period of analysis; a vehicle has at least one trip in a day.For vehicles in weekdays and weekends, the value of their travel days in a week ranges from zero to seven.
(4) The Trip Start/Finish Time.The trip start/finish time could give expression to the trip purpose and consistency of trip.Volatility of the start time for the first trip and the finish time for the last trip are crucial aspects when analyzing vehicle travel patterns.
(5) Total Activity Duration.Activity refers to all those actions vehicles perform when they are not traveling and in this paper the time interval between the two adjacent trips is defined as the protocol of activity duration.There is a mass of activities purposes, business, work, study, and entertainment, among others.The characteristics of the activity performed at a destination may determine the vehicle's travel decision and the average activity duration of a vehicle in each day varies from weekdays to weekends.

The Distribution of the Variables for All Vehicles
(1) Weekdays.Figure 4 illustrates the distribution of all the temporal and spatial variables in weekdays which is a statistical indicator of the whole vehicles.
In Figure 4(a), there is an obvious peak during the interval of 8:30 am to 9:00 am, representing that the average trip start time of vehicles is mostly focused between 8:30 am and 9:00 am, implying the morning peak hours.Figure 4(b) shows the tendency of the average trip finish time and the majority of the vehicles finish their trip at around 18:00 pm-19:30 pm, which means the afternoon peak hour.Additionally, there is also a large amount of vehicles that start their trip at 12:30 pm-13:30 pm.
For the number of trips per day in Figure 4(c), vehicles traveling 1.5 trips/day occupy a high proportion and vehicles traveling 3.5 trips/day, 2 trips/day, and 4 trips/day followed.The result seems to be confused that vehicles traveling 1.5 trips/day (less than 2 trips/day) conquer such a high rate.Probably, it is because the definition of one trip in the study and the incomplete vehicle detection data.
Figure 4(d) demonstrates days of travel.Vehicles that only travel one day in a week occupy a high rate.The activity duration of most vehicles is within 11 h in Figure 4(e).Figures 4(f), 4(g), and 4(h) reflect the travel distance of vehicles.The maximum travel distance of vehicles for one trip is almost within 60 km, the minimum travel distance is less than 30 km, and the average travel distance is within 40 km.At the same time, we can see that, for the average travel distance of vehicles for one trip, over 68% of trips are within 10 km.
According to Figures 4(i) and 4(j), for the percentage of different starting or ending points, values 0 and 1 seize on a high proportion.Value 0 means the starting/ending points of each trip are identical, and the regularity is high.Analogously, value 1 means that the starting/ending points of each trip are all different, and irregularity is high.
(2) Weekends.For vehicles traveling in weekends, the distribution of their temporal indicators is basically similar to the weekdays.For the value of both of the percentages for different starting and ending points in weekends, value 0 takes up the highest ratio; in other words, these vehicles travel with less regularity.Compared with the weekday vehicles, they travel a relatively short distance; whether it is the maximum travel distance, minimum travel distance, or average travel distance, almost all are within 10 km and relatively concentrated within 5 km.

Results and Discussions
The values of within-cluster variation and the DBI/SC are shown as functions of the number of clusters in Figures 5(a) and 5(b).A smaller value of DBI and a larger value of SC are better.In Figure 5(a), when the cluster number is six, the value of DBI is the smallest, and when it turns to seven, the value of SC is the largest.The value of SC of seven groups is just a little better than six groups but the value of DBI of six groups is much better than seven groups.As a result, "six" is a relatively better choice.In Figure 5(b) when the cluster number is three, both values of SC and DBI are optimal; there is a lowest point of DBI and a highest point of SC.So, the cluster number for weekdays and weekends is selected as six and three, respectively.The -means clustering method provides not only information about each cluster's core characteristics but also information about the average characteristics of each cluster.Tables 5 and 6 display the average values of each index for each category in weekdays and weekends.
For Vehicles in Weekdays, Six Groups Are Clustered.The last column of Table 5 illustrates the proportion of the total number of each category.The smallest cluster contains 4.1% of the vehicles in the sample, and the largest one accounts for 33.7%.Groups 1 to 6 are identified as follows, long travel distance vehicles, commuting vehicles, noon travel vehicles with short travel distance, off-peak hour travel vehicles, midnight travel vehicles, and peak-hour travel with short activity duration vehicles, respectively.
Group 1 is inferred as long travel distance vehicle that travels 1.82 days in a week and makes 2.13 daily trips.On average, the first trip starting time of Group 1 is 10:14 am and the last trip ending time is 19:02 pm.Additionally, the travel behavior of this group is irregular because the trip origins and destinations are all different.Besides, the total activity duration of this group is about 7.41 hours, and the travel distance of this group of vehicles is relatively long.The maximum travel distance for one trip is 78.1 km on average.Group 2 may be commuting vehicle, which travels 5.94 days of the week on average and makes 2.18 trips per day.The first trip of the day starts at approximately 8:42 am and the last trip of the day ends at 18:18 pm.The activity duration lasts 8.67 hours on average.Furthermore, the distance between the origin and destination of their trips varies from 6.9 km to 59.2 km, and their average travel distance is about 17.5 km for one trip.The proportion of different starting and ending points for Group 2 is 0.12 and 0.09, representing a high regularity in the daily origins and destinations.All of these features support the speculation of Group 2 to be commuting vehicles.Group 3 is defined as noon travel vehicle with short travel distance.The first travel starts at 10:07 am and the last travel ends at 15:14 pm; it only travels at noon.Moreover, Group 3 travels only 1.08 days in a week and 1.82 trips in a day, and the activity duration is also short, only 4.02 hours on average.The travel distance varies between 1.9 km and 3.5 km, dropping in a short range and the travel origins and destinations are almost different.
Group 4 is concluded to be off-peak hour travel vehicle; the first trip of the day starts at 10:20 am and the last trip of the day ends at 19:48 pm, which staggers the peak hours.There are 1.82 days of travel in a week and 1.63 trips in a day and the travel distance of Group 4 is similar to that of Group 3. In particular, the maximum travel distance is only 2.9 km and in accordance with the percentage of different starting and ending points, the travel for Group 4 is not so regular too.
Unlike other groups, Group 5 may be midnight travel vehicle, which has the most distinguish feature.Vehicles start their travel at 0:40 am and the activity duration is around 17.14 hours.Besides, the number of travel times per day is 2.99, which is also higher than others and the travel distance varies between 4.8 km and 32.5 km.The origins and destinations also have a certain degree of randomness.day and they have short activity duration.The travel origins and destinations are not regular and they travel for 28.1 km in average.
In general, the start time of the first trip and the end time of the last trip for Group 2 are similar to those of Group 6, both in the peak hour.Even so, the days of Group 6 traveling in a week are less and its travel distance is much longer.Comparing the characteristics of Group 2 with Group 6, we can conjecture that Group 2 is commuting vehicles traveling twice everyday and Group 6 may be vehicles commuting only in part of the days in a week and consistent with activities for leisure, recreational, or sporadic work in the rest days.Moreover, Groups 2, 3, and 4 are the main composition of traffic flow, taking up 79.1% of the whole vehicle population.Group 4 is off-peak hour travel vehicle and there is no clear travel purpose that could be inferred using only these travel behavior characteristics.These clusters could be composed of leisure travelers, visitors, or sporadic vehicles.They may be vehicles coming out to pick up child or shopping nearby.Group 5 has distinguishing features from others; they travel only in the midnight; it is similar to taxi or online hailing vehicles (i.e., Uber); the travel time,and travel purposes are random and not sure.
For Vehicles in Weekends, Three Groups Are Clustered.The characteristics of each group are shown in Table 6.
Group 1 is deduced as off-peak hour travel, where the starting time of the first travel is 10:11 am and the trip ending time is 19:30 pm.They travel 1.87 days in a week and 2.02 trips in a day.In addition, the average travel distance is about 30.8 km and the similarity of the travel origins and destinations is high.Combining with the travel frequency, travel time, and travel distance of these vehicles, they may live in the city center for work in the weekdays and during weekends they may visit their parents or relatives in the suburbs or have picnics to relax.
Group 2 is defined as afternoon travel with short activity duration vehicle, which travels 1.27 trips per day and 1.66 days in a week.It travels in off-peak hour, which is 12:30 am and 16:18 pm, the travel distance is not long and the activity duration is about 3 hours.Additionally, the origins and destinations are relatively stable.Combined with all of these features, group 2 tends to be vehicles going shopping or leisure on weekends.
Group 3 may be peak-hour travel vehicle, the average start time of the first trip is 7:42 am and the average finish time of last trip is 18:50 pm and it only travels 2.11 days in a week.The travel distance is as short as Groups 3 and 4 in weekdays.Vehicles in this group resemble commuting vehicles in weekdays.This kind of vehicles may work only in weekends, for example, people working for cram schools and the like.

Conclusions
This paper shows that it is possible to analyze the travel characteristics of vehicles and identify vehicle groups with similar travel behavior using LPR data.The main contribution of this paper is summarized as follows: (i) Six vehicle groups with similar travel characteristics in weekdays and three groups in weekends are identified and the detailed behavior of each cluster is presented.(ii) Travel characteristics are studied by analyzing the distribution of these variables and the values of each For example, with the clustering, we can effectively extract the commuting travel vehicles which provide better decision information for developing urban traffic demand and managing policy by analyzing the spatial and temporal distribution of its travel behavior.In addition, summarizing the clustering result, there are almost 46% (type 3 and type 4) off-peak hour travel vehicles traveling in short distance (less than 3.5 km) in weekdays.Considering that the detectors are mainly installed on expressways, we can guide these vehicles to take arterial roads instead of expressways by implementing some traffic management schemes during off-peak hour to improve the level of services of arterial roads and finally release the traffic pressure of off-peak hours on expressways.
In general, firstly, this study has shown that it is possible to analyze the travel characteristic of vehicles and identify vehicle groups with similar travel behavior using LPR data.Besides, a study of the vehicles' travel pattern can be performed based on this study results and this information can be used to preferably understand how the behavior of the different groups affects the road system, the travel patterns, and travel modes.
Secondly, from the standpoint of transportation planning, clustering vehicle travel patterns allow the analysis of possible differences in level of service experienced by different vehicle segments and the identification of potential biases.It can also provide better understanding of how changes in level of service affect different vehicles and how they respond to those changes.Knowing the main differences between groups can contribute to a better understanding of the effect of disruptions on travel behavior.
Finally, the method displayed in this study is innovative and practical which can be applied in several similar problems and researches.It highlights the potential of using LPR data to mine underlying information of vehicles and the study also reveals the importance of clustering vehicles based on their characteristics.

Figure 1 :
Figure 1: Road network and distribution of LPR system in Shenzhen.

( 1 )
Delete erroneous LPR data: there are two kinds of erroneous data in our study: (a) the detected time of the record is beyond the range of [0:00-24:00] and (b) the latitude and longitude of the detection site of the record are beyond the scope of Shenzhen.(2)Remove duplicated LPR data records: if there are two identical records, only one needs to be kept.(3) Extract the trip chain in accordance with the definition of one trip: that is, the data has been processed into the following form.

Figure 2 :
Figure 2: The distribution of number of trips for taxi in Shenzhen.

Figure 3 :
Figure 3: The relationship between the travel trajectory and the detection points.

Table 1 :
Raw data sample.

Table 2 :
Amount of detection ID for each day.

Table 3 :
Advantages and disadvantages of some unsupervised algorithms.

Table 4 :
The average number of trips for different thresholds.

Table 5 :
Average values of variables for each category in weekdays.

Table 6 :
Average values of variables for each category in weekends.variablefor each category.In addition, we defined vehicle type for each group of vehicle, to identify the commuting vehicle and other ordinary leisure travel vehicles, and the clustering result can be used in several aspects, such as