With the development of and advances in smartphones and global positioning system (GPS) devices, travelers’ long-term travel behaviors are not impossible to obtain. This study investigates the pattern of individual travel behavior and its correlation with social-demographic features. For different social-demographic groups (e.g., full-time employees and students), the individual travel behavior may have specific temporal-spatial-mobile constraints. The study first extracts the home-based tours, including Home-to-Home and Home-to-Non-Home, from long-term raw GPS data. The travel behavior pattern is then delineated by home-based tour features, such as departure time, destination location entropy, travel time, and driving time ratio. The travel behavior variability describes the variances of travelers’ activity behavior features for an extended period. After that, the variability pattern of an individual’s travel behavior is used for estimating the individual’s social-demographic information, such as social-demographic role, by a supervised learning approach, support vector machine. In this study, a long-term (18-month) recorded GPS data set from Puget Sound Regional Council is used. The experiment’s result is very promising. The sensitivity analysis shows that as the number of tours thresholds increases, the variability of most travel behavior features converges, while the prediction performance may not change for the fixed test data.
An activity-travel behavior pattern analysis includes the identification of activity patterns, such as types, duration, sequence, and locations, and the recognition of travel behavior pattern regarding departure time, travel time, and travel types, such as commuting and noncommuting. It is one of the most fundamental research topics to many real-world applications, including Active Traffic and Demand Management (ATDM), Mobility-as-a-Service, and transportation demand management. The activity-travel behavior pattern is derived from either manually collected traveler activity diaries in travel surveys or passively obtained data, like global positioning system (GPS) trajectory data [
Travel demand management, such as ATDM, aims to reduce traffic demand or to redistribute the traffic demand temporally or spatially [
For incentive strategies, the challenge is that the travel patterns and social-demographic features of the target users are not entirely understood. Some ATDM programs use incentives to influence travelers in specific groups [
However, collecting travelers’ social-demographic information is not trivial. The most used method is collecting an activity diary in a traffic survey, including paper-based questionnaires and telephone interviews [
Travel behavior variability describes the variance of travel behavior for an extended period, which was recognized and studied [
This study proposes a social-demographic role prediction framework based on individuals’ travel behavior variability. It first extracts travel behavior variability from a long-term GPS data set. The travel behavior variability is decomposed as three-dimensional features: temporal, spatial, and mobile. The temporal dimension represents the departure time variability, and the spatial dimension indicates the destination location variability. The fluctuations of trip travel time and driving time ratio form the mobile variability dimension. In this study, the travelers’ home sites are detected from the raw GPS data. Then, the home-based tours and the travel behavior variability are produced. Next, the travel behavior variability is fed into a supervised machine learning model (support vector machine) to predict travelers’ social-demographic roles. The study built upon the Puget Sound Regional Council household 2004–2006 survey data, which are provided by the National Renewable Energy Laboratory’s Transportation Secure Data Center [ This study proposes an individual social-demographic role prediction model based on travel behavior variability. The travel behavior variability and its correlation to the social-demographic role are explored. A sensitivity analysis of sampling threshold for a long-term data set reveals how the travel behavior variability and social-demographic role prediction change by different data sampling thresholds.
This research is expected to provide a practical process framework to fully take advantage of available emerging data (i.e., continuous GPS tracked data) and integrate them into the existing modeling or behavior-related research and applications. These are elaborated in the following sections. The details of travel behavior variability extraction and the social role prediction method are introduced in Methodology. Case Study and Discussion describe the experimental details and the experimental results on the testing data set. It also reveals the result of the sensitivity study of the impact of data collection on travel behavior variability and the social-demographic role prediction. Finally, the Conclusion closes the paper, and the principal findings are illustrated.
The framework of the proposed social-demographic role prediction method is shown in Figure
Methodology framework.
The GPS trajectory preprocessing aims to remove outliers from vehicle-instrumented GPS data. Initially, a data cleaning and smoothing process derived from Schüssler’s raw GPS data-processing procedure [
For continuous GPS trajectory data, it is not hard to detect the home location and then to generate home-based tours. The top three most visited places clusters’ centroid location that the user has visited (departure from or arrive to) are at least 1 mile away from each other, as they are more likely to be home or other locations, such as the workplace. According to the characteristics of the trips related to these sites, such as the departure location of the first trip of the day, the arrival location of the last trip of the day, and the duration of the stay (e.g., more than 8 hours) at this site, home locations can be identified.
After determining the home location, the individual home-based tours, such as Home-to-Home (HH) and Home-to-Non-Home (HN), can be produced. An HH tour is defined as the traveler departing from and returning back home with a reasonable trip travel time during the day (such as 3 hours). An HN tour is the travel during the day departing from home and arriving at any other location, such as the workplace. In a day, the HH and HN tour number, especially the HH tour number, may be greater than 1.
In this study, a home-based tour, either HH or HN, may comprise one or more consecutive trips, which is described by departure time, destination location, driving time, and travel time. Similar to the trip, a tour has the tour features encompassing departure time, destination location, driving time ratio, and travel time. The
For a traveler
Tour feature and variability variables description.
Tour type | Features | Variability |
---|---|---|
HH | Departure time | Departure time |
Destination locations | Destination locations entropy | |
Travel time | Travel time SEM | |
Driving time ratio | Driving time ratio SEM | |
|
||
HN | Departure time | Departure time SEM |
Destination locations | Destination locations entropy | |
Travel time | Travel time SEM | |
Driving time ratio | Driving time ratio SEM |
The tour temporal feature is the tour departure time, which is converted into a 15-minute time slot index from the beginning (00:00 a.m.) of the day, to describe the departure time within a day numerically. In that case, any time of day can be expressed as the 15-minute time slot index integer ranging from 0 to 95. The tour temporal variability
The spatial feature is represented by the destination locations, which are the destinations of all trips in the tour. For example, although the HH tours have a fixed origin and destination (i.e., home), an HH tour may include multiple trips with different purposes, such as grocery shopping trips, children-pickup trips, or social trips. They may have different destination locations. For HN tours, except for the tour destination variation, the in-tour trip destination locations may vary a lot like the HH tours. To numerically describe the variability of the destination locations, Shannon’s entropy [
First, for individual
The mobile features reflect the vehicle movement behavior and travel property. They are delineated by travel time and driving time ratio. The variability of tour travel time
After collecting individuals’ variability variables, with the individuals’ social-demographic role labels as the ground truth data, a supervised machine learning model describing the correlation between travel behavior variability and social-demographic role can be developed. The eight variability variables are the independent features for defining an individual’s travel behavior variability pattern, and the ground truth social-demographic role is used as the dependent variable. The support vector machine (SVM) [
The Puget Sound Regional Council traffic choices study was an 18-month (during 2004 to 2006) research on travel behavior in response to road use. With 450 vehicles from over 275 households, the GPS raw trajectory data indicated that more than 4.5 million vehicle miles were traveled. Travelers’ social-demographic features are collected as well. The National Renewable Energy Laboratory’s Transportation Secure Data Center [
After the raw data were preprocessed and incomplete records were removed, a total of 218 individuals have complete variability variables for at least five HH or HN tours with social-demographic information. For those 218 individuals, the individual’s HH tours (green) and HN tours (red) number distributions are illustrated Figure
Histogram of number of HH and HN tours.
The individuals’ social-demographic roles (employment status) include six types: (1) full-time employee; (2) part-time employee; (3) student; (4) homemaker; (5) retired; and (6) other. The number of type 1-full-time employees dominates the other types. Considering the unbalanced data amount of social role types, the original data set is converted as a binary class data set as type 1 and type 0. Type 1 class is the original type 1 class, while type 0 class stands for the total of type 2 through type 6. Type 1 class has 165 travelers; type 0 class has 53 travelers. The tours’ variability variables of the binary class data set are discussed. Table
Variability variables statistical details for binary class.
Variability variables | Type 1 | Type 0 | ||||
---|---|---|---|---|---|---|
Mean | Std. | Mean | Std. |
|
| |
HH | ||||||
Departure time |
|
|
|
|
|
|
Destination locations entropy |
|
|
|
|
|
|
Travel time SEM |
|
|
|
|
|
|
Driving time ratio SEM |
|
|
|
|
|
|
HN | ||||||
Departure time |
|
|
|
|
|
|
Destination locations entropy |
|
|
|
|
|
|
Travel time SEM |
|
|
|
|
|
|
Driving time ratio |
|
|
|
|
|
|
The statistically significant variables are For HN tours, type 1 travelers have significantly lower mean values of For HN tours, the HN tour’s mean For HH tours, the departure time situation is reversed. The mean
In the prediction model, the SVM classification is implemented by the python library (sklearn) taking default configurations, and radial basis function kernel is used. The multiclass and binary class prediction accuracy results are illustrated in Table
Multiclass and binary class employment status SVM prediction results.
Employment status-multiclass | Estimation | Recall accuracy | |||||||
---|---|---|---|---|---|---|---|---|---|
Type | 1 | 2 | 3 | 4 | 5 | 6 | total | ||
Actual | 1 | 165 | 0 | 0 | 0 | 0 | 0 | 165 | 100.00% |
2 | 3 | 20 | 0 | 0 | 0 | 0 | 23 | 86.96% | |
3 | 3 | 0 | 3 | 0 | 0 | 0 | 6 | 50.00% | |
4 | 2 | 0 | 0 | 14 | 0 | 0 | 16 | 87.50% | |
5 | 1 | 0 | 0 | 0 | 2 | 0 | 3 | 66.67% | |
6 | 2 | 0 | 0 | 0 | 0 | 3 | 5 | 60.00% | |
Total | 176 | 20 | 3 | 14 | 2 | 3 | 218 | — | |
|
|||||||||
Precision accuracy | 93.75% | 100% | 100% | 100% | 100% | 100% | — | 94.95% |
Employment status-binary class | Estimation | Recall accuracy | |||
---|---|---|---|---|---|
Type | 1 | 0 | Total | ||
Actual | 1 | 165 | 0 | 165 | 100.00% |
0 | 11 | 42 | 53 | 79.25% | |
Total | 176 | 42 | 218 | — | |
|
|||||
Precision accuracy | 93.75% | 100% | — | 94.95% |
From Table
One observation of the results is the poor prediction performance of type 2 to type 6 classes in the multiclass case and type 0 in binary class cases. The poor prediction results may be led by the unbalanced data set and the limited sample size.
In addition to the employment status, an individual’s other social-demographic variables, including income, age, and gender, are discussed in this study. Similar to the experiment results of employment status shown previously, the prediction results of those three variables (income, age, and gender) are shown in Tables
Income level multiclass SVM prediction results.
Income level-multiclass | Estimation | ||||||
---|---|---|---|---|---|---|---|
<$25K | $25K–50K | $50K–75K | $75K–150K | >$150K | Total | Recall accuracy | |
Actual | |||||||
<$25K | 16 | 0 | 1 | 3 | 0 | 20 | 80.00% |
$25K–50K | 0 | 26 | 0 | 10 | 0 | 36 | 72.22% |
$50K–75K | 0 | 0 | 33 | 8 | 0 | 41 | 80.49% |
$75K–150K | 0 | 0 | 1 | 101 | 0 | 102 | 99.02% |
>$150K | 0 | 0 | 1 | 5 | 13 | 19 | 68.42% |
Total | 16 | 26 | 36 | 127 | 13 | 218 | — |
|
|||||||
Precision accuracy | 100.00% | 100.00% | 91.67% | 79.53% | 100.00% | — |
|
Age level multiclass SVM prediction results.
Age level-multiclass | Estimation | Recall accuracy | ||||||
---|---|---|---|---|---|---|---|---|
<21 | 22–34 | 35–44 | 45–54 | 55–65 | >65 | Total | ||
Actual | ||||||||
<21 | 0 | 1 | 0 | 2 | 0 | 0 | 3 | 0.00% |
22–34 | 0 | 37 | 0 | 7 | 0 | 0 | 44 | 84.09% |
35–44 | 0 | 1 | 51 | 11 | 0 | 0 | 63 | 80.95% |
45–54 | 0 | 3 | 0 | 71 | 0 | 0 | 74 | 95.95% |
55–65 | 0 | 1 | 1 | 4 | 19 | 0 | 25 | 76.00% |
>65 | 0 | 0 | 1 | 5 | 0 | 3 | 9 | 33.33% |
Total | 0 | 43 | 53 | 100 | 19 | 3 | 218 | — |
|
||||||||
Precision accuracy | — | 86.05% | 96.23% | 71.00% | 100.00% | 100.00% | — |
|
Gender multiclass SVM prediction results.
Gender status-binary class | Estimation | Recall accuracy | ||
---|---|---|---|---|
Female | Male | Total | ||
Actual | ||||
Female | 130 | 5 | 135 | 96.30% |
Male | 15 | 68 | 83 | 81.93% |
Total | 145 | 73 | 218 | — |
|
||||
Precision accuracy | 89.66% | 93.15% | — |
|
The overall prediction accuracy values of the three variables (income level = 86.7%, age level = 83.03%, and gender = 90.83%) are still acceptable, although they are relatively lower than the prediction accuracy of employment status (94.95%). It indicates that individual’s employment status is easier to predict than other variables. The reason behind is that the employment status is more directly and closely correlated to the travel behavior variability than other social-demographic variables.
The test data were collected over nearly 18 months, and for a data set collected over a long time, it is feasible to carry out a sensitivity analysis for the sampling threshold, that is, the number of tours. The sensitivity analysis investigates how the threshold impacts the tour variability and even social-demographic role prediction, aiming to answer the questions about the data collection sufficiency for travel behavior variability convergence and estimating the individuals’ social-demographic roles. As a comparison to the SVM model used in the study, another machine learning classification model, logistic regression (LR), is implemented in the analysis.
The number of tours threshold is defined as the required minimum number of tours for both HH and HN for a successful data collection. The number of tours threshold ranges as
The tour variability variables plotted against the number of tours thresholds for type 1 and type 0 travelers are illustrated in Figure
Variability versus number of tours thresholds.
The HH and HN tours’
The
Generally, for a large sample size, the variances of travel behavior features will not change too much, and the variability values are low. According to the diagrams, one thumb of rule is that when the number of tours reaches about 40, the variances of travel behavior features keep constant at low values (except destination location entropy) and the travel behavior variability is more reliable and predictable.
The statistical analyses of two types of travelers for all eight travel behavior variability variables are conducted to understand the travel behavior features variances “before and after 40 tour threshold.” The statistical results are listed in Table
Statistical analysis of variances of travel behavior features for threshold-40.
Features | Statistical measures | Type 1 | Type 0 | ||
---|---|---|---|---|---|
≤40 (1272) | >40 (847) | ≤40 (379) | >40 (174) | ||
HH departure time SEM | Mean | 4.17 | 1.99 | 3.66 | 1.94 |
Std. | 3.28 | 0.46 | 2.90 | 0.47 | |
|
23.26 | — | 11.22 | — | |
|
0.00 | — | 0.00 | — | |
|
|||||
HH location entropy | Mean |
|
|
|
|
Std. | 0.92 | 0.57 | 0.95 | 0.51 | |
|
−27.52 | — | −15.08 | — | |
|
0.00 | — | 0.00 | — | |
|
|||||
HH travel time SEM | Mean | 129.84 | 103.76 | 393.90 | 164.91 |
Std. | 388.96 | 164.87 | 409.97 | 264.37 | |
|
2.12 | — | 1.82 | — | |
|
0.03 | — |
|
— | |
|
|||||
HH driving time ratio SEM | Mean | 0.06 | 0.03 | 0.06 | 0.03 |
Std. | 0.05 | 0.01 | 0.06 | 0.01 | |
|
21.29 | — | 10.22 | — | |
|
0.00 | — | 0.00 | — | |
|
|||||
HN departure time SEM | Mean | 1.46 | 0.83 | 2.42 | 1.11 |
Std. | 2.23 | 0.59 | 2.95 | 0.60 | |
|
9.53 | — | 8.22 | — | |
|
0.00 | — | 0.00 | — | |
|
|||||
HN location entropy | Mean |
|
|
|
|
Std. | 0.77 | 0.75 | 0.93 | 0.98 | |
|
−12.89 | — | −4.44 | — | |
|
0.00 | — | 0.00 | — | |
|
|||||
HN travel time SEM | Mean |
|
|
36.16 | 23.12 |
Std. | 276.03 | 192.73 | 103.15 | 37.95 | |
|
−0.57 | — | 2.16 | — | |
|
|
— | 0.03 | — | |
|
|||||
HN driving time ratio SEM | Mean | 0.04 | 0.02 | 0.05 | 0.03 |
Std. | 0.05 | 0.01 | 0.06 | 0.01 | |
|
11.24 | — | 7.34 | — | |
|
0.00 | — | 0.00 | — |
The
The statistical analysis results are consistent with the observations from Figure
The sensitivity study includes logistic regression (LR) as a comparable prediction approach to the SVM model used in this study. This comparison study focuses on the data set overall recall accuracy. Since the number of qualified individuals decreases as the number of tours threshold goes up, the various sample set sizes at different thresholds may impact the prediction results. Figure
Number of qualified individuals versus number of tours thresholds.
From Figure
The sensitivity research result for the fixed sample set for the number of threshold ranging from 5 to 90 is illustrated in Figure
Sensitivity study of prediction accuracy and number of tours threshold for a fixed sample set.
The prediction results illustrate that a larger number of tours required for data collection does not significantly improve the prediction accuracy. Since the traveler type detection result heavily depends on the travel behavior variability differences between both types of travelers, the same or similar travel behavior variability patterns of both types of individuals (which are observed from the diagrams in Figure
Variability mean difference ratio versus number of tours threshold.
This paper proposes a social-demographic role prediction method based on the travel behavior variability pattern. It is based on the principles that, for different social groups, they have specific travel behavior patterns. The paper provides a way to formalize traveler’s travel behavior variability pattern by analyzing long-term raw GPS data and to predict individuals social-demographic roles through support vector machine model by travel behavior variability.
The study applies to Puget Sound Regional Council data set, which includes a long-term (18-month) GPS trajectory data set and a particular individual social-demographic data set. The variability derived from the data set indicates that, (1) for HN tours, the full-time employees have tighter departure time restrictions on home to other places tours, for example, the morning home-to-work commute; (2) they are more dedicated to their trips and do not stop frequently; (3) for HH tours, the full-time employee individuals have more departure time flexibility. According to the travel behavior variability properties, the prediction accuracy rates for social-demographic features, including employment status, income, age, and gender, are discovered. Among the social-demographic features, an individual’s employment status is mostly related to the travel behavior variability and can be predicted accurately. The sensitivity analyses about sampling size (number of tours threshold) impacts on the tour variability and the prediction accuracy are also studied. The tour variability is going to converge as the number of tours threshold increases. However, for the fixed sample set, the social-demographic role predictions do not change much as the number of tours threshold increases.
This study preliminarily explores the possibility of using travel behavior variability to predict an individual’s social-demographic information. This prediction method helps to obtain the social-demographic data for the people with long-term collected activity data without any traditional travel surveys. The sensitivity analysis can guide future studies to gather data and design the experiments. However, there are several limitations of this study. The first issue is that there are only a few individuals in the test data set. A larger traveler sample size may improve the model’s performance: the model only considers home-based tours and limited travel behavior variability attributes. More measures of travel behavior features and their variability, such as travel mode, trip purpose, and other types of tours (e.g., work-based tours), should be considered in future work.
The publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work or allow others to do so, for U.S. Government purposes.
The authors declare that there are no conflicts of interest regarding the publication of this manuscript.
This work was supported by the U.S. Department of Energy under Contract no. DE-AC36-08GO28308 with Alliance for Sustainable Energy, LLC, the Manager and Operator of the National Renewable Energy Laboratory. Funding was provided by the Federal Highway Administration.