Transferability of a Machine Learning-Based Model of Hourly Traffic Volume Estimation—Florida and New Hampshire Case Study

-is paper focuses on the problem of model transferability for machine learning models used to estimate hourly traffic volumes. -e presented findings enable not only an increase in the accuracy of existing models but also, simultaneously, reduce the cost of data needed for training the models—making statewide traffic volume estimation more economically feasible. Previous research indicates that machine learning volume estimation models that leverage GPS probe data can provide transportation agencies with accurate estimates of hourly traffic volumes—which are fundamental for both operational and planning purposes—and do so with a higher level of accuracy than the prevailing profiling method. However, this approach requires a large dataset for model calibration (i.e., input and continuous count station data), which involves significant monetary investment and data-processing effort. -is paper proposes solutions, which allow the model to be prepared using a much smaller dataset, given that a previously collected dataset, whichmay be gathered in a different place and time period, exists. Based on a broad selection of experiments, the results indicate that the proposed approach is capable of achieving similar model performance while collecting data for a 5 times shorter time period and utilizing 1/4 of the number of continuous count stations. -ese findings will help reduce the cost of preparing and maintaining the traffic volume models and render the traffic volume estimations more financially appealing.


Introduction
is research investigates the transferability of the artificial neural network (ANN) models applied in hourly traffic volume estimation. e introduced methodology explores the feasibility of training an ANN model to estimate the traffic volume by transferring data from another region when sufficient data are unavailable for the target network.
is methodology requires ground truth observations to train the models, but previously trained models can be applied at the entire road network, regardless of the existence of the ground truth data. e estimation of traffic flow parameters is one of the fundamental needs in many of the intelligent transportation system applications as they are used for a wide variety of purposes from the planning to the design and operational stages of highway networks [1][2][3]. Two of the most crucial inputs required by transportation agencies in order to calculate statewide performance measures are traffic volumes and speeds. ere are many wellknown methods and solutions for obtaining network-wide speed data, which are already implemented for commercial use. However, network-wide traffic volumes are much more challenging to obtain and thus remain a key missing dimension for quantifying traffic conditions, transportation system performance assessment, and cost-effective management of mobility projects and programs. e current estimation methods require continuous count station (CCS) data and knowledge regarding aggregate volume estimates such as annual average daily traffic (AADT), making them both expensive and inaccurate in locations where CCS data are unavailable. Moreover, the accuracy of these models tends to decrease when the objective is to estimate traffic volumes with higher temporal granularity (e.g., for 15minute intervals). ese difficulties are generated by a lack of network-wide available data reflecting the spatiotemporal traffic volume changes. e I-95 Corridor Coalition founded the "Volume and Turning Movement" project in 2015, seeking to provide representative volume and turning moving data products and assess their accuracy and feasibility. During the first phase of the project, it was shown that an ANN-based volume estimation model leveraging probe data as a key input was able to improve hourly volume prediction accuracy on functional road class (FRC) 1 roads by 26% compared to the currently often-used profiling method [4]. In the second phase, this solution was employed statewide in Florida [5], and extended to the principal, major, and minor arterials. While implementing the model at the statewide level helped us to meet the project's main goals, it required a large calibration dataset to accomplish this. In order to enable this, three months of data from 173 CCSs were collected as the ground truth for training the ANN model to learn the relations between the input variables and actual, observed traffic volumes. However, many states do not have such a robust network of CCSs and thus may not have sufficient calibration data to train a volume estimation model. Nonetheless, some patterns learned from such a large dataset may not be limited to a specific time or location and may be successfully applied elsewhere. For example, the impact of weather or traffic jams on the same types of roads should be similar regardless of the location. A natural way to circumvent the data limitations is to transfer knowledge from areas where enough data are available to areas where the amount of collected data is insufficient to apply to ML techniques. Due to the significant spatiotemporal variation in traffic volumes, previously trained models cannot provide reliable results in a new study area. However, a small dataset from the study area can be combined with large datasets obtained from other regions to train an ANN model with acceptable accuracy. is paper explores whether target geographies with a small calibration network can leverage a larger, previously collected dataset to enhance the volume estimation performance. To illustrate this, we focused on estimating traffic volumes in New Hampshire-a small state with a limited CCS network-by utilizing the previously described Florida dataset, which is four times the size. e contribution of this article is twofold. We proposed a solution that enables traffic volume estimations even with small datasets, provided that the other available large datasets or previously trained models are leveraged. We also designed, explored, analyzed, and assessed different techniques that allow leveraging available large external datasets or previously trained models to improve traffic volume estimations when only a small dataset for a particular area and time is available. As the cost of the data is the main factor that impedes wide usage of traffic volume estimations, these contributions are of significant practical importance. e remainder of the paper is organized into five sections. First, the literature review is presented to show how the proposed approach fits into the existing traffic volume estimation research. Afterwards, the data used for the analysis are presented, followed by a description of the model-including hyperparameter selection and details of the training procedure. Subsequently, the experiments and results are discussed and, finally, conclusions are drawn, as several extensions of this work are proposed.

Literature Review
e existing literature contains many approaches for estimating traffic volumes [2,4]. One possible way to categorize these studies is to divide them into two groups of parametric and nonparametric methods [6,7]. Parametric methods use linear and nonlinear regression or the autoregressive integrated moving average (ARIMA) and their modifications to analyze historical data. e main limitation of parametric methods is that they yield the best results in analyzing stable phenomena with linear relationships between parameters.
As traffic shows features of chaotic systems with nonlinear relationships [8], the researchers' attention turned more to nonparametric methods. ey can approach any nonlinear characteristics of the traffic flow data. is group includes many different methods that are mostly concerned with AADT estimations. Reference [9] used support vector machine (SVM) models for AADT prediction, [10] employed artificial neural networks (ANN) and SVM to predict AADT volumes, [11] also employed SVM-based models, [12] and employed classification and regression trees to predict short-term AADT volumes. Decision trees were also used by [13] in intersection traffic prediction. Reference [14] proposed a solution based on genetic algorithms, and a number of publications ( [15][16][17][18]) used Bayesian networks in predicting traffic flow. Comparative analyses ( [19][20][21][22]) show that while parametric methods can be used under certain conditions, nonparametric methods are better in others. In addition, there are many hybrid approaches that employ various methods simultaneously, e.g., [23][24][25][26].
Among nonparametric solutions, ANNs have been growing in popularity ever since being proposed to estimate motion parameters [27]. e recent advancement in massive data availability has made it possible for ANN models to achieve better results than models based on statistical methods [28]; thus they are widely used to predict traffic flow and AADT. Reference [29] used fuzzy networks to deal with the uncertainty of spatiotemporal data features in traffic flow prediction. Reference [30] handled the problem of missing data in forecasting AADT using recurrent neural networks. Reference [31] examined the possibility of estimating AADT from a one-week dataset using the ANN-fuzzy approach. An extensive literature review related to the use of ANNs for traffic flow prediction was included in [32]. Moreover, multiple recent studies have leveraged ANNs to address the hourly traffic volume estimation. Reference [33] used an ANN to estimate hourly traffic volume using continuous count station features as direct inputs to their model.
Reference [34] estimated freeway off-ramps hourly traffic volume using an ANN. is paper illustrated that the offramp volume estimation can achieve acceptable accuracy when proper data are fed into the model. An exhaustive review and classification of the deep learning methods and models used in the estimation of road traffic parameters is also included in [2].
Although ANNs achieve good results, a lot of data are necessary, and meeting this requirement is usually expensive and time-consuming. According to [35], one of the fundamental problems of using predictive models is the limited possibility of using them outside the area or period of study in which it was conducted. A potential solution to the problem of insufficient data is to obtain additional data from other available sources, such as GSM stations [36], GPS data [1], social media, or applications installed voluntarily on mobile devices [37]. Although Big Data technologies and open data sources have made the data problem easier to solve than ever before, in many cases it still remains an issue [38]. e second approach assumes that it is possible to transfer data and models developed in one spatiotemporal context to another location. Transferability can be investigated in the temporal and spatial dimensions. Temporal transferability is the possibility of using observations or models trained in a given time window to estimate parameters in other periods. Spatial transferability means the possibility of using observations or models trained for one area in another location [39]. e spatiotemporal transferability of forecasting models is of considerable practical interest. It can save time and money on gathering data or preparing the model itself and overcoming the lack of a trained personnel [40]. is issue has been of interest in the context of traffic volume estimation for decades (see for instance [3,[40][41][42][43]). For machine learning models, this problem is defined as transfer learning. e transfer of data or knowledge from one domain to another is a way to solve the problem of insufficient training data, which is particularly important when using ANNs [44]. Reference [45] demonstrated that the size of the dataset is crucial for the accuracy of traffic forecasts in the ANN-based models and, in terms of results, it is more important than other factors. Hence, the problem of using data or knowledge of traffic patterns acquired in one place at different locations still needs a solution [7,46]. e current study aims to address this problem for hourly traffic volume estimation. is paper explores the viability of using existing CCS data from Florida to help calibrate an hourly traffic volume estimation model in New Hampshire, where a much smaller calibration dataset is available. In particular, the following two specific issues were explored concerning the model performance: Using significantly fewer CCSs in the target state (New Hampshire) to calibrate the model Collecting calibration data from the target state for a shorter time period. e proposed solution can be classified as transfer learning based on mapping [44], which consists of combining data sources from the source and destination domain, thus creating a new (larger) dataset. is method is based on the assumption that although the data from both domains differ, this new set of data contains sufficient features of the target domain for accurate forecasting. Namely, it applies a strategy named fine-tuning without freezing transferred layers described in [3].

Data
Two datasets were used to test the impact of model transferability: Florida and New Hampshire. e Florida dataset encapsulates October-December 2016, while New Hampshire data are from June-September 2017. e Florida dataset is over four times larger than the corresponding New Hampshire dataset (that is, it contains data from four times more Continuous Count Stations), but apart from the location and size, the structure of both datasets is the same. Each dataset is organized spatially by the traffic message channel (TMC) segments and temporally at the hour level. Table 1 summarizes the ground truth values and input variables used for model training and development, which are further described below.

Ground Truth Data.
Hourly traffic volumes from continuous count stations were used as a ground truth (expected output) for the neural network volume estimation models. 173 traffic sensors were used in Florida, while 42 were used in New Hampshire, and these stations were located at all types of major roads (motorways, highways, major and minor arterials). To obtain counts on the TMC network (i.e., the road network used for analysis), each unique count station and traffic direction was mapped to a corresponding TMC segment via GIS analysis.

Input Features.
Vehicle Probe Counts. e hourly aggregated vehicle probe volumes were obtained based on the raw global positioning system (GPS) data provided by a probe data vendor. e raw waypoints were initially snapped to the XD road segments by the provider. To remain consistent with other data sources, we used a bridge between the XD and TMC segment definitions to match the waypoints with the TMC segments and then aggregated the data at an hourly level. Additionally, each vehicle was associated with one of the three weight classes (class 1: below 14,000 lbs, class 2: between 14,000 and 26,000 lbs, and class 3: above 26,000 lbs), and the probe volumes for each weight class were provided separately to the model. e median of all penetration rates (based on a comparison of the vehicle probe counts with the corresponding continuous count stations) was 2.19% in Florida and 2.3% in New Hampshire. Vehicle Probe Speeds. e average hourly speed estimates based on GPS data were obtained from RITIS (the regional integrated transportation information system) [47]. RITIS was created and is maintained by the Center for Advanced Transportation Technology at the University of Maryland, College Park and provides visual analytics and data query capabilities for industrysourced probe data. Weather Data. Weather features were extracted from all permanent weather stations using data archived by the Iowa Environmental Mesonet [48] and assigned to each TMC segment based on spatial proximity. Initial tests suggested that the most important weather features are precipitation, temperature, visibility, and humidity. Road Characteristics. Infrastructural characteristics were extracted from both the National Performance Measurement Research Dataset (NPMRDS) TMC shapefile and the Open Street Map (OSM) road network. As the OSM maps use a different network topology, the OSM road characteristics (road classification and number of lanes) first had to be conflated to the TMC map, which was conducted using an automated conflation algorithm developed for this purpose. e final road characteristic features used for each TMC segment include information regarding road classification, number of lanes, segment length, and reference speed, as well as historical average annual daily traffic values associated with each TMC segment (obtained from the NPMRDS TMC shapefile). Temporal Data. Information concerning the hour of the day and the day of the week (working day/Saturday/ Sunday) was also considered for each data point in order to account for temporal traffic patterns. Other. Hourly volume profiles were derived by applying the widely used profiling method [49]. is method transforms AADT estimates derived from the highway performance monitoring system [50] into hourly volume profiles based on historic speeds available in RITIS.

Model
In previous research [4], it was shown that a fully connected (dense) neural network with three hidden layers yields the best volume estimation model performance. e structure of this network is presented in Figure 1 and is used for all subsequent experiments.

Training Procedure.
e mean absolute error (MAE) was selected as the loss function (i.e., the function that is minimized by the learning algorithm). Although a less popular choice than the mean squared error (MSE)-which tends to place greater emphasis on higher volumes, MAE was selected because it provides a good trade-off between MAPE and R 2 performance metrics, which were used to assess the tested models and approaches. Additionally, MAE was used in previous research, therefore, the selection of MAE makes it easier to compare the results. e model was trained with the Adam algorithm [51] proposed by Diederik P. Kingma and Jimmy Ba in 2014. Among many advantages, such as quick convergence, computational efficiency, and intuitive hyperparameters, the Adam algorithm is robust in terms of hyperparameter settings and usually does not require much hyperparameter tuning-a feature that was particularly important due to the number of experiments required. Overall, we trained 546 models, which made it impossible to separately tune the hyperparameters manually for each model. us,during initial tests, we discovered that the default hyperparameters (α � 0.01, β 1 � 0.9, β 2 � 0.999) work reasonably well and achieve strong model performance. Additionally, experiments showed that tuning hyperparameters around the default values did not significantly change the results-only the speed of convergence. Furthermore, due to the implementation of Dropout [52] after each hidden layer, the models turned out to be resistant to overfitting. Based on these initial findings, we decided to use the default hyperparameters for all training procedures and train the networks longer than required (i.e., to avoid tuning at the expense of some efficiency). e sample loss plots for the models with the smallest and largest datasets are presented in Figure 2. Both train and validation losses do not significantly decrease in the final few epochs, thus demonstrating that the models were trained long enough, and further training would not have improved the accuracies. On the other hand, the validation losses do not increase with time, which shows that the models do not overfit. e smallest datasets contained only 1 week of New Hampshire data, while the largest one contained all 3 months of both New Hamshire and Florida data. e charts come from the initial (tuning) experiments, where the New Hampshire data were split into train and validation datasets.

Partitioning Training and Validation
Data. Initial tests were first performed to verify the structure of the network and tune the network hyperparameters. ese tests were performed for each experiment with a fixed split of New Hampshire data allocated into the training and validation parts. To avoid data leakage different, continuous count stations were used for training and validation sets. Additionally, the split was made taking into consideration the functional road class (each FRC was represented in the same proportions in the training and validation sets). Next, after determining all the hyperparameters, the full crossvalidation procedure was employed. Each model was trained using data for 41 New Hampshire Continuous Count Stations and tested on the 42nd station. is procedure was repeated 42 times to ensure that all NH data were included in the test dataset. While being time consuming, this approach allowed us to avoid data leakage and to take full advantage of the given datasets.

Evaluating Model Accuracy.
During each iteration of cross-validation, model performance is quantified at the test location via the following error metrics: R 2 , MAPE (mean absolute percentage error), and EMFR (Error to Maximum Flow Ratio), a process that is repeated during each experiment. ese metrics were also used in previous research [4,5], which renders further comparisons easier. R 2 , MAPE, and EMFR metrics are presented in the following equations, where y i denotes an actual volume, y i is a volume estimate, y stands for the sample average, y max represents the maximum observed traffic volume, and n is the number of data points used to compute the metric: R 2 represents the fraction by which the variance of the errors is less than the variance of the dependent variable. In other words, it shows the percent of variance explained. MAPE expresses accuracy as a relation between an absolute error and the real observed value. MAPE is widely used for traffic volume estimations, mainly due to its ease of interpretation. However, MAPE has a few flaws. Namely, in the case of small volumes, the MAPE can be very high even if the absolute error is relatively small. Due to this, MAPE is highly affected by the time periods when the traffic volumes are small, whilst for planning and operational purposes, the time periods with high traffic volumes are usually much more important. EMFR is used in order to deal with this problem. EMFR is defined as the relation between an absolute error and the maximum observed value. us, if the absolute error is much smaller than the maximum observed traffic, the Journal of Advanced Transportation value of EMFR is also small, regardless of the current traffic. On the other hand, if the observed traffic is close to the maximum, the values of EMFR are close to the values of MAPE. EMFR has also a practical meaning, in the abovementioned VTM project, this metric was used to define the model quality thresholds (less than 10% EMFR was considered "satisfactory," while less than 5% EMFR was considered "very good").

Experiment 1: Model Comparison.
e goal of the first group of experiments was to explore the possibility of using the Florida dataset for New Hampshire predictions. During the experiments, four different approaches were employed and compared. e outline of the experiments is shown in Figure 3.
e "base" model was trained with the New Hampshire data only; no Florida data were used. e detailed results are presented in Table 2 in the Base columns. e mean and median R 2 s for this approach were 0.72 and 0.82, while the mean and median MAPEs were 43.3% and 26.9%, and the mean and median EMFRs were 8.12% and 7.04%, respectively. e typical (i.e., median) model performance is good and is comparable with the results that were obtained for the state of Florida [5] in the previous stage of the project (median R 2 : 0.83, median MAPE: 25%). ey are also much better than the currently used profiling method (median R 2 : 0.60, median MAPE: 55.6%). However, the problem with the       Journal of Advanced Transportation "base" model is the outliers, which can be seen in Figure 4-which shows the R 2 metrics with and without outliers-and in Table 2. High standard deviations, a large difference between mean and median R 2 s and MAPE s and unacceptable results for the worst continuous count station R 2 : −2.74 and MAPE: 326%) emphasize the problem.
Base stands for the model trained on New Hampshire data only, Fl is the model trained on Florida data and tested with NH data, TL stands for transfer learning (the model was trained on the Florida dataset, and then fine-tuned with NH dataset), and Ext stands for extended dataset (both Florida and New Hampshire datasets were included into the training set).
Base stands for the model trained on New Hampshire data only, Florida is the model trained on Florida data and tested with NH data, TL stands for transfer learning (the model was trained on the Florida dataset, and then finetuned with the NH dataset), and Extended stands for the extended dataset (both Florida and New Hampshire datasets were included in the training set).
Time series charts (a, b, and c) present the real (blue) and estimated (red) traffic volumes. For an ideal model, these two lines should overlap. Hexabin charts (d, e, and f) present the relations between the real and the estimate traffic volumes. e color of hexagons corresponds to the number of points within each hexagon. As long as the estimates are close to the real values, the points are located on the diagonal line.
Additionally, we selected three locations with EMFR equal to 5%, 10%, and 25% and plotted both time series and hexabin charts for each location ( Figure 5). e main purpose of this figure is to facilitate the interpretation of the presented metrics. e second approach involved training the model with only Florida data, and then testing on the New Hampshire dataset. e detailed results are shown in Table 2 in the Fl columns. Although the results are still better than for the profiling method, they are the worst from all the tested models, in terms of both mean and median R 2 s, MAPE s, and EMFR s.
is suggests that it is difficult to simply transfer the model from one area (state) and time period, use it in another state and time period, and maintain a high level of estimation accuracy. e third approach was based on transfer learning and the fine tuning procedure. It was similar to the second one, but this time the model trained with Florida data was finetuned with New Hampshire data. e results for this scenario are presented in both Figure 4 and Table 2 (column TL). During the training process, consecutively one, two, or three (all) hidden layers of the model were unfrozen, i.e., the weights of these layers were changed in the fine-tuning procedure. Due to the fact that the model did not overfit, the best results were obtained for all three unfrozen layers. e results presented in the papers are from the network with all the layers unfrozen. e transfer learning based models behave better than the "base" New Hampshire model, for all the metrics, except the median MAPE s, which are equal for both approaches. e transfer learning approach results also in a significant reduction in outliers, which may be noticed in both Figure 4 and Table 2.
Finally, we added the Florida dataset to the training data. For the crossvalidation procedure, each training set for this scenario contained all Florida data, and data for 41 of 42 New Hampshire Continuous Count Stations. e detailed results are shown in Table 2 in the columns Extended and in Figure 4. e results indicate that not only is the typical model performance superior to the other approaches (in terms of mean and median R 2 s and MAPE s this approach is the best for three out of four metrics) but it also best deals with outliers.

Experiment 2: Impact of Dataset Size.
e goal of the second set of experiments was to check if it is possible to use both the transfer learning and the extended dataset approaches, as explained in the previous subsection with a dataset, which covers a relatively small time period. ese experiments have important practical implementations-the cost of the data depends on the size of the dataset. For these experiments, we compared three approaches-the "base" approach based on the model trained on New Hampshire, the transfer learning approach based on the model pretrained on the Florida dataset and fine-tuned on the New Hampshire dataset, and the "extended" model-based on using merged Florida and New Hampshire datasets for training.
is time, instead of using all the New Hampshire data in the training procedure, we reduced the time scope of the data; we trained the models using the NH dataset reduced to twelve, eleven, ten, and so on down to one week and (for the extended approach) the entire dataset from Florida. During the training, we repeated the full crossvalidation procedure, and each of the models was tested on a full three months

Journal of Advanced Transportation
New Hampshire dataset that corresponded to the tested continuous count station. e detailed results are presented in Table 3, while a comparison of the models' behavior is illustrated in Figure 6.
e size of the ribbons is proportional to the standard deviation (0.2 * std). Due to the large values of the standard deviation in case of the "base" model, plotting the entire standard deviation makes the figures difficult to read.  e accuracy of the "base" model depends primarily on the size of the New Hampshire training dataset. All the metrics show that the performance of the model diminishes when the size of the dataset is reduced. On the other hand, the accuracy of the "transfer learning" and "extended" models do not significantly change with a change in the size of e New Hampshire training dataset. It shows that by leveraging a larger dataset collected elsewhere, it is possible to achieve good results with as little as one week of data.
Moreover, the results for both the "transfer learning" and the "extended" approaches with 1 week of New Hampshire training data are better than the results of the "base" model trained on three months of data, thus emphasizing on the practical usefulness of the proposed solutions. e "Extended" approach turned out to be slightly more accurate than the one based on transfer learning. However, the differences are not that significant, and the transfer learning-based approach has two advantages. First, due to   Journal of Advanced Transportation the smaller dataset, the training process is much faster for the transfer learning-based approach than for the "extended" one. Additionally, it is possible to employ transfer learning even without direct access to a larger dataset. For transfer learning, only the previously trained model is necessary. is is important due to data licensing. Some vendors offer data for "limited usage" only. It means that the data purchased by a larger state cannot be directly employed for training the model in other states. Such a situation makes using the "extended" approach impossible, while using the pretrained model and transfer learning approach is still doable.

Discussion.
e first set of experiments suggest that utilizing information from a previous dataset can improve traffic volume estimates in a target location-even if the locations and time periods differ. However, it is very difficult to maintain a model's accuracy when applied directly to a different location and time period than it was trained. Directly applying a model trained in one location (e.g., Florida) to a new target location (e.g., New Hampshire) yields reasonable "typical" performance-implying that, in general, the model captures the relation between input variables and traffic volumes, but suffers from severe outliers. Instead, a more promising approach appears to be fine-tuning the model originally trained in one location using data from the target location-an approach that incorporates patterns in the target location, and thus helps to avoid many outliers. Finally, if available, incorporating all data for training purposes (e.g., adding all Florida data for training a model to estimate New Hampshire volumes) yields the best results.
Given the findings from the first set of experiments-namely, that a previous dataset can be used to improve estimates in a new location, the second set of experiments seek to understand how much data are needed at the target location to achieve acceptable accuracy. Focusing,   in particular, on the "transfer learning" with fine-tuning and "extended" approaches, it finds that given a sufficiently large alternate dataset to leverage (i.e., Florida), both approaches can achieve reasonable results with as little as one week of data in the target (i.e., New Hampshire) location.
Overall, it appears that the machine learning-based method of volume estimation requires large datasets to learn complex spatial and temporal patterns between the input variables and traffic volumes. Given financial limitations and the fact that some states and jurisdictions are small and have limited count stations on various road classes, it can be challenging to collect sufficient data to train a reliable estimation model. However, the approaches highlighted in this paper show that input features and corresponding traffic counts can be leveraged from other locations, and used in conjunction with small amounts of data from the target location to develop better overall models. e results indicate that if possible, the best approach is the "extended" one, which uses all previous data in the training process, and does the best job of eliminating outliers. However, in cases where this is not possible (perhaps due to licensing of previous data), the "transfer learning" with fine-tuning approach can be used. is approach only requires access to a previously trained model (not the raw data), whose hyperparameters are subsequently tuned while training the model based on data from the target location.

Conclusion
is paper explores the transferability of volume estimation machine learning models, seeking to understand the extent to which a model trained to predict traffic volumes in one location can be used directly or modified slightly to do so elsewhere. e implications of this research question are significant, as it is expensive to acquire GPS trajectory data-a key model input-and time consuming to preprocess data sources. If existing large datasets can be utilized for training purposes, smaller, less expensive datasets can be purchased in new locations and used to build accurate models-potentially saving transportation agencies time and money. e experimental results suggest that a key component of model performance is having sufficient data for training. If access to a larger, existing dataset is available, the optimal approach is to train the volume estimation model on all available data-using both locations in the target geography and previously collected data. However, even if the raw dataset is not available from a previous location (perhaps due to licensing restrictions), an existing model can be trained and fine-tuned using a small amount of data from the target locations. Interestingly, the larger dataset used to improve performance appears to be useful even if it comes from different places and time periods, and when the target dataset is as small as one week.
Note that the results provided in this study are limited to the hourly volume estimation model's transferability at the state level. e states whose data were available to test the proposed approach encompass both urban and rural areas with different land-use characteristics. Additionally, all roads used for the analyses are FRC 1 or 2 due to data accessibility limitations. erefore, overgeneralizing the results to various geographical and road levels should be avoided. However, given the availability of the required data for different locations, it is possible to test the proposed approach in other areas to investigate the extent to which it can be generalized.
A key future direction for research includes investigating how much data need to be collected from continuous count stations in order to constitute a satisfactorily large dataset for transfer learning, and whether there are certain temporal or spatial characteristics that are necessary for the data to be transferable. Additionally, it would be beneficial to understand how often the models should be fine-tuned or retrained in order to optimize performance. Based on the promising results presented in this paper, these research questions will be explored in more detail in future modeling efforts.

A. Model Selection
is appendix describes the experiments performed to verify whether the solution selected in [4] (a fully connected artificial neural network) is still the most suitable for the presented task. Similar to Section 5, two experiments were carried out.
First experiment-carried out on the entire dataset. e primary purpose of this experiment was to analyze alternative models and compare them with artificial neural networks. is experiment corresponds to the experiment described in Section 5.1. Experiment 1: model comparison. Second experiment-carried out on reduced datasets. It was conducted to check if the conclusions based on the results from the first experiment are valid for much smaller datasets. is experiment corresponds to the experiment described in Section 5.2. Experiment 2: impact of the dataset size.
A.1. Analyzed Models. In the first step of the presented analysis, the authors analyzed different models and selected some of them for further tests. e following models were selected for future analysis: Fully connected neural network (dense): e model used in [4]. is model is a neural network built of three fully connected (dense) hidden layers, 256 neurons in each layer. e hidden layer neurons use ELU activation function. Random forest (RF): In our experiments, this model performs only slightly worse than the dense neural network. RF models tend to overfit, so we applied regularization; namely, we limited the minimum samples per leaf (values 5, 10, 15, 95, and 100 were tested). e model with the smallest median EMFR was selected. is model was employed for both experiments.
Linear regression (LR): A basic linear regression model was used as a baseline for our experiments. Bayesian ridge regression (BR): an approach to linear regression that tries to solve the problem of poorly distributed data by using probability distributions rather than point estimates.
Polynomial regression (PR): A variation of linear regression that uses polynomial features. Technically, the model is linear, but additional features are provided to ensure the polynomial output. For example, if the original input features were X 1 and X 2 , and we are interested in the second degree polynomial, the following set of features shall be provided as a model input: X 1 , X 2 , X 1 * X 2 , X 2 1 , X 2 2 . e following models were not considered in this comparison: Support vector machines (SVM): SVM is a very powerful solution, but it does not scale well to large datasets. Our training dataset contained over 830,000 samples, thus it was impossible to train an SVM model with a radial basis function (RBF) kernel that ensures the best performance of this approach. In fact, it turned out that with the Sci-kit Learn library, it was also impossible to train an SVM model using polynomial or linear kernels that are expected to generate much worse results than an RBF kernel. erefore, we did not include this class of models in our analysis. Long-short-term memory neural networks (LSTM)-LSTM networks seem to be the most suitable architecture for the presented task, as they can generate estimates using also the previously observed features as inputs, and simultaneously, not excessively increase the complexity of the networks. During previous (unpublished) research, we used LSTM networks and achieved slightly better results than with dense neural networks. However, there are two main reasons that prevented us from using these models in the presented research. Depending on the size of the train dataset, it takes 6-48 hours to train the model. Overall, we trained more than 500 models; thus, it would not be possible to repeat the entire procedure with LSTM models in a reasonable timespan. Although LSTM models are slightly better than dense neural networks, we also discovered that for the given type of data they are very susceptible to small changes in the hyperparameters. e gist of our research was to discover how the models behave in given different data scenarios. Had we used an LSTM model and got worse results for some scenario, we could not be sure whether this would be the problem with the scenario or the model hyperparameters. One can argue that we could fine-tune the hyperparameters for each scenario, but given the aforementioned training time, it would take far too long. Both train and test datasets contained the features described in Section 3.2.
In the second experiment, only the New Hampshire dataset (25 continuous count stations, 175,404 data points) was selected. To perform this experiment, the authors followed the procedure described in Section 5.2. Experiment 2: impact of dataset size, namely, the size of the dataset was gradually reduced from twelve weeks to one week. Each time, the entire crossvalidation procedure was carried out to determine the models' accuracy.

A.3 Results.
e final results of the first experiment are presented in Table 4. For random forest, the authors trained the models for different minimum samples per leaf values and chose the best model (the model with the smallest median EMFR value).
A fully connected (dense) neural network turned out to be the best approach, although the results generated by the random forest model were very alike. is is consistent with the results presented in [4]. e results generated by the polynomial regression model are surprising. e medians of the metrics are not unexpected, although the means are very poor. Overall, the model behaved well, but it completely misestimated the traffic for one CCS, in one direction. e authors believe that this behavior is an overfitting problem, which was caused by fitting the model with too many features. To generate a second degree polynomial model as presented in the table, the authors used 3,240 features instead of the 79 used in all other models. Due to the observed overfitting, the authors did not test higher degree polynomial models (for example, to generate a third degree polynomial model, 88,560 features are needed, and a fourth degree polynomial model requires 1,837,620 features). e second experiment's results overview is presented in Figure 7. Additionally, Table 5 consists of the detailed results for the smallest (one week long) dataset. First of all, both linear and polynomial regression could not deal with outliers. Similar to polynomial regression in the first experiment, they completely misestimated the traffic for one CCS in one direction, which resulted in terrible mean results. e median results presented in Figure 7 show that these models cannot handle the datasets smaller than 9 weeks of data. Bayesian ridge regression results were stable and did not deteriorate heavily with the train dataset size reduction. Moreover, Bayesian Ridge Regression managed to deal with outliers, regardless of the size of the dataset. However, these results were significantly worse than the ANN-based results. Random forest turned out to be the best of the alternative models. For a two-week-long training set, RF results were similar to the results obtained with a "base" artificial neural network, and for a one-week-long dataset, they were even better. However, regardless of the size of the dataset, RF results were worse than the results of both approaches that leveraged the Florida dataset, namely, extended and transfer learning-based models. e first three rows of the table present results obtained with dense neural networks. Base stands for the model trained on New Hampshire data only, TL stands for transfer learning (the model was trained on the Florida dataset, and then fine-tuned with NH dataset), and Extended stands for the extended dataset (both Florida and New Hampshire datasets were included into training set).

Data Availability
e data used to support the findings of this study are not available due to third-party rights.

Disclosure
An early version of this manuscript has been published at the 99th Annual Meetingof the Transportation Research Board. An earlier version of this paper was presented at the TRB annual meeting.

Conflicts of Interest
e authors have no conflicts of interest to declare.