Exploring Transit Use during COVID-19 Based on XGB and SHAP Using Smart Card Data

,


Introduction e global coronavirus (COVID-19
) pandemic has profoundly impacted all areas of people's lives around the world.Unlike conventional viruses, the spread of COVID-19 has been difficult to contain, and it is expected to change the appearance of our society permanently rather than temporarily.e first COVID-19 case in Seoul, Korea was reported on January 20, 2020.A year and three months later, in April 2021, 106,898 confirmed cases and 1,756 deaths have been reported in Seoul.To decelerate the spread of COVID-19, the government of Seoul implemented protective measures such as interpersonal distance.e interpersonal distance policy consists of 1∼2.5 levels connected with the severity of the spread of COVID-19.Currently, the government of Seoul has announced a 2.5 level which is the strictest lockout measure.
is policy significantly changed the lifestyle and travel behavior of local residents in Seoul.e protective measures implemented by the government included closing all facilities, for example, restaurants or gyms, at 10 P.M. and prohibiting gatherings of more than four people.Furthermore, public transport reduced fleet operations by 30% after 9 P.M.According to statistical analysis using smart card data in Seoul, the number of transit trips in 2020 decreased by about 27% compared with the previous year.However, descriptive statistics from smart card data do not provide information on how or why people change their travel behavior, such as given-up transit use.Due to the specificity of the current pandemic situation, little is known about the changes in transit user travel behavior.It is important to understand the reasons for changing travel behavior and the potential impact on transit use during the COVID-19 pandemic.
Early studies addressed the impacts of the COVID-19 pandemic on travel behavior, mode choice, and other activities in different countries worldwide [1][2][3][4][5][6][7][8][9].Many studies have focused on travel pattern changes considering work and shopping behaviors.Due to the COVID-19 pandemic, the proportions of telecommuting usage and online shopping have increased, resulting in a decrease in overall transit and auto trips [10].To understand this phenomenon, survey and mobile application data were used, however, little work has been published about the change in travel behavior during the interpersonal distance policy period with largescale data such as smart card data.Many researchers have sought to estimate users' mode choice behavior using smart card data due to its quality and quantity.For example, Kim et al. [11] proposed an express train choice model based on smart card data, and the results of the model showed notable performance in exploring user choice behavior.Lee et al. [12] identified user preference of urban transit modes with the smart card data.Similarly, Jánošíková et al. [13] developed a transit route choice model based on the multinomial logit model (MNL) using smart card data.ese previous studies implied that the smart card data was very useful for analyzing mode choice behavior since it accurately provides all transit trip information.
As various data on transit systems are being collected, some studies have explored transit user travel behavior based on a data-driven approach and machine learning techniques [14].e choice model based on machine learning techniques has an advantage with high accuracy compared to conventional choice models such as the MNL model [15].One of the major drawbacks of machine learning techniques is the difficulty in interpreting the impact of the inputs on the outputs.However, it has become possible to accurately estimate and analyze various individual travel behaviors with the advent of interpretable machine learning (IML) techniques.For example, Lee et al. [14] used the IML approach to analyze train choice, for example, local and express train, and interpret user preferences.Similarly, Wang and Ross [15] developed an IML-based transit mode choice model and compared it to the MNL model.ese studies mentioned that IML provided a more accurate estimation and a better understanding of user preference than other conventional models.
To shed light on these matters that are essential for analysis and transit policy, this study explores transit user's travel behavior, specifically whether or not transit use is givenup, during the COVID-19 pandemic.Two days of smart card data, before and during the COVID-19 pandemic, are used to estimate trips that gave up transit use during the COVID-19 pandemic.e choice set of the dataset includes two alternatives, given-up transit use due to COVID-19 pandemic and transit use.e extreme gradient boosting (XGB) model is used to estimate the transit user's travel behavior.Shapley additive explanations (SHAP) are performed to explain the estimation results of the transit use choice model.Feature importance and relationships between features are investigated by a SHAP summary and dependence plot, respectively.Also, the O-D pairs where the potential for high given-up of transit use were identified in terms of policy implementation.

Description of Smart Card Data.
e government of Seoul has operated an integrated automatic fare collection system since 2004.e transit fare from the origin to the destination station is charged based on the total distance traveled by transit modes, for example, bus, subway, or both modes (transfer between bus and subway).With a smart card, users can use any combination of transit modes for free up to four transfers.e transit network in Seoul is operated with only a 100% smart card system without any other payment method, for example, cash and ticket, and the smart card data in Seoul provides 99% of transit users' trip information.us, it is widely used for microscopic user behavior analysis [16][17][18].
One of the biggest advantages of the smart card system in Seoul is that users must tap their smart card in or out when they get in or out of transit mode, respectively.If users do not tap in or out their smart card, a double fee will be charged on the next trip as a fine.us, the smart card data in Seoul is considered complete and reliable data that records complete transit user information.However, behind these advantages lies the disadvantage of privacy.If someone knows when and where an individual has used transit mode, even roughly, their trip information can be tracked in smart card data.us, the government of Seoul implemented a privacy protection policy for smart card data in 2020.e identification of the individual user was deleted to protect the identification of the user's trip sequence and chain.Also, travel time information is recorded every 5 minutes unit, and locations are encrypted with codes that are not identifiable by the general public.
rough this privacy protection policy, smart card data has been advanced by recording transit users' information while protecting personal information.
Although the AFC system provides high-quality trip information, limitations of smart card data remain.For example, smart card data typically underestimate ridership owing to possible fare evaders [19,20].ere also can be anomalies in smart card data due to software problems with the AFC system.ese limitations are common that can occur in transit systems around the world.e smart card data used in Seoul also faced this problem, and the government of Seoul estimates anomalies in smart card data to be about 1%.us, this study assumed that the smart card data in Seoul contained 99% of transit trips in Seoul without anomalies.
e smart card data from November 14, 2019 and December 10, 2020 were used to analyze the impact of the COVID-19 pandemic on transit mode choice.e smart card data of November 14, 2019 was used as data before the COVID-19 pandemic, and the smart card data of December 10, 2020 was used as data during the COVID-19 pandemic.According to the smart card data, the number of trips before COVID-19 and during the COVID-19 pandemic were 8,196,311 and 4,780,953, respectively.is smart card data indicates that the spread of COVID-19 in Seoul decreased the number of transit trips by about 43% per day.e information for each trip is classified into 16 categories in the smart card data.e description of smart card data and transit network in Seoul is shown in Table 1 and Figure 1, respectively.

Data Preprocessing.
e smart card data of November 14, 2019 and December 10, 2020 were used to estimate the impact of the COVID-19 pandemic on given-up transit use.
ere are two choice alternatives for transit use, for example, given-up and transit use.However, it is necessary to add e average travel time and the number of transfers were the same as 30 minutes and 0.24 transfers, respectively.Overall, the difference in travel behavior between November 15, 2018 and November 14, 2019 was less than 0.9%.us, the data preprocessing was performed, assuming that the travel pattern in 2019 and 2020 would be the same in the absence of COVID-19.Since the travel time information of smart card data in 2020 is recorded every 5 minutes due to the privacy policy, the travel time information of smart card data in 2019 was also recalculated from seconds unit to 5 minutes unit.
e data preprocessing was performed in two stages.e first stage selected O-D pairs containing given-up trips in 2020.e second stage is to filter the number of given-up trips from the 2019 data and fill it into the 2020 data.Each trip of 2020 was compared to that of 2019 based on departure time, arrival time, mode, number of transfers, and travel time.Departure time, arrival time, and travel time were aggregated by the units of hours when compared for each trip.As a result of the comparison, only trips that existed in the 2019 data were selected as given-up trips.Figure 2 shows an example of the data preprocessing performed in this study.Firstly, the number of trips of O-D pairs by travel modes was calculated in stage 1.Five O-D pairs were selected as the O-D pairs of trips that are reduced during the COVID-19 pandemic.For the O-D pair from station 1 to station 2, the number of subway trips decreased from 100 to 70.In stage 2, 100 trips from 2019 data and 70 trips from 2020 data were compared based on the departure time, arrival time, mode, number of transfers, and travel time.Among the trips that existed only in the 2019 data, 30 trips were selected as given-up trips.e mode code of a filled trip was set to 0 which refers to the trip that was givenup in the transit use of 2020.
e number of given-up trips was estimated to be 3,415,358.By adding information of trips that were given-up in the transit use of 2020 data, 8,196,311 trips were obtained as the sample.With the preprocessed data, seven features were calculated for each trip to explore changes in transit use choice.e number of O-D trips and the difference between the number of O-D trips were 63.6 and 44.7 trips, respectively, on average.e number of transfers was 0.35 times, on average.e travel time and fare were 27.5 minutes and 1,111 KRW, respectively, on average.e average departure and arrival times were 13:46 and 14:14, respectively.A description and descriptive statistics of the preprocessed data are shown in Table 2.  Journal of Advanced Transportation

Methodology
3.1.Extreme Gradient Boosting.Extreme gradient boosting (XGB) proposed by Tianqi Chen and Carlos Guestrin refers to an ensemble machine learning algorithm used for classification or regression predictive modeling problems [21].XGB is regarded as the most efficient decision tree-based algorithm for data analysis competitions due to its speed and scalability [22].XGB constructs a sequence of the low-depth decision tree, and each tree is trained to give more weight on the incorrect output of the previous trees.Also, XGB provides parallel tree boosting to solve large-scale problems in a fast and accurate way.e dataset with 8,196,311 samples includes independent variables x i and dependent variables y i , for example, 0 for given-up transit use trip and 1 for transit use trip, 8,196,311).Each x i has m features therefore x i ∈ R m (m = 1: number of O-D trips, 2: difference between number of trips before and during COVID-19, 3: number of transfers, 4: travel time, 5: fare, 6: arrival time, 7: departure time).
ese features have corresponding dependent variables such as transit use or given-up (x i ∈ R m , y i ∈ R). e tree ensemble model estimates the target value ( y i ) using f k which is an Kth independent tree structure with leaf scores as shown in the following equation:.
where f k is an independent tree structure with leaf scores and F represent the space of trees.e objective of the model is to minimize L(ϕ) with the loss function l and the mathematical expression of the objective is shown in the following equation:.
Here, Ω is the term which penalizes the complexity of the model calculated and the mathematical expression of the objective is shown in the following equation: In equation ( 3), w i is the score of the leaf i and T is the number of leaves.By solving equations ( 1)-( 3), the optimal weight w * i and the corresponding value  L t (q) are shown the following equations: It is generally difficult to enumerate all possible tree structures of q. us, the greedy algorithm, which branches out a single leaf to many branches iteratively, is used to estimate the optimal solution.e greedy algorithm is usually used to evaluate spilled candidates.I � I L ∪ I R , I L is the instance set of left nodes after split and I R is the instance set of right nodes after the split.
e mathematical expression is shown in the following equation: e additional advantage of XGB is that it is not affected by multicollinearity.us, several variables can be kept, even if these variables capture the same phenomenon in the same system.is is even desirable since feature analysis using SHAP is conducted in this study.

Hyperparameter Tuning for XGB.
ere are several hyperparameters related to the XGB model.Hyperparameter tuning XGB is necessary to avoid the overfitting problem and heavy complexity of the model.A grid search based on cross-validation was performed to set the optimal six hyperparameters, for example, number of iterations, learning rate, subsample, colsample_bytree, alpha, and lambda.e learning rate refers to the scale of the weights of each tree, and it changes the impact of each tree to make a robust model.
ere are two hyperparameters related to preventing the overfitting problem of the model.e first one is the subsample, which stands for the ratio of randomly selected observations for training instances.e other one is the colsample_bytree parameter which is the fraction of columns when constructing each tree.e alpha parameter is the regulation term on weights of L1, and lambda is the regulation term on weights of L2.As a result of the grid search based on cross-validation analysis, the hyperparameters of XGB in this study were selected as 622 for the number of iterations, 0.3 for learning rate, 0.9 for subsample, 0.9 for colsample_bytree, 0.4 for an alpha, and 0.3 for lambda, respectively.

Performance Measures for XGB.
ree performance measures, for example, specificity, sensitivity, and balanced accuracy, were selected to evaluate the model performance.
ese measures are well-known composite classification metrics-based methods for evaluating a multiclass classification model.
Specificity is the number of true-negatives from among the true-negatives and false-positives.Sensitivity stands for the true-positives from among the true-positives and falsenegatives.Balanced accuracy is the average of sensitivity and specificity.Balanced accuracy is great for the classification problem when the difference between negative and positive samples is large.In this study, true-positive and false-positive stand that the model estimated transit user as transit use (correct) and given-up (incorrect), respectively.e truenegative and false-negative mean that the model estimated given-up user as given-up (correct) and transit use (incorrect), respectively, where TP is true-positive, FP is falsepositive, TN is true-negative, and FN is false-negative.e mathematical expressions of three performance measures are shown in the following equations: balanced accuracy � specificity + sensitivity 2 . (11)

Shapley Additive Explanations for Model Interpretation.
SHAP was used to interpret the results of the transit use choice model proposed in this study.e objective of SHAP is to interpret the contribution of each feature to the output [23,24].e Shapley values are estimated based on cooperative game theory.e feature values of each sample act as players in a coalition.e Shapley value helps distribute a payoff for all features when each feature might have contributed more or less than the others.e algorithm repeatedly asks the impact of the feature on each output, and the answer is computed as the Shapley value.With the Shapley value, it is possible to interpret the contribution of each feature [25].To develop an interpretable mode, SHAP uses an additive feature attribution method, for example, an output model is defined as a linear addition of input features.Assuming a model with input features x i � (x 1 , x 2 , . . ., x i , x m ), where i is the number of input features (e.g.12) shows the linear function g which is defined by the additive feature function, and equation (13) shows the mathematical expression of the SHAP.
4. Application With these datasets, the naïve XGB model was developed to select meaningful features to interpret transit users' behavior.e feature selection process consisted of three steps.Firstly, the features were ranked by importance and frequency scores computed from the naïve XGB model.en, the importance and frequency scores were clustered by the k-means clustering method.Finally, the features were selected based on the significance of the cluster at 99%.As a result of feature selection analysis, seven features included in four clusters were selected as significant to the model.Specifically, the number of O-D trips feature was estimated to be the most significant feature, with the highest importance and frequency scores of 0.68 and 0.28, respectively.
e difference between the number of O-D trips, number of transfers, travel time, arrival time, and departure time features were analyzed to have a significant impact on the output.However, the six sociodemographic-related features, fo example, population, density, number of households and companies, land-use, and average land price, were estimated to have little impact on output with both importance and frequency scores less than 0.5.e results of the feature selection analysis are shown in Figure 3. e proposed model was designed as a binary problem to classify given-up or transit use trips.However, three measures were classified by transit modes to identify model performance in detail.For subway users, specificity, sensitivity, and balanced accuracy were estimated to be 0.902, 0.903, and 0.903, respectively.e number of true-positives was 179,611 and the number of true-negatives was 310,191.e number of falsepositives was 33,610 and the number of false-negatives was 19,209.For bus users, specificity, sensitivity, and balanced accuracy were estimated to be 0.907, 0.987, and 0.947, respectively.e number of true-positives was 193,715 and the number of true-negatives was 230,070.e number of false-positives was 23,501 and the number of false-negatives was 2,484.For both modes (subway + bus) users, specificity, sensitivity, and balanced accuracy were estimated to be 0.932, 0.980, and 0.956, respectively.e number of true-positives was 114,677 and the number of true-negatives was 111,895.e number of falsepositives was 8,185 and the number of false-negatives was 2,299.

Performance of the
Overall, specificity, sensitivity, and balanced accuracy were 0.909, 0.953, and 0.931, respectively.
ese results indicate that the proposed model showed notable performance with an accuracy of over 93.1%.Moreover, the proposed model was found to be suitable for exploring the impact of COVID-19 on transit mode choice.e performance of the proposed model is shown in Table 3.
To compare the results of XGB with the parametric model, the transit use choice model was developed with the MNL model, the method most widely used for modeling choice behavior [14,15].e parameters were estimated with 85% of the dataset and validation was performed with 15% of the dataset.Since a multicollinearity problem between variables, three variables, for example, number of O-D trips, number of transfers, and arrival time, were used to develop the MNL model.e result of the MNL model is shown in Table 4.As a result of estimating MNL, the constants of given-up, subway, bus, and both modes were estimated to be 1.095, 0.434, and 0.673, respectively.ese results indicated that many people preferred to use transit even during the COVID-19 pandemic.e parameters of the number of O-D trips, the number of transfers, and arrival time were estimated to be 0.0019, − 0.1518, and − 0.0299, respectively.
ese parameters indicated that people preferred to give up transit use as the number of trips, the number of transfers increased, and arrival time increased.e F1 score of MNL was estimated to be about 0.706, which is relatively low compared with that of XGB of 0.931.Specifically, the F1 score of MNL tends to be low, in the order of given-up, subway, bus, and both modes.e MNL model could not estimate users' transit use preferences accurately, with a low F1 score of 0.706. is result implied that the MNL model was suitable for simple problems due to the low flexibility of data distribution assumptions.Also, MNL had a limitation in not being able to interpret the relationship between features.However, the XGB model had high flexibility without distribution assumption and the ability to interpret the relationship between features.us, the proposed XGB model accurately estimated the transit use behavior, with a high F1 score of 0.931.Among the features related to transit service, for example, number of transfers, travel time, and fare, the number of transfers was found to have the largest impact on transit use behavior.e Shapley value of the number of transfers indicated that the probability of transit use increased as the number of transfers decreased.ese results indicated that the users who used both modes gave up transit use since contact with other people could increase as the number of transfers increased.Especially, contact with people was related to the concerns about COVID-19 infection during the pandemic [27].In the case of the travel time feature, the Shapley value showed that the probability of transit use increased as travel time decreased.e result of the Shapley value for travel time indicated that users avoided long transit times due to concerns about COVID-19 infection [27].e Shapley value for fare feature showed that users did not prefer transit use as the fare decreased.is result explained that users who use the transit service at a discounted rate, that is, the elderly, disabled, and students, tended to give up transit use during COVID-19.Users who use the transit service at a discounted rate tended to be more health conscious than general users [28].us, elderly, disabled, and student users tended to give up transit use more than general users during the COVID-19 pandemic.e Shapley  values for arrival time and departure time were found to have the least impact on the transit mode choice.ese results indicated that the user's departure or arrival time did not significantly affect transit use.

Feature Analysis of the Transit Mode Choice. Shapley values of seven features of the XGB model are illustrated in
Overall, transit mode preferences were analyzed using eight features, and the impact of each feature was explained.Especially, the presence of COVID-19 had the greatest impact on users that gave up transit use.e impacts of the number of transfers, travel time, and fare on the transit use were also derived from the results, which were consistent with common sense.

Feature Dependency Analysis of Transit Mode Choice.
Travel time was selected as a feature to interpret its impact on transit use since it is one of the most important features to analyze user behavior [1].Travel time was also the most persuasive to compare the Shapley value by transit modes, since other variables, that is, number of transfers and fare, could vary depending on the modes.e results of feature dependency analysis with travel time and transit use choices, given-up, and transit use were drawn in Figure 5.
In Figure 5(a), subway travel time was selected as a feature to interpret its impact on users who gave up or used the subway during the COVID-19 pandemic.e red points within Circle (1) represent users that gave up transit during the COVID-19 pandemic, and the blue points within Circle (2) represent transit users.Circle (1) described that the Shapley values decreased as subway travel time decreased.e red trips within Circle (1) show the relationship between subway travel time and the Shapley value of subway travel time during the COVID-19 pandemic.
e trend for the impact of travel time on subway users was illustrated by a red line.During the COVID-19 pandemic, users tended to give up the subway trip as the travel time increased.e sensitivity of travel time for subway use was estimated to be the highest among that of the transit modes, for example, bus and both modes.
e difference between sensitivities of given-up users and transit users was estimated to be the lowest among that of the other modes.
In Figure 5(b), bus travel time was selected as a feature to interpret its impact on users who gave up or used the bus during the COVID-19 pandemic.Circle (3) illustrates the relationship between the travel time of given-up users and Shapley value of travel time.Circle (4) illustrates the relationship between the travel time of transit users and the Shapley value of the travel time.e Shapley value of bus travel time showed that the Shapley value decreased as travel time increased.e trend of the impact of travel time on bus users was illustrated by a red line.During the COVID-19 pandemic, users also tended to give up bus trips as travel time increased.e sensitivity of the travel time for bus use was estimated to be the second-highest among that of the transit modes, for example, subway and both modes.
In Figure 5(c), the travel time of both modes (bus + subway) was selected as a feature to interpret its impact on users who gave up or used both modes during the COVID-19 pandemic.Circle (5) illustrates the relationship between the travel time of given-up users and Shapley value of the travel time.Circle (6) illustrates the relationship between the travel time of transit users and Shapley value of the travel time.Circles ( 5) and ( 6) illustrate that the Shapley values decreased as the travel time of both modes increased.
e trend of the impact of travel time on both modes users was illustrated by a red line. is result indicated that users did not prefer both modes during the COVID-19 pandemic.However, the sensitivity of the travel time of both modes use was estimated to be very low compared to that of subway and bus.
e overall impact of travel time on transit users is shown in Figure 5(d).Circle (7) illustrates the relationship between the travel time of given-up users and Shapley value of the travel time.Circle (8) illustrates the relationship between the travel time of transit users and Shapley value of travel time.e result of the dependency analysis with the travel time feature indicated that the probability of transit use decreased as travel time increased.Especially, these tendencies in travel time were shown to be more evident for the users that gave up use.Travel time sensitivity was estimated to be high in the order of subway, bus and both modes use.e difference between the sensitivity of givenup and transit users was estimated to be 1.34, 1.69, and 1.88 times, respectively, using linear regression.e difference between the sensitivity of given-up and transit users was high in the order of both modes, bus and subway use.e slopes of the trend line of bus and both modes decreased sharply.ese results reflect the behavior of users avoiding spending long travel times in a transit mode due to concerns about infection during the COVID-19 pandemic [5].ese results also implied that the users more easily give up the use of a bus or both modes compared to subway use as travel time increased.4.5.Discussion.Many countries around the world have implemented policies for the public transportation system as the demand for transit decreased during the COVID-19 pandemic.Specifically, the transit demand in Seoul has been reduced by about 30% during the COVID-19 pandemic.us, the government of Seoul considered shortening and reducing the hours of service and dispatching of transit services, respectively.In terms of these practical issues in Seoul, the O-D pairs where the potential for high given-up of transit use was explored using the proposed XGB model.e demand estimation during COVID-19 was performed, and the given-up ratio was calculated for each administrative unit, such as Dong unit.Here, the given-up ratio means a reduction ratio of estimated number of O-D trips during COVID-19 pandemic compared to O-D trips in 2019.
Figure 6 shows the results of the O-D pairs where the potential for high given-up of transit use.e results showed that Jongro and Gangnam areas were the most potential for high given-up of transit use.Specifically, the number of O-D trips between Jongro and Gangnam significantly decreased with a given-up ratio of 0.7∼1.0.However, the number of trips from suburban to Jongro or Gangnam was not decreased with a given-up ratio of 0.0∼0.2.ese results implied that transit use was mostly given-up in the O-D pairs Journal of Advanced Transportation connecting central areas, for example, Jongro and Gangnam, but users still used transit from residential areas to the central areas.From this implication, it could be inferred that users maintain single-purpose trips, such as work trips, but do not have additional business or leisure trips.In terms of transit operation, it is reasonable to reduce hours of service and dispatches of transit services in central areas.Specifically, the transit policies, for example, reduction of hours of service and dispatches of transit services, could be implemented regarding feeder buses in Gangnam and Jongro to improve the operation efficiency.Conversely, the main routes, for example, subway and trunk bus route, connecting the central areas and the suburban areas is not essential to be reduced since the number of trips was not decreased much as inner trips in the central area.
Overall, users tended to give up using transit services when they traveled within the central areas.
us, it is reasonable to implement transit policies targeting feeder bus routes in central areas to improve operational efficiency.

Conclusion
is study aimed to understand the impact of COVID-19 on transit use.Analysis was conducted using two days of smart card data on days, for example, before and during COVID-19 pandemic.With data preprocessing, two alternatives, for example, given-up transit use during the COVID-19 pandemic and transit use, were considered in the choice set.e XGB model was used to train transit preference.Feature analysis based on SHAP was performed to interpret the estimation results from the proposed model.XGB was trained on 6,966,864 of 8,196,311 trips from smart card data and tested on the remaining 1,229,447 trips.e specificity, sensitivity, and balanced accuracy of the proposed model were 0.909, 0.953, and 0.931, respectively.e proposed was found to be suitable for exploring the impact of COVID-19 on transit use.Feature analysis was performed to explore the impacts of the features on transit use with Shapley values.e number of O-D trips feature was found to impact substantially influence users that gave up transit.Feature dependency analysis was also performed, and the impacts of travel time of the model were identified and interpreted by transit modes.e dependency analysis showed that users gave up transit use as travel time increased.
ese tendencies in travel time were more evident during the COVID-19 pandemic.
e remarkable performance of XGB supported its ability to estimate the impact of the COVID-19 on transit use.e hyperparameters obtained by the cross-validation conserved the steady low learning error rates in the training of the model.It also derived robust results in estimating transit use.Feature analysis with SHAP provided insights for the proposed model.e Shapley value estimated feature importance and the direction of the impacts.e Shapley value also identified the nonlinear joint impacts of features of the proposed model.
ere were several interesting findings, such as the COVID-19 pandemic impact on transit use could not be identified by other machine learning techniques.e findings of this study could potentially be helpful and provide implications for policymakers both in mitigating the spread of the disease and establishing appropriate policy that considers travel behavior during the pandemic.With the proposed XGB model, O-D pairs where the potential for high given-up of transit use was identified in terms of policy implementation.As a result, transit use was mostly given-up in the O-D pairs connecting central areas, for example, Jongro and Gangnam.
is result implied that it is desirable to implement transit policies targeting feeder bus routes in central areas to improve operational efficiency.
Although the proposed model established notable performance on the estimation of transit use considering users that gave up transit during the COVID-19 pandemic, it would be desirable to consider other external attributes or variables, for example, land-use and sociodemographic features.Understanding additional features would provide a variety of perspectives regarding the impact of the COVID-19 pandemic.

2
Journal of Advanced Transportation alternatives such as given-up transit use to take into account the 30% change representing reduced transit trips due to the emergence of the COVID-19 pandemic.Since smart card data only contains the revealed trip information, there is no information about given-up trips due to COVID-19.To obtain information regarding given-up trips, data preprocessing was performed to combine two smart card data sets before and during the COVID-19 pandemic.Before preprocessing data to obtain the given-up trips for 2020, the data for 2018 and 2019 were compared to identify the yearly change in travel patterns.e results of the comparison showed that the number of trips on November 15, 2018 and November 14, 2019 were 8,268,438 and 8,196,311, respectively.
, 1: number of O-D trips, 2: difference between number of trips before and during COVID-19, 3: number of transfers, 4: travel time, 5: fare, 6: arrival time, 7: departure time) and the explanation model g(z ′ ) with simplified input z ′ .For transit use subset S⊆N (where N stands for the set of all samples), two models are trained to estimate the effect of feature i. e first model v(S ∪ i { }) is trained with feature i Journal of Advanced Transportation while the other model v(S) is trained without feature i, where S ∪ i { } and S are the values of input transit use features.e difference in model outputs v(S ∪ i { }) − v(S) is estimated for each possible subset S⊆N i { }, equation (

Figure 4 :
Figure 4: Results of the Shapley values of nine features.

Figure 5 :*Figure 6 :
Figure 5: Result of feature dependency analysis.(a) Impact of travel time on subway users during COVID-19 pandemic.(b) Impact of travel time on bus users during COVID-19 pandemic.(c) Impact of travel time on both modes users during COVID-19 pandemic.(d) Overall impact of travel time on transit users during COVID-19 pandemic.

Table 1 :
Description of the smart card data.
1: subway, 2: bus, 3: both modes (bus + subway) 3 Line ID Unique ID for each line 4 Vehicle ID Unique ID for each bus vehicle (not for subway) 5 Boarding station ID Unique ID for each station (max five stations are recorded for a trip) 6 Alighting station Unique ID for each station (max five stations are recorded for a trip) 7 Boarding date/time Year/month/day/hour/minute/seconds 8 Alighting date/time Year/month/day/hour/minute/

Table 2 :
Description and descriptive statistics of the preprocessed data.

Table 3 :
Performance of proposed XGB model.

Table 4 :
Results of multinomial logit model.