Mode Choice Model for Public Transport with Categorized Latent Variables

Mode choice model for public transport, which integrates structural equationmodel (SEM) and discrete choice model (DCM)with categorized latent variables, was presented in this paper. Apart from identifying those important latent variables that affect mode choice for public transport, the objective of this study was also to develop an improved disaggregative model that better explains travel behavior of those decision-makers in choosing public transport. After extensive observations, selective latent variable sets which consist of latent variable components were chosen together with explicit variables in formulating the utility functions. Data collected in Chengdu city, China, were used to calibrate and validate the model. Results showed that the impact of fare on mode choice of public transport escalated in the SEM-DCM integrated model compared with the traditional logit model. The goodness of fit for the integrated model with latent variable sets is 0.201 higher than that of the traditional logit model, which proves that latent variables have an obvious impact on mode choice behavior, and the SEM-DCM integrated model has higher accuracy and stronger explanatory ability. The results are especially helpful for public transport operators to achieve higher mode share split by improving the service quality of public transport in terms of providingmore convenience and better service environment for public transport users.


Introduction
Mode choice behavior for public transport is a key element in public transport planning, as it has a direct impact on the design of urban transport system structure, and is also the basis for urban public transport planning and management policymaking.Discrete choice models (DCMs) are a typical method of research on consumer choice behavior originally applied in economics.Paying for using public transport, public transport users are actually consumers.Therefore, their mode choice behavior is actually a consumer choice behavior.Discrete choice models (DCMs) explain individual choice behavior as the consequence of preferences that an individual makes with the assumption that the consumer chooses the most preferred option.Under certain assumptions, consumer preferences can be represented by a utility function in which the choice process somehow seems to be a sort of "black box."Its inputs are the attributes of available options and the individual's characteristics, and its output is the utility value of each alternative.Koppelman and Pas [1] carried out a preliminary study of DCMs to analyze travel characteristics of mode choice and travel purposes.Ben-Akiva et al. [2] borrowed the concept of utility function in economics and applied it to transport mode choice behavior from developing DCMs.Later, Koppelman and Wen [3] selected variables of frequency, travel cost, in-vehicle time, and out-of-vehicle time in mode choices among transport modes of air, train, car, and bus.Hensher and Reyes [4] and Schwanen [5] considered spatial variation in building DCMs on mode choice behavior.Mitra and Buliung [6] applied Multinomial Logit Models (MNL) to analyze the differences of mode choice to school between children aged 11 and 14-15 years in Toronto, Canada, and the result shows that distance is the biggest problem for younger children to walk to school, while boys are more willing to walk than girls.Liu et al. [7] investigated the Swedish mode choice behavior under different climate conditions.Palma and Rochat [8] analyzed mode choice for commuters in Geneva using Nested Logit Model and found out that car ownership is the key factor that influences mode choice behavior.Apparently, it can be concluded that those DCMs considered only objective/measurable attributes from the alternatives and socioeconomic characteristics of the individuals as explanatory variables.
Based on consumer choice behavior theory, consumers' preference is influenced by consumers' satisfaction to available transport modes and their importance to the consumers.It is obvious that individual perception and aptitude to different transport modes differ due to their different socioeconomic status and different intrinsic characteristics of transport modes, which cannot be directly observed or measured.In other words, transport mode choice behavior is influenced not only by measurable factors (travel time, fare, gender, age, occupation, income, etc.), but also by unmeasurable factors (service quality, convenience, safety, etc.).Therefore, latent variables should be added in the model to include the subjective factors' influence on transport mode choice.The initial research on latent variables can be traced back to Spearman [9] that conducted factor analysis for human intelligence test.Latent variables were widely applied in many fields, such as social sciences [10,11], psychology [12,13], and market and economics research [14,15].
Unmeasured variables, factors, unobserved variables, constructs, or true scores were several terms that researchers used to refer to latent variables in psychology and the social sciences [13].In transport modeling, latent variables include service reliability, environment perception, and potential preferences for transport modes, which bring about new approaches in modeling transport mode behavior.Those research studies on extended framework incorporated with latent variables for DCMs were developed by Train et al. [16] and Ben-Akiva and Boccara [17].Traditional mode choice models have been enriched by introducing latent variables in modeling transport mode choice behavior [14,[18][19][20].Morikawa et al. [21] and Morikawa et al. [20] included modal comfort and convenience in their analyses of transport mode choice.In their studies, latent variables were measured and modeled through attitudes (attitudinal indicator variables) towards the chosen transport mode and its alternatives.Golob [22] used a series of models to explain how mode choice and attitudes regarding HOV (high-occupancy vehicle toll) lanes free of charge in San Diego differed across the population.However, attitudes towards fairness of the HOT (HOV-Toll) lanes to carpoolers exhibit the greatest number of significant explanatory variables.Mitra et al. [23] and Lee et al. [24] analyzed the relationship between potential safety measures and severity of road accidents in transport planning.Gopinath [25], Morikawa et al. [21], Walker and Ben-Akiva [26], Johansson et al. [27], Yáñez et al. [28], and Politis et al. [29] added the influence of traveler's aptitude and perception into their models and integrated it into demand forecasting models.Shiftan et al. [30] explored the causal relationship between latent variables (time sensibility, determined travel plan, aspiration to use transport vehicle, etc.) and their corresponding measured variables and categorized the transport market based on these latent variables.Deutsch et al. [31] analyzed the relationship between travel behavior of residents and their perception on surrounding environment, using structural equation model based on 719 sample data points from a survey in Santa Barbara and California, USA.Nurlaela and Curtis [32] established a mode choice model in downtown, which has not been developed and analyzed empirically before.
To capture the impact of latent variables on the decision process, during the last few decades, a hybrid choice model (HCM) has been developed [33] using psychometric data to explicitly model attitudes and perceptions and their influences on mode choices [28,34,35].Kim et al. [36] expanded the HCM by including social influence as well as attitudes on choice probabilities to investigate the effects of these factors on the intention to purchase electric cars.Among the numerous HCMs that explicitly model latent psychological factors such as attitudes and perceptions (latent variables), the Integrated Choice and Latent Variable (ICLV) model is easy to be understood and exhibits better predictive power [37][38][39][40][41][42].There are serious problems when arbitrary values are used for normalization and when data variability is low, especially regarding the generation of the latent variables.Hence, Raveau et al. [43] suggested normalizing variances associated with the hybrid discrete choice model's structural equations instead of the parameters of its measurement equations.
With high economic development rate and fast urbanization in China in the last few decades, it was observed that more passengers were willing to pay for taking transport mode with higher service quality, more convenience, and higher personal security.This paper aims to extend Bhat and Dubey's [42] model formulation to develop an ICLV model for public transport that incorporates five latent variable sets associated with convenience, personal safety, modal comfort, service environment, and waiting feelings which were more sensitive to public transit users according to date collected in Chengdu, China.Five fundamental latent variable sets were proposed that represent a combination of underlying affective value norms/beliefs, lifestyle orientations, and personality traits rather than a more generic single "willingness to walk/cycle/drive" attitude as the latent construct [44].In other words, five latent variable sets were employed to explain the public transport mode choice behavior to achieve a more precise interpretation of the individual factors affecting public transport mode choice.In summary, this paper has improved model formulations based on models proposed by Bhat and Dubey [42] and Yáñez et al. [28] and made comparisons among three models that incorporated different numbers of explicit variables with latent variable sets.A new approach of estimating goodness of fit when comparing the ICLV model with a model without latent variable sets was applied as well, which has not received the attention it deserves.The latent variables were identified based on data collected in Chengdu, China, which is a typical urban city; therefore, the results would be beneficial to public transit policymaking and operation especially for those urban cities in China.
The rest of the paper is organized as follows.Section 2 presents the model formulation and estimation.Section 3 presents the data and sample characteristics.The estimation results of the models are presented and discussed in Section 4. Information of bus stop on route is clear.
Personal safety If there is a safety hammer, fire extinguisher, and so forth in the vehicle that make passengers feel safe. 5 If propaganda on a solution to an emergency situation has been well carried out. 6 Overall satisfaction to feeling safe in a bus.
Modal comfort Feeling comfortable on bus seats. 8 If entertainment facilities, like TV and radio broadcast, are pleasant. 9 Clear and correct bus stop information in broadcast in the bus.
Service environment Newspapers or other information provided at the bus stop is interesting.
Section 5 concludes the paper by summarizing the key findings and providing directions for further research.

Assumptions
(i) Travelers are rational in making mode choices as they will choose the travel scheme with the highest utility value.(ii) Options of mode choice are categorized into public transport and nonpublic transport modes.

(iii) Error distribution of each utility function follows
Gumbel distribution with mean of 0 independently, while error distribution of the rest of the stochastic factor functions follows a normal distribution.

Model Variables.
Apart from social status and travel characteristics, which can be directly observed and measured, other factors included in the model are latent variables, which cannot be directly observed or measured.Therefore, it requires corresponding measured variables to describe latent variables.In this paper, 7 Likert scales were adopted to measure latent variables.Five latent variable sets are used to model the public transport mode choice behavior, which are convenience, personal safety, modal comfort, service environment, and waiting feelings, as shown in Table 1.
(1) Latent Variable Structural Equation Model.It is assumed that the latent variable  *  is a linear function of covariates as follows: where x denotes a ( D × 1) vector of observed covariates without including a constant,   is a corresponding ( D × 1) vector of coefficients, and   is a random error term assumed to be normally distributed.In our notation, the same exogenous vector x is used for all latent variables; however, this is in no way restrictive, since one may place the value of zero in the equation discussed by Stapleton (1978) and used by Bolduc et al. [37] by assuming that   is standard normally distributed.Next, define  × D matrix  = ( 1 ,  2 , . . .,   )  , (×1) vectors z * = ( * 1 ,  * 2 , . . .,  *  )  , and  * = ( 1 ,  2 , . . .,   )  .To allow correlation among the latent variables,  is assumed to be standard multivariate normally distributed:  ∼ (0  , ∑  ), where ∑  is a correlation matrix.Equation ( 1) can be written in matrix form as Logit utility

Latent variable Observed variable
Step 1: SEM  (2) Latent Variable Measurement Equation Model.Here, we define where  refers to continuous variables ( 1 ,  2 , . . .,   ) with an associated index ℎ (ℎ = 1, 2, . . ., ),  ℎ to a scalar constant,  ℎ to an ( × 1) vector of latent variable loadings on the ℎth continuous indicator variable, and  ℎ to a normally distributed measurement error term.Stack the  continuous variables into an ( × 1) vector y, the  constants  ℎ into an ( × 1) vector , and the  error terms into another ( × 1) vector  = ( 1 ,  2 , . . .,   ).Similarly, let ∑  be the covariance matrix of  and define the (×) matrix of latent variable loadings  = ( 1 ,  2 , . . .,   )  .Equation (3) can be rewritten in matrix form as follows: (3) Choice Model.According to random utility theory, it is assumed that individuals make decisions with maximum utility   .It is also assumed that the analyst, who is an observer without perfect information, is only able to define/observe a representative utility   .Therefore, it is necessary to associate an error term   with each alternative [45] shown as follows: Latent variables can be included in a utility function shown in (6), where   and   are parameters to be estimated, and they are correlated, respectively, to the tangible attributes and the latent variables: A binary variable   is used to define the individual decision-making as shown in As the latent variable   is actually unknown, the discrete choice model should be estimated jointly with the structural equation model ( 2) and measurement equation model (4).

Estimation.
As Yáñez et al. [28] mentioned, currently, there are two approaches to estimate hybrid choice models; they differ mainly in how the available information is used.Although there is experimental software for simultaneous estimation [46], it can only estimate multinomial logit (MNL) models without accommodating heterogeneity or correlation among individuals and/or observations through random parameters or error components.Therefore, we followed Yáñez adopting sequential estimation approach.
According to Yáñez et al. [28], this sequential estimation approach consists of two separate stages in sequential estimation approach: the latent variable modeling and the discrete choice modeling.In the first stage, a structural equation model is solved to obtain estimators (and their standard deviation errors) for the parameters in the equations containing the latent variables with the explanatory variables and the perception indicators.Using these parameters, it is possible to calculate expected values for the latent variable of each alternative for individuals, through (2), and eventually include them directly in the discrete choice model in the second stage.In this stage, a proper procedure requires integrating with the variation of the latent variable [34].With the interactions between the structural equation model and choice model, the latent variable can be added in any explanatory variable.Therefore, we estimate these parameters together with those of the traditional variables in the second stage, in order to guarantee unbiased estimators for the parameters involved [13] as the expected values of the latent variables have measurement error.Estimating a Mixed Logit (ML) model with random parameters on the latent variables can solve this issue.
In this paper, we adopted the sequential estimation approach that Yáñez et al. [28] proposed to solve the problem.Based on survey data of individual preferences in southwest China, latent variable sets were carefully selected.The aim is to explore the unique characteristics of latent variables in the fast economic development era in China and help the public transit policymakers and operators to improve share of public transport in daily trips.

Data Collection and Analysis
Data on predesigned questionnaires were collected from those respondents of commuters in Chengdu, China.This survey collected 570 questionnaires, and the valid amount was 497, which is 87.06%, excluding those samples: (1) samples in which respondents did not seriously answer the question; (2) samples with 3 or more questions void of answers; (3) samples with 5 continuous extreme values.All these valid questionnaires are summarized in Table 2.
According to structural equation model, data should follow multivariate normal distribution.Maximum likelihood method is adopted to estimate parameters in the structural equation model.Multivariate normality test covers two main indices, skewness and kurtosis, and makes judgements based on their absolute values.(1) If the absolute value of  > 3.0, then it is in the range of extreme skewness; (2) if the absolute value of  > 10.0, then it means there is a problem with the value of kurtosis; and (3) if the absolute value of  > 20.0, then it is in the range of extreme kurtosis.
Table 3 shows that values of skewness of 19 measured variables are from −0.762 to 0.576, which are all less than 3, while values of kurtosis vary from −0.993 to 0.702, which means that all measured variables follow normal distribution.

Estimation Results
Use the data from the questionnaire to conduct the following analysis: (1) exploratory factor analysis.Verify whether the 5 measurement indices of latent variable sets of public transport service quality are qualified to be effective factors that influence passengers' mode choice behavior.(2) Compare calibration results from mode choice model considering both measured variables and latent variables with those from mode choice model considering measured variables only and analyze the differences.

Exploratory Factor Analysis.
There is no general standard for measurement index of public transport service quality; therefore, exploratory factor analysis is used to verify whether the latent variables and their corresponding measurement indices are suitable or not.And the factors with eigenvalue more than 1 are selected based on principal component analysis; there are 5 selected latent variable sets in total: convenience, personal safety, modal comfort, service environment, and waiting feelings.All the Cronbach  values which are a coefficient for 5 latent variable sets are above 0.69, which is very reliable.

Analysis of Calibration Results of Two Models.
First integrate variables of travelers' personal characteristics and mode choice options into mode choice behavior model, the Binary Logit (BL) in Stage II, and get the basic transport mode choice model, which does not consider LVs.Use NLOGIT (NLOGIT is a suite of software for the estimation of discrete choice models by Econometric Software Inc.) software to estimate parameters of this model and verify the results, remove those variables meeting || < 1.96, and get parameter estimation and check results after various tests shown in Table 5. Variables of travelers' personal characteristics include gender, occupation, transport vehicles ownership, and fare.If  values of gender, occupation, and traffic tools are positive, then the larger their variables are, the more possible it is to choose public transport mode; if  values are negative, then the higher the price is, the less possible it is to choose public transport mode.And the goodness of fit of this model is 0.245, which proves that the model has certain explanatory ability with high accuracy.
Put 5 latent variable sets into choice model; we get public transport mode choice model considering latent variable sets and use NLOGIT to calibrate and verify the parameter of the integrated model.As latent variables cannot be directly measured, they are described by measured variables.Different integrated models could be worked out based on different description methods of latent variable sets and measured variables: (1) SEM-BL1: the relationships among latent variable sets and measured variables and between latent variable sets and measured variables are described through a structural equation based on the calculation of ( 2) and ( 4); (2) SEM-BL2 (the relationship between latent variable sets and measured variables through factor analysis): put all measured variables into integrated model based on results of Table 4; (3) SEM-BL3: only put measured variables of high-load factors into the integrated model after factor analysis.All calculation results of parameters in each model are illustrated in Table 6.SEM of public transport service quality calculated by AMOS (AMOS stands for Analysis of Moment Structures, which is software that solves structural equations model developed by SPSS, Inc.) includes two parts: (1) Causal relationship between latent variable sets and measured variables of passengers' personal characteristics shown in the structural pattern in Table 7 and (2) relationship between latent variables and their measured variables shown in the measurement pattern in Table 7.It can be shown in Table 7 that the older the passenger's age is, the more satisfied he/she is with the public transport service; influence of occupation on service environment is also positive, and it is more obvious than that of age, as government officers and employees in commerce and service industry demand higher service quality than people of other occupations.Educational background and modal comfort are positively correlated.Well-educated people demand high-level comfort; monthly income is also in positive correlation with convenience and personal safety, which means that people with higher income level pay more attention to convenience and personal safety.Women care more about waiting feelings, reflecting women's lack of patience compared with men.
It can be seen from Table 6 that results of estimated parameters are different among three different estimation methods in the ICLV model.In SEM-BL1, the variable set that has the greatest impact on mode choice is convenience, while the least impact is from modal comfort, but in SEM-BL2 and SEM-BL3, the variable set that has the greatest impact on mode choice is waiting feelings, while the least  includes all measured variables is 0.007 which is higher than that for SEM-BL3 which only includes measured variables of large load matrices.Therefore, it can be concluded that those models that only include measured variables of large load matrices may miss certain information for estimation.Of all the parameters in the calibration results shown in Table 7,  2 has the highest impact on convenience, while  3 has the lowest impact on it;  4 has the highest impact on personal safety;  7 has the highest impact on modal comfort, while  9 has the lowest impact on it;  10 and  13 have the lowest impact on service environment;  17 has the highest impact on waiting feelings while  18 has the lowest impact on it.The result of this research provides basis and inspiration for public transport companies to improve the service quality on the perspective of passengers.
Based on the final estimation and verification of model parameters, the ICLV model with 5 latent variable sets is more accurate than those BL models without including latent variable sets, and the goodness of fit has improved from 0.245 to 0.446, with () increasing by 20.607, which means that explanatory ability of the ICLV has greatly improved, and latent variables play an important role in decisionmaking of mode choice.In the ICLV model, all parameters estimated meet the verification requirements.After adding latent variables, the constant value in  obviously declines, which means that BL model is less accurate if latent variable sets are not added to the model.
Of all the latent variable sets, convenience and service environment have a major impact on mode choice for public transport, while modal comfort has a minor impact on it.In the integrated model that includes latent variables, fare obviously influences the travelers' choice, which means significance of fair will be underestimated if latent variables are not added to the model.

Conclusions
Borrowing the concepts of latent variables and measured variables in the theory of consumer behavior, this research introduced them into an ICLV model to analyze those variables that affect mode choice for public transport.Measured and unmeasured factors influencing mode choice for public transport were analyzed to build the structural equation model.Quantitative causal relationships between various factors and degree of influence on mode choices were estimated.
Different estimation methods were applied to compare the final results.It is found that the ICLV model is stronger in explanatory ability and accuracy than the model without including latent variables.Therefore, it is more appropriate to apply structural equation model to build public transport mode choice model and calculate the value of the goodness of fit for latent variables, which can be described by measured variables by public transport users.The results also revealed that latent variable sets (e.g., service environment and waiting feelings) have a major impact on mode choice behavior.Therefore, it is obvious that, to achieve higher share of mode split, public transport operators should improve service quality by increasing those latent factors (e.g., convenience, personal safety, and service environment).
Further research in the future should focus on universality of types and quantity of latent variables and their degree of influence on mode choice behavior.Meanwhile, more efforts should be made for the collection of questionnaire survey data through social networking services to reduce the bias on the data.

Table 7 :
Results of structural equation model of public transport services quality.Pattern Structural equation model Convenience Personal safety Modal comfort Service environment Waiting feelings Standard deviation of error term Parameter values of  are larger than 1.96, which means that  < 0.05 meets the verification requirement.

Table 1 :
Model variables description.Travel time from departure place to bus stop is short. 2 Convenient in transferring to other buses or transport modes. 3 13Clean inside the bus.11Fresh air in the bus.12If passengers are in good order when they are boarding or alighting.13Ifthebus driver and conductor are polite and willing to serve passengers.14If interference among passengers is little.
15Overall satisfaction to in-vehicle environment.Waiting feelings  16Estimated arrival time of incoming bus is accurate and in time shown on a variable message sign (VMS) board at bus stops.17Actual waiting time is very close to the estimation on the VMS board. 18

Table 2 :
Summary of collected data.

Table 3 :
Measured variables normality test results.

Table 5 :
Reliability test results.

Table 6 :
Calibration results of SEM-DCM model of mode choice for bus.Note.All values meet the verification requirements, except  value of personal Safety in SEM-BL3 which is less than 1.96.