Factors Identification and Prediction for Mind Wandering Driving Using Machine Learning

. Traﬃc safety is aﬀected by many complex factors. Mind wandering (MW) is a fatal cause aﬀecting driving safety and is hard to be detected and prevented due to its uncertain and complex occurrence mechanism. The aim of this study was to propose a framework for analyzing and predicting MW based on readily available driving status data. The data used in this study are the single-trip information collected by the questionnaire, which includes drivers’ personal characteristics, contextual information in which MW occurs, and in-vehicle environmental factors. After investigating the extent of factors that inﬂuence MW, these chosen factors are used to forecast MW. Based on these results, we select factors reliable to be obtained in real life to forecast MW. To verify that the new factors explored are useful in improving the forecast accuracy, the compared analysis is conducted with the results found by our approach and the existing approaches. We compare results obtained by four machine-learning-enabled forecasting approaches on a real-life data set. The result shows that the factors found in this paper can signiﬁcantly improve forecast accuracy. The confusion matrix, ROC curves, and AUC are conducted, and the performance of the gradient boosting decision tree algorithm is better than other forecast approaches. The importance rankings of most factors obtained by the Gradient Boosting Decision Tree and questionnaire are the same.


Introduction
e U.S. National Highway Traffic Safety Administration reports that distracted driving caused 3,142 deaths in 2019 and expects a 10.59% increase per year for fatal traffic accidents [1]. Using a smartphone, tuning a radio, or chatting with passengers, and many other behaviors unrelated to the driving tasks can impact driving performance [2][3][4]. In addition, there is a kind of distracted driving behavior wherein the driver is thinking unrelated to the driving task at hand without any attention [5]. It is called mind-wandering (MW) in driving, and MW is a situation in which executive control shifts away from a primary task to the processing of personal goals [6]. e episode of MW is linked to risky driving behavior, which would be a potential safety hazard [7].
MW occurs with high frequency and has many negative impacts on driving safety. MW frequently occurs during longdistance driving and under optimal driving conditions and is not easily detectable by the driver [8]. Normally, MWs could occur five times in 15 min commuting trips [9]. When it occurs, the driver would be less attentive to the primary task of driving [10]. MW reduces drivers' cognitive and visual performance and results in their low-perceived-risk status. However, drivers' average driving speed during MW would be higher than that under normal driving. us, drivers may not react quickly to dangerous emergency situations [11]. Moreover, the driver should also monitor the current driving scene when operating autonomous vehicles. MW would also be a noticeable safety hazard for automated long-distance trips due to its occurrence mechanism [12,13].
us, the effective analysis and forecast of MW are significant for driving safety.
To respond to the impact of MW on driving behavior, existing research has explored the differences in self-reported mind-wandering according to drivers' demographic characteristics and the context in which MW occurs [7,[14][15][16]. Some scholars have identified real-time MW based on eye movements, steering wheel reversal rates, and other driving performance data collected by the driving simulator [17]. ese proposed methods have high accuracy in detecting MW. However, these parameters may not be easily available in daily driving. To our best knowledge, there has been no existing research to predict MW based on readily available data without a driving simulator.
us, this paper aims to propose a framework for analyzing and predicting MW based on readily available driving status data. e impact of readily available factors on MW should be explored. Based on such factors and their specific impacts on MW, a framework to predict this dangerous driving behavior should be established. To ensure the validity of the forecasting framework, it is also necessary to verify the forecasting results using drivers' real-life MW. To achieve these objectives, we analyze the relationships between factors and MW, and propose a forecasting framework based on our analysis results. First, in this paper, the questionnaire method is used to collect data of MW episodes when driving. e collected data contain drivers' personal characteristics (age, gender, driving experience, and educational background), context information when MW occurs (time, distance, trip purpose, etc.), and the number of MW for the reporting trip. Second, Chi squares of independence are performed on the number of MW occurrences variable, and the extrinsic and intrinsic factors triggering the MW, and the extent of factors that influence MW are unveiled. ird, we set evaluations of predicting MW using the factors proposed in this paper. en, we compare the forecasting results of the four approaches based on machine-learning algorithms with a real-life data set, and provide the relative importance of the factors. is paper has three main contributions that distinguish our research on factor identification and forecast for MW from the existing studies. First, we identify the extrinsic and intrinsic factors for MW and indicate the extent of these factors.
ese results could provide suggestions for developing MW prevention strategies. Second, to our best knowledge, this study is one of the first studies to apply machine learning to predict MW based on readily available data. ird, we compare four forecasting approaches based on machine-learning algorithms for the occurrence of MW to cope with the randomness and uncertainty of MW. e performance advantages of such algorithms are also analyzed, which could provide suggestions for real-life applications. e rest of this paper is organized as follows: in Section 2, we review existing studies on the analysis and forecast of MW. e data collection and analysis of MW are presented in Section 3. Section 4 describes four machine learning algorithms for predicting MW and evaluating the performance and effect of MW forecast. is paper is discussed in Section 5 with our contributions and perspectives on future work. Finally, Section 6 concludes the paper.

Literature Review
To distinguish our research from the existing studies, previous research on the identification of different factors' impact on MW and the detection of MW based on driving behavior are reviewed. We also review the literature for the applications of machine learning in driver behavior analysis and forecast.
To respond to the uncertainty and unconscious behavior of MW, some scholars focused on the identification of scenarios that MW is prone to occur using simulated-based methods. Studies showed that the more times the participants drive, the less attention their play on driving [8,18]; the participants are more familiar with the scene within the the driving simulator, and more frequency of MW is occurs [19]; the less visual complexity of the oncoming traffic scene that the participants meeting, the less attentive their play on and the more frequency of MW is occuring [20]. It is also noticeable that when drivers pay attention to MW, they become more aware than ever of the occurrence of MW as an unsafe behavior [21]. Although simulation-based studies can obtain comprehensive data of driving behavior characteristics, some identified factors related to MW researched by scholars are even hard to get and quantify in real life. Besides, the data may not reflect the reality as the driver in the exprimental situation always feels nervous or feels like playing a game [19]. e previous research used questionnaires to ask drivers to review their past experiences of MW and analyzed the interactive factors related to the MW. Male drivers and young drivers are reported having aggressive driving behavior more frequently, and when their are tired, they are reported more to occur MW [14,22]. As for environmental factors, traffic environments like monotonous motorways and by-passes, also under normal weather conditions, can highly impact MW. Besides, MW is common in daily life which is closely correlated with the situation of drivers such as stress, alone in car, and driving time [7,14,23]. ese researches revealed factors linked to MW from drivers' perspective and environmental factors. However, there are still some relationships between extrinsic intrinsic factors and MW that they did not focus on, such as traffic flow conditions and in-vehicle environments. ese factors need to be further explored.
Highly distracting MW thoughts are related to a higher risk of being involved in an accident [5]. As for the characteristics of off-task thoughts while driving, most off-task thoughts reported by drivers were present-and futureoriented, of neutral emotional valence, principally for the solution of personal and professional problems, and occurring without any contextual cue [7]. Over half of drivers' MW reports were related to things they saw or heard [9]. Although inner thoughts during MW were investigated, the factors inherent in these thoughts' differences were not discussed further.
To our best knowledge, while some scholars present methods for real-time MW detection, only a few studies focus on predicting MW. Most scholars used driving performance information and eye movement information collected by the driver-in-loop simulator to detect the drivers' cognitive status for MW detection. After feature extraction, important features of MW were set as classification features, and the support vector machine classifier was trained and cross-validated within subjects [24]. Osman et al. [25] presented a bi-level hierarchical classification methodology with the Decision Tree algorithm to distinct types of secondary tasks. e inputs to the model were five driving behavior parameters and their standard deviations. Besides, research showed steering wheel reversal rate could also be an effective identification factor of MW [17]. Some of these methods had high accuracy and instantaneity in MW detection and provided knowledge on risky driving behavior interruption. However, these parameters were collected in the simulator study, which is not available readily in daily driving. Further research is needed to predict MW during actual driving and use data easily available, to our best knowledge.
Machine learning has recently received attention due to the emergence of big data generated by multiple sources and the availability of computational power [26]. It performs well in solving several complex and nonlinear problems [27,28] and is widely applied in the prediction of drivers' behavior and some traffic-safety-related researches [29]. Chung et al. [30] proposed an ensemble machine-learningbased algorithm for electric vehicle user behavior prediction, including stay duration and energy consumption based on historical charging records. Deng et al. [31] predicted the performance of lane-changing behaviors combined with environmental and eye-tracking data based on four different machine learning approaches. Saeed et al. [32] carried out an empirical assessment of uncorrelated and correlated random parameter count models on multilane highways considering two crash severities for predicting road crash frequencies.
ese researches can provide a reference for constructing a framework for MW predictive models that analyze the relationship between forecast objectives and related variables, and make these related variables as inputs for the forecast of MW based on different machine-learning algorithms. While various studies have been conducted on the applications of machine learning, there has been no existing research applying such methods to predict MW.

e Questionnaire.
e questionnaire was published on an open-source questionnaire platform (https://www. wenjuan.com), and people could fill it out online by clicking on the link on the website. e questionnaire was distributed and propagated on social media like WeChat groups, QQ groups, and other personal media platforms such as WeChat Moments and microblogging to enlarge the occupation and age structure cover scale. Participants were likely a mix of employed and unemployed people, students, and retired people from cities, smaller towns, and rural areas [23]. e snapshot of the questionnaire is shown in Figure 1. Finally, we collected 201 questionnaires in about two months and distributed and propagated them. e mean response time for the questionnaire was 6.3 minutes (SD � 5.02), and the questionnaire was collected between December 10, 2019, and February 24, 2020.
To verify the questionnaires' availability, we just set the time threshold to judge whether the questionnaire is acceptable. e threshold is set as 2 min according to the question number, reading speed, and comprehension based on the experiences of contactable participants. If the time for filling the questionnaire is less than 2 min, we considered that the participants did not take the questionnaire seriously and classified it as an invalid questionnaire. We found that only 11 questionnaires are invalid in this survey. e detailed content of the questionnaire in http:// wjw.com (Shanghai Zhongyan Network Technology Co., Ltd, Shanghai, China) can be viewed at https://www. wenjuan.com/s/UnauInE. e questionnaire contained three sections with 25 questions and could be answered based on previous driving experiences. e first part of the questionnaire collected personal information about the driver's age, gender, educational background, and years of driving experience. Besides, to investigate the influence of MW on driving safety, traffic accidents caused by MW of the participants and their opinion about the intervention with MW were asked. e second part of the questionnaire asked the participants to report specific information about the trip in which MW occurred. e report content includes the MW occurrence frequency, time, duration, distance, trip purpose, traffic flow condition, road pavement condition, location, weather, and mood of the trip. e third part of the questionnaire was added to explore the influence of in-vehicle environments Journal of Advanced Transportation such as temperature, multimedia, and passenger on the occurrence of MW. Besides, the content of the mind wanding was also concerned and added in the questionnaire.

e Participant.
Only automobile drivers could participate in the survey for this study, and this statement was put out before asking volunteers to fill the questionnaire. e questionnaire requested that the participants are required to have a driving license and have driving experience. In the questionnaire, we stated the objective of data collection and promised to keep the personal information confidential.
e participants included 129 males and 61 females. ey were aged between 19 and 60 years based on valid questionnaires. e participants' minimum, average, and maximum driving years were 1, 5.65, and 20, respectively. 78.9% of the participants had been involved in risk traffic behavior due to MW. Although missing road signs and markings, panic braking occurrence most, there are still 21.05% of the drivers who reported they had occurred traffic accident due to MW. Table 1 summarizes the participants' information in detail.

Forecast Variables.
A significance analysis between the potential influence factors and MW was performed. e potential influence factors were divided into two groups: personal characteristics and contextual information in which MW occurs. e personal characteristics information includes gender, age, years of driving experience, and educational background. e contextual information covers the time, duration, distance of driving, trip purpose, traffic flow condition, road pavement condition, location, weather, and mood along with the driving trip. e results of the significance analysis with the chi-square test are shown in Table 2.
It can be seen from Table 2 that there is no significant evidence to prove gender, time, trip purpose, weather, and road pavement condition being conspicuously related to the occurrence and frequency of MW. For factors related to MW, driving experience, educational background, age, duration, mood, distance, traffic flow condition, and location have a relatively significant effect on the occurrence of MW. According to the questionnaire, drivers with five to ten years of driving experience, with higher education, under 40 years old are prone to MW. Within the driving influence factors, driving with long duration and distance, driving in emotionally unstable state (excited or frustrated), driving in free flow condition, and driving in the city are prone to MW.
Age is a drivers' attribute that has the most significant impact on MW.
us, it is necessary to clarify the relationship between age and MW. To explore how drivers' age influences participants' frequency of MW, we divided the participants into three groups based on the frequency of MW in a single trip: less than one, between two and three, and more than three. ey were defined as low frequency, middle frequency, and high frequency of MW, respectively. e distribution of drivers' age in different MW frequencies is shown in Table 3.
As shown in Table 3, the age distribution in the low frequency and middle frequency of MW has consistency. Within low frequency and middle frequency groups, the drivers under 30 years old make up the largest proportion (44.07%). en, the middle-aged drivers aged between 40 and 50 years make up the second largest proportion (34.75%). On the contrary, in the high frequency of MW, the participants aged between 30 and 40 years make up the major proportion, about 38.89%.
Furthermore, different drivers have variations in MW occurrence frequency. Figure 2 shows the MW occurrence frequency for drivers. Almost all drivers seldom report high-frequency MW, excluding drivers aged 30-40 years. 20% of drivers aged 30-40 have a high frequency of MW, which is significantly higher than drivers of other ages. Older drivers also tend to have high-frequency MW and injuries [33]. ere is research showing that drivers about 35 years old are likely to be distracted [34], and teens reported more distracting behavior while driving compared to their parents [35], which is coincident with our findings.
e driving experience has shown a relatively significant relationship with MW. Table 4 shows the distribution of drivers' driving experience in different MW frequencies. It can be concluded that drivers with less than 5 years of driving experience are the main components for low frequency and middle frequency. However, for the high-frequency group, drivers with 5-10 years of driving experience constitute the major part. Figure 3 shows the occurrence of MW among drivers with different driving experiences. e group of drivers with 5-10 years of driving experience has the highest rate of highfrequency MW. It can also be concluded that drivers with many years of driving experience seldom have MW.
e previous study has not researched the relationship between educational background and MW. To find how the educational background influences the frequency of MW, we analyzed the distribution of drivers' educational backgrounds in different MW frequencies in a single trip, as shown in Table 5.
As shown in Table 5, drivers with bachelor's and specialist degrees majorly constitute low-frequency and middle-frequency groups. For the high-frequency group, participants with a masters' degree or Ph.D. degree made up the major component (56%). Previous research showed that people with high working memory capacity are more prone to MW than those with low working memory capacity [36].
is study, to some extent, validates the findings above. Figure 4 shows the occurrence of MW among drivers of different educational backgrounds. We can find drivers with lower education are less likely to experience a high frequency of MW. ere is a noticeable phenomenon that the welleducated ones are most prone to have high-frequency MW (19.61%). 4 Journal of Advanced Transportation

In-Vehicle Influence Factors.
e in-vehicle environments also affect driving behavior. However, how the invehicle environments impact the driving behavior is still unclear, as it is difficult to collect the field-based in-vehicle information in previous works [7]. e research found that in-vehicle environments are highly related to traffic crashes [37]. erefore, this section tries to find the potential relationship between the in-vehicle environments and MW. In the questionnaire, the possible influence factors: passenger, temperature, and multimedia were surveyed. Table 6 shows the survey results of three potential influence factors related to MW by listing the percentage of drivers' choice of the factors most prone to MW.      It can be seen from Table 6 that the drivers in the situation of chatting with passengers, warm temperature, and no playing multimedia in the vehicle are prone to MW (P < 0.001). Furthermore, different gender of drivers has significantly different choices in situations that are most prone to MW (P < 0.05).Male drivers are more prone to MW when they drive alone. According to our investigation, driving with passengers and chatting is highly related to MW, especially for female drivers. 47     we added the following question, "What kind of thoughts are you most likely to have when MW occurs?" Before separating the questionnaire, we classified the off-task thoughts into 7 classes. When participants answer the questionnaire, they need to choose the thought that appeared during the trip. e frequency of daily routines caused MW is shown in Figure 5. It can be seen from Figure 5 that 54 drivers (about 28.4%) confirmed that they are most likely to think about workrelated routines when they have MW; "personal emotional problems" and "family-related affairs" account for 23.2% and 18.9%, respectively. erefore, it can be deduced that "work-related problems," "personal emotional problems," and "family-related affairs" are the main intrinsic factors of MW induction.
To explore the character of daily routine-related wandering further, significance analysis between drivers' main attributes and daily-routine-related wandering is conducted using the chi-square test. e results are shown in Table 7. Only age has a significant relationship with daily-routinerelated wandering.
e data also show that daily-routine-related wanderings during MW episodes have a statistically significant difference for drivers of diverse ages, with χ 2 � 18, p � 0.003. e questionnaire results about the daily-routine-related wandering of different ages of drivers are shown in Table 8.
It can be seen from Table 8 that drivers under 30 years old are prone to fall into personal-related issues when experiencing MW. For drivers between 30 and 40 years old, about 40% of them report they are easily caught up in workrelated problems. e main wandering issue turns to be family-related affairs for drivers more than 40 years old. It can be concluded that the main daily-routine-related wandering is different for various ages of drivers. Adolescent drivers focus on personal emotional problems. With increasing age, the drivers' issues will be adjusted to undertaking and family. Furthermore, when the issues focus on turn out in life, it will be easier to be caught in MW.
In summary, we explored new factors related to MW. We also analyzed the internal reasons for the differences in the thoughts of MW as these reasons have not been studied. Previous studies have focused on how drivers' gender, age, years of driving experience, and characteristics of driving tasks influence MW. In this section, based on the existing conclusions, drivers' educational background, trip purpose, traffic flow condition, location, mood, and their relationships with MW were taken into consideration. ese factors are easy to be collected and can be used to predict MW occurrence. Besides, whether passenger presence, in-vehicle temperature, and in-vehicle multimedia lead to MW was also investigated.

Forecasting Approaches Based on Machine Learning
Effective forecasting of MW can improve driving safety greatly; it is also an urgent need for the driver currently. According to our survey, 93.16% of the drivers report that they are willing to follow an approach to prevent MW. is study predicts the occurrence of MW based on readily available driving status data that no researcher has done, to our best knowledge. e factors associated with MW are identified in the previous section. First, we choose the early selected factors as input variables for forecasting MW, as the other impact factors found by our approach could not be reliable to be obtained by the driving simulator. Second, since there is no crucial need to predict the exact value of the number of occurrences of MW in terms of early warning of actual unsafe driving behaviors, we have further processed the data. Here, trips with less than two single-trip MWs were defined as low-risk trips, and the remaining were defined as high-risk trips; then, the forecast of MW is transformed into a classification process [38,39]. ird, we compare four forecasting approaches based on machine-learning algorithms for the occurrence of MW with a real-life data set and provide the relative importance of the factors.

Random Forest Classifier.
Random forest is an ensemble learning methodology. Liking other ensemble learning techniques, its performance is boosted via a voting scheme [40]. e method combines Brieman's bagging idea and Ho's "random subspace method" to construct a collection of decision trees with controlled variations [41]. e main idea of the random forest is to grow a large population of unpruned decision trees to bootstrap a sample of training data by randomly selecting features at each segmentation, and then choosing the best split among these features [42]. Compared with other algorithms, random forest training needs less time, and it can handle high-dimensional data and does not have to make feature selection [43]. After training, it can inform which features are more important.
Based on these advantages, the random forest algorithm was chosen in this study to predict the MW in the trip. ose variables significantly associated with MW were chosen as input, then it was ascertained whether the current trip belonged to the high-risk trip or low-risk trip. Besides, the importance ranking of these features could also be given.

Gradient Boosting Decision Tree. Gradient Boosting
Decision Tree (GBDT) is a supervised learning algorithm. GBDT repeats the selection of an ordinary model and adjusts it based on the previous models' performance. It uses gradient boosting to improve model performance [44]. GBDT combines regression trees with a gradient boosting technique and has been widely applied in various disciplines, such as credit risk assessment, transport crash prediction, and fault prognosis in electronic circuits [45]. e method has high prediction accuracy and can flexibly handle various data types, including continuous and discrete values.
As the factors associated with MW collected in this experiment were discrete and continuous variables, the GBDT method was also chosen to predict the number of single-trip MW occurrences. Similar to the random forest approach, variables significantly associated with MW were used as input variables for the forecast, and the classification of the trip belonging was outputted.

Naive Bayes.
A Naïve Bayes classifier estimates the posterior probability of Y given X, P(Y|X), there are m samples, each sample has n features, and the feature output has K categories defined as C 1 , C 2 , . . ., C k . Applying Bayes' theorem and the assumption of conditional independence, the posterior probability can be represented as [46]: (1) e posterior probability maximization is needed to determine the classification, calculate all K conditional probabilities P(Y � C k |X � X (test) ), and then find the category corresponding to the maximum conditional probability, which is the plain Bayesian prediction. Using the independence assumption of the plain Bayesian, one can obtain the plain Bayesian inference formula in the usual sense: Although the Naïve Bayes classifier is based on the "naïve" conditional independent assumption, compared to other classification algorithms, it still demonstrates preferable performance in analyzing many real data sets that do not strictly follow the conditional independent assumption [47]. erefore, as a classical classification method, it was chosen here to predict the occurrence of MW. e event code with the largest estimated probability was then chosen as the forecast for the MW.

Multiple Linear Regression. Multiple linear regression
can deal with the relationship between a dependent variable and an independent variable or multiple independent variables. Regression analysis generally includes the following steps: first, determine the independent and dependent variables in the data; second, calculate the coefficients of the multiple linear regression model and standardize the regression coefficients; third, use the determined multiple linear regression model to make predictions. e general representation of the model is as follows: β 0 , β 1 , . . . , β p are fixed and unknown, where the random variables X j , j � 1, . . ., p, denote p regressor variables. e random variable ε denotes an error term. It is uncorrelated with the regressors and has expectation 0 and varianceσ 2 > 0.   Although multiple linear regression is weaker than the previous machine-learning methods for classification problems, it has a better explanation and can quantitatively express the relationship between the probability of occurrence of MW and each factor, so it was chosen to predict the occurrence of MW. A linear equation was fitted to predict MW using historical data on the occurrence of MW from the questionnaire.

Evaluation.
To verify whether the new factors we explored are useful in improving MW's forecasting accuracy, we conducted three evaluation procedures. e input of compared evaluations are factors associated with MW found by our approach, associated factors found by existing approaches, and factors that only have a significant relationship with MW, separately. e data used for training were the MW driving obtained from a questionnaire and driver demographics and relevant factors for that trip. 70% of the data were randomly chosen for training models, and the remaining 30% were used for evaluating performance. To score the algorithms' performance, we provided the accuracy and confusion matrix of the algorithms' forecast on the test data. Moreover, ROC curves, the Area under a ROC Curve (AUC), and the ranking of risk factors were shown to analyze the results.

Comparative Analysis.
e first evaluation procedure was to select factors found in previous studies related to MW driving (driving experience, gender, age, time, duration, distance, and road pavement condition) as input variables for the forecast. e second evaluation procedure was to select the factors we investigated (age, gender, driving experience, educational background, duration, mood, distance, traffic flow conditions, location, time, trip purpose, weather, and road pavement condition) as input to make the forecast. e third evaluation forecasts MW using factors that have a relatively significant relationship with MW (age, driving experience, duration, and mood).
For these three evaluations, the comparative results of random forest, Gradient Boosting Decision Tree, Naïve Bayes, and multiple linear regression are shown in the Table 9. We conducted forecasting 10 times for these experiments and calculated the average forecasting accuracy.
From Table 9, it can be seen that the variables we aided have a significant effect on the improvement of forecasting accuracy. Although some of these factors did not show a statistically significant relation with MW driving, they were convenient to collect and could improve forecasting accuracy. erefore, we analyzed the results of the MW forecast with the factors found by our approach in Section 3.
By comparing the results of four machine-learning algorithms, the accuracy of the GBDT algorithm is 73.69%. e random forest algorithm, Naïve Bayes algorithm, and Multiple Linear regression algorithm have an accuracy of 70.28%, 68.42%, and 65.42%, respectively. It can be observed that the GBDT algorithm has better performance in the forecast of MW. It is a meaningful result to achieve acceptable forecasting results only with the driver's personal information and context information when experiencing MW.
Furthermore, we analyze the forecasting results of these four algorithms in detail. e forecast confusion matrix [48] of the algorithms is shown in Figure 6.
e results show that the forecast value of the random forest algorithm, GBDT algorithm, and Naïve Bayes algorithm is lower than the actual value of MW. e Naïve Bayes elicits the most inferior performance. In contrast, the forecast value of the linear regression algorithm is higher than the real value of MW.

Receiver Operating
Characteristic. ROC is widely used for evaluating the performance of binary classifiers. e ROC curve shows how the number of correctly classified positive cases varies with the number of incorrectly classified negative cases [46]. It consists of the x-axis of false positive (FP) rate and the y-axis of true positive (TP) rate. FP defines the number of the estimated instances incorrectly classified as positive when they were negative. TP rate is used to measure the proportion of instances correctly predicted as positive in all actual positive instances. Random forest algorithm, GBDT algorithm, and Naïve Bayes algorithm were constructed from the training data set, and the ROC performance was evaluated using the test data set. ROC curves are shown in Figure 7.
If a classifier is perfect for predicting all cases correctly, the ROC curve of the classifier will be in the upper-left corner. As our sample size is not very large, the ROC curves are not smooth enough. To show the differences between these curves, we smoothed the curve with a polynomial fit. It can be seen from Figure 7 that the ROC curves of the GBDT algorithm are higher than the random forest algorithm and the GNB algorithm. It can be indicated that the GBDT algorithm performs better in the MW forecast. e MW forecasting results using the random forest algorithm and the GNB algorithm show no significant difference.
e Area under ROC Curve (AUC) is defined to evaluate the overall performance of a classifier quantitatively. e AUC is the area enclosed by the ROC curve, the horizontal axis, and the right boundary of the ROC space, with a maximum value of 1 indicating a perfect classification result. A value of 0.5 indicates that the classifier produced a random classification result [47]. As shown in Figure 7, the GBDT algorithm achieves the best performance in the MW forecast, with an AUC of 0.71. e random forest algorithm and the GNB algorithm both achieve performance with an AUC of 0.62.

Journal of Advanced Transportation
In summary, it can be proved that the GBDT algorithm has the advantages for the MW forecast with higher forecasting accuracy and produces an acceptable performance based on the comparison. us, the GBDT algorithm has the potential to be applied in real-life MW forecasts.

Ranking of Risk Factors.
To evaluate the contribution of different factors to the forecast of the frequency of MW, we extract the variables' weight distribution. Variables' rank is compared within the GBDT algorithm and data analysis. e compared results are shown in Table 10 and Figure 8. e variables' weight distribution ranking is explored in Table 10. In Table 10    gender have a relatively significant impact on the occurrence of MW. Trip purpose and road pavement conditions do not significantly affect the occurrence of MW. It also can be seen from Figure 8 that there is the same consistent trend in the ranking of influence factors within the two analyzes. e ranking of influence factors also has some individual differences. e importance ranks of the weather and traffic flow conditions are different from these two analyzes. Although these two factors have different important ranks, they hardly impact MW significantly.

Discussion and Conclusion
As a high risk of driving behavior, MW can give rise to traffic crashes, a potentially safety hazard. is study identified the extrinsic and intrinsic factors that triggered the MW that had not been explored and indicated the extent of factors' influence on MW. We proposed a framework for predicting MW to cope with the randomness and uncertainty of MW. en, we compared four forecasting approaches for the occurrence of MW based on machine-learning algorithms. Our research can provide suggestions for developing MW driving prevention strategies and can be applied to in-vehicle information systems or mapping software to give drivers early warning regarding MW.
MW is widespread, and its occurrence is not completely untraceable, linked to certain external factors. It can be concluded from the data analyses and variables' importance of machine-learning algorithms that duration, age, and driving experience have significant impacts on the occurrence of MW. Educational background, mood, distance, traffic flow conditions, and location also have relatively obvious influences on MW. In contrast, gender, time, trip purpose, weather, and road pavement conditions have inconsiderable implications on MW. For in-vehicle environmental factors, chatting with passengers when driving, comfortable temperature, and playing multimedia in the vehicle are prone to MW driving. With the presence of passengers, the mean proportion of time having elevated gravitational-force events in curves is significantly higher than when there was no passenger [4]. Driving with passengers or people they have met has an impact on driving behavior. As to playing multimedia in the vehicle, however, a previous study showed that radio tuning tasks might seem reasonably distracting as drivers get inattentive in no-task driving [48]. us, it is common for drivers that their minds do not always focus on the driving task.
For off-task thoughts, 28.4% of drivers think that they are most likely to wander about work-related routines when they have MW; "personal emotional problems" and "familyrelated affairs" account for 23.2% and 18.9% respectively. It can also be concluded that the main daily routine-related wandering is different for various age groups. Adolescent drivers focus on personal emotional problems, and with increasing age, the drivers focusing on them will be adjusted to undertaking and family. Furthermore, when the issues focus on turn out in life, it will be easier to be caught in MW driving.
As the factors associated with MW are identified, these related factors are used to predict MW. e Random forest algorithm, Gradient Boosting Decision Tree algorithm, Naive Bayes algorithm, and Multiple Linear regression model are chosen to predict the MW. By comparing the results of the four machine-learning algorithms, the  accuracy of the GBDT algorithm is 73.69%, and the random forest algorithm, the Naïve Bayes algorithm, and the Multiple Linear regression algorithm have an accuracy of 70.28%, 68.42%, and 65.42%, respectively. e GBDT algorithm achieves the best performance in MW forecast, with an AUC of 0.71, while the random forest algorithm and the Naïve Bayes algorithm both achieve performance with an AUC of 0.62.
Among the above factors used to predict MW, some factors are considered for the first time. ese factors proved by us include drivers' educational background, trip purpose, traffic flow condition, location, and mood. e results of the comparison experiments illustrate that the addition of these factors can effectively improve the forecasting accuracy of MW.
In the future, it is advised that under the premise of ensuring the authenticity of survey samples, more drivers should be surveyed to enlarge the data set. In addition, when more dimensions of data related to MW are collected, some factors that are more detailed such as the type of vehicle can be considered to improve the forecasting accuracy of MW.
MW is a high-risk driving behavior, and it is hard to be detected. However, its occurrence is not untraceable. is paper reported some factors related to MW (e.g., educational background, and mood). Based on these findings, machinelearning algorithms can be used to forecast the occurrence of MW, and the factors we explored can improve forecasting accuracy. e evaluation shows that the GBDT algorithm performs better in the MW forecast.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.