Exploring the Behavior-Driven Crash Risk Prediction Model: The Role of Onboard Navigation Data in Road Safety

,


Introduction
Trafc is a complex system that consists of the driver, vehicle, road, and environment, in which the abnormality of one or more components will heighten crash risk.Researchers have studied extensively the relationship between vehicles, roads, and environment with crash risk and built real-time risk prediction models.However, there are few studies that evaluate the impact of driver behavior on crash risk.From the general rational point of view, there should exist some close connection between driver behavior and crashes.Most existing research on driving behavior has been based on perception surveys [1] or vehicle trajectories [2] captured by video cameras.On the one hand, the response rate of questionnaire surveys tends to be low, yet the research result is greatly afected by the quality of respondents' answers.On the other hand, the extraction process of vehicle trajectory data is cumbersome, and often times only a short distance of trajectory is flmed.In earlier studies, given the difculty of obtaining real-time driving behavior data, behavior was often treated as being heterogeneous in the analysis process [3] and dealt with via a random efect model to clarify the mechanism of the crash [4].Distraction [5], fatigue [6], and drunk driving [7] were the familiar heterogeneities considered.Later, by using eye trackers or simulators, diferent driver attributes were also fully considered [3].An ideal method to obtain driving behavior is to utilize onboard sensors and GPS to transmit real-time vehicle position and operating condition information, including longitude and latitude coordinates, vehicle speed, acceleration, and steering angle, and use such data to assess driving behavior and crash risk.An easily accessible source for this type of data can be streamed from online navigation software operating in conjunction with high-precision maps.
Hereby, in order to fll the gap of the role of driving behavior in crash risk theory and hence allow timely implementation of safety remediation measures, this study dealt with extreme driving behavior (termed herewith as abnormal driving behavior) obtained for the China G25 expressway.Te data were provided by AutoNavi over a span of 7 days and 120 km and the crashes that occurred during this same period.Tis facilitated the development of a crash risk prediction model based on real-time driving behavior.AutoNavi frst screened out abnormal driving behavior on the expressway, including sharp-left-lanechange, sharp-right-lane-change, sharp-acceleration, and sudden-braking behaviors.For each crash, the features of risky driving behavior occurring before the crash were attached, and a case-control method was applied to generate datasets with diferent proportions of positive and negative cases.Te crash risk prediction models are established via various machine learning methods, and the role of driving behavior is explored via logit regression and partial dependency plots.
A key contribution of this study is to verify the feasibility of predicting crash risk using abnormal driving behavior collected by the onboard navigation system.Regression using feature variables generated from original data is presented, and these generated variables are found to be interpretable in improving the performance of the prediction model.Another contribution is that through the behavior-driven risk prediction model, the relationship between driving behaviors and crash risk is studied in depth, which can provide drivers with risk avoidance advice.
Te structure of the paper is organized as follows.Tis section is the introduction, followed by Section 2 which covers the literature review.In Section 3, processing of the dataset and the methodology are described.Section 4 gives the regression model results and an in-depth analysis of the variables.Limitations of the study and future work are presented in Section 5, along with a conclusion summary of the research study.

Literature Review
Many scholars have explored the contributory factors of trafc crashes from various prospects.Corresponding risk prediction models are also established for real-time risk prediction [8].Macroscopic trafc-fow information such as trafc volume, speed, and density are collected to assess their impact on crash risk [9,10].Microscopic trafc data, such as headway [11] collected via onboard cameras and radar sensors, are used to reconstruct the process of crash formation.Later, scholars introduced more factors about road alignment and the environment, including horizontal and vertical curves [12], pavement surface [13], illumination conditions [14], and the weather [15].
When it comes to the efects of driving behavior on crash risk, many data collection methods have been proposed.Te most economical way to collect driving behavior data is to conduct surveys by using questionnaires which have found signifcant driver-related crash factors on drinking [16], fatigue, and distracted driving [17].However, the response rate of questionnaire surveys tends to be low and it is not possible to do real-time tracking.Another widely used method to analyze driving behavior is by examining the trajectory of vehicles.Researchers have used onboard cameras or drones to record vehicle trajectories and explore crash risk factors combined with microscopic trafc-fow indicators [18].Meanwhile, more publicly available datasets of naturalistic driving studies are used to assess driving risk [19].Furthermore, researchers found that combining driving behavior and trafc-fow parameters to predict risk can greatly improve prediction performance.Ma et al. [20] used the frequency of risky driving behaviors and trafc-fow data to establish a short-term crash risk prediction model.However, given the difculty and cost of data acquisition, large-scale and long-term collection of driving behavior and trafc-fow parameters is still difcult to apply in practice.Te information silo efect of roadside and in-vehicle devices also makes the collaborative evaluation of multisource data challenging.
Nowadays, modern onboard navigation software can efectively capture vehicle movement information [21].In China, Amap and Baidu Map together hold more than 50% of the market share.In the U.S., Google Map has a wide user base.Tere are also a number of vehicle manufacturers that have partnered with navigation companies, e.g., Tesla ® , Cadillac ® , and BYD ® , where the navigation system provides drivers with lane-level positioning, as well as a variety of speed and trafc limit information for the vehicle.Herein, the use of data from navigation software that has high market penetration to assess safety risks on the roads is very promising and commercially viable.Tere is a diversity of vehicle brands on the roads but utilizing in-vehicle highperformance sensors installed on many brands of vehicle to collect data will inevitably magnify the cost.Hereby, collecting data by using the navigation software will be the most economical and also the most suitable way for relevant companies to develop this function.
For the research to explore the infuencing factors of crash risk, machine learning methods can easily improve the accuracy of the prediction model, but the black-box model will complicate the interpretation [22].Terefore, researchers often use agent models (e.g., logistics regression) [23] and visualization methods (e.g., SHAP and PDP) [24] to interpret machine learning models.
To sum up, as illustrated by Figure 1, driving behavior plays an important role along with other factors in the formation of a crash, while noting that current behavior data collection methods require substantive costs and are also constrained by the information silo efect.In this research, abnormal driving behavior data provided by AutoNavi ® are utilized to generate datasets based on crash data via the case-control method.Multiple machine learning methods are used to establish the behavior-driven risk prediction model and explore the impact of risky driving behavior on trafc crashes, which will help in the development of appropriate risk prediction functions to be applied on a large scale.

Data Preparation and Methodology
3.1.Data Preparation.Te data on incidences of abnormal driving behavior on a 120-kilometer section of the G25 expressway in China from September 26 to October 3 in 2020 were provided by AutoNavi Software Co. Ltd.Te study area is the section from Liyang to Changxing of the G25 expressway, which has 3 lanes in both directions, with an average annual daily trafc volume of 56,772 vehicles.Te alignment is gentle, and the fuctuation is small.Tere are no large-and medium-sized cities or large entrances along the road, and the trafc volume of each section along the road does not change signifcantly.Crash data were provided by the authority responsible for managing the G25 expressway.Statistics showed that on the abovementioned dates at the expressway sections, over 140,000 incidents of abnormal driving behavior data and 284 crashes were collected accordingly.Original data collected by AutoNavi were sampled via onboard GPS once per second, including acceleration, speed, course angle, latitude, and longitude changes during vehicle movement.AutoNavi classifes the vehicle motion state with these parameters at the 90th percentile as an "abnormal" behavior and defnes four incidents based on the specifc weighting parameter.Incidents of abnormal driving behaviors were divided into four categories, namely, sharpleft-lane-change, sharp-right-lane-change, sharpacceleration, and sudden-braking.Te specifc technical details including the threshold and weights are not allowed to be disclosed due to confdential agreement.However, the threshold is not the focus of this research.Tis study is aimed at proposing a comprehensive methodology to study the feasibility of establishing a risk prediction model based on extreme driving behavior.Kinematic parameters of driving behaviors provided in the processed dataset are sufcient for the research.Te features of the 4 abnormal behavior data are listed in Table 1.
Crash data and behavior data were connected with coordinates and time as key values.Previous studies have shown that behaviors around the crash site will show some high-risk characteristics before the crash occurs, Ma et al. [25] pointed that there is a generalized linear relationship between the length of the road section and the number of crashes, and with a certain length range, the heterogeneity caused by road alignment and section location can be eliminated to a certain extent.Hereby, the mean value 120 km/284 � 422 m is calculated as the theoretical length.While the actual crash coordinate is not so precise, considering a redundancy here, a spatial range of 500 m is chosen as the practical target section.Tat is, for each crash coordinate, abnormal driving behaviors within 250 meters before and after the crash position were extracted via GIS software, as shown in Figure 2. Te red cross dot represents a crash, and the solid pink point is driving behavior.
Compared to normal trafc operation status, a crash is actually a rare event.Te case-control method used for constructing datasets for regression can help analyze risk factors that deviate from normal operating status.Te dependent variable of crash data (positive case) is regarded as 1, and the noncrash data (negative case) with dependent variable 0 can be obtained by the following two rules (also shown in Figure 2): Collecting driver characteristics.
Low recovery rates and uneven quality.
Behavioral and psychological info.Samples are small and costly. ( Step 1. Te extraction interval T (e.g., 5 minutes) of additional features is determined, and then the crash data will include abnormal driving behaviors within the time window ranging from 0 to T before the crash. ( Step 2. Te ratio R of crash and noncrash data (e.g., 3) is determined, and then the location of the crash in step 1 is kept unchanged.Te abnormal driving behaviors are taken within multiple time windows ranging from n × T to (n + 1) × T before the crash as the nonaccident data, where n is traversed from 1 to R.
Tus, multiple case-control datasets with diferent extraction intervals and diferent data ratios are obtained.In the past research on crash prediction driven by trafc-fow data, multiple sets of time intervals were used to collect data on trafc-fow characteristics and the corresponding prediction models were established [26].Guo et al. [27] collected the frequency of abnormal driving behavior in a 1hour interval for modeling, which achieved good performance.According to the literature, in this paper, fve intervals of 5, 15, 30, 45, and 60 minutes were used to construct datasets and to select the optimal solution according to their performance.Four ratios, 1 : 1, 1 : 3, 1 : 10, and 1 : 20, were selected to investigate the infuence of the ratio of positive and negative cases on the regression results [28].Due to the loss of behaviors near some crash points, the ratios in the actual constructed datasets would not be accurate, but basically they met the proportion requirements.Tus, 20 datasets were obtained, and the actual number of positive and negative cases in each dataset is shown in Table 2.For each crash or noncrash data, three derivative variables are calculated: the number of certain behaviors occurring, the average value of the maximum speed of certain behaviors, and the average value of the maximum acceleration of certain behaviors, which are calculated by using the following equation: where max_a and max_v are the features illustrated in Table 1, subtitle i is the behavior data id, and N is the total number of certain behavior category c.Te naming principles and meanings of the variables in the dataset are given in Table 3.In summary, each piece of data includes a dichotomous dependent variable and 24 behavioral characteristics as independent variables.

Logistic Regression (LR).
Currently, there are many methods to assess the contributory factors of a crash [29].Logistic regression is a mature statistical model, which is a simple but efective binary classifer, belonging to a kind of generalized linear model.We consider a pair of data points set {(x, y)}, where x is the N-dimensional feature variable and y is the binary dependent variable.Te mathematical form of LR is as follows: where r is the possibility index, which in this study denotes the risk of a crash, α is the constant term, x is the independent variable, and β is the estimated coefcient.Te analysis of variance (ANOVA) test is applied to judge the signifcance level of variables in the model.During the variance test, all highly signifcant variables with a p value below 0.05 should be retained as far as possible.Te artifcial neural network (ANN) is a machine learning algorithm based on the perceptron model.ANN contains an input layer, an output layer, and several hidden layers, and the connections between neurons in all layers have learnable weight parameters.Trough model learning, the connection weights between neurons can be continuously adjusted, so that the model output is gradually close to the real situation.

Machine Learning
Naive Bayes is a well-established classifer whose core idea is to compute the probability of each categorical value under given conditions and use the class with the highest probability as the output.Te parameter estimation of Naive Bayes uses maximum likelihood estimation methods and shows robust performance in noisy datasets.
RUSBoost (random undersampling boost) is a combination of undersampling and AdaBoost algorithm aimed to solve unbalanced data.Te weak learner is trained by constructing a balanced dataset through random sampling, and then the integration algorithm is used to obtain a classifer with higher accuracy.Although the SMOTE algorithm also solves the imbalance problem [30], it enlarges the dataset and amplifes the training time, and studies have shown no signifcant diference in performance between oversampling and random sampling [31].
Each dataset in Table 2 is divided into a training set and a validation set to conduct 5-fold cross-validation in the regression process.Ten, a test set with 15% samples of the original dataset will be applied to verify the model performance.

Performance Evaluation Metrics.
Te datasets obtained in Section 3.1 are a typical unbalanced dataset of classifcation problems, so comprehensive indicators should be selected to evaluate the performance of the model.Te confusion matrix of the predicted results is shown in Table 4. Ten, the true positive rate (TPR) and false positive rate (FPR) can be calculated by using the following equations.Multiple (FPR and TPR) coordinates can be obtained by adjusting the classifcation threshold, which is connected successively to draw the receiver operating characteristic curve (ROC) and calculate the area under the curve (AUC).TeAUC is a comprehensive evaluation index, which is greater than 0.5 and less than 1.Te closer it is to 1, the better the classifer efect is.
3.2.4.Model Interpretation Method.Partial dependence plot (PDP) is a visualization method widely used to explain the joint efect of two independent variables on the dependent variable in a model.For the variable x E to be explained, the following equation is generally used to calculate their partial dependence efects: where N is the number of samples in the dataset and x S is the variable in the sample data other than x E .

Results and Discussion
4.1.Model Accuracy Performance.LR and machine learning were performed on the case-control datasets obtained in Section 3, respectively, and ANOVA was conducted to ensure that the signifcance (p value) of variables that remained in the model was all below 0.05.All the models have been verifed by cross-validation, and their AUCs in diverse datasets are shown in Figure 3.
From Figure 3, on the one hand, under the condition of the same sampling time interval, we see that if the unbalanced proportion of the dataset is larger, then the regression efect of the models gradually increases, except for ANN.Too little data can lead to very poor model performance, as can be seen with data ratios of 1 and 3 for each model and a time interval of 5 minutes.On the other hand, under the condition of the same sampling ratio, we see that the efect of the model engendered apparent fuctuation, except for Naive Bayesian, which indicates that blindly expanding the data scale and time interval is not conducive to improving the efect of the regression model.Note that the AUC reaches the best performance of 0.782 in the RUSBoost model, with the ratio of 1 : 20 and the time interval of 15 minutes; therefore, this model is selected fnally and interpreted.
If only the frequency of occurrence of abnormal driving behavior is used, the prediction results are shown in Figure 4. Models using only behavioral frequency variables performed worse than models with speed and acceleration variables attached under all datasets.Te two additional variables do not directly characterize the trafc-fow parameters, but still refect the trafc operation state to some extent, which is an enhancement to the behavior-driven crash prediction model.
Compared with other data sources, the abnormal driving behavior data collected by the navigation system also have advantages in accuracy and usability.As shown in Table 5, the performance of risk prediction models using other data sources is enumerated.Te performance of the abnormal behavior data-driven RUSBoost model represented by the AUC value is better than that of the machine learning models using traditional trafc-fow data [20,32].In previous studies, it has been demonstrated that the simultaneous use of driving behavior and other trafc information can substantially improve the prediction results [33,34].Exactly, the more detailed the data, the better the performance of the model, but the cost and acquisition difculty also increase.Te abnormal driving behavior data used in this study are easy to obtain by navigation enterprises, and the performance of the obtained model is greatly improved compared with the model driven by traditional trafc-fow data, which achieves a balance between accuracy and the difculty of data acquisition.In summary, it is technically and economically feasible for navigation companies to use abnormal driving behavior data to predict crash risk.

Discussion of Factor Impact.
LR is a statistical analysis method with good interpretability.According to the defnition of equation ( 2), the parameter estimation β is the contribution of the behavior feature to the crash risk, and its exponent value demonstrates the fact that when the variable is increased by 1, the probability of an accident becomes e β times greater than what it would otherwise be.Te regression results for each LR experiment are shown in Tables 6-9.Here, the focus is more on common fndings across multiple sets of experiments.Te frst three variables with the largest absolute values of the estimation in each LR result of the 20 experiments were retained, and their frequency of occurrence was counted.Te frequency histogram is shown in Figure 5(a).Clearly, the occurrence frequency of

Variable format and subscript meaning Capital letter: the category of abnormal behavior
L: sharp-left-lane-change behavior R: sharp-right-lane-change behavior A: sharp-acceleration behavior B: sudden-braking behavior Letter after the 1st underscore: the derivative feature n: number of abnormal behaviors a: the average max_accelerate during all abnormal behaviors v: the average max_velocity during all abnormal behaviors Letter after the 2nd underscore: the position relative to the accident site up: within 250 m upstream dn: within 250 m downstream e.g., B_v_dn means "average max_v of all sudden-braking within 250 m downstream of the crash coordinate" Journal of Advanced Transportation acceleration features of sharp-acceleration and suddenbraking behaviors occupies the forefront in an overwhelming number, while that of the features of sharp-lanechange behavior are very few, in fact only once.Consequently, subsequent PDP analysis is also dominated by the interaction of the characteristics of these two behaviors.Figure 5(b) describes in more detail the distribution characteristics of the estimation values of these variables.Te vertical axis is the name of the variables and is also sorted in the same frequency order as Figure 5(a), and the horizontal axis is the estimation values.Each dot in Figure 5(b) records one occurrence of the corresponding variable and the exponent value of its coefcient estimation.Te blue line perpendicular to the X-axis demarcates the boundaries of infuence, with dots to the left of the line representing a negative impact on risk and to the right representing a positive efect.It can be clearly seen that the variables that have a strong impact on crash risk whose estimation is much greater than 0 are the four acceleration variables of sharp-acceleration and suddenbraking behaviors (A_a_up, A_a_dn, B_a_up, and B_a_dn).Even if other features appear in the model, they have little infuence on crash risk.Hence, it can be considered that the acceleration features play a dominant role in the formation of expressway crashes, which refect the intensity of driving behavior.In addition, four speed-related variables (B_v_up, L_v_dn, B_v_dn, and A_v_up) showed negative correlations.Te number of sharpacceleration behaviors was negatively correlated, and the number of sudden-braking behaviors was positively correlated (for more detailed numerical results, one can refer to the contents in Tables 6-9, which will not be repeated here).
In order to deeply analyze the risk factors in driving behavior and derive quantitative risk aversion opinions, PDP analysis was conducted on acceleration and times and speed and times in the two behaviors, respectively.Te results are shown in Figures 6 to 8.Not all variables for sharp-acceleration behavior and sudden-braking behavior were retained in the LR regressions, resulting in a total of three PDP results.
Figure 6 illustrates the interaction between acceleration and the number of sharp-acceleration behavior on risk.It can be seen that although the estimation of acceleration in the LR results is large, the magnitude of the change in risk is very small and not outstanding, at only 0.02.We consider the interaction of speed and times which is shown in Figure 7.In general, the speed and times of sharp-acceleration behavior are both negatively correlated with risk.When the speed exceeds 12 m/s (43.2 km/h), the risk depends mainly on the speed of the sharp-acceleration, while the number has little efect.

Journal of Advanced Transportation
When it comes to the interaction of acceleration and the number of sudden-braking behavior, the PDP heatmap in Figure 8 shows a typical convex curve when the acceleration reaches 0.5 g and the number is from 9 to 21.Meanwhile, the risk index ranging from 0.24 to 0.36 is greater than that of sharp-acceleration behavior.Terefore, compared with the sharp-acceleration behavior, the sudden-braking behavior is a more critical factor leading to the surge of crash risk.
By comparing the characteristics of the interaction heatmap of sharp-acceleration and sudden-braking behaviors, it is intuitively refected that the sudden-braking behavior is a typical risk factor, wherein the upper limit of risk in Figure 8 is much higher than that in Figures 6 and 7. Te phenomenon that the risk efect of sudden-braking behavior is greater than that of sharp-acceleration behavior can be explained by recreating the scenario in which such risky driving behavior occurred.Tere are usually three kinds of sharp-acceleration behavior on the expressway.One is that the driver decides to increase its cruising speed to a higher value, the second is to change lanes and overtake, and the third is the vehicle starting of on a congested road.Te frst two behaviors are allowed to occur on the premise that the current road trafc fow is stable and the driving conditions are good.Generally, the driving speed in this scenario is high, reaching or approaching free fow.In contrast, the third situation occurs on congested road sections whereby vehicles will frequently start and accelerate in a rapid manner, with low intensity but high frequency of occurrence, and crashes are more likely to occur.Tis fnding is consistent with the pattern in Figure 7 and the fndings in some previous research [35].In other words, the features at the time when sharp-acceleration behavior occurs also refect the current road level of service.When sharpacceleration occurs, most of the surrounding vehicles are in the similar trafc state and generally follow the driving rules, releasing signals in advance (such as turning on the turn light) to inform the surrounding vehicles of the upcoming action to avoid risk.Te occurrence of the suddenbraking behavior is diferent from the sharp-acceleration behavior.It may occur when an obstacle is met suddenly in front of the car, or drivers fnd their routes are wrong while on the ramp, being full of randomness, suddenness, and uncertainty, which is difcult to be predicted by other drivers, eventually leading to an accident.Although both behaviors are the main causes of crash risk, they also show great diferences in formation mechanisms.Terefore, when using behavior-driven risk prediction models, navigation companies should focus on congested environments and driving conditions with frequent hard braking, specifcally represented in this dataset by speeds less than 12 m/s during   Journal of Advanced Transportation

Conclusion
In order to predict crash risk on a large scale and at a lower cost, real-time driving behavior data provided by AutoNavi onboard GPS were utilized to establish a behavior-driven risk prediction model.Te generated datasets contained sharp-left-lane-change, sharp-right-lane-change, sharpacceleration, and sudden-braking behaviors within 250 meters upstream and downstream of the crash site within a certain time interval.Te frequency, speed, and acceleration in the process of these behaviors were calculated as supplementary features.Multiple classifcation learners were applied to regress the dataset, and PDP was applied to determine the main contributory factors of expressway crash risk.
Primarily, the behavior-driven risk prediction model is established through RUSBoost, with the AUC index reaching 0.782, which overperforms various machine learning models which use traditional trafc-fow data.It is demonstrated that the behavior-driven model has more advantages than the trafc-fow-driven model in risk prediction.Herein, navigation systems can provide corresponding safety monitoring services by using the real-time behavior data.
Furthermore, the results of LR and PDP show that sharpacceleration and sudden-braking behaviors are the main factors of expressway crash risk.Further study of the interaction of these two behaviors' features demonstrates that sudden-braking is the most critical source of risk.Te risk of sharp-acceleration behavior is found on the congested roads with high frequency.When sudden-braking behaviors occur in excess of 0.5 g acceleration, it is imperative to inform drivers of prospective upcoming risks.
In addition, when constructing a dataset, the interval before a crash in collecting real-time driving behavior has a great infuence on the regression efect and so does the ratio of the crash and noncrash data to be chosen.Blindly expanding the sampling interval and data ratio would not achieve the best regression efect.After comparing multiple groups of experiments, in this case, real-time driving behavior features within 15 minutes before the crashes are collected and the unbalanced ratio of the dataset is about 1 : 20.Te speed and acceleration characteristics in the driving behavior can also efectively infuence the prediction accuracy, while data of vehicle motion conditions and trafc operation patterns should be retained as much as possible.
To sum up, this research verifes the feasibility of the behavior-driven risk prediction model by using onboard navigation data and provides a rational basis to apply active countermeasures.Future research will focus on supplementing other trafc-fow data to obtain a highly accurate and interpretable model and developing risk avoidance measures.Last but not least, the legal risk on personal privacy brought about by the use of navigation software to capture driving behavior should also be taken seriously, so as to safeguard trafc safety while maintaining the privacy of users.

Figure 2 :
Figure 2: Schematic diagram of driving feature aggregation and sampling process.

Figure 3 :
Figure 3: Models' AUC performance of diferent datasets (with speed and acceleration features).

Figure 5 :
Figure 5: Distribution characteristics and the coefcient estimation of high-impact variables.(a) Frequency of occurrence of high-impact variables in 20 LR results.(b) Coefcient estimation of high-impact variables.

Figure 6 :
Figure 6: Interaction for acceleration and number of sharp-acceleration behavior.

Figure 7 :
Figure 7: Interaction for speed and number of sharp-acceleration behavior.

Figure 8 :
Figure 8: Interaction for acceleration and number of sudden-braking behavior.

Table 1 :
Elements of each category of abnormal driving behavior.
Methods.Multiple machine learning methods are used to obtain a more accurate classifcation model, including artifcial neural network, Naive Bayesian model, and RUSBoost.Feature selection is performed before machine learning regression on the dataset.LR itself is an efective feature extraction method, and hence the high-signifcance variables (p value <0.05) retained in LR are used as input variables.

Table 2 :
Real ratio of positive to negative cases in each dataset.

Table 3 :
Independent variable description of each data at the given time interval.

Table 4 :
Confusion matrix of the prediction model.

Table 5 :
Comparison of prediction performance using diferent data sources.

Table 6 :
Logit regression and ANOVA results in datasets with rates of 1 : 1.

Table 7 :
Logit regression and ANOVA results in datasets with rates of 1 : 3.

Table 8 :
Logit regression and ANOVA results in datasets with rates of 1 : 10.

Table 9 :
Logit regression and ANOVA results in datasets with rates of 1 : 20.