Analysis of Temporal Relationships between Eye Gaze and Peripheral Vehicle Behavior for Detecting Driver Distraction

A car driver’s cognitive distraction is a main factor behind car accidents. One’s state of mind is subconsciously exposed as a reaction reflecting it by external stimuli. A visual event that occurs in front of the driver when a peripheral vehicle overtakes the driver’s vehicle is regarded as the external stimulus. We focus on temporal relationships between the driver’s eye gaze and the peripheral vehicle behavior.The analysis result showed that the temporal relationships depend on the driver’s state. In particular we confirmed that the timing of the gaze toward the stimulus under the distracted state induced by amusic retrieval task using an automatic speech recognition system is later than that under a neutral state while only driving without the secondary cognitive task. This temporal feature can contribute to detecting the cognitive distraction automatically. A detector based on a Bayesian framework using this feature achieves better accuracy than one based on the percentage road center method.


Introduction
Driver distraction is a diversion of attention away from activities critical for safe driving toward a competing activity [1] and is a large risk factor that causes accidents [2]. Note that distraction differs from fatigue [3] which is defined as a state that disables one from continuing the activity [4]. Many researchers have developed driver distraction monitoring systems to maintain safety while driving by considering different types and levels of distraction [3]. The National Highway Traffic Safety Administration (NHTSA) classifies distractions into (1) cognitive distraction, (2) visual distraction, (3) auditory distraction, and (4) biomechanical distraction from the viewpoint of the driver's functionality [2]. Cognitive distraction can be considered as an internal state of the driver. It is difficult to sense this from outside. The other distractions are external factors that disturb the activity and can be observed more easily. We focus on cognitive distraction and seek novel findings to automatically detect it.
In the past few decades, a number of methods for detecting distraction have been proposed [3]. The methods fall into the following five categories based on the types of measures: (1) subjective report measures, (2) driver biological measures, (3) driving performance measures, (4) driver physical measures, and (5) hybrid measures. Among these measures, subjective report measures and driver biological measures are not suitable under real driving conditions. Driving performance measures as indicated by steering, braking behavior, and so forth are suitable for detecting visual distraction [5]. Even if a system can detect such overt behaviors that are more directly linked with risk, for maintaining safety, the timing may be too late to provide support to the driver after the detection.
Eye-gaze measure, which is one of the driver's physical measures, is a useful measurement of visual distraction especially for In-Vehicle Information System (IVIS) and Advanced Driver Assistance System (ADAS) assessment as specified in the existing standards ISO 15007-1 [6] and ISO/TS 15007-2 [7], and it has the potential for capturing symptoms of cognitive distraction [8]. An eye-gaze pattern could be used to discriminate driving while performing a secondary cognitive task from driving only [9]. Drivers under cognitive distraction had fewer saccades per unit time, which was consistent with less exploration of the driving environment [10]. Saccades may be a valuable index of mental workload [11]. Miyaji et al. reported that the standard deviations of eye movement and head movement could be suitable for detecting cognitive distraction that caused gaze concentration and slow saccades when drivers looked at the roadway [12]. Kircher et al. indicated the percentage of time that the driver spent observing the road ahead, which is called the percentage road center (PRC) of gaze direction, was more than 92% under cognitive distraction in a field study [13]. Johansson et al. have reviewed the existing gaze-based techniques and metrics for analyzing visual and cognitive distractions [8].
These approaches on driver's physics mainly measured only driver's gaze toward the road ahead or in-vehicle static objects without any regard for the peripheral traffic environment, which includes many scattering visual stimuli, or measured a rough correlation between spatial features of the gaze and the environment. They also need a long-term evaluation. To more flexibly support the driver, an improvement in the time resolution is required for the detection. We take account of short-term dynamics of cross media to detect cognitive distraction. In the field of human-computer interaction, some researchers have investigated a state of mind by analyzing the temporal relationships between eye movements and visual changes in the user interface [14][15][16]. The latent state is subconsciously exposed as a reaction reflecting it by external stimuli [17]. In controlled settings such as using a driving simulator, the detection response task (DRT) is an upcoming method of measuring visual and cognitive distractions [18][19][20][21], which asks the subject to respond via a device such as a button to visual, tactile, or acoustic stimuli. The response time relates to the distraction. However, it is difficult to give actual drivers on the road the task without disturbing the safety driving. In this work, our target is the temporal relationships between driver's gaze and peripheral vehicle behavior in a real driving situation. In particular it is the timing of the gaze toward the visual stimuli caused by the peripheral vehicle.

Timing of Gaze Reaction to Overtaking Event.
To analyze the temporal relationships, we focus on peripheral vehicle behaviors with a high level of visibility for the driver. When a peripheral vehicle (called the overtaking vehicle) overtakes the host vehicle driven by the driver, a visual change that occurs in front of the field of view can attract the driver's attention. We define this overtaking event as our target of analysis. The event has a base-point time 0 (= 0), a beginning time (= 0 − /2), and an ending time (= 0 + /2). 0 is the time when the front position of the overtaking vehicle in the direction of forward movement becomes equal to the front position ℎ of the host vehicle. is the duration of the overtaking event, which is a configuration parameter of the analysis and set in Section 4.2. Figure 1 shows the overtaking event.
We define saccade timing and gaze timing . The former is the time when the driver turns gaze toward the overtaking vehicle, whereas the latter is the time while the driver fixates the overtaking vehicle. The temporal relationships characterizing the gaze reaction to the overtaking vehicle are the time differences between the saccade timing or the gaze timing and the base-point time 0 of the event. Figure 2 shows the timing structure.

2.2.
Hypothesis. Some researchers have investigated the correlations between gaze directions and traffic object positions, for example, road curvature, oncoming traffic, road signs, and pedestrians [22][23][24], and have shown that the correlations are high under the neutral state while driving only. Most of the researchers, however, dealt mainly with spatial correlation. On the temporal relationship between visual attention and external stimuli, Posner revealed that covert attention to them decreases the reaction time, and, conversely, distraction increases it [25]. We therefore propose the following hypothesis: the timing of when a driver gazes toward the overtaking event under a state of cognitive distraction is later than that under a neutral state.

Real-World Driving Database
We analyze a part of a database collected using the "NUDrive Vehicle" in Nagoya, Japan [26].

Data-Collection
Vehicle. The "NUDrive Vehicle" was designed to synchronously record multimedia signals of driver performance (gas pedal, brake pedal, steering angle, velocity, acceleration, and position of the car), intervehicular distance, biological signals, videos, and audio signals. Various external sensors were mounted on a Toyota Hybrid Estima with a 2360 cc engine and automatic transmission and steering wheel on the right side. Figure 3 shows the data-collection vehicle. All the sensors used for recording were commercially available.

Participants.
A total of 30 participants (10 males and 20 females) took part in the experiment. They were, on average, 39.0 years old (range of 29-52 years) and had held a driver's license for a mean period of 18.6 years (range of 8-32 years). They received 5000 Japanese yen as compensation for their participation. during the initial period were not used in this work. The experimenter monitored the experiment from the rear seat and indicated the route to the driver. During a particular period of driving, the participant performed a secondary hands-free task of retrieving and playing songs from a list of 635 titles from 248 artists using an automatic speech recognition system [27]. The secondary task artificially induced the cognitive distraction state. The experimenter instructed the participant to retrieve as many songs as possible; accordingly, within around 30 s of successfully retrieving each song, the participant had to retrieve another song. All experiments were performed on two-or three-lane highways. The experimental route was the same for all participants.

Measures.
In this work, we analyzed the intervehicular distance measured by laser scanners and the driver's gaze direction extracted from the recorded video. The following are the details of the analyzed data and omit the account of other data.

Intervehicular Distance Scanning.
Two laser scanners (front: RIEGL LMS-140i-80; rear: RIEGL LMS-Q120i), mounted on the front and back of the host vehicle shown in Figure 3, provided geometric information about the peripheral environment of the vehicle. The laser scanners covered 80-degree arcs at both front and back of the vehicle, to an effective range of about 100 m to the front and 55 m to the rear, but had blind areas at the left and right sides of the vehicle. The data were acquired at a sample frequency of 10 Hz. For tracking peripheral vehicles in the blind areas, we applied a Kalman filter to the data [28]. The dynamics of their position and velocity relative to the host vehicle could be estimated even if they were outside the laser range. The position was on a horizontal plane whose coordinate system was comprised of a moving directional axis and its orthogonal axis with origin ( 0 , 0 ) at the center of the frontal laser scanner. We did not take velocity into account in this work. The practical area to analyze is limited to a rectangular area with a length of 80 m, −40 ≤ ≤ 40, and a width of 9.9 m, −4.95 ≤ ≤ 4.95.

Video
Recording. The driver's face was captured by a camera (Sony 1/2 inch CCD video camera DXC-200A) mounted on the dashboard. The data were acquired at a resolution of 692 pixels in width and 480 pixels in height at a sample frequency of 29.4 fps.

Driver Gaze
Tracking. The driver's gaze direction was manually labeled using ELAN, (http://www.lat-mpi.eu/tools/ elan) which is a tool for the creation of complex annotations on video and audio resources, by an annotator. We prepared five gaze labels according to ISO 15007-1 [6] and three additional labels to detect gaze toward overtaking vehicles and toward upper traffic signs that could be extracted from the low-resolution video as follows: ( 0 ) right side (gaze toward right mirror and right window by head turning); ( 1 ) right front (gaze rightward from front, including gaze toward overtaking vehicles in the right lane); ( 2 ) rear (gaze toward rear-view mirror); ( 3 ) front (gaze to road scene ahead, the reference direction); The annotator detected the beginning of the saccade of each gaze behavior as the beginning of interval with the gaze label. We consider that the gaze direction can be labeled more stably and accurately by using commercially supplied eye-tracking systems (e.g., FaceLab). Figure 4 shows a sample set of face images that were given each label.

Extracting Overtaking Events.
We extracted 274 overtaking events from the intervehicular distance data of the neutral task (task condition C ) and 81 events from that of the musicretrieval task (task condition C ).

Clustering of Peripheral Environment Data.
We do not analyze traffic simulation data, we but focus on real traffic data. There are some peripheral vehicles in addition to the overtaking vehicle that act as triggers of random events in a real traffic scene. The peripheral vehicles are the visual stimuli. From the standpoint of the visual target search task, the overtaking vehicle and the other peripheral vehicles are regarded as target and visual distractors, respectively. The performance of the target search task depends on the traits of the visual distractors [29]. In this work, a trait of the peripheral environment needs to be defined. We classify the traffic scenes spatiotemporally according to time while the peripheral vehicles exist in the subperipheral area within the interval of the overtaking event, . The peripheral area is divided into six subareas corresponding to the six gaze labels ( = 0, . . . , 5). Figure 5 shows the six subperipheral areas. Each of the six areas is assigned a Boolean variable . A vector, e, comprises the six variables: e = ( 5 , 4 , 3 , 2 , 1 , 0 ), which represents the trait of the peripheral environment. In this work, if the cumulative duration while the peripheral vehicles exist in subperipheral area goes over 50% of the targeted time interval, = 1; otherwise, = 0. We focus on the dynamics of the driver's gaze on the base point of the overtaking event, 0 , as a reference point of analysis. Therefore, the interval of the overtaking event, , is divided into the first half e 1 from to 0 and the second half e 2 from 0 to . Practically, we set to 10 s ( 0 = 0, = −5, = 5(s)) as a sufficient time duration to analyze in consideration of the analysis area with the length of 80 m and the maximum velocity difference (=40.9 km/h) between the host vehicle and the overtaking vehicles in the area.
From all data including the overtaking events: 274 events for C and 81 events for C , we could not extract uniformly distributed vectors e 1 and e 2 . In addition, we eliminated some traffic scenes along curves and including lane changes because they would induce a specific gaze behavior. Adequate samples were extracted for two types of state transitions of peripheral environment as follows: environmental state transition E from e 1 = (0, 0, 0, 0, 0, 1) to e 2 = (0, * , 0, * , 1, * ), 75 events for task C and 23 events for task C , and a 0 , g 0 a 4 , g 4 a 2 , g 2 a 3 , g 3 a 1 , g 1 a 5 , g environmental state transition E from e 1 = (0, 0, 0, 0, 1, 1) to e 2 = (0, * , 0, * , 1, * ), 43 events for task C and 13 events for task C . 43.0% for C and 44.4% for C of all overtaking events were fallen into either E or E . " * " denotes any binary value. The former represents when the overtaking vehicle runs in the right lane and the other peripheral vehicles do not exist for more than 2.5 s (=50% of /2) before it overtakes the host vehicle, whereas the latter represents when the overtaking vehicle also runs in the right lane and a peripheral vehicle exists in the right-front area for more than 2.5 s before it overtakes the host vehicle. We analyze the relationships between the extracted events for two environmental conditions E and E and the gaze data below.

Testing of Hypothesis Based on Temporal Gaze Distribution.
The relative frequency distributions of the gaze directions (temporal gaze distributions) that were measured in the peripheral environment E varied in terms of time, as shown in Figure 6. Since the relative frequency of the right-front gazes increased quickly after the base point of the overtaking event, 0 , regardless of the task, the participants frequently gazed toward the overtaking vehicle in the right lane when the number of visual distractors was small. We verified International Journal of Vehicular Technology the hypothesis (see Section 2.2) of the temporal difference between the relative frequency distributions for the musicretrieval task C and the neutral task C by performing a statistical test. Let us limit the interval for testing to the time from = 0 (= 0) to = 2(s) following the work of Merat et al. [19]. The average saccade timings for the two conditions C and C were 0.82 s and 1.07 s (SD = 0.52 s and 0.42 s), respectively. The saccade timing for the C condition was shorter than that for the C condition, but we did not obtain adequate samples for performing any statistical test to the saccade timings. The average gaze timings for the two conditions C and C were 1.09 s and 1.46 s (SD = 0.51 s and 0.43 s), respectively. The two relative frequency distributions within the limited interval had normality and homoscedasticity. The -test revealed that the gaze timing for the C condition was significantly shorter than that for the C condition, (183) = −3.76, < .001.
In contrast, Figure 7 shows that the participants rarely looked at the overtaking vehicle in the peripheral environment E while performing the music-retrieval task C , whereas they followed the same behavior as E for the neutral task C . In the same manner as E , the average gaze timings for the two conditions C and C were 0.88 s and 1.25 s (SD = 0.67 s and 0.60 s), respectively. The saccade timing for the C condition was shorter than that for the C condition, but we did not obtain adequate samples for performing any statistical test to the saccade timings. The average gaze timings for the two conditions C and C were 0.93 s and 1.34 s (SD = 0.51 s and 0.45 s), respectively. The -test revealed that the gaze timing for the C condition was significantly shorter than that for the C condition, (83) = −3.1, < .005. These results support our hypothesis but are limited to evaluation for two conditions, E and E , of the peripheral environment.

Discrimination between Distraction and Neutral State in a Bayesian Framework.
To identify the class label (distraction, i.e., C , or neutral, i.e., C ) of the driver state using the temporal gaze distribution, we use a naive Bayesian framework as follows: where represents the time from when the host vehicle was overtaken by the other vehicle, represents whether a gaze belongs to the distraction class at time , that is, the binary class label, represents whether the direction of the gaze is right-front, that is, gaze label, and E represents the condition of the peripheral environment. One of the important characteristics of the Bayesian framework is the capability to infer the state of an unobserved variable, given the state of the observed variables. In our case, we want to infer the driver's internal state, that is, cognitive distraction, given the gaze data and the peripheral environment.
To make a decision as to the class of discrimination is assigned to the gaze data, the equation can be iterated over a time interval related to an overtaking event. We can then accumulate the computed posteriors and choose the class of driver state with greater score based on Maximum a Posteriori (MAP) as follows: = arg max ( | , E) . In this work, the priors ( | E ) and (E ) are assumed as uniform, that is, noninformative prior distributions, and and E are regarded as binary variables not depending on time : ( | , E) ≃ ( | , E ), E = (e 1 , e 2 ). Therefore, we can apply the temporal gaze distribution extracted in Section 4.3 to ( | , E).

Results.
We matched the class label computed from equation (3) with the true label to discriminate between distraction state C and neutral state C . The experimental data were the same as the data for analysis of gaze timing (see Section 4): 75 events for C and 23 for C in E , 43 events for C , and 13 for C in E . We applied leave-one-out cross validation to obtain the discrimination accuracy. Table 1 shows the accuracies of the two-class discrimination. As this is for testing our hypothesis (see Section 4.3), we limited the time interval for discrimination to from = 0 (= 0) to = 2(s). Here, we employed a baseline method based on the percentage road center (PRC) [13], that is, proportion of total gaze duration toward road scene ahead to total time of sample events, which detected C and C by thresholding PRC at ; that is, if PRC was larger than , the method determined the data as distraction state C ; otherwise, neutral state C was assigned. The thresholds for E and E were set to 79.0%, by searching for an equal rate between detection of C and C . We can confirm that the proposed method performed more accurately than the baseline one.

Discussion
We obtained better test results supporting our hypothesis and confirmed that the proposed discriminator performed more accurately than the baseline one. This approach takes advantage of the shorter time needed to detect cognitive distraction because it focuses on only the important scene for the detection but needs to trigger the gaze reaction to the overtaking event. The Bayesian rule-based method excels in application. It can be naturally integrated into the stateof-the-art method based on Bayesian networks using hybrid measures [30].
Figures 6 and 7 also suggest another difference between two tasks C and C . Note the frequency of gazing to a downward direction, that is, down. We consider that the participants frequently looked at the speedometer or navigation system while performing the neutral task. This behavior agrees with the prior findings of PRC [13]. In the temporal section without the overtaking event, PRC needs to be addressed to detect cognitive distraction.
Here, let us compare the average timing of the gaze to right front in condition E with that in condition E (see their values in Section 4.3). We can confirm that the latter was slightly shorter than the former. In condition E , there was a vehicle that preceded the overtaking vehicle. The participants might still focus their attention on the preceding vehicle or right-front area and then effectively react to the next overtaking event. The event did not cause inhibition of return [31], which retards their reaction.
These results were verified under only two limited environmental conditions, that is, E and E , of peripheral vehicles because we could not analyze enough experimental data. We need to increase the number of clusters of peripheral environment data for wide-ranging analysis and to model the dynamics of peripheral vehicles based on, for example, time to collision (TTC), velocity, acceleration, and interaction among vehicles for deeply analyzing the temporal relationships and increasing the discrimination accuracy. We also have to analyze the other secondary tasks and differences among individuals.

Conclusions
The dynamics of the external environment can elicit reactions reflecting the human internal state, that is, make the latent state explicit. We showed that the temporal factor, that is, timing, of a reaction is important for understanding the state by focusing on cognitive distraction in a car-driving situation.
The concrete contribution of this paper is twofold. First, we obtained test results supporting our hypothesis that the timing of when a driver gazes toward the overtaking event under cognitive distraction is later than that under the neutral state. Second, we confirmed that a Bayesian-based detection of distraction using the temporal gaze distribution performed more accurately than the PRC-based one. The findings of this work should be generalized through additional analysis in future work. We have built a large database of 500 drivers [26].
The generalized findings will suggest a risk of voice interactive navigation (hands-free navigation) using automatic speech recognition and a novel testing scenario without driver's extra workload for driver information system in real driving situation to the Alliance of Automobile Manufacturers (AAM) guidelines [32]. To put our approach into practical use, we will refer to peripheral vehicle-tracking systems based on computer vision techniques [24,33] and replace the laser scanner with a camera-based system.