Driver Fatigue Detection Based on Facial Key Points and LSTM

In recent years, fatigue driving has been a serious threat to the traffic safety, which makes the research of fatigue detection a hotspot field. Research on fatigue recognition has a great significance to improve the traffic safety. However, the existing fatigue detection methods still have room for improvement in detection accuracy and efficiency. In order to detect whether the driver has fatigue driving, this paper proposes a fatigue state recognition algorithm.*emethod first uses MTCNN (multitask convolutional neural network) to detect human face, and then DLIB (an open-source software library) is used to locate facial key points to extract the fatigue feature vector of each frame.*e fatigue feature vectors of multiple frames are spliced into a temporal feature sequence and sent to the LSTM (long short-termmemory) network to obtain a final fatigue feature value. Experiments show that compared with other methods, the fatigue state recognition algorithm proposed in this paper has achieved better results in accuracy. *e average accuracy of the proposedmethod in detecting key points of the face is as high as 93%, and the running time is less than half of the ordinary DLIB method.


Introduction
Automobiles have become the most popular tools of transportation. As the frequency of automobile use continues to increase, traffic accidents are also increasing. In many traffic accidents, fatigue driving is one of the main reasons. Fatigue driving has caused many major traffic accidents, which caused huge losses to people's lives and properties.
Relevant Chinese traffic laws stipulate that driving for 4 hours without a break is fatigue driving. In a survey in the United States, more than half of the drivers admitted that they had fatigue driven [1]. When a driver is fatigued, his concentration, judgment ability, and reaction sensitivity are reduced [2]. ese factors will make traffic accidents more likely to occur. Long-distance driving is the most prone to fatigue driving and often causes the safety accidents. erefore, fatigue driving detection technology has become a research hotspot in the field of the traffic safety.
At present, fatigue detection methods are divided into the following categories: methods based on the physiological information, methods based on the vehicle status, methods based on the computer vision, and methods based on the information fusion models [3].
Physiological information mainly refers to the driver's breathing rate, pulse, blood pressure, and heart rate. ese parameters can quickly and accurately reflect a person's physical and mental state. e detection methods based on the physiological information not only have strong real-time performance but also have high accuracy [4]. However, the driver needs to wear related equipment during the detection process, which will affect the normal operation of the driver, so that the practical applications are limited. e status of the vehicle refers to the vehicle's trajectory, steering wheel manipulation, and lane deviation. ese detection methods indirectly analyze the driver fatigue state by analyzing vehicle information [5].
e main disadvantage of these methods is low accuracy. e detection methods based on the computer vision can quickly and accurately detect the driver fatigue state by capturing and analyzing the driver's face video in real-time.
ese methods do not need the driver to wear the related equipment and have good performance in the terms of detection rate and reliability. e main difficulty of these methods is face image processing. Information fusion methods are the comprehensive use of the physiological information, vehicle information, and computer vision algorithms to detect the driver's fatigue state. e advantage is that it can improve the accuracy of the detection, but the disadvantage is that it is difficult to establish an information fusion model and obtain various information.
e main contribution of this paper is to propose a new, high-precision, real-time fatigue detection method based on the computer vision. We combine MTCNN and DLIB together, which allows us to extract the facial features fast and accurately and then combine the facial features of multiple frames to make our fatigue judgment results more accurate.
is method first divides the video into image frames and cuts out the facial area through MTCNN and then uses the DLIB library to extract the fatigue features of the eye and mouth for each image frame. Finally, multiple frames of the fatigue feature are input into the recognition network based on LSTM to obtain fatigue judgment results.

Related Work
In recent years, many scholars and institutions have conducted a lot of researches on the fatigue driving detection based on the computer vision.
D'orazio et al. proposed an algorithm by eye detection. e algorithm used iris geometric information to determine the entire image [6]. Sun et al. studied the relationship between the closed eyes and the fatigue, and they used PER-CLOS to detect the driver's fatigue and obtained better test results [7]. Ma et al. designed a system to detect the fatigue driving state at night. ey used a deep framework based on ConNN and verified it on their own dataset [8]. Zhang et al. created a model to solve the influence of the sunglasses on the fatigue detection, which used the IRF dataset [9]. Gupta et al. observed the facial features of the driver through a camera and classified the fatigue levels through principal component analysis and support vector machine (SVM) classifier [10]. Junaedi and Akbar calculated PERCLOS by detecting the eyes and used it to judge the fatigue. ey used the YawDD dataset [11]. Savas and Becerikli tried to use the SVM algorithm to detect driver fatigue. In their study, they used the number of yawns, the internal area of the mouth, and the number of blinks to determine the driver fatigue level on the dataset [12]. Amodio et al. designed a driver state detection system based on pupil light reflection. ey used the pupil size contour and SVM classifier to judge the driver's state [13]. Li et al. designed a human behavior recognition classification system based on ConNN. ey proposed a face recognition algorithm based on LBP-EHMM [14]. Liu et al. proposed a driver fatigue detection algorithm using a two-stream network model with multiple facial features. ey applied gamma correction to enhance the image contrast to obtain better results [15]. Savaş and Becerikli proposed a multitask convective neural network model to detect driver drowsiness/fatigue. e features of the eyes and mouth were used to model the behavior of the driver. e changes in these characteristics were used to monitor the driver's fatigue [16]. Liu et al. proposed a fatigue detection algorithm based on the deep learning facial expression analysis. ey trained a facial key point detection model through multiple local binary patterns and AdaBoost classifiers [17]. Ed-Doughmi et al. proposed a method to analyze and predict driver drowsiness by applying a recurrent neural network on the driver's face in sequence frames. ey used a 3D convolutional network based on a repetitive neural network architecture of a multilayer model to detect the driver's drowsiness [18].
Yawning and frequent blinking are the most obvious signs of driver fatigue. erefore, the first task is to determine the human eyes' state and mouth's state. ere are generally two ways to detect the eyes and mouth. One is to directly detect the positions of the eyes and mouth. e other is to firstly find the facial area and then detect the positions of the eyes and mouth. e human face has more information, and the features are more stable than the human eyes. Cutting out the face area can reduce the test range of the eye position and avoid the interference of the background. e existing face detection algorithms can be divided into two categories: one is a multilevel detection algorithm based on the proposed region. e other is the target detection algorithm based on anchor frame [19]. e representative algorithms of the former are Faster-RCNN [20] and MTCNN [21]. e representative algorithms of the latter are S3FD [22] and SSH [23]. Compared with traditional learning methods, detection methods based on deep learning do not require manual feature extraction. With the support of a large amount of training data, the detection performance will be greatly improved.
Fatigue driving is a continuous behavior. erefore, the fatigue detection method based on continuous multiple frames will definitely be better than the single frame method. Donahue proposed the LRCN framework [24], which can process continuous multiple frames of relevant information to perform behavior recognition and classification.

Methodology
e framework proposed in this paper is shown in Figure 1. We will introduce the implementation details of each part in detail.

Face Detection.
In this task, we use MTCNN for face detection, which is based on deep learning and can quickly and efficiently complete face detection and face alignment [25]. MTCNN can detect five key points of the face: left and right corners of the mouth, nose, and left and right eyes. However, the five key points are not enough to extract facial fatigue information, so we use MTCNN just for face detection. MTCNN includes three subnets: proposal network (P-Net), refine network (R-Net), and output network (O-Net). MTCNN is composed of cascades of them [26].

P-Net.
e main task of this network is to obtain the bounding box and regression vector of the candidate window. After the candidate window is calibrated, nonmaximum suppression is used to eliminate highly overlapping windows. P-Net is a regional proposal network for face regions. e network uses a face classifier to determine whether there is a face in the area and uses border regression and a locator of facial key points to make a preliminary proposal of the face area. is part will output many candidate windows and use these windows as the input of R-Net.

R-Net.
e main task of this network is to eliminate false samples and continue to obtain bounding boxes and regression vectors. Unlike the previous network, R-Net has a more complete connection layer. When the test sample passes the P-Net layer, many candidate windows are gotten. e network will filter out a large number of wrong candidate windows. Finally, bounding-box regression and non-maximum suppression (NMS) were performed on the selected candidate boxes to further optimize the prediction results.

O-Net.
is network is more complicated than the first two networks. O-Net has a 256 fully connected layer. After further filtering the candidate window of R-Net, this layer of network will also calculate the position of the facial feature points. In addition, this operation can eliminate the influence of some obstructions, such as sunglasses, hats, and ordinary glasses.

Facial Key Point Detection.
In this phase of the task, we use DLIB to label the key points of the face. DLIB can be regarded as a machine learning toolbox, which is designed to solve the extraction of key points of human faces. DLIB has received widespread attention once it is launched, and it can be applied to mobile devices or large-scale high-performance computing environments. Like many open-source libraries, DLIB can be used by researchers for free. We choose the DLIB library because it can provide training and extraction tools for 68 facial key points. We can use it to obtain 68 facial key points and use these key points to extract fatigue features [27].

Closed-Eye Detection.
Obviously, when people's eyes are open, the distance between the upper and lower feature points of the eyes will be relatively large. When the eyes are closed, the distance becomes smaller. e EYE value is calculated by using the distance of the eye feature points. Among the 68 feature points on the face, the eye points correspond to 37-42 and 43-48, respectively. Figures 2 and 3 e numerator represents the Euclidean distance between the vertical feature points of the eyes, and the denominator is the Euclidean distance between the horizontal feature points of the eyes. e Euclidean distance between two points is calculated as follows: where P a · x and P a · y represent the coordinates x and y of point a, respectively, and the horizontal and vertical Euclidean distances of the eye can be expressed as follows: Eye v � Mean Dis P 38 , P 42 , Dis P 39 , P 41 , where Mean(A, B) means the average of A and B, and then the aspect ratio of the eye can be expressed as follows: Since the value calculation process of the left and right eyes is the same, the calculation process of the right eye will not be repeated. e eye feature vector (EFV) is composed of EYE left and EYE right . Mouth v � Mean Dis P 62 , P 68 , Dis P 63 , P 67 , Dis P 64 , P 66 , Mouth h � Dis P 61 , P 65 , where Mean(A, B, C) means the average of A, B, and C. en, the aspect ratio of the mouth can be expressed as follows:

Fatigue Recognition Network.
Many existing fatigue identification methods only use a single fatigue feature, which will lead to many misjudgments. Assuming that only the mouth information is used to determine whether you are tired, it is likely to misjudge your speech as fatigue [28]. erefore, the fatigue detection results obtained by analyzing one single frame are not accurate. Inspired by LRCN [24], a two-stage fatigue identification method is designed in this paper. e first stage is splitting the input video into frames of pictures. e fatigue vector of a single frame is extracted through MTCNN and DLIB, and the information of multiple consecutive frames is combined to form a temporal feature vector. e second stage is as follows: these fatigue feature sequences are input into the LSTM-based network to identify the fatigue state.

Temporal Fatigue Characteristic Sequence.
e feature extraction task needs to extract the eyes and mouth state values of each frame. erefore, we set the single frame feature vector length to 3. e fatigue feature vector of a single frame image is as follows: where Value leye and Value reye represent the state of the left eye and the right eye and Value mouth represents the state of the mouth. e feature vector of each frame is 1 × 3. So, we splice the feature vectors of multiple frames, and a temporal feature sequence of n × 3 will be formed. e vector length is 3, and the number of spliced frames is n. e splicing process is shown in Figure 6.   As shown in Figure 6, the length of the time window is a key parameter to construct the temporal fatigue characteristic sequence. If the length is too short, the obtained sequence may not be able to completely cover the fatigue state, and the excessively long time window will cause the sequence to contain too much redundant information.
Another key parameter is the number of the skipped frames. Since the information of adjacent frames will be almost the same, it is not necessary to extract the information of each frame, which will cause a lot of waste of calculation and greatly reduce the efficiency. We split each video sample at a rate of two frames per second. Since the fatigue process usually does not exceed three seconds, we chose a time window length of 6 and skipped frames number of 2.

Fatigue Recognition Network Based on LSTM.
LSTM is carefully designed to avoid the problem of long dependencies. Remembering long historical information is actually a default behavior. LSTM works very well on various problems and is now widely used in pattern recognition. Based on this idea, a fatigue identification network based on LSTM is applied in this paper. Its structure is shown in Figure 7.
As shown in Figure 7, the input of the LSTM network is a sequence of time features. e time feature sequence is composed of six single frame feature vectors. erefore, the length of LSTM is also 6. LSTM will return a probability value, which represents the probability of driver fatigue in the current time window. When the probability value is more than 0.5 or equal to 0.5, we set this value to 1 and indicate that the driver is in a state of fatigue during the current period. When the probability value is less than 0.5, we set this value to 0, which indicates that the driver is awake during the current period. As long as a period is judged to be fatigued, we will treat the video as a fatigue sample.

Dataset.
In the experimental part, we selected the YawDD dataset and self-built dataset to verify the performance of the method.

YawDD Video Dataset.
e dataset is collected by Abtahi et al. [29], which was captured in a static environment. e collectors gathered a large number of volunteers. e volunteer group was composed of drivers of different skin colors, sexes, and ages. ey did different actions according to the instructions as normal driving, talking, and yawning. Each volunteer was shot multiple videos. When the driver wears pure black sunglasses, the human eye cannot recognize the eye condition of driver. erefore, we selected 100 videos where volunteers were not wearing pure black sunglasses, including 50 men and 50 women for testing. A part of the dataset is shown in Figure 8.

Self-Built Dataset.
In the YawDD dataset, some drivers in the video do not yawn naturally, but just open their mouths to make a yawning action. In order to capture the most natural fatigue state as much as possible, our fatigue video samples are all taken after the volunteers get off work. After working for a long time, most people are more prone to fatigue. We cannot guarantee that every sample captures the natural yawning action, but we filmed the behavior that fits the most natural fatigue. Our algorithm was tested on behaviors often associated with fatigue versus actual fatigue. Second, the proportion of yellow people in the YawDD dataset is low, mostly whites and Indians. Adding a self-built dataset can help reduce the difference in experimental results caused by races of different skin colors.
Self-built dataset was collected by our experimental team. We gathered 10 volunteers and each was shot two videos: one is a normal video, and the other is a fatigue video; they included closing eyes, talking, laughing, and yawning. ese videos had slightly different face orientations, mouth shapes, and whether they wear glasses, and they were collected under different lighting conditions. Part of the dataset is shown in Figure 9.

Experimental Results and Analysis.
e platform of this experiment is Windows 10, the processor is Inter(R) Cor-eTM i7-9700k, the main frequency of the CPU is 3.6 GHZ, and the memory is 8 GB.
e programming language is Python. In the experiment, we split the video dataset into images and use MTCNN to detect and crop the face images. After cropping the face image, the DLIB library is used to mark the key points of the face to calculate the state value of the eye and mouth. By calculating the aspect ratio of the eyes and the mouth, we can perform closed eye detection and yawn detection. In order to verify the performance of the proposed algorithm, we compare our algorithm with the key point detection algorithms proposed in recent years. e experimental results are shown in Tables 1 and 2.  Tables 1 and 2, respectively, show the detection accuracy of our model and other methods in the YawDD dataset and the self-built dataset. It can be seen that our model is significantly better than other algorithms. e method proposed in this paper has a higher eye-mouth marking rate than other methods. Compared with the Viola-Jones algorithm, our method has significantly better results in the detection of faces, eyes, and mouths. Second, the detection results on the YawDD dataset are slightly lower than the selfbuilt dataset. is may be due to the small number of videos in our self-built dataset.
ere is not much difference in actual detection results. Next, we compare the detection time between different methods. Tables 3 and 4, respectively, show the detection time of our model and other methods in the YawDD dataset and self-built dataset. e Viola-Jones algorithm uses integral images to calculate its Haar-like features, which greatly reduces the amount of calculation. However, this algorithm was originally used to detect frontal face images, and it is not very robust to the detection of side face images. erefore, its detection accuracy is low. e head pose estimation Security and Communication Networks algorithm mainly uses the DLIB library to detect facial key points. e method proposed in this paper first uses MTCNN to extract the face and then uses the DLIB library to detect the key points of the face. In the process of detecting the key points of the eyes and mouth, the head pose estimation algorithm uses DLIB to detect the entire picture, which increases the amount of calculation and the detection rate is low. It can be seen from the data in the two tables that our method has a longer detection time than the Viola-Jones algorithm, but our average detection accuracy is 11%-15% higher than the Viola-Jones algorithm. Compared with the head pose estimation algorithm, the detection time is reduced by half, and the accuracy is increased by 8%-10%. Finally, we compared the accuracy of fatigue detection. Tables 5 and 6 , respectively, show the fatigue detection accuracy of our model and other methods in the YawDD dataset and self-built dataset. is study selected videos of drivers driving normally, talking, laughing, and yawning from the dataset and analyzed the results of driver fatigue through the state of the eyes and mouth. We use MTCNN + DLIB, DLIB + LSTM, head pose estimation method, and Viola--Jones method to compare the results with the method in this paper. When using DLIB + LSTM to detect the fatigue state, DLIB directly detects the entire picture, which not only takes a long time to detect but also has lower accuracy. e facial key points' detection accuracy directly affects the judgment of the fatigue state. When we use MTCNN + DLIB to detect the fatigue state, we only rely on the fatigue feature value of a (a) (b) (c) Figure 9: Self-built dataset.  single frame to determine the fatigue state, but fatigue is a continuous time behavior. So, the accuracy of this detection method is significantly lower than our method. In addition to these two methods, we also select two methods with superior performance to compare with our method. It can be seen from the result in Tables 5 and 6 that the accuracy rate of our method has reached 88%-90%.

Conclusion
We proposed a fatigue detection algorithm based on facial key points and long short-term memory. Since the face contains more features than the eyes and mouth, it is easier to be detected. So, we first obtained the face image and marked the key points of the eyes and mouth in the face image. is can reduce the scope of the eyes and mouth test and also avoid the interference of the background area in the image. Fatigue is a continuous behavior. It is easy to make misjudgments if the result only relies on the eye and mouth features of a single frame, so we split the fatigue feature values of a single frame into a temporal fatigue feature sequence and sent it to LSTM network. Although our method is superior to other methods in the extraction accuracy of facial key points and the final fatigue determination accuracy, the detection performance under insufficient light still needs to be improved. Our next step is to study fatigue driving detection in complex lighting environments and focus on the challenge of fatigue testing under poor light conditions, such as strong light and weak light. ese application scenes are more practical and more difficult. When an automobile enters a tunnel or runs at night, how can we recognize the driver's fatigue driving behavior in time? is direction is also one of the current researches focuses in the field of fatigue driving detection.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interests regarding the publication of this paper.

Acknowledgments
is work was supported in part by the National Natural Science Foundation of China under grant nos. 61872134 and Table 3: Detection time of different areas in the YawDD dataset.