Model Lightweighting for Real-time Distraction Detection on Resource-Limited Devices

Detecting distracted driving accurately and quickly with limited resources is an essential yet underexplored problem. Most of the existing works ignore the resource-limited reality. In this work, we aim to achieve accurate and fast distracted driver detection in the context of embedded devices where only limited memory and computing resources are available. Specifically, we propose a novel convolutional neural network (CNN) light-weighting method via adjusting block layers and shrinking network channels without compromising the model's accuracy. Finally, the model is deployed on multiple devices with real-time detection of driving behaviour. The experimental results for the American University in Cairo (AUC) and StateFarm datasets demonstrate the effectiveness of the proposed method. For instance, for the AUC dataset, the proposed MobileNetV2-tiny model achieves 1.63% higher accuracy with just 78% of the model parameters of the original MobileNetV2 model. The inference speed of the proposed MobileNetV2-tiny model on resource-limited devices is on average 1.5 times that of the original MobileNetV2 model, which can meet real-time requirements.


Introduction
According to the Association for the World Health Organization (WHO), more than 1.35 million people die in road trafc accidents every year [1]. Distracted driving has been one of the leading causes [2]. Te National Highway Trafc Safety Administration (NHTSA) reports that 3,142 people were killed on US roads involving distracted drivers in 2019 [3]. It is necessary to design a system to detect distractions, helping alleviate the current serious situation.
A visual feature-based approach to capturing distraction behaviour has been widely used in intelligent transportation systems with the help of deep neural networks. At present, edge-based advanced driver assistance (ADAS) [4] or driver status monitoring (DSM) [5] systems are now an important module for collaborative driving. Te edge central processing units (CPUs) and graphics processing units (GPUs) of systems are generally powerless [6]. Ziran et al. [7] proposed a vehicleto-cloud method for driver assistance systems. Abdu et al. [6] deployed the training and validation modules in the cloud environment; that is, the edge device data are directly uploaded to cloud and the distracted driver detection task is performed on cloud servers, where abundant computing and storage resources are available to realize real-time inference [8].
However, several faws in cloud computing make it less favourable for applications enabled by edge devices. First, the speed of data transmission highly depends on the Internet connection. Especially for some high-speed moving scenes, for example, in vehicles, data transmission may not be complete in the case of poor network signals. More importantly, as the number of edge devices is growing exponentially, data transmission speed is becoming the bottleneck of cloud computing. Second, high-end GPUs are expensive, bulky, and have high power consumption, which makes it challenging to deploy devices on resource-limited edge devices. Tird, data transmission to a centralized cloud server also leads to security and privacy issues [9]. Hence, it is necessary to process the data at the site where it is collected rather than in "cloud computing" [10].
One of the current research studies on detecting distracted driving on edge devices with limited resources is the pruning method that introduces flters and weights on large CNNs such as ResNet50 [11] and VGG [12]. However, the large pruning ratio easily leads to a loss of accuracy. Another suitable approach is introducing lightweight models such as MobileNetV2 [13]. For example, MobileNetV2 is seven times faster than ResNet50 but has a 3.6% lower accuracy. However, due to the low hardware utilization of compact operators commonly adopted by these lightweight models, existing lightweight models are still limited in their ability to improve practical hardware efciency.
Tis paper focuses on detecting distracted driving with resource-limited edge devices. We design a new strategy to develop efcient CNNs that maintain model accuracy while increasing inference speed. Specifcally, the blocks and channels of the network are optimized by analyzing the sensitivity of the model's performance. Te innovation of our method is compatible layer pruning and flter pruning to obtain a novel model lightweight way. Te experiments showed that our approach can be applied to in-vehicle terminals to provide real-time reminders.
Te contributions are summarized as follows: (1) Block-level architecture redesign is proposed to make the model more suitable for driving behaviour analysis scenarios. Unlike the direct compression of the block layer in the past, the distribution of the block layer is automatically adjusted according to the task, which helps the network learn more abstract and detailed features to improve its accuracy. Sensitivity information is used to adjust the number of residual bottleneck block layers. (2) Channel-level architecture redesign is proposed via pruning flters at each layer of the network with dynamic pruning ratios. Tose flters of "relatively little" importance are pruned to compress the CNN model. Fine-tuning is applied to reduce the loss of accuracy.  [14] created the Southeast University Driving Posture (SEU-DP) dataset in 2011, which includes four types of behaviours: safe driving, operating the shift lever, calling and eating, and talking on a phone. K-nearest neighbour (KNN), random forest (RF) [15], Geronimo-Hardin-Massopust (GHM) multiwavelet transform [16], and the pyramid histogram of oriented gradient (PHOG) [17] methods were used for driving posture feature extraction and distraction detection. However, the SEU-DP dataset was not publicly available.
Te StateFarm dataset was the frst published dataset in the distracted driving detection competition on Kaggle in 2016, which contains ten types of distracted driving behaviours: driving safely, texting with the right hand, calling with the right hand, texting with the left hand, calling with the left hand, operating the radio, drinking, reaching back, hair and makeup, and talking with passengers [18]. Abouelnaga et al. [19,20] created the AUC-distracted driver dataset, which contains the same ten types as the StateFarm dataset, in 2019. Compared to traditional machine learning methods such as RF and KNN, the CNN method can effectively improve accuracy and handle more complex classifcation problems. More and more researchers tend to use deep learning to solve this problem. Te visual geometry group (VGG) [21], InceptionV3 [22], ResNet [23], and video-based methods [24] were used to improve distraction detection accuracy. Improved ReSVM [25] is a method that combines in-depth features from ResNet with a support vector machine (SVM) classifer for driver distraction detection.
To improve recognition accuracy, facial expressions, hand gestures [26], and human body key points [27] are also used in the feature extraction of driving behaviour. Chiou et al. [28] used a cascaded face detector to detect the face area and obtain the coordinates of the face, eyes, and mouth to judge whether the driver was driving normally or not according to the coordinates. When abnormal driving behaviour is detected, it is further determined whether the behaviour is drowsy driving behaviour or distracted driving behaviour. Facial landmark detection [29] was used for distraction detection due to the driver's head panning. Te proposed method increases the feature extraction capability through novel geometric and spatial scale-invariant features and outperforms the existing state-of-the-art approaches in detection accuracy for multiple datasets.
However, early distraction detection research focused on improving recognition accuracy. Most current studies are based on traditional deep CNN models [30], such as ResNet and VGG. Although they can achieve high accuracy, they are not friendly to embedded systems with limited memory space and computing resources. At present, some researchers have begun to study distraction detection lightweight methods. Binbin et al. [31] proposed a new neural network model based on decreasing the size of the flter. Te model had only 0.76 parameters, and the results were 95.59% and 99.87% accurate for the AUC and StateFarm datasets, respectively. Dropout, L2 regularization, and batch normalization [32] were used on VGG-16 to reduce the number of parameters from 140 M to 15 M only. Bhakti et al. [33] proposed a new structure network mobileVGG with only 2.2 M parameters. Zuopeng et al. [34] introduced a lightweight microscopic detection network (LMS-DN) for lightweight distraction detection.
At present, the research on the lightweight of distracted driving is relatively scarce and constantly developing. Terefore, we were committed to fnding a novel lightweight method for resource-limited devices to detect distracted driving in real time. Tis paper focuses on reducing the number of parameters and increasing speed while maintaining high accuracy.

CNN Compression Techniques.
To deploy CNNs on resource-limited devices, pruning [35] has often been used to reduce the model's complexity while trying to maintain decent accuracy.
Many model pruning methods for flters and weights have been proposed in recent years [36]. Hao et al. [35] proposed reducing the number of convolutional channels in the convolutional layer to reduce the size of the model and computational complexity. Te L1-norm statistic was used to select less signifcant flters. Sensitivity information was used to evaluate the infuence of each layer of the network. Te FPGM strategy [37] was proposed to assess the importance of flters in a single convolution by calculating the geometric distance between flters. Liu et al. [38] suggested that researchers consider model pruning as a model design process, using diferent pruning rates on diferent layers according to various tasks. Learning flter pruning criteria [39] were proposed to learn and select the most appropriate pruning criteria for each functional layer.
Compared with flters and weights pruning, pruning an entire layer/block is more efective in reducing model complexity and hardware latency [40]. Block-level pruning [41] adopts a multiobjective optimization paradigm to reduce the blocks of the model while avoiding accuracy degradation. Xu et al. [42] proposed a fusible residual convolutional block for layer pruning. Te convolutional layers of the network were converted into a residual convolutional block with a layer scaling factor for layer pruning. A DepthShrinker framework [43] was proposed by shrinking the basic building blocks of CNNs with irregular computation patterns into dense networks.
However, the previous work was mainly based on pruning large models. Compared with the pruning of large models, pruning of lightweight networks such as Mobile-NetV2 is more difcult. In this paper, we propose a new lightweight model compression method for MobilevetV2. Tis method will be compatible with block-level pruning and channel-level pruning. Diferent from previous block-level pruning methods, we mainly prune the multigroup residual modules for networks.

Resource-Limited Device.
Resource-limited devices, such as embedded devices, mobile devices, and other Internet of things devices [44], have limited memory and processing resources. Deep learning algorithms are often computationally and memory-intensive and, therefore, unsuitable for resource-limited devices [45]. Computational processing units on resource-limited devices typically include integrated CPUs [46] and GPUs [47]. Extensive research is underway to develop suitable hardware acceleration units, such as FPGA [48], ASIC [49], TPU [50], and NPU [51], to create distributed systems to meet the high computational demands of deep learning models.
Another solution is to use lightweight networks, such as MobileNetV2, SqueezeNet [52], MobileNetV3 [53], and EfcientNet [54], which enable feasible embedded deployment. However, these lightweight models are still difcult to balance in terms of speed and accuracy. MobileNetV2 is 3.6% less accurate than ResNet50.
In this work, we designed an improved method based on the lightweight model that considers the speed and accuracy of inference. We chose two devices: the Xiaoyi Smart Rearview Mirror equipment and HUAWEI MediaPad c5 device, which can be used for the actual car driving recorder intelligent system. With the resource-limited CPU and GPU, the devices can be used to verify the efectiveness of our method in edge device deployment.

Method
Te overall framework of the proposed model is shown in Figure 1. First, L1-norm regularization and sensitivity methods are used for block-level architecture redesign to improve accuracy. Ten, flter pruning and fne-tuning methods are introduced for channel-level architecture redesign to reduce the number of parameters. Te model is optimized and trained on the server side. Te improved method is deployed on embedded systems for model inference. We use MobilevetV2 as the basic model for the experiment and, at the same time, we apply our improved method to multiple lightweight models, such as SqueezeNet [52], MobileNetV3 [53], and EfcientNet [54].
3.1. L1-Norm and Sensitivity. L1-norm regularization was used to select pruned flters. Let X i ∈ R n i ×w i ×h i denote the input feature maps for the i th convolutional layer and X i+1 ∈ R n i+1 ×w i+1 ×h i+1 be the feature maps for the next convolutional layer. n i+1 3D flters F i,j ∈ R n i ×k×k are applied to the i th convolution layer. k × k is the size of the convolution flter.
Te L1-norm of F i,j is Te relative importance of the flter F i,j can be measured by calculating the sum of its absolute weights, and the sum of F i,j absolute weights is Based on L1-norm regularization, the sum of the absolute weights of F i,j shows the weight to the magnitude of the output feature map. Compared with other flters in this layer, flters with smaller absolute weights tend to generate Computational Intelligence and Neuroscience 3 feature maps with weak activation. Te smaller the absolute weight result, the less signifcant the flter. By pruning each layer independently, each layer's sensitivity can be understood. Te sensitivity of each layer of the convolutional network could be used to represent the importance of the convolutional network layer, which can visually display the impact of each layer of the network and flter on accuracy. Sensitivity is obtained by pruning the flters of each layer independently and evaluating the accuracy of the verifcation set. After calculating the sensitivity, less essential layers/flters will be pruned frst.

Block-Level Architecture Redesign
3.2.1. Motivation. Sensitivity was analyzed to understand the applicability of MobilenetV2 in distracted detection tasks. MobileNetV2 contains 19 residual bottleneck blocks, of which multiple residual bottleneck blocks are generated in a cycle. Taking the residual unit of MobilenetV2 as a whole bottleneck block, the method only counts the sensitivity of the frst-layer network of the residual unit. Te sensitivity of the AUC dataset can be obtained by applying diferent pruning ratios on the convolutional layer.
Sensitivity analysis can be used to optimize existing open-source models for driving behaviour analysis scenarios. We observe that the sensitivity of layers in the same cycle stage (with the same feature map size) is diferent. Te more the number of cycles, the smaller the contribution of the subsequent layers to accuracy. Layers with relatively fat slopes are less sensitive for accuracy. We can reduce the model sublayers that have a relatively small impact on accuracy. At the same time, for precision-sensitive layers, we can increase the number of cycles of the module sublayer to improve precision. Our idea is conducive to improving the accuracy of MobileNetV2 in driving behaviour tasks, given that the distribution of MobileNetV2's block layers is designed based on the benchmark dataset.

Block-Level Optimization.
Our method optimizes the layout of the network layer from a well-trained model to improve accuracy and increase the model's applicability in specifc task scenarios. For the network structure of MobilenetV2, it is mainly to optimize the number of cycles of blocks in the same stage.
For multiple cycles of residual blocks, the number of cycles can be adjusted according to the infuence of the sensitivity results on accuracy. For the residual block with a slight change in sensitivity, the number of cycles can be reduced,the residual block of the cycles whose impact on the accuracy is controlled within k%. For the residual block with a signifcant change in sensitivity, the number of cycles can be increased until the increased cycle modules have less than k% impact on accuracy.
Nevertheless, layer-by-layer optimization and retraining can be a very time-consuming process. We design an automatic optimization adjustment strategy, which can calculate the importance score of each layer using the same sensitivity analysis criteria and adjust the distribution number of block layers of the task adaptively according to the value of k. Our strategy achieves comparable accuracy to the original.
At the same time, a fne-tuning process is introduced to speed up the training process of the tuned network. Same as the original training process parameters, the fne-tuning process is obtained by retraining on the basis of original network weights. By adopting block-level optimization, the network can increase its adaptability to specifc tasks. Figure 2 shows the optimization of the layout of MobilenetV2 network layers. Te layers of con4_3, con5_2, con5_3, and con5_4, which have a relatively small impact on accuracy, were reduced. Te residual unit after con7_3 and con8, which signifcantly afected accuracy, was increased. Te adjustment of the block layer, which we designed, can automatically adjust the distribution number of the block layer according to the task. In particular, increasing the number of deep block layer loops helps the network learn more abstract and detailed features to improve accuracy.

Channel-Level Architecture
Redesign. Te sensitivity analysis helps explain the principle of network layer optimization and improve the model's accuracy. However, many network flters after network layer optimization still do not contribute much to the fnal result and appear redundant. Te method of pruning flters is used to compress the network.
In our algorithm, the pruning ratio is dynamically calculated based on sensitivity information, and the pruning ratio of each layer flter is diferent. Te dynamic pruning  rate ensures that the accuracy loss after pruning is as tiny as possible. According to the sensitivity information, the impact of the pruning ratio of each layer flter on accuracy is controlled within k%. Fine-tuning is used after pruning to ensure that the accuracy loss of the network after pruning is as tiny as possible.
Te procedure of pruning is as follows: Step 1: we calculate the pruning ratios of each convolutional layer flter based on sensitivity information.
Step 2: We set diferent pruning ratios for each convolutional layer flter. We prune a small number of flters layer by layer and count foat point operations (FLOPs) and accuracy. If FLOPS and accuracy have met the requirements, we go to Step 4. Otherwise, we go to Step 3.
Step 3: we perform epoch fne-tuning on the network and enter Step 2.
Step 4: we perform fne-tune training to convergence.

Overall Optimization.
We get the fnal distribution for all layers with block-level and channel-level optimization. Te pruned network is then fne-tuned to obtain the fnal accuracy. Te whole process of the proposed CNN lightweighting method is explained in Algorithm 1.

Results and Discussion
Te experiments use the AUC dataset and the StateFarm dataset to verify the efectiveness of the improved method. Finally, the model is deployed on embedded devices for validity and real-time testing. Tis improved design could achieve real-time goals with lower computational complexity and memory requirements while maintaining good driver posture classifcation accuracy. Te top-1 accuracy, the approximate number of parameters, FLOPs, and frames per second (FPS) were used to evaluate the performance of the proposed design.

Datasets.
Te AUC dataset contains ten types of distracted driving behaviours. Figure 3 shows the sample images of ten types of distracted driving behaviours from the AUC dataset. Te camera was placed on the upper right of the front passenger seat to record the simulated behaviour of the driver. Te videos were cut into single-frame pictures, with each frame size of 1080 × 1920 pixels. Te dataset has 31 participants from 7 diferent countries, with 17,308 images, including 12,977 images in the training set and 4331 images in the test set. Te StateFarm-distracted driving detection dataset was used in the experiment to verify the method's applicability. Te StateFarm dataset has 26 participants (13 males and 13 females), including 22,424 training set images and 79,726 test set images, and each image has a size of 640 × 480 pixels. Figure 4 shows the sample images of ten types of distracted driving behaviours from the StateFarm dataset. After the competition, the labels of the test set were not available. Te training set was processed in two groups of experiments: In the frst set of experiments, the StateFarm training set was randomly divided into a 90% training set (17934 images) and 10% validation set (4490 images) for performance evaluation and verifcation. In the second set of experiments, the StateFarm training set was divided into 70% and 30% randomly referring to other literature methods. In the third set of experiments, the StateFarm training set was randomly divided into a 60% training set, 20% test set, and 20% validation set for the cross-validation experiment. [55] with the programming language Python3.7. A single NVIDIA GeForce GTX 1080Ti GPU with 12 GB system RAM was used to train the network. Te cosine function was adopted as the function of learning rate decay in the training process. Te learning rate was 0.05, the epoch was 100, and the image shape was [3,224,224]. Te batch size was set to 32. Te optimizer used SGD utilized with momentum � 0.9 and weight decay � 5 × 10 −3 . Te ImageNet pretrained model was used as initialization to speed up model convergence. Table 1 shows the change in the residual bottleneck block of MobileNetV2 from n to n ′ in the global block layout optimization step, with channels changing from c to c ′ in the pruning step. Each line describes a sequence of 1 or more residual bottleneck layers, repeated from n times to n ′ times. n ′ is the number of cycles adjusted according to the sensitivity result. Tis optimization can improve the accuracy of the network in detecting distracted driving. c ′ is the channel of each block improved by the pruning step. Te feature maps for each convolutional layer can drop at least 10% of channels without afecting accuracy. Reducing the number of flters can efectively reduce the number of parameters. Figure 5 shows the sensitivity of MobilenetV2 and MobilenetV2-tiny for the AUC dataset. Te abscissa is the ratio of flters cropped, and the vertical coordinate is the loss of accuracy. Each coloured dotted line represents a convolutional layer in the network. It shows the relationship between the accuracy and the pruning ratio of each layer of the module for the AUC dataset. Accuracy decreases slowly with the cropping rate from 0 to 0.9, which means that the corresponding convolutional layer is relatively insensitive, and the contribution to network accuracy is relatively low. It shows the relationship between the accuracy and the pruning ratio for the AUC dataset. Te inverted residual modules of con4_3, Computational Intelligence and Neuroscience con5_2, con5_3, and con5_4 have a relatively small impact on accuracy. Te modules of con7_3 and con8 have a more signifcant efect on accuracy. Compared with Figure 5(a), the sensitivity of MobilenetV2-tiny shown in Figure 5(b) reduces the residual bottleneck layer, whose impact on accuracy is less than 1%. We can reduce the model sublayers that have a relatively small impact on accuracy. At the same time, for precision-sensitive layers, we can increase the number of cycles of the module sublayer to improve precision. Te experimental results of the improved MobileNetV2tiny model for the AUC dataset are shown in Table 2. Compared with the original models, the MobileNetV2-tiny Input: training data X, validation data Y; the i th convolution layer with flters F i,j ; the i th convolution layer bottleneck residual block number n; the i th convolution layer channel number c; Output: the light-weighting model and the updated n, c, and F i,j .

Results for the AUC and StateFarm Dataset. Te method was developed in PaddlePaddle
(1) fori in convolution layers do (2) Optimization n and c based on (2); (3) Fine-tuning the module; (4) Update n and c based on Y accuracy; (5) for pruning ratios ⟵ 0.1 to 0.9 do (6) Fine-tuning the module; (7) Update pruning ratios based on Y accuracy controlled decrease within k%; (8) end for (9) end for (10) Get fnal parameters n, c, and F i,j and fne-tuning the pruned model with X. ALGORITHM 1: Te proposed CNN light-weighting method.  model has fewer parameters. Te MobileNetV2-tiny model has a 1.63% higher accuracy than the original MobileNetV2, with only 78.06% of the original MobileNetV2 parameters. Tis new design reduces computational complexity while maintaining the accuracy of driver posture classifcation, which is necessary for embedded applications. FLOPS are directly reduced by 71.6%, which is very suitable for deployment on resource-limited devices. Table 3 shows the results compared with the latest methods in the literature. Te number of parameters of our model is relatively small, while the accuracy of the network is relatively high. Our method is the optimization of the existing mature algorithm, which is a diferent way of improvement. Table 4 shows the verifcation results for the StateFarm dataset. Te StateFarm dataset was randomly split into a training data set : test data set = 9 : 1. A total of 100 epochs were trained. Input parameters are the same as those in the training AUC dataset. Te improved MobileNetV2-tiny's accuracy rate is 0.19% higher than that of the original model. Other studies, such as Dhakate and Dash [22] randomly split the national train data set: test data set = 7 : 3. By dividing the dataset in this way, the results are shown in Table 5. Our      Table 5: Comparisons with the state-of-the-art methods in the literature for the StateFarm dataset with the dataset randomly split by 7 : 3.

Model
Top-1 acc (%) Params (M) InceptionV3 [22] 92.90 25.6 Xception [22] 82.50 22.9 InceptionV3 + Xception [22] 90.00 46.7 InceptionV3 + Xception + ResNet50 + VGG-19 [22] 97.00 214.3 D-HCNN [31] 99    be seen from Figure 6  platform into an inference model that can be applied in an embedded environment, including data structure conversion and fle parameter conversion. Figure 7 shows the FPS to infer frames on the embedded platform. On the Xiaoyi platform, the optimized model inference speed has been signifcantly improved. Te processing time for one frame is faster than that of the original model, which can meet real-time processing requirements. Te reasoning speed of HUAWEI MediaPad c5 is faster and can reach the level of real-time processing of video frames. In general, the proposed approach can achieve real-time processing speed and can be applied to actual distracted driving scenarios, which have a certain practical reference value for the research on the deployment of driving behaviour detection end to end.

Conclusion
Although some studies have considered distracted driving detection, the current works are less focused on real-time detection for embedded devices. We develop a lightweight design model for real-time monitoring of driving behaviour, which can be applied to vehicle-mounted terminals to provide real-time reminders. We adopted the method of   pruning, but diferent from the direct compression of layers and flters in the past, our method was to automatically adjust the distribution of layers according to the task, which can increase the cycle of critical layers while pruning the least essential layers. Te proposed MobilevetV2-tiny FLOP model is only 0.7 MobileNetV2 and obtains an accuracy of 1.63%, which is higher than that of the original MobileNetV2 model for the AUC dataset. Compared with the advanced methods in the existing literature, the results show that our method has advantages in terms of speed and model size while maintaining high accuracy. Te lightweight method can meet real-time processing requirements for embedded devices.
However, there are still some problems. First, we notice that in the SqueezeNet model, the improvement of the proposed method is relatively small (only 0.51% improvement). Second, the training time of the proposed method becomes longer. (1) Next, we will continue to improve the method in order to reduce the time consumption of pruning.
(2) We will further investigate other methods to improve the actual hardware efciency of the model. In the future, we wish to comprehensively analyze the degree of dangerous driving behaviours by integrating information such as vehicle speed to meet the needs of practical applications.

Data Availability
Te AUC public dataset and the StateFarm open training dataset were used for this study.

Conflicts of Interest
Te authors declare no conficts of interest.