A Lightweight Object Detection Network for Real-Time Detection of Driver Handheld Call on Embedded Devices

It is necessary to improve the performance of the object detection algorithm in resource-constrained embedded devices by lightweight improvement. In order to further improve the recognition accuracy of the algorithm for small target objects, this paper integrates 5 × 5 deep detachable convolution kernel on the basis of MobileNetV2-SSDLite model, extracts features of two special convolutional layers in addition to detecting the target, and designs a new lightweight object detection network—Lightweight Microscopic Detection Network (LMS-DN). The network can be implemented on embedded devices such as NVIDIA Jetson TX2. The experimental results show that LMS-DN only needs fewer parameters and calculation costs to obtain higher identification accuracy and stronger anti-interference than other popular object detection models.


Introduction
Traffic congestion is a huge problem [1] facing the world in the process of rapid urbanization. In order to avoid traffic accidents, algorithms and embedded environment applications of drivers' dangerous behavior detection [2], such as handheld receiving and calling, have been studied extensively [3][4][5][6][7]. Most of the research on object detection networks aims at improving accuracy [8] but ignores the problems of model, calculation amount, and the number of parameters. In 2014, Girshick et al. proposed a region-based convolutional neural network (R-CNN) [9], which uses a region-based recognition method to detect objects. In 2015, two improved versions of R-CNN-Fast R-CNN [10] and Faster R-CNN [11]-were proposed to realize end-to-end detection of targets. Both models are two-stage algorithms with higher accuracy than traditional algorithms, but the detection speed is slow and cannot meet the real-time requirements. In 2016, Redmon et al. proposed YOLO [12] and YOLO9000 [13] and then proposed YOLOv3 [14]; the latter greatly improved the detection effects of small objects when compared to YOLO9000. e YOLO series algorithm greatly improves the detection speed because it combines the twostage task of sorting and identifying candidate box in Faster R-CNN. However, the detection speed is relatively slow due to the large number of parameters of YOLOv3. e network should be both accurate and fast, in order to realize the operation of the object detection network on mobile devices. Liu et al. then proposed an SSD [15] algorithm to realize the regression detection of the whole image that improved the speed but greatly reduced the accuracy of the detection of small targets. YOLO, SSD, and the network derived from them, as the representatives of the phase network, realize end-to-end training. ey only use a convolutional neural network to directly predict the categories and positions of different targets and improve the detection speed [16] by sacrificing accuracy.
Tiny-Yolo, introduced in 2017, is widely used due to its high speed and low memory consumption, but for devices that do not have GPU, it is still difficult to use Tiny-Yolo for real-time applications. In the same year, Andrew G. Howard et al. proposed MobileNet [17] for mobile and embedded vision applications. In 2018, the network MobileNet-SSD, derived from VGG-SSD, was proposed to dramatically reduce the parameters while greatly increasing the detection speed, but the risk of missing and false detection of small objects was extremely high in this. Several efficient mobile neural networks for common object detection have been proposed in the year recently, such as SqueezeNet [18], ShuffleNet [19], and so on. In order to further enhance the accuracy of small object detection and study the possibility of the application of lightweight object detection network in embedded devices, a new lightweight object detection network is proposed in this paper, which especially improves the detection ability of small target objects. e rest of this paper is organized as follows. Section 2 mainly introduces the related research on the application of small target detection and deep learning network on embedded devices. Section 3 introduces the overall structure of MobileNetV2 and SSD. Section 4 presents the structure of the new object detection network Lightweight Microscopic Detection Network (LMS-DN) and explains the innovation of the network. Section 5 provides and compares the experimental results with previous object detection networks on the KITTI, VOC, and Safe_Imgs datasets. e experimental results demonstrate the efficiency improvement of LMS-DN implemented on NVIDIA Jetson TX2. Finally, we draw conclusions on this work in Section 6.
e main contributions of this study are as follows: (

Related Work
Driver's handheld call detection can be regarded as the detection of small target mobile phone, which has become a challenging problem in the field of target detection. If an object is only a small part of an image (less than 0.1% of the image area), it is considered a small target. By analyzing the previous target detection networks (such as R-CNN, R-FCN [20], etc.), it can be said that a large number of convolution layers are used in the network structure to extract image features [21] dynamically. At the same time, the pool layer is designed to gradually reduce the size of the feature map that further reduces the calculation cost and prevents the model from overfitting. In this process, the resolution of the image is obviously reduced, and the information loss for small objects is very high. In view of the above problems, many researchers are focusing on the study of small target detection. Chen et al. proposed TSSD in 2017, which reused the pyramidal feature hierarchy calculated by convolution networks and used each layer of feature maps for detection, but it requires high computational capacity and memory of the computers. Krishna and Jawahar add deconvolution layers to the RPN to restore low-resolution images to high resolution, thus improving the detection effect for small targets. Lin et al. proposed FPN [22], which can be predicted on different feature maps by merging high-level semantic information and low-level location information. is method is very important in small target detection. In the same year, Guimei et al. proposed the feature-fused SSD [23] based on SSD, which fused different levels of features into context information, which is helpful in detecting small objects. IoU threshold is very important for sample selection. If the IoU threshold is too high, the quality of a positive sample will be very high, but the quantity will be very small, and the imbalance in the sample proportion will have a large influence on the final outcome. If the IoU threshold is low, the number of samples will increase, but the quality of samples will decrease. How to balance this relationship, that is, how to select a good IoU, is very important for the detection results. e cascade R-CNN, proposed by Cai et al. in 2018, continuously raised the threshold of IoU through a multistage structure, while ensuring the number and quality of samples, and finally trained a high-quality detector. Singh et al. proposed SNIP, which pretrained the image and then fine-tuned it using the original sampled image, and used the fine-tuned model to predict the original sampled image, thus making the input distribution as close to the model's pretrained distribution as possible. Zhishuai Zhang proposed DES on the basis of SSD, added semantic segmentation branch and global activation module, and carried out highlevel target detection features by learning the semantic relationship between feature channels and target categories.
Artificial intelligence and machine learning need a powerful computing platform for its implementation. e computing module is mostly placed at back end. An embedded system is developed for special equipment; one of its characteristics is high cost, so hardware resources are limited. In the face of some special scenes such as automobile driving, the real-time requirement of embedded equipment is high. NVIDIA launched a high-performance Jetson TX2 embedded board to speed up the layout of deep learning on terminal devices. e chip will be used in urban security, smart drones, ocean observations, intelligent agriculture, and industrial and commercial robots. Jetson TX2 really solves the real-time positioning and map building (Simultaneous Localization and Mapping, SLAM) computing requirements. ere are several case studies about CNN-based object detection algorithms for embedded platforms. Mao et al. implemented the detection algorithm Fast R-CNN on Jetson TK1. ey applied SVD to the FC layer, and by adopting a more effective approach to regional recommendations, simplified the Fast R-CNN model. Both the CPU and GPU were employed to achieve the parallel computing of region proposal and CNN classification. Wang et al. customized the design parameters of the object detection framework, to maximize the algorithm energy e aforementioned works mainly focused on the design of the algorithm to reduce the computational complexity of the CNN model, which is often at the expense of accuracy loss.
is paper proposes a lightweight model LMS-DN that needs only a few parameters. e calculation costs to obtain higher identification accuracy are lower than other popular object detection models. LMS-DN can capture the location information of small objects to improve the accuracy of mobile phone detection while running on NVIDIA Jetson TX2 embedded devices in real time.

MobileNetV2 Lightweight Network.
MobileNetV2 is an improvement upon MobileNetV1. MobileNetV2 can reduce the number of parameters and improve the operational speed more effectively, with the use of depth separable convolution and the incorporation of residual connection, providing an effective model for mobile and embedded visual applications. Depth separable convolution is a decomposable convolution operation that can be divided into two smaller operations: Depthwise Convolution and Pointwise Convolution. e first step uses single-channel convolution to check the convolution of each channel of the input data. In the second step, N Point convolution with the same depth as the input data is used to check the results separated in the previous step and combine them to generate new results, as shown in Figure 1. Table 1 shows the main structure of MobileNetV2, where t represents the "expansion" multiple, C represents the number of output channels, N represents the number of repeats, and S represents the stride length. e core of MobileNetV2 is the bottleneck residual block (inverted residuals), as shown in Figure 2, which forms the "expansion" to "convolutional layer extracts features" to "compression" structure. Expansion layer uses 1 × 1 convolution kernel in order to map the low-dimensional space to the high-dimensional space. Among them, one super parameter is the dimension expansion multiple, which defaults to 6. After the depth separable convolution layer, it is projection layer, and 1 × 1 convolution kernel is also used in order to remap the high-dimensional features to the low-dimensional space. e function of the inverted residual module is to transform the input data into high dimensions and then extract features through deep decomposition and convolution.

Single Shot MultiBox Detector. Single Shot MultiBox
Detector (SSD) is an object detection network [6] that can directly predict the target category and position. In the previous model, the standard architecture network used for image classification was called the basic network, and then some additional layers were added, and the feature maps output by six different convolutional layers were fused to achieve comprehensive detection. Figure 3 shows the SSD. VGG16 [4] forms the basic network of the algorithm. e two full connection layers are changed into the convolutional layer, and another four convolutional layers are added to construct the network structure. e six convolutional layers involved in feature map fusion will produce a certain number of borders, which are together called the default box. e calculation formula of default box size is as follows: where S min is 0.2, S max is 0.95, through aspect ratio A to adjust the default box. In the calculation of loss function, SSD uses the sum of two terms, including positioning loss function and regression loss function. Total loss function: N is the number of matched default boxes. x represents whether the matched boxes belong to category p, and the value is {0, 1}. g is the ground truth box; c refers to the confidence of the checked target belonging to category P.

Proposed LMS-DN
LMS-DN is proposed in order to meet the requirements of real-time detection [24] and to further enhance the extraction of features of small objects. e first part of the twopart LMS-DN is the improved basic classification network, MobileNet-I (the network structure is shown in Figure 4), and the second part is the improved SSDLite network. MobileNet-I borrows from inception [25] structure expansion features to extract sensing field, integrates 5 × 5 depth separable convolution, and adjusts the overall structure of MobileNetV2. e MobileNet-I model contains 5 × 5 depth convolution compared to the previous studies that typically used only a 3 × 3 kernel. In fact, a 5 × 5 kernel does save resources for deeply separable convolutions rather than two 3 × 3 kernels. Formally, given the input shape (H, W, M) and output shape (H, W, N), let C 5×5 and C 3×3 , respectively, represent the calculation cost of depth separable convolution of 5 × 5 and 3 × 3: For the same effective receptive field, when the input depth is N >7, a convolution kernel with 5 × 5 has less computation than two cores with 3 × 3. Remove the last pooling layer of MobileNet-I and add an auxiliary convolutional layer to connect the base network and SSDLite networks. SSDLite replaces some standard convolution in the SSD prediction layer with Deep separable Convolution. According to the feature extraction mechanism of small target and the characteristics of different convolutional layers, LMS-DN extracted two additional features of special Computational Intelligence and Neuroscience convolutional layers to detect the target, and that is very effective for detecting small target objects such as mobile phones. us, Figure 5 shows the overall structure of LMS-DN that is formed.
As can be seen in Figure 5, in addition to the six feature maps extracted previously, LMS-DN also selects feature maps of the "expansion" layer in two convolution units, BnConv5_4 and BnConv6_2 for convolution, permutation, and flattening, and then merge the output for detection through the splicing layer. e reasons for the additional extraction of two feature maps of convolutional layers are as follows: (1) From the above definition, when the input image resolution is 300 × 300, the resolution of small objects should not be greater than 30 × 30. In the deep learning model, the deeper the network layer is, the stronger is the ability to extract image features. In order that the features of small objects could be optimally learned without losing information about them, we decide to extract the feature map of the "expansion" layer of the convolution unit BnConv5_4 with an image no greater than 30 × 30. (2) e resolution of the convolution unit BnConv6_2 is 19 × 19. As the transition layer of convolution units BnConv5_4 and BnConv7_1, BnConv6_2 has information that is more global, and with better feature extraction capability, it is more conducive to detecting small targets in images.     Computational Intelligence and Neuroscience other popular object detection networks have been described. Finally, the Safe_Imgs collected in the driver video monitoring platform have been used to detect the handheld call of the driver through LMS-DN. VOC2007 and VOC2012 datasets are divided into 20 target object categories and 21 background categories. In this paper, both VOC2007 and VOC2012 are taken as training datasets while VOC2007 test sets alone are used for testing.

Experimental Results and Analysis
KITTI datasets contain real image data that have been collected from urban, rural, and highway scenes and are annotated in eight categories. In this paper, in KITTI dataset, person sitting is merged into a pedestrian. e dataset consists of seven categories: car, van, truck, tram, pedestrian, background, and so on. Of the 7,400 tagged images, 6,660 were used as training sets and 740 as test sets.
All images of the Safe_Imgs dataset are from the driver video monitoring platform [29], as shown in Figure 6. e safe driving and talking video of driving of 30 drivers were selected in the data, out of which the driving videos of 15 drivers were selected as the verification set. Using OpenCV to capture the collected videos at 10 frames per second with a resolution of 300 × 300, 5,500 images of the training dataset were obtained, and 550 images of the dataset were verified. LabelImg [30] was used to generate an XML file corresponding to the sample after annotation.

Performance Evaluation Indicators.
ere are many evaluation criteria for the object detection algorithm. According to the emphasis of different standards, this paper uses detection accuracy, detection efficiency, and model size to evaluate the object detection model. e accuracy, precision rate, recall rate, mean average precision (mAP) were used to evaluate the performance of the target detection model in all categories on the dataset. e calculation methods of these indexes are as follows: where a is the number of misclassified samples, m is the total number of samples, TP (true positive) refers to a positive sample that is predicted to be positive by the model, FP (false positive) refers to a negative sample that is predicted to be positive, and FN (false negative) refers to a positive sample that is predicted to be negative.    was used to evaluate the size of the model. By weighing these performance indexes, the algorithm which is more suitable for embedded transplantation is discussed.

Results and Analysis.
We conducted comparative experiments from the five aspects given below, in order to prove that the proposed model and algorithm have better detection effect, especially in the detection of small objects: (1) is section makes a comparison of the effects of different combination models by controlling variables and using different basic networks and SSD structures. e base network experiment was conducted on KITTI datasets using a combination of MobileNet, MobileNetV2 and MobileNet-I, and SSD structures. e results are shown in Table 2. Comparing the results of lines 1 and 2, it is evident that when SSD is used in all detection networks and MobileNetV2 is used as the basic network, the network model size can be effectively reduced by 3.3 MB, and the detection accuracy is unaffected. Compared to lines 2 and 3, the results show MobileNet-I as the basic network, compared to MobileNetV2. By comparing the results of lines 2 and 3, it can be seen that MobileNet-I, as the basic network, has more 5x5 depth separable convolution with a slightly reduced model size compared with MobileNetV2, while imitating the inception structure to introduce Sin-Conv convolution unit, and the model size is slightly reduced. is suggests that the depth of the network reaches a certain degree-a 5 × 5 convolution kernel than two 3 × 3 convolution kernels with less amount of calculation. In addition, MobileNet-I as a base network has a higher detection accuracy than MobileNetV2, with MAP improving by 1.3%, indicating that MobileNet-I is also better than Mobile-NetV2 in feature extraction.
(2) In this section, experiments were carried out on KITTI, VOC, and Safe_Imgs datasets using several popular lightweight object detection networks-MobileNet-SSD, MobilenetV2-SSDLite, and LMS-DN-as proposed in this paper.
e experimental results on the VOC and KITTI data sets are shown in Tables 3 and 4. It can be seen from the tables that the basic network, MobilenetV2, can reduce the size of the model while improving the accuracy. By comparing the results of the first and second lines, it can be seen that MobilenetV2-SSdLite reduces the size of the model while attaining higher accuracy than MobileNet-SSD. By comparing the results of lines 2 and 3, it can be seen that when compared with MobilenetV2-SSDLite, MAP of LMS-DN proposed in this paper improved by 1.4% and 2.6%, respectively, under the condition that FPS of model size was slightly enlarged while actual FPS slightly decreased. Higher recognition accuracy was obtained with fewer parameters and calculation costs in comparison to MobilenetV2-SSDLite. e effect diagram of MobilenetV2-SSDLite and LMS-DN on KITTI dataset is shown in Figure 7. As can be seen from the diagram, under MobilenetV2-SSDLite, many cars with small proportions in the figure have not been detected, whereas when LMS-DN improved after SSDLite, there has been a great improvement on the detection of small target objects, and target vehicles at a long distance can be detected. e experimental results on the Safe_Imgs data set are shown in Table 5. LMS-DN achieves 86.2% Computational Intelligence and Neuroscience accuracy, 4.9% higher than MobileNet-SSD and 3.5% higher than MobileNetV2-SSDLite at the cost of less model size. At the same time, there are varying degrees of improvement in the accuracy rate and recall rate. According to the above experimental results, improvements to SSDLite cause LMS-DN to detect smaller target objects than MobilenetV2-SSDLite.
(3) To analyze the performance of MobileNetV2-SSDLite and LMS-DN models, the Safe_Imgs dataset was used to compare the accuracy of the two models under different thresholds under the same experimental conditions in this section. As shown in Figure 8, it can be seen that the detection performance of MobileNetv2-SSDLite decreases significantly when the threshold increases gradually from 0.25 to 0.55. However, for LMS-DN, when the threshold value is 0.55, the accuracy rate still reaches 82.60%, indicating that LMS-DN has a good anti-interference capability. Figure 9 shows the detection results of a threshold value of 0.4.         Figure 10: Illumination test.   Table 7 shows the statistical results of three different lighting types. In addition, obstacle-blocking test was carried out on the development board, and some picture effects were obtained, as shown in Figure 11, and the statistical results are shown in Table 8. LMS-DN can also accurately detect the mobile phone with the target object when the palm blocks most of it. To sum up, according to the statistical data, the LMS-DN model proposed in this paper can not only overcome the influence of strong and weak light, but it can also complete the real-time detection of targets with high accuracy in case of certain interference of obstacles.

Conclusions
is paper proposes a lightweight network LMS-DN for the driver's active safety prevention and control system, which can detect effectively and in real time on embedded devices. e network is improved on the basis of MobileNetV2-SSDLite and tested using KITTI, VOC, and Safe_Imgs datasets. e experimental results show that considering the factors such as accuracy, speed, and model size, the lightweight driver handheld call detection network proposed in this paper is more accurate and has stronger anti-interference in the detection of small target objects than the previous lightweight object detection networks. In future work, more real-time tests will be conducted on the embedded platform for different driving situations.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.