Face Detection and Segmentation Based on Improved Mask R-CNN

Deep convolutional neural networks have been successfully applied to face detection recently. Despite making remarkable progress, most of the existing detection methods only localize each face using a bounding box, which cannot segment each face from the background image simultaneously. To overcome this drawback, we present a face detection and segmentation method based on improved Mask R-CNN, named G-Mask, which incorporates face detection and segmentation into one framework aiming to obtain more ﬁne-grained information of face. Speciﬁcally, in this proposed method, ResNet-101 is utilized to extract features, RPN is used to generate RoIs, and RoIAlign faithfully preserves the exact spatial locations to generate binary mask through Fully Convolution Network (FCN). Furthermore, Generalized Intersection over Union (GIoU) is used as the bounding box loss function to improve the detection accuracy. Compared with Faster R-CNN, Mask R-CNN, and Multitask Cascade CNN, the proposed G-Mask method has achieved promising results on FDDB, AFW, and WIDER FACE benchmarks.


Introduction
Face detection is a key link of subsequent face-related applications, such as face recognition [1], facial expression recognition [2], and face hallucination [3], because its effect directly affects the subsequent applications performance. erefore, face detection has become a research hotspot in the field of pattern recognition and computer vision and has been widely studied in the past two decades.
Large amounts of approaches have been proposed for face detection. e early research on face detection [4][5][6][7][8][9] mainly focused on the design of handcraft feature and used traditional machine learning algorithms to train effective classifiers for detection and recognition. Such approaches are limited in that the efficient feature design is complex and the detection accuracy is relatively low. In recent years, face detection methods based on deep convolutional neural network [10][11][12][13] have been widely studied, which are more robust and efficient than handcraft feature methods. Besides, a series of efficient object detection frameworks are used for face detection to improve detection performance [14][15][16][17][18], including R-CNN [19], Fast R-CNN [20], and Faster R-CNN [21]. ese methods mainly implement face detection and the location of the face bounding box, which may have some drawbacks such as the extracted face features have background noise, spatial quantization is rough and cannot be accurately positioned. ese drawbacks will directly affect the follow-up subsequent face-related applications, such as face recognition, facial expression recognition, and face alignment [22]. erefore, it is necessary to study a face detection and segmentation method.
Mask R-CNN [23], an improved object detection model based on Faster R-CNN, has an impressive performance on various object detection and segmentation benchmarks such as COCO challenges [24] and Cityscapes dataset [25]. Unlike traditional R-CNN series methods, Mask R-CNN adds a mask branch for predicting segmentation masks on each Region of Interest (RoI), which can fulfil both detection and segmentation tasks. In order to fulfil both face detection and segmentation tasks from the image to overcome the drawbacks of the existing methods, a face detection and segmentation method based on improved Mask R-CNN (G-Mask) is proposed in this paper. In particular, our scheme introduces Generalized Intersection over Union (GIoU) [26] as the loss function for bounding box regression to improve detection accuracy of face detection. e main contributions of this paper are as follows: (1) A new dataset was created (more details are described in Section 4.1), which annotated 5115 images randomly selected from the FDDB [27] and ChokePoint datasets [28]. (2) A face detection and segmentation method based on improved Mask R-CNN was proposed, which can detect faces correctly while also precisely segmenting each face in an image. Furthermore, the proposed method improves the detection performance by introducing GIoU as a bounding box loss function. e experimental results verify that our proposed G-Mask method achieves promising performance on several mainstream benchmarks, including the FDDB, AFW [29], and WIDER FACE [30]. e remainder of this paper is organized as follows. Section 2 briefly reviews the related work. e G-Mask framework for face detection and segmentation is described in detail in Section 3. Section 4 presents the experiment and discussion of the proposed method. In the last section, the work is summarized and the direction of future work is proposed.

Related Work
Face detection as one of the important research directions of computer vision has been extensively studied in recent years. From the development process of face detection, we can simply classify previous work as handcraft feature based and neural networks based methods.

Handcraft Feature Based Methods.
With the appearance of the first real-time face detection method called Viola-Jones [4] in 2004, face detection has begun to be applied in practice. e well-known Viola-Jones can perform real-time detection using Haar feature and cascaded structure, but it also has some drawbacks, such as large feature size and low recognition rate for complex situations. To address these concerns, a lot of new handcraft features are proposed, such as HOG [5], SIFT [6], SUFT [7], and LBP [8], which have achieved outstanding results. Apart from the above methods, one of the significant advances was Deformable Part Model (DPM), proposed by Felzenszwalb et al. [9]. In the DPM model, the face is represented as a set of deformable parts, and the improved HOG feature and SVM are used for detection, achieving remarkable performance. In general, the advantages of handcraft features are that the model is intuitive and extensible, and the disadvantage is that the detection accuracy is limited in the face of multiobjective tasks.

Neural Networks Based Methods.
As early as 1994, Vaillant et al. [10] first proposed using neural network to detect faces. In this work, Convolutional Neural Networks (CNN) is used to classify whether each pixel is part of a face and then determine the location of the face through another CNN. After that, the researchers did a lot of research based on this work. In recent years, the deep learning approaches has significantly promoted the development of the computer vision technology, including face detection. Li et al. [11] proposed a cascade CNN network architecture for rapid face detection, which is a multiresolution network structure that can quickly eliminate background regions in the low-resolution stage and carefully evaluate challenging candidates in the last high resolution stage. Ranjan et al. [12] proposed a deformation part model based on normalized features extracted by deep convolutional neural network. Yang et al. [13] proposed a method called Convolutional Channel Feature (CCF) by combining the advantages of both filtered channel features and CNN, which has a lower computational cost and storage cost than the general end-to-end CNN method.
Recently, witnessing the significant advancement of object detection using region-based methods, researchers have gradually applied the R-CNN series of methods to face detection. Qin et al. [14] proposed a joint training scheme for CNN cascade, Region Proposal Network (RPN), and Fast R-CNN. In [15], Jiang et al. trained the Faster R-CNN model by using WIDER dataset and verified performance on the FDDB and IJB-A benchmarks. Sun et al. [16] improve the Faster R-CNN framework through a series of strategies such as multiscale training, hard negative mining, and feature concatenation. Wu et al. [17] proposed a different scales face detection method based on Faster R-CNN for the challenge of small-scale face detection. Liu et al. [18] proposed a cascaded backbone branches fully convolutional neural network (BB-FCN) and used facial landmark localization results to guide R-CNN-based face detection. e neural networks based methods are already the mainstream of face detection because of its high efficiency and stability. In this work, we propose a G-Mask scheme, which achieves fairly progress in face detection task compared to the original architecture.

Network Architecture.
e proposed method is extended from the Mask R-CNN [23] framework, which is the stateof-the-art object detection scheme and demonstrated impressive performance on various object detection benchmarks. As stated in Figure 1, the proposed G-Mask method consists of two branches, one for face detection and the other for face and background image segmentation. In this work, the ResNet-101 backbone is used to extract the facial features of the input image, and the Region of Interest (RoI) is rapidly generated on the feature map through the Region Proposal Network (RPN). We also use the Region of Interest Align (RoIAlign) to faithfully preserve exact spatial locations and output the feature map to a fixed size. At the end of the network, the bounding box is located and classified in the detection branch, and the corresponding face mask is generated on the image in the segmentation branch through the Fully Convolution Network (FCN) [31]. In the following, we will introduce the key steps of our network in detail.

Region Proposal Network.
For images with human faces in our daily life, there are generally some face objects with different scales and aspect ratios. erefore, in our approach, Region Proposal Network (RPN) generates RoIs by sliding windows on the feature map through anchors with different scales and different aspect ratios. Details are shown in Figure 2. e largest rectangle in the figure represents the feature map extracted by the convolutional neural network, and the dotted line indicates that the anchor is the standard anchor. Assume that the standard anchor size is 64 pixels, and the three anchors it contained represent three anchors with aspect ratios of 1 : 1, 1 : 2, and 2 : 1. e dot-dash line and the solid line represent the anchors of 32 and 128 pixels, respectively. Similarly, each of them also has three aspect ratios anchors. For traditional RPN, the above three scales and three aspect ratios are used to slide on the feature map to generate RoIs. In this paper, we use 5 scales (16 2 , 32 2 , 64 2 , 128 2 , and 256 2 ) and 3 aspect ratios (1 : 1, 1 : 2, and 2 : 1), leading to 15 anchors at each location, which was more effective in detecting objects of different scales.

RoIAlign
Layer. G-Mask, unlike the general face detection methods, has a segmentation operation, which requires more refined spatial quantization for feature extraction. In the traditional region-based approaches, RoIPool is the standard operation for extracting small feature map from RoIs, which have two quantization operations that result in misalignments between the RoI and the extracted features. For traditional detection methods, this may not affect classification and localization, while for our approach, it has a great impact on prediction of pixelaccurate masks, as well as for small object detection.
In response to the above problem, we introduced the RoIAlign layer, following the scheme of [23]. As shown in Figure 3, suppose the feature map is divided into 2 × 2 bins.
It can be seen that the RoIAlign layer cancels the harsh quantization operations on the feature map and uses bilinear interpolation to preserve the floating-number coordinates, thereby avoiding misalignments between the RoI and the   extracted features. e bilinear interpolation function has two steps, which are defined as follows: Interpolate on the x-axis direction as follows: Interpolate on the y-axis direction as follows: are the value obtained by interpolating in the x-axis direction.

Mask
Branch. e mask branch realizes the segmentation of face object and background image in G-Mask model, which predicts the segmentation mask in a pixel to pixel manner by applying Full Convolutional Network (FCN) [31] to each RoI. e FCN scheme is one of the solutions for instance segmentation, which originates from CNN but is also different from general CNN. For the traditional CNN network architecture, in order to obtain the feature vector of fixed dimensions, the convolutional layer is generally connected with several full connection layers, and finally the output is a numerical description of the input, which is generally applicable to tasks such as image recognition and classification, object detection, and positioning. e FCN framework is similar to the traditional CNN network, which also includes the convolutional layer and the pooling layer. In particular, the FCN uses the deconvolution to up-sample the feature map in the end convolution layer so that the output image size can be restored to the original image size, and finally uses the Softmax classifier to predict the category of each pixel.

Generalized Intersection over Union.
Bounding box regression, as one of the fundamental components of many computer vision tasks, deserves further study by researchers [32]. However, unlike the architecture and feature extraction strategy improvement researches, which have made great progress in recent years [33], the research of bounding box regression has lagged behind somewhat.
e Generalized Intersection over Union (GIoU) [26], as the latest metric and bounding box regression method, demonstrates state-ofthe-art results on various object detection benchmarks by incorporating with the general object detection frameworks. For traditional IoU, there are two weaknesses when it is used as a metric or a bounding box regression loss: (a) the IoU value is zero when two objects do not overlap, making it difficult to optimize the nonoverlapping bounding boxes; (b) the IoU value may be the same when two objects intersect in different orientations, so the IoU function does not reflect how the two objects overlap. To overcome these drawbacks, GIoU not only focuses on the situation where two objects overlap but also considers the situation of nonoverlapping. e details of the GIoU metric are shown in Figure 4.
are the coordinates of an object's predicted bounding box and the ground-truth bounding box, where x 2 > x 1 and y 2 > y 1 in B P and B g ; then, the area of them is e coordinates and area of intersection I of B P and B g can be calculated as Similarly, the smallest enclosing box B c can be found through and the area of B c can be computed as e IoU between B P and B g is defined as erefore, GIoU can be calculated by the definition of 3.6. Loss Function. e proposed G-Mask model consists of two stages, which are the same as the general region-based model. In the first stage, RPN proposes the candidate bounding boxes of the object face. e second stage, follow the Fast R-CNN architecture, extracts features from each candidate box and then performs classification and bounding box location. In addition, like the Mask R-CNN, we added a mask branch parallel to the classification branch and the bounding box location branch. erefore, we define a multitasking objective function, which includes classification loss L cls , bounding box location loss L box , and segmentation loss L mask . Our loss function for each image is defined as In (14), the classification loss L cls and segmentation loss L mask are defined the same as in Mask R-CNN. For the bounding box loss, we found that GIoU can better respond to face detection tasks through several experiments compared with the traditional bounding box regression method. erefore, in this paper, we introduced GIoU as a bounding box loss function. In more detail, the classification loss is defined as in where N cls is the minibatch size, i is the index of an anchor in a minibatch, and p i is the prediction probability of whether anchor i is a face target. e ground-truth label p * i � 1 if the anchor is positive, and p * i � 0 when the anchor is negative. e classification loss L cls of each anchor is log loss of whether an object is a face, which is defined as For bounding box loss, we introduce GIoU as the loss function, and the definition of GIoU metric is described in (13), so the loss bounding box function is defined as follows: For segmentation box loss, we adopt the average binary cross-entropy loss, which is defined in (18) where y ij is the label value of a cell (i, j) for the region of size m × m and y k ij is the predicted value of the k-th class of this cell. L * mask is only defined on a specific mask, which is related to the ground-truth class k, and other mask outputs do not affect the loss.

Experimental Setup.
Unlike object detection and generic face detection, there are no off-the-shelf face datasets with masks annotation that can be employed to train our model [34]. erefore, the first step of our work is to create a new dataset with mask annotations. In order to enhance the reliability of the samples, we selected 5115 samples from FDDB and ChokePoint datasets and annotated them with masks labels. After the annotation work, we trained the G-Mask model on this dataset.
For implementation, we adopt Keras [35] framework to train the G-Mask model in Ubuntu 16.04. ResNet-101 [36] is used as the backbone network architecture in our work. In the training phase, the G-Mask model is train on aforementioned dataset for 150,000 iterations (where the epoch is 50 and the steps of per epoch are 3000) with the learning rate set to 0.001 and the weight decay rate set to 0.0001. We randomly sample one image per batch for training [37], in which the short side of each image was resized to 800 and the long side was resized to 1024. In the RPN part, RoIs is generated by sliding the window on the feature map through anchors of different scales and different aspect ratios. It will have 2000 RoIs kept after nonmaximum suppression, and the RoIs will only be considered as foreground if its IoU with the ground truth is greater than 0.5. e testing phase settings are the same as the training phase, and the region proposal is considered to be a face only if the confidence score is greater than 0.7. e training and testing process is carried out on the same server, which is a Xeon E5 CPU of 128 GB flash memory and NVIDIA GeForce GTX 1080Ti GPU.

Experimental Results.
In this work, G-Mask model not only realized the bounding box localization of the face target but also separated the face information from the background image by binary mask, so that more detailed face information could be obtained through the above process. e comparison experiment was carried out on three popular face benchmark datasets, including FDDB, AFW, and WIDER FACE. Discrete Dynamics in Nature and Society e FDDB [27] dataset is a well-known face detection evaluation dataset and benchmark, which contains 2845 images of 5171 human faces. In this dataset, the faces of each image come from different scenes, which is quite challenging. We compared several methods on the FDDB dataset, including Faster R-CNN [15], Mask R-CNN [23], Pico [38], Viola-Jones [39], and Koestinger [40]. For effective comparison, the training data of the G-Mask, Mask R-CNN, and Faster R-CNN models are the same, which is the dataset constructed in this work. We compared the true positive rates at 1500 false positives, and the results are shown in Figure 5. It can be seen from Figure 5 that G-Mask performs better than Faster R-CNN when there are more than 160 false positives. When there are more than 280 false positives, the performance of G-Mask is better than that of Mask R-CNN. Furthermore, our method can achieve 88.80% true positive rate in 1500 false positives, which exceeded all the comparison methods. e comparison results of the FDDB dataset show that our proposed G-Mask method has achieved promising results, demonstrating that our method can segment face information while detecting effectively. Some detection results of the Mask R-CNN and G-Mask models in the complex scenario of FDDB dataset are shown in Figure 6. It is obvious that   on the AFW benchmark is shown in Figure 7, and it can be seen that the G-Mask method achieved 95.97% average precision (AP). Although our dataset has a different label format from the AFW benchmark, as well as the moderately sized training dataset, we also demonstrate the generalization of our method.
WIDER FACE [30], one of the largest and most challenging face detection datasets in the open source data, has 32,203 images and 393,703 labeled faces. In this dataset, various changes in the face size, pose, and occlusion have brought great challenges to face detection, and the dataset is divided into easy, medium, and hard subsets according to the difficulty level. To further demonstrate the detection performance of our proposed method, we trained the G-Mask model on WIDER FACE dataset and verified it on the validation dataset. e proposed method is compared with several major methods including MSCNN [42], CMS-RCNN [43], ScaleFace [44], Multitask Cascade CNN [45], and Faceness-WIDER [46]. e precision-recall curves of G-Mask method on the WIDER FACE benchmark are shown in Figure 8. It can be seen that our method obtained 0.902 AP in the easy subset, 0.854 AP in the medium subset, and 0.662 AP in the hard subset, which exceeded most of the comparison methods. Compared with the state-of-the-art MSCNN method, the AP value of the proposed method is only 0.014 lower in the easy subset and 0.049 lower in the medium subset. ere are some gaps between G-Mask and MSCNN methods on hard subset. e reason may be that the MSCNN method uses a series of strategies for small-scale faces detection and thus they can deal with more challenging cases. Nevertheless, the G-Mask method still achieves promising performance, which demonstrates the effectiveness of the G-Mask method.
We further demonstrate more qualitative results of G-Mask method in Figure 9. It can be observed that the proposed method can detect faces correctly while also precisely segmenting each face in an image.
We also compared the running time of different regionbased methods in the a series of dataset such as FDDB, AFW, and ChokePoint. e WIDER FACE dataset was not used for testing because the running time of the hard and easy subset on the WIDER FACE was quite different. We randomly selected 100 images from each of the above datasets to test and calculate their average time, and the results are reported in Table 1. We can clearly see that Faster R-CNN has the shortest running time because of its relatively simple structure, while the proposed method has a running time similar to Mask R-CNN. Compared with Faster RCNN method, G-Mask adds a segmentation branch, which leads to an increase in computational complexity. However, the G-Mask method can achieve higher accuracy with less time consumption compared with other region-based methods and can also obtain more detailed face information through segmentation branches while accurately locating.

Conclusions
In this paper, a G-Mask method was proposed for face detection and segmentation.
e approach can extract features by ResNet-101, generate RoIs by RPN, preserve the precise spatial position by RoIAlign, and generate binary masks through the full convolutional network (FCN). In doing so, the proposed framework is able to detect faces correctly while also precisely segmenting each face in an image. Experimental results with self-built face dataset as well as public available datasets have verified that our proposed G-Mask method achieves promising performance. For the future work, we will consider improving the speed of the proposed method.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.