CenterFace: Joint Face Detection and Alignment Using Face as Point

Face detection and alignment in unconstrained environment is always deployed on edge devices which have limited memory storage and low computing power. This paper proposes a one-stage method named CenterFace to simultaneously predict facial box and landmark location with real-time speed and high accuracy. The proposed method also belongs to the anchor free category. This is achieved by: (a) learning face existing possibility by the semantic maps, (b) learning bounding box, offsets and five landmarks for each position that potentially contains a face. Specifically, the method can run in real-time on a single CPU core and 200 FPS using NVIDIA 2080TI for VGA-resolution images, and can simultaneously achieve superior accuracy (WIDER FACE Val/Test-Easy: 0.935/0.932, Medium: 0.924/0.921, Hard: 0.875/0.873 and FDDB discontinuous: 0.980, continuous: 0.732). A demo of CenterFace can be available at https://github.com/Star-Clouds/CenterFace.


Introduction
Face detection and alignment is one of the fundamental issues in computer vision and pattern recognition and is often deployed in mobile and embedded devices. ese devices typically have limited memory storage and low computing power. erefore, it is necessary to predict the position of the face box and the landmark at the same time, and it is excellent in speed and precision.
With the great breakthrough of convolutional neural networks (CNN), face detection has achieved remarkable progress in recent years. Previous face detection methods have inherited the paradigm of anchor-based generic object detection frameworks, which can be divided into two categories: two-stage method (Faster-RCNN [1]) and one-stage method (SSD [2]). Compared with the two-stage method, the one-stage method is more efficient and has higher recall rate, but it tends to achieve a higher false positive rate and to compromise the localization accuracy.
en, Hu and Ramanan [3] used a two-stage approach to the Region Proposal Networks (RPN) [1] to detect faces directly, while SSH [4] and S3FD [5] developed a scale-invariant network in a single network to detect faces with mutiscale from different layers. e previous anchor-based methods have some drawbacks. On the one hand, in order to improve the overlap between anchor boxes and ground truth, a face detector usually requires a large number of dense anchors to achieve a good recall rate. For example, more than 100 k anchor boxes is designed in RetinaFace [6] for a 640 × 640 input image. On the other hand, the anchor is a hyperparameter design that is statistically calculated from a particular dataset, so it is not always feasible to other applications, which goes against the generality.
In addition, the current state-of-the-art face detectors has achieved considerable accuracy on the benchmark WIDER FACE [7] by using heavy pretrained backbones such as VGG16 [8] and resnet50/152 [9]. First, these detectors are difficult to use in practice because the network consumes too much time and the model size is also too large. Secondly, it is not convenient for face recognition application without facial landmark prediction. erefore, joint detection and alignment, as well as better balance of accuracy and latency, are essential for practical applications.
Inspired by the anchor-free universal object detection framework [1,[10][11][12][13][14][15], this paper proposes a simpler and more effective face detection and alignment method named CenterFace, which is not only lightweight but also powerful.
e network structure about the CenterFace is shown in Figure 1, which can be trained end-to-end. We use the center point of the face's bounding box to represent the face, then facial box size and landmark are regressed directly to image features at the center location. So, face detection and alignment are transformed to the standard key point estimation problem [16][17][18]. e peak in the heat map corresponds to the center of the face. e image features at each peak predict the size of the face and the face key points. is approach was fully evaluated and the latest detection performance were shown on a number of benchmark datasets for face detection, including FDDB [19] and WIDER FACE.
In summary, the main contributions of this work can be summarized as four-fold: (i) By introducing the anchor-free design, face detection is transformed into a standard key point estimation problem, using only a larger output resolution (output stride is 4) compared to previous detectors (ii) Based on the multitask learning strategy, the face as point design is proposed to predict the faceBoxes and five key points at the same time (iii) is paper proposes a feature pyramid network using common layer for accurate and fast face detection (iv) Comprehensive experimental results based on popular benchmarks FDDB and WIDER FACE, as well as CPU and GPU hardware platforms, have demonstrated the superiority of the proposed method in terms of speed and accuracy

Cascaded CNN Methods.
e method of cascade convolutional neural network (CNN) [20][21][22] uses cascaded CNN framework to learn features in order to improve the performance and maintain efficiency. However, there are some problems about cascaded CNN-based detector. (1) e runtime of these detector is negatively correlated with the number of faces on the input image. e speed will dramatically degrade when the number of faces increases. (2) Because these methods optimize each module separately, the training process becomes extremely complicated.

Anchor-Free
Methods. In our view, Cascaded CNN methods are also a kind of anchor-free methods. However, these method uses sliding window to detect human faces and relies on image pyramids. It has some shortcomings such as slow speed and complex training process. LFFD [31] regards the RFs as natural anchors which can cover continuous face scales, which is just another way to define anchor, but the training time is about 5 days with two NVIDIA GTX1080TI. Our CenterFace simply represents faces by a single point at their bounding box center; then, facial box size and landmark are regressed directly from image features at the center location. us, face detection is transformed into a standard key point estimation problem. And the training time of a NVIDIA GTX2080TI is only one day.

Multitask Learning.
Multitask learning uses multiple supervisory labels to improve the accuracy of each task by utilizing the correlation between tasks. Joint face detection and alignment [17,20] is widely used because alignment task, paralleling with the backbone, provides better features for face classification task with face point information. Similarly, Mask R-CNN [32] significantly improves the detection performance by adding a branch for predicting an object mask.

CenterFace
3.1. Mobile Feature Pyramid Network. We adopted Mobi-lenetv2 [33] as the backbone and Feature Pyramid Network (FPN) [14] as the neck for the subsequent detection. In general, FPN uses a top-down architecture with lateral connections to build a feature pyramid from a single scale input. CenterFace represents the face through the center point of the face box, and face size and facial landmark are then regressed directly from image features of the center location. erefore, only one layer in the pyramid is used for face detection and alignment. We construct a pyramid with levels {P-L}, L � 3, 4, 5, where L indicates pyramid level. Pl has 1/2 L resolution of the input. All pyramid levels have C � 24 channels, and we define classification loss, box regression loss, and landmark regression loss only on P2.

Face as Point.
Let [x 1 , y 1 , x 2 , y 2 ] be the bounding box of face. Facial center point lies at c � [(x 1 + x 2 )/2 and (y 1 + y 2 )/2]. Let I ∈ R W × H × 3 be an input image of width W and height H. Our aim is to produce the heat map Y ∈ [0, 1] W/R × H/R , where R is the output stride, and we use the default output stride of R � 4. During training, the prediction Y x, y � 1 corresponds to a face center, while Y x, y � 0 is background. For each ground truth Y x, y , we calculate the equivalent heat map by using y an unnormalized 2D Gaussian to represent the ground truth. e training loss is a variant of focal loss [15]: where α and β are hyperparameters of the focal loss, which are designated as α � 2 and β � 4 in all our experiments following Law and Deng [34].
To gather global information and to reduce memory usage, downsampling is applied to an image convolutionally, and the size of the output is usually smaller than the image. Hence, a location (x, y) in the image is mapped to the location (x/n, y/n) in the heatmaps, where n is the downsampling factor. When we remap the locations from the heatmaps to the input image, some pixel may be not alignment, which can greatly affect the accuracy of facial boxes. To address this issue, we predict position offsets to adjust the center position slightly before remapping the center position to the input resolution: where o k is the offset and x k and y k are the x and y coordinate for face center k. We apply the L1 Loss at ground-truth center position.

Box and Landmark Prediction.
To reduce the computational burden, we use a single size prediction S ∈ R W/4 × H/4 for facial box and landmarks. Each ground-truth bounding box is specified as G � (x1, y1, x2, y2). During training, our goal is to learn a transformation that maps the networks position outputs (h, w) to center position in the feature maps: where R is the stride of networks, which are designated as R � 4.
Different from box regression, the regression of the five facial landmarks adopts the target normalization method based on the center position: where lm x and lm y are the x and y coordinates for face landmark, c k and c k are the x and y coordinates for face center, and box w and box h are width and height of the face. We also use smooth L1 loss to facial box and landmark prediction at the center location. For any training face center, we minimise the following multitask loss: where λ off , λ box , and λ lm is used to scale the loss, and we use 1, 0.1, and 0.1, respectively, in all our experiments.

Dataset.
e proposed method is trained on the training set of WIDER FACE benchmark, including 12,880 images with more than 150,000 valid faces in scale, pose, expression, occlusion, and illumination. RetinaFace [6] introduces five levels of face image quality and annotates five landmarks on faces.

Data Augmentation.
Data augmentation is important to improve the generalization. We use random flip, random scaling [35], color jittering, and randomly crop square patches from the original images and resize these patches into 800 × 800 to generate larger training faces. Faces that are less than 8 pixels are discarded directly.

Training Parameters.
We train the CenterFace using Adam optimiser with a batch-size 8 and learning rate 5e − 4 for 140 epochs, with the learning rate dropped 10x at 90 and 120 epochs, respectively. e downsampling layers of MobilenetV2 are initialized with ImageNet pretrain and the

Experiments
In this section, we firstly introduce the runtime efficiency of CenterFace and then evaluate it on the common face detection benchmarks.

Running Efficiency.
e existing CNN face detectors can be accelerated by GPUs, but they are not fast enough in most practical applications, especially CPU-based applications. As described below, our CenterFace is efficient enough to meet practical requirements and its model size is only 7.2 MB. In Table 1, comparing with other detectors, our method can exceed the real-time running speed (>100 FPS) at different resolutions by using a single NVIDIA GTX2080TI. Owing to the DSFD, PyramidBox, S3FD, and SSH are too slow when running on CPU platforms, and we only evaluate the proposed CenterFace, FaceBoxes, MTCNN, and CasCNN at VGA-resolution images on CPU and the mAP means the true positive rate at 1000 false positives on FDDB. As listed in Table 2, our CenterFace can run at 30 FPS on the CPU with state-of-the-art accuracy.

FDDB Dataset.
FDDB contains 2845 images with 5171 unconstrained faces collected from the Yahoo news website. We evaluate our face detector on FDDB against the other state-of-the-art methods, and the results are shown in Table 3 and Figure 2, respectively. We also add DFSD, PyramidBox, and S3FD detectors, whereas these detectors are much slower due to the larger backbone and denser anchors. Our Cen-terFace can also achieve good performance on both discontinuous and continuous ROC curves, i.e., 98.0% and 72.9% when the number of false positives equals to 1,000 and it outperforms LFFD, FaceBoxes, and MTCNN evidently.

WIDER FACE Dataset.
Until now, WIDER FACE is the most widely used benchmark for face detection. e WIDER FACE dataset is split into training (40%), validation (10%), and testing (50%) subsets by randomly sampling from 61 scene categories. All the compared methods are trained on the training set. For testing on WIDER FACE, we follow the standard practices of [6] and employ flip as well as multiscale strategies. Box voting [36] is applied on the union set of predicted faceBoxes using an IoU threshold at 0.4. We report the results on the testing sets in Table 4, respectively. e proposed method CenterFace achieves 0.932 (Easy), 0.921 (Medium), and 0.873 (Hard) for testing set. Although it has gaps with state-of-the-art methods, but consistently outperforms SSH (using VGG16 as the backbone), LFFD, FaceBoxes, and MTCNN. Additionally, CenterFace is better than S3FD that uses VGG16 as the backbone and dense anchors on hard parts.
Furthermore, we also test on WIDER FACE not only with the original image but also with a single inference, and our CenterFace also produces the good average precision (AP) in all the subsets of both validation sets, i.e., 92.2% (Easy), 91.1% (Medium), and 78.2% (Hard) for the validation set. Figure 3 shows some qualitative results on the WIDER FACE dataset.

AFLW Dataset.
To evaluate the accuracy of face alignment, we compare CenterFace with MTCNN on the AFLW dataset. e mean error is measured by the distances between the estimated landmarks and the ground truths and normalized with respect to the interocular distance. As shown in Figure 4, we give the mean error of each facial landmark on the AFLW dataset [37]. CenterFace significantly decreases the normalized mean errors (NME) from 6.2% to 6.9% when compared to MTCNN.

Parameter, FLOPs, and Model
Size. In this section, the comparison method is studied from the perspective of parameters, computation, and model size. Edge devices always have limited storage. We use FLOPs to measure the computation at resolution 640 × 480. e number of parameters is closely related to the size of the model. However, the model size may vary slightly with different libraries, and less parameters do not mean less computation. All the information is presented in Table 5.    [29] 0.984 0.754 PyramidBox [28] 0.982 0.757 S3FD [5] 0.981 0.754 MTCNN [20] 0.944 0.708 Faceboxes3.2 [36] 0.

Scientific Programming
For the most advanced methods DSFD and PyramidBox, they have a large number of parameters, FLOPs, and model sizes. Evidently, the proposed method has much more efficient computation and light network, which demonstrates the superiority of the concise network design.

Conclusion
is paper introduces the CenterFace that has the superiority of the proposed method, performs well on both speed and accuracy, and simultaneously predicts facial box and landmark location. Our proposed method overcomes the drawbacks of the previous anchor-based method by translating face detection and alignment into a standard key point estimation problem. CenterFace represents the face through the center point of the face box, and face size and facial landmark are then regressed directly from image features of the center location. Comprehensive and extensive    experiments are made to fully analyze the proposed method. e final results demonstrate that our method can achieve real-time speed and high accuracy with a smaller model size, making it an ideal alternative for most face detection and alignment applications.

Data Availability
e data used to support the findings of this study have been deposited in the http://mmlab.ie.cuhk.edu.hk/projects/ WIDERFace/WiderFace_Results.html repository.

Conflicts of Interest
e authors declare that they have no conflicts of interest.