3D Large-Pose Face Alignment Method Based on the Truncated Alexnet Cascade Network

Aiming at the low accuracy of large-pose face alignment, a cascade network based on truncated Alexnet is designed and implemented in the paper. The parallel convolution pooling layers are added for concatenating parallel results in the original deep convolution neural network, which improves the accuracy of the output. Sending the intermediate parameter which is the result of each iteration into CNN and iterating repeatedly to optimize the pose parameter in order to get more accurate results of face alignment. To verify the eﬀectiveness of this method, this paper tests on the AFLW and AFLW2000-3D datasets. Experiments on datasets show that the normalized average error of this method is 5.00% and 5.27%. Compared with 3DDFA, which is a current popular algorithm, the accuracy is improved by 0.60% and 0.15%, respectively.


Introduction
As an important research topic in the field of artificial intelligence and face recognition, face alignment has been widely concerned by academia and industry. e core is to use computing equipment to extract the semantics of pixels in face images, which has a great theoretical research significance and practical application value. In recent years, the success of application by using deep learning has greatly improved the accuracy of face alignment. However, there are still many challenges and bottlenecks in the recognition problem under the unrestricted conditions in the real scene, among which the pose change as a factor that cannot be ignored greatly affects the accuracy of face alignment.
At present, the mainstream face alignment methods can be divided into two categories: 2D face alignment and 3D face alignment. As the widely used 2D face alignment method, Zhang et al. [1] proposed face marker detection based on deep multitask learning in 2014, and Lee et al. [2] improved it by using the Gaussian-guided regression network in 2019. en, pearl to the fine shape retrieval method was proposed by Zhu et al. [3]. In 2015, they have laid the foundation for face alignment of small and medium attitude where the yaw angle is less than 45°and all the landmarks are visible. e steps of 2D face alignment can be roughly divided into face preprocessing, shape initialization, shape prediction, and output.
Compared with the traditional 2D face alignment, 3D face alignment mainly uses a subspace to model 3D face and realizes fitting by minimizing the difference between image and model appearance, which makes the model performance more robust and accurate in unconstrained scenes. Of course, there are several inherent defects in the 3D face alignment method. e alignment results are similar with the average model. ey are lack of personalized features. In order to solve the problem, Yin et al. [4] proposed a 3D deformation model for face recognition. However, each image takes one minute, which takes too much time. Liu and Jourabloo [5] fitted the 3D deformation model to 2D image, with the aid of the sparse 3D point distribution model; the model parameters and projection matrix are estimated by cascade linear or nonlinear regression variables, which realize alignment of human faces in any posture. However, the effect of recovering face detail features is still not good. en, Liu and Jourabloo [6] used 3D face modeling to improve the result of locating landmarks in large-pose face. But the accuracy of alignment results is still limited by linear parameterized 3D models. Large-pose alignment methods still need to be improved. Zhu et al. [7] improved the face alignment performance across large poses and addressed all the three challenges that traditional models need visible landmark points which are not applicable to the side; large poses will cause significant changes in face from front to side and to locate invisible landmarks in large poses. e first one has been properly solved by the 3D dense face model [8], whereas the others still depend on the model accuracy but only the method. erefore, we need the model which will be more accurate and reliable. As the solution, we propose a cascaded convolutional neutral network-(CNN-) based regression method. CNN has been proved of excellent capability to extract useful information from images with large variations in object detection and image classification. And on this basis, we designed a new cascade network structure based on truncated Alexnet to improve the accuracy.

Feature Selection.
Good features can make training efficient and improve the accuracy of the model. In order to get better features, we designed a new cascade network structure based on truncated Alexnet.

Alexnet.
Alexnet deepens the network structure based on Lenet [9]. e structure of Lenet is shown in Figure 1. e structure of Alexnet is shown in Figure 2. e network contains five convolution layers and three fully connected layers. Compared with Lenet, Alexnet has a deeper network structure and uses several parallel convolution layers and pooling layers to extract image features. It also uses dropout and data enhancement data augmentation to suppress over fitting.

Cascade Network Structure Based on Truncated
Alexnet. Based on the structure of Alexnet, this paper constructs a new kind of truncated Alexnet. e structure is shown in Figure 3. An additional parallel convolution pooling layer is added to the original structure to form a truncated Alexnet cascade network. e input image is stacked with the iterated PNCC as input and then convoluted into the network in parallel. e parallel results are stacked together to form a full connection layer.

Network Structure.
e purpose of 3D face alignment is to estimate the target from a single face image. Different from the existing network, based on the cascaded network structure of 3ddfa, we add a parallel pooling layer and a concatenate step before the full connection layer. In general, at iteration k (k � 0, 1, . . ., K), given an initial parameter p k , we construct a specially designed feature PNCC with p k and train a convolutional neutral network Net k to predict the parameter update △p k : Afterwards, a better medium parameter p k+1 � p k + Δp kk becomes the input of the next network Net k+1 has the same structure as Net k . e input is the 100 × 100 × 3 color image stacked by PNCC. e network contains eight convolution layers, seven pooling layers, and two fully connected layers. e first two convolution layers share weights to extract lowlevel features. e last three convolution layers do not share weights to extract location sensitive features, which is further regressed to a 256-dimensional feature vector. e output is a 234-dimensional parameter update including 6-dimensional pose parameters (f, pitch, yaw, roll, t 2dx , and t 2dy ), 199-dimensional shape parameters α id , and 29-dimensional expression parameters α exp .

PNCC.
e special structure of the cascaded CNN has three requirements of its input feature. First, the feedback property requires that the input feature should depend on the CNN output to enable the cascade manner. Second, the convergence property requires that the input feature should reflect the fitting accuracy to make the cascade converge after some iterations. Finally, the convolvable property requires that the convolution on the input feature should make sense. Based on the three properties, we design our features as follows: first, the 3D mean face is normalized to 0-1 in x, y, and z axis as given in the following equation. e unique 3D coordinate of each vertex is called its normalized coordinate code (NCC).
where the S is the mean shape of 3DMM in equation 4. Since NCC has three channels as RGB, we also show the mean face with NCC as its texture. Second, with a model parameter p, we adopt the Z-buffer to render the projected 3D face colored by NCC as in the following equation: where Z − buffer (v, t) renders an image from the 3D mesh v colored by t, and V 3d (p) is the current 3D face. Afterwards, PNCC is stacked with the input image and transferred to CNN. Projected normalized coordinate code (PNCC) is shown in Figure 4.

3DMM
. Blanz and Basso [10] proposed the 3D morphable model (3DMM) which describes the 3D face space with PCA, and it is widely used in face alignment field [11][12][13]. 3DMM is shown in the following equation: where S is a 3D face, S is the mean shape, A id is the principle axes trained on the 3D face scans with neutral expression and α id is the shape parameter, and A exp is the principle axes trained on the offsets between expression scans and neutral scans and α exp is the expression parameter. In this work, the A id and A exp come from the Basel Face Model (BFM) and Face-Warehouse [14], respectively. e 3D face is then         Advances in Condensed Matter Physics projected onto the image plane with weak perspective projection.
where V(p) is the model construction and projection function, leading to the 2D positions of model vertexes, f is the scale factor, Pr is the orthographic projection matrix 1 0 0 0 1 0 , R is the rotation matrix constructed from rotation angles pitch, yaw, and roll, and t 2d is the translation vector. e collection of all the model parameters is P � [f, pitch, yaw, roll, t 2d , α id , α exp ] T .

Loss Function.
In this paper, the loss function is shown in the following equation: where L(y i , f(x i ; ω)) is used to measure the error between the predicted value f(x i ; ω) of the model for the i th sample and the real label y i. As mentioned above, it is necessary to minimize this value as much as possible to improve the fitness between the model and the training set. e fitness is not the final evaluation index, but the test error. erefore, the regularization function Ω(ω) of parameter ω is introduced to constrain the model, in order to avoid over fitting. It is shown in the following equation: e initial learning rate was 10 −4 , and the batch size was 8. After 15 complete cycle iterations, the learning rate was reduced to 10 −5 . en, after 15 iterations, the learning rate was reduced to 10 −6 . Totally, 40 iterations were carried out for the whole training.

Evaluation Index.
In this paper, normalized mean error (NME) [15] is applied to measure the accuracy of face alignment rather than the Euclidian distance; the reason is that the Euclidian distance of the contour surface with small eye distance is not accurate. NME is shown in the following equation: where x denotes the ground truth landmarks for a given face, y is the corresponding prediction, and d is the square root of the ground truth bounding box, computed as d �

Experimental Analysis.
e input is single picture, and the output results are face detection image, PNCC, and pose estimation results. e results construct on 2.30GHZ CPU and GTX1060. Table 1 shows the most popular image datasets and their main features.
In order to verify the effect of the face alignment method in large poses in this paper, experimental results are based on Annotated Facial Landmarks in the Wild (AFLW). AFLW face database is a dataset composed of face pictures in various natural situations, and the landmarks are accurately marked. e database is suitable for face recognition, face detection, face alignment, and other research. Table 2 and Figure 5 show the comparison of mainstream algorithms. Among them, ESR [16] (explicit shape regression), SDM [17] (supervised descent method), LBF [18] (local binary features), CFSS [3] (coat to fine shape searching), RCPR [19] (robust cascaded pose regression), RMFA [20] (restrictive mean field approximation), and 3DDFA [21] are popular methods based on cascade regression.
By comparing the experimental results in Table 2 and Figures 5 and 6, it shows the accuracy of the results. Compared with the 3DDFA algorithm as the main reference object, the NME of AFLW2000 and AFLW2000-3D is reduced to 5.00% and 5.27%, respectively, which is better than several popular faces alignment algorithm which shows the effectiveness and accuracy of this method. e output results are shown from

Conclusion
In this paper, a method of face alignment using cascade unified network structure is proposed for large-pose face alignment. By using the deep convolution neural network to iterate repeatedly and using the iterative results to return the face feature points, the face alignment in large-pose environment is realized, and the result is improved by using normalized mean error function to evaluate alignment accuracy. e experimental results show that this method has obvious advantages over the existing face alignment methods in accuracy. However, it still needs to be improved in the efficiency of the algorithm. At the same time, it is difficult to achieve accurate face alignment in the presence of external occlusion. ese problems need to be further studied and discussed, which will be the focus of subsequent research work.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.