Facial Recognition of Cattle Based on SK-ResNet

,


Introduction
With the development of animal husbandry in the direction of scale, informatization, and refnement, intelligent cattle farms will gradually replace the traditional farming mode of small scale, such as retail farming. In large cattle farms, in order to realize the automatic and information-based daily fne management of individual cattle and to realize the health status tracking of each cow and the traceability of dairy and meat products, it is necessary to realize the construction and improvement of the quality traceability system, and the key is the identifcation of individual cattle [1]. Te traditional methods of individual identifcation of cattle include ear engraving, the external marking method with an ear tag, and the external equipment marking method with RFID [2]. Ear-cutting is a painful and time-consuming incision in the animal's ear [3]; the external marking of ear tags is often lost or damaged during breeding. Ear tags are easily lost, so they cannot be worn for long periods of time [4]. If the RFID tagging method is used for a long time, it will cause security problems such as ear tag falling of, tag content tampering, system crash, and server intrusion attack, and the cost is high [5,6].
In recent years, driven by deep learning, the use of machine vision technology to supervise the identifcation of dairy cows has become a trend. As a popular technology for intelligent and precise breeding, machine vision has the advantages of low cost, noncontact and avoiding animal stress, long continuous monitoring time, and so on. Noncontact identifcation is a new trend in livestock and poultry identifcation, which is based on biological characteristics and is unique, invariable, low cost, easy operation, and high animal welfare. It is a new trend in livestock and poultry identifcation. Noncontact identifcation methods use computer vision to extract biometrics for individual identifcation. In biometrics, facial recognition has strong anti-interference and scalability. In 2015, Moreira et al. [7] applied deep convolutional neural networks to recognize lost dogs in dog facial recognition, but this study has low recognition accuracy in dogs of the same breed and high similarity. In 2018, Hansen et al. [8] proposed a pig face recognition algorithm based on convolutional neural networks. By collecting information from pigs' facial features, pigs with black spots have a better recognition efect, but in this study, artifcial paint to create artifcial features is diffcult to achieve in practice. In 2019, Yao et al. [9] proposed a cattle face recognition framework that combines Faster R-CNN detection and the PANSnet-5 recognition model. First, an image was input through the cattle face detection model, and the cattle face region in the image was detected and cropped. Ten, the cropped image of the cattle face region was sent to the recognition model to confrm its specifc number. However, the characteristics of the facial pattern of dairy cows are obvious, so there is less research value. In 2020, Mathieu Marsot et al. [10] proposed a new framework consisting of computer vision algorithms and machine learning and deep learning techniques. First, two cascaded classifers based on Haar features and a shallow convolutional neural network automatically detect highquality images of a pig's face and eyes. Second, deep convolutional neural networks are used for facial recognition. However, due to the black spots on pig faces, in order to train the network, the output images of the Haar cascade eye detector need to be manually classifed and then input into the neural network. Tis study is very heavy and often difcult to achieve in practical applications. In 2021, Bello et al. [11] proposed a deep belief network to learn the texture features of bullnose images and use bullnose image patterns for recognition. Tere will be certain limitations when using this method to extract bullnose texture from free-range cattle farms.
Discrete orthogonal polynomials have attracted a lot of attention from researchers in many scientifc felds, especially in speech and image analysis, due to their robustness to noise. Te basic principle is to use orthogonal polynomials (OPs) to form matrices and to use the basis functions of the orthogonal polynomials as approximate solutions of differential equations. In recent years, orthogonal polynomials have been widely used in face recognition, edge detection, and other related felds. In 2020, Abdul-Hadi et al. [12] proposed a new recursive algorithm to generate Meixner polynomials (CHPs) for higher-order polynomials, which is 44 times faster than existing recursive algorithms but still has room for improvement in speed. In 2021, Abdulhussain et al. [13] proposed a new recursive algorithm for solving higherorder Charlier polynomials(MNPs) coefcients. Feature extraction tools are computationally expensive but not used for boundary detection. In 2022, Mahmmod et al. [14] proposed an operation method for calculating Hahn orthonormal basis and applied it to the calculation of highorder orthonormal basis. Tis method uses two adaptive threshold recursion algorithms to stabilize the generation of Discrete Hahn polynomials (DHP) coefcients. Te algorithm has better performance in the case of wider parameter value ranges α and β and polynomial size.
Although the traditional recognition method has achieved good results, the recognition process is complicated and often requires manual intervention. Te image features extracted by artifcial design are usually shallow features of the image, with limited expressive ability and insufcient efective feature information. In addition, the artifcial design method has poor robustness and is greatly afected by external conditions. With the development of deep learning technology and discrete orthogonal polynomials and the improvement of the hardware environment, the method of action image recognition based on deep learning has become a research hotspot. Te cattle face dataset is generated by using cattle face images obtained from diferent angles, and an improved recognition model based on SK-ResNet is proposed to extract cattle face features. Te model uses ResNet-50 as the basic model, uses diferent numbers of SK-Bottlenecks, and fuses the information of multiple receptive felds to extract facial features at multiple scales; the maximum pooling layer is connected in the model shortcut connection to reduce information loss. Te ELU activation function is used in the network, and its linear part on the right makes the ELU more robust to input changes or noise. Te average value of the left curve and ELU output is close to 0, which makes the model converge faster and can solve the problem of neuron death. Te recognition model based on SK-ResNet was trained with the bovine face dataset, and it was proved that the cattle could be accurately identifed. Te method is tested on public datasets and self-built datasets. Compared with the existing recognition methods, the experimental results verify the advanced nature of the method.
Te contributions of this paper include the following: (1) We built up two large datasets for cattle identifcation and model robustness testing. Te frst one consists of cattle facial images. Te second one consists of long white pig facial images. Tose images were captured with diferent angles and backgrounds.

Data Collection and
Processing. Te data were collected at Dongfeng Dairy Cattle Farm, Liaoyuan City, Jilin Province, China, and the shooting time was July 2021. Trough the camera equipment deployed on the farm, the images of eight solid-color cattle were intercepted from diferent angles. Te solid-color cattle samples are shown in Figure 1, and the cattle face dataset with a resolution of 1298 × 1196 pixels and a format of Joint Photographic Experts Group (JPG) was obtained. We aimed to prevent image saturation, avoid direct sunlight on the face in the images, and remove complex backgrounds [15] by extracting the face region of the image A total of 5677 facial images of cattle were collected, which were randomly divided into training and verifcation sets at a ratio of 7 : 3 (3974 training images and 1703 verifcation images). Te purpose is that when the feature space dimension of the sample is larger than the number of training samples, the model is prone to overftting. In order to enhance the robustness and generalization ability of the network, the number of training samples is increased by enlarging the limited number of training samples. Te expansion method of translation, rotation, and cropping is used to increase the sample size to four times the original, avoiding the problem of overftting. After data enhancement, 15,896 training sets were obtained. Te size of the training dataset has a signifcant impact on the performance of the training network.

Cattle Individual Identifcation Process.
Te identifcation process used in this paper is shown in Figure 2. Te acquired images are preprocessed with the method in Section "Data Collection and Processing." Ten, the dataset is divided into training set and a verifcation set with a ratio of 7 : 3. In this paper, ResNet is used as the skeleton network to construct the recognition model of the individual cattle's faces. Te training set is used to train the model, and the validation set is used to verify the accuracy and robustness of the model, so as to realize the rapid and efective identifcation of cattle.

Convolutional Neural Network.
In the ResNet network [16], with the deepening of the network, problems such as gradient disappearance and gradient explosion will occur, which makes the training of convolutional neural networks difcult and the model performance will also decline. In order to alleviate this efect, a residual block can be constructed to hop connections between diferent network layers in order to improve network performance. Terefore, the residual network has been widely used in plant disease spot classifcation [17], pathological image classifcation [18], remote sensing image classifcation [19], and face recognition [20] due to its superior performance. Te residual module structure is shown in Figure 3. For the multilayer stacked network structure, when the input data is X, the learning feature is denoted as H (X). It is stipulated that when H (X) is obtained, the residual can be obtained by linear transformation and activation function as follows: (1) Te actual learned feature is as follows: In the extreme cases, the convolutional layer implements the identity mapping even if F (X) � 0. Te performance and characteristic parameters of the network remain unchanged. In general, F (X) > 0. Te network can always learn new features, thus ensuring gradient transmission in backpropagation and eliminating the problems of network degradation and gradient disappearance. [21] is an upgraded version of SENet [22], which is one of the visual attention mechanisms in the attention mechanisms. Convolution kernels of different sizes have diferent efects on targets of diferent scales. SKNet proposed a mechanism that not only takes into account the relationship between channels but also takes into account the importance of convolution kernels, that is, diferent images can obtain convolution kernels of diferent importance so that the network can obtain information of diferent receptive felds.

SKNet Network. SKNet
Te SKNet network is formed by stacking multiple SK convolution units. Te SK convolution operation consists of three modules, Split, Fuse, and Select, which contain Scientifc Programming multiple branches. Take the two-branch SKNet network in Figure 4 as an example. First, the feature map X of size c × w × h is subjected to group convolution and atrous convolution through the (3 × 3) and (5 × 5) size SK convolution kernels, respectively, through the spilt operation, output U and U. Te Fuse operation fuses the two feature maps with element-wise summation and then generates a c × 1 × 1 feature vector S (c is the number of channels) through global average pooling. Feature maps S forms a vector Z after two full connection layers of dimensionality reduction and dimensionality enhancement. Te select module regresses the vector Z to the weight information matrix a and matrix b between channels through 2 Softmax functions and uses a and b to weight the two feature maps U and U and then sum to get the fnal output vector V. SKNet mechanism can not only make the network automatically learn the weight of the channel but also take into account the weight and importance of the two convolutions (convolution kernel). SK convolution unit not only uses the attention mechanism but also uses multibranch convolution, group convolution, and atrous convolution.

Model Building
(1) Model Improvements. Te input image frst passes through a 7 × 7 convolution layer and uses a large convolution kernel to retain the original image features; then, it uses a 3 × 3 maxpool with a stride of 2 to extract the feature map and compress the image. Ten, enter four layers in turn; each layer includes a diferent number of SK-Bottleneck because the cattle's facial features are limited, so the model in this paper mainly extracts facial features through a shallow network, so the number of SK-Bottlenecks for the four layers is set to 3, 4, 1, and 1, where the branch of the SK module is set to 3, as shown in Figure 5(a). Facial features are extracted at multiple scales by integrating information of multiple receptive felds. Maxpool is used for fast connection in SK-Bottleneck, as shown in Figure 6(b). Atrous convolution is connected after the frst layer to expand the receptive feld without introducing parameters and accurately locate the target features. Each convolutional layer is connected to a BN layer. In the training phase, the cattle face dataset is small. In order to prevent overftting, a dropout layer is added before the fully connected layer, and the global average pool is used to optimize the network structure and increase the generalization and antioverftting ability of the model. Finally, use the Softmax classifcation layer for classifcation. Replace the ReLU activation function in the entire network with the ELU activation function, which is more robust to the vanishing gradient problem. Trough the above methods to improve the recognition accuracy of network training, the improved network structure is shown in Figure 6.
(2) Activation Function. Traditional ResNet networks use rectifed linear unit (ReLU) activation functions [23]. ReLU is simple, linear, and unsaturated. Te algorithm can effectively alleviate gradient descent and provide sparse representation. Te ReLU activation function is shown in the following equation: It can be seen from formula (3) that when the value of x is 1, the gradient will disappear if it is too small. When the value of x is less than or equal to 0, as the training progresses, neurons will undergo apoptosis, resulting in the failure to update the weights.  Te ELU activation function [24] combines Sigmod and ReLU, with soft saturation on the left and desaturation on the right. Te linear part on the right side makes the ELU more robust to input variations or noise. Te output mean of the ELU is close to 0 and converges faster to solve the neuron death problem. Te ELU activation function is shown in the following equation: (3) Improved Quick Connect. In the original ResNet structure, when the dimension of x does not match the output dimension of F (x), a shortcut connection is applied to x [25], and then, x is added to F (x). Figure 6(a) is the default shortcut connection used in the original ResNet. Te original shortcut connection uses a 1 × 1 convolutional layer; when the spatial size is reduced by a factor of two, a 1 × 1 convolutional layer with stride 2 skips 75% of the feature map activations, resulting in a signifcant loss of information. In addition, inputting 25% of the feature mapped activation obtained from the 1 × 1 convolutional layer to the next ResBlock introduces noise and information loss, which negatively interferes with the main information fow of the network. Te improved shortcut connection is shown in Figure 6(b), using spatial projection and channel projection; spatial projection uses a 3 × 3 max pooling layer with stride 2, and channel projection applies 1 × 1 convolution with stride 1 layer. Activation criteria for 1 × 1 convolutional layers are introduced via max pooling layers. Spatial projection not only guarantees all the information from the feature maps but also extracts the main features. Te convolution kernel of the max pooling layer is consistent with the intermediate convolution kernel of ResBlock to ensure that element-wise addition is performed between elements in the same space, and the improved shortcut connection reduces information loss. A ResNet requires four shortcut connections. Te shortcut connections used in this paper do not add any parameters to the model. Te structure is shown in Figure 6.  In the experiment, to better evaluate the diferences between real and predicted values, the batch training method was adopted. Te other settings were as follows: loss function � cross-entropy loss, weight initialization method � Xavier, initialization deviation � 0, initial learning rate of the model � 0.001, batch size � 16, and momentum � 0.9; the model used the stochastic gradient descent (SGD) optimizer and Softmax classifer; the model was optimized by stochastic gradient descent; and the model was reduced by 0.1 every 10 iterations. When training and testing the model, the input image size was normalized to 224 × 224, a total of 51 epochs were trained, and, fnally, the converged model was unifed as the fnal saved model.

Experimental Results and Analysis.
Based on the cattle facial recognition model and process of SK-ResNet, the SK module, the maximum pooling layer (maxpool) of the shortcut connection, and the ELU activation function are, respectively, explored to build the model. Te infuence of diferent modules on the model is shown in Table 1. Base represents the basic ResNet model with (3, 4, 1, 1) Bottle-Neck, and from Table 1, it can be seen that the use of the SK module in the base model improves the recognition accuracy by 1.69%, and there is a signifcant reduction in the model size and the number of model parameters. By adding maxpool to the shortcut part of the above model, the number of model parameters remains unchanged, while the accuracy is improved; the fnal model uses ELU, and the results show that the model accuracy has further improved and the growth rate of the number of model parameters is within our acceptable range. Te fnal recognition accuracy of this model reaches 98.42%, which is 3% higher than the recognition accuracy of the base model. Te model curve of this paper is shown in Figure 7.
In order to verify the efectiveness of the model, the model in this paper was compared with classic ResNet-50, SKNet, DenseNet, and GoogleNet on the constructed cattle face dataset constructed. Te experimental results are shown in Table 2. Although the size, number of parameters, and FLOPs of the GoogleNet model are slightly higher than those of this paper, the training state of the model is unstable, and the loss value of the model is higher than that of this paper, and the model in this paper has relatively higher recognition accuracy and stability. Te model size, number of parameters, FLOPs, and loss values of ResNet-50, SKNet, and DenseNet models are all higher than those of this paper, and the recognition accuracy is also much lower than that of this paper's model. Te fnal average accuracy of this paper's model is 98.42%, which shows that the SK-ResNet cattle facial recognition model constructed in this paper can guarantee the recognition accuracy while reducing the number of model parameters and can identify individual cattle faster. Te result curves of the model in this paper and the comparison model are shown in Figure 8.
In order to observe and reach the purpose of correct classifcation, the observation network which focuses more on the area, we have adopted a class activation mapping (CAM) to determine whether a high-response area falls under our concerns. Te principle is that for a CNN model, global average pooling (GAP) is performed on the last feature map to calculate the mean value of each channel and then mapped to the class score through the fully connected (FC) layer to fnd the argmax, and calculate the output of the largest class relative to the last one. Te gradient of a feature map, and then visualize the gradient on the original image. Intuitively, it is to see which part of the high-level features extracted by the network has the greatest impact on the fnal classifer [26]. Figure 9 is a partial heat image of the face of cattle, and it can be seen that the dark color is concentrated in the facial area of cattle, which proves that the model in this study is trained to recognize the facial features of cattle, and this result verifes the accuracy of the model for facial recognition proposed in this study. DenseNet network models were trained on the Long White Pig facial dataset and their results compared. Te experimental results show that ( Table 3) the accuracy of the proposed model is 98.57% and the loss is only 9.6 on the selfbuilt Long White Pig facial dataset. Compared with other models, the results show that the accuracy of the proposed model is higher than that of the other four models, and the loss of the model is much lower than that of the other four models. Te model in this paper is compared with the classic       Table 4 shows that the accuracy rate of this model on this data set is 97.02. Compared with the model proposed by the original author, the recognition efect of this model is the best, which is 2% higher than that of the model proposed by the original author and 6% lower than that of the model proposed by the original author. From Table 4, it can be seen that the recognition efect of this model is better than other models, and the loss is the smallest.
To verify the efectiveness of the model proposed in this study, the solid-color cow face dataset was trained on the pig face recognition model proposed by Yan et al. [28] and the sheep face recognition model proposed by Abu Jwade et al. [27]. Te experimental results are shown in Table 5. Te recognition accuracy of the model proposed in this study is much higher than that of the other two models, and the loss value and model size of the model are smaller than those of the other models, so the model proposed in this study has a better recognition efect than the other models and is suitable for animal face recognition.

Conclusions
In recent years, with increase in the amount of cattle breeding, the monitoring of individual information and health management has become extremely important. Terefore, there is an urgent need for an intelligent, intensive, and standardized method of managing individual cattle on farms. Te paper aims to improve the accuracy, stability, and speed of cattle identifcation to promote the development of intelligent cattle breeding. In this study, this paper generates a cattle face dataset using images of cattle faces obtained from diferent angles and proposes an improved ResNet-based recognition model to extract cattle facial features. Te model uses ResNet-50 as the base model, in which the SK-Bottleneck module is used to fuse information from multiple sensory felds to extract facial features at multiple scales. Second, a maxpooling layer is used in the shortcut connection to reduce information loss. Finally, an ELU activation function is used in the network to reduce vanishing gradients, prevent overftting, speed up convergence, and improve the generalization ability of the model. Te SK-ResNet-based recognition model is trained with the cattle face dataset, which proves that the individual cattle can be accurately identifed. Te improved method is compared with the existing models ResNet-50, SKNet, DenseNet, and GoogleNet and found to be more accurate in recognition while having fewer parameters and faster computation. Te results show that the method achieves an average recognition accuracy of 98.42% on a dataset of 5677 images. To verify the generalizability of the proposed model, a Long White Pig facial dataset (with fewer facial features) and a sheep face dataset were used; accuracies of 98.57% and 97.02% were obtained, respectively. Hence, the recognition accuracy of the proposed model is higher than that of the other models. Te experimental results show that the proposed model can accurately identify individual cattle, giving it great potential for application to cattle breeding and the facial recognition of other animals with high facial similarity.

Data Availability
Te [cow face image] data used to support the results of this study were created by the authors themselves through video intercepts and can be obtained from the authors upon request.

Conflicts of Interest
Te authors declare that there are no conficts of interest.