Application of Animation Character Intelligent Analysis Algorithm Based on Deep Learning

e exit situation of characters is unstable, and the image is easy to change with the scene. erefore, in this paper, an intelligent analysis algorithm of animation characters based on SSD target detection is proposed, which is trained by data enhancement strategy, where the weight factor of positive and negative samples are adjusted by an optimization method. Finally, it analyzes the overall process of intelligent analysis ability service application of animation characters. e test results verify the eectiveness of the improved algorithm that the proposed method can optimize the training process of the model, which helps users to more conveniently understand all kinds of animation characters, eectively enhances the eciency of information dissemination, and ultimately plays a role in promoting the development of animation industry.


Introduction
With the vigorous development of animation industry, a large number of excellent works have emerged. ese animation works are more and more widely concerned and welcomed, among which the role of animation is deeply loved by people of di erent ages. e market value of the animation industry has increased signi cantly, and its competition is becoming increasingly erce. In order to occupy a place in the market, animation-related manufacturers constantly introduce new animation works and applications in their own products, in order to retain and attract users. It will be a very novel and interesting application to integrate the technology of arti cial intelligence into products [1,2]. As an important part of AI and a new growth point, computer vision is bound to present an explosive development posture, and is bound to be widely used in all walks of life.
As an important direction of computer vision, image target detection, with the support of arti cial intelligence technology, will have more application development space. e analysis of the animation characters in the image in the form of target detection can help users easily understand unfamiliar animation characters and arouse users' interest in animation characters and related works, which make animation products more entertaining and attractive [3]. e target detection task includes the prediction of target location coordinates and the classi cation of target categories [4]. In the former, the main problem is the location of the target where it is mainly to complete the category prediction for all possible targets, and assign the target to the most credible category among the preset categories. Common target detection tasks can be divided into general detection tasks and special detection tasks. e general detection task does not speci cally de ne the category of detection targets, which is often necessary to detect a large number of objects in daily life. While special detection tasks often focus on speci c scenes, such as pedestrian detection in security monitoring, face detection in face authentication scene, and vehicle detection in intelligent tra c scene. Girshick et al. proposed the R-CNN model [5]. Compared with traditional methods, R-CNN uses CNN to extract image features of candidate regions, which improves the detection accuracy e ectively. However, because the feature extraction of selected candidate region and convolution network is carried out separately, R-CNN needs to repeatedly calculate convolution operations, which directly leads to the slow running speed of the algorithm. Although Fast-R-CNN can complete feature extraction, classification and candidate frame regression in target detection, it still uses the traditional region search method, which greatly affects the efficiency. To solve this problem, Ren [6] added the generation of candidate regions to convolution networks, and a complete end-to-end framework from candidate region production to target detection is realized.
In recent years, some researchers have proposed a target detection algorithm that is not based on anchor free. e main idea is to transform the detection of frame into the detection of key points, locate the target by detecting the position of the target frame, and determine which points belong to the same target by embedding features [7][8][9]. Examples of such algorithms are Cornernet and Centernet.
is algorithm can really realize the end-to-end training. However, it is more difficult to detect the points. In order to achieve good detection accuracy, the input image needs to be processed twice by Hourglass Network; in addition, due to the large number of up sampling operations, the network parameters and computation are large, which requires high memory and computing power of training equipment, and the process of training and reasoning is slow.
Wang et al. [10] introduced the current video detection algorithms from three technical challenges (improvement and optimization, maintaining spatiotemporal sequence consistency, and model lightweight). ere are four types: motionbased information, combination of detection and tracking, lightweight video detection, and the use of crossover models (such as combining transformer and video detection in the field of natural language processing). Zhou [11] studied the detection method based on time sequence characteristics, combined with feature fusion and dual model to detect video frame by frame, corrected the detection result of the current frame through the feedback of the previous frame so as to improve the continuity between frames and improve the detection accuracy and video continuity.
Based on the above analysis, defining each animation character as a category can realize the detection of animation characters in the animation scene. erefore, through partial selection of SSD detection module, this paper proposes an intelligent analysis algorithm of animation characters based on target detection, in order to reduce the detection accuracy. At the same time, it analyzes the application scenarios of the algorithm, so as to help users understand all kinds of animation characters more conveniently, effectively enhance the efficiency of information dissemination, and finally play a role in promoting the development of the animation industry. On this basis, the SSD network continued to add Conv8_2, Conv9_2, Convl0_2, Convll_2, and the previous Conv6 and Conv7 to form the auxiliary network structure. As shown in Figure 1, the width and height of the cube represent the size of the feature graph, and the thickness represents the number of channels.

Selection of Default Box.
In the design of network structure based on end-to-end deep learning target detection, the generation of default box is very important for the tasks that can be targeted by the network and the performance of detection. As shown in Figure 2, it is assumed that there are 8 * 8 and 4 * 4 different feature graphs. e feature graph grid refers to each small grid in the feature graph that there are 64 and 16 grids in the feature mAP of 8 * 8 and 4 * 4, respectively. e default box means that each grid of the feature corresponds to preset boxes of fixed size, as shown in Figure 2 with four dotted lines corresponding to a small grid.
Assuming that each feature grid pair should have k default boxes, then n_label category confidence and 4 positional offsets relative to the default box need to be predicted for each default box. In addition, assuming that the size of a feature graph is m * n, and the convolution operation using some small 3 * 3 convolution check feature graph will output (n_label +4) * k * m * n predicted values.
Feature mAPs with different depth of network layer have different feelings of corresponding original image size. Suppose we use feature mAPs of m network layers with different depths for prediction, and the default box size ratio formula of feature mAPs of different layers is as follows: Each default box has a different shape, and in the base SSD algorithm. e width of each default box is calculated as follows: e height of each default box is calculated as follows: ough animation characters in the animation story at different times is uncertain, the animation character's head is the highest frequency, which contains the hairstyle and the entire facial features of animation character. erefore, this paper chooses to label the animation character's head to represent the animation character. However, the shape of the head is relatively single. In this paper, a new default box shape is set based on a large number of labeled head shapes.

Algorithm Flow.
During the training of SSD frame, the real label box and default box are matched in the following way: (1) First of all, find the default box with the largest Jaccard overlap with each real annotation box, so as to ensure that each real annotation box corresponds to a unique default box. (2) en try to pair the remaining default box that has not been paired with any real annotation box. As long as the Jaccard overlap between the two is greater than the threshold, it is considered as a match. In this way, the default box paired to the real box is the positive sample, and the default box that is not paired to the real box is the negative sample.
e training of SSD algorithm is based on reference [10]. e overall objective loss function is as follows: where c represents category confidence, l represents target prediction box, g represents target real label box, and the default α � 1. e calculation of Loc is as follows: Among them, (cx, cy) represents frame coordinates of the center of the w and h, respectively frame's width and height, (g cx j , g cy j , g w j , g h j ) represents the bubbles,

Algorithm Optimization.
For SSD algorithm, a serious problem is the imbalance of labeling classes. e quantity of negative cases (foundation classes) is a lot bigger than the quantity of positive examples. Each training picture can produce up to 8732 candidate frames of various scales in the feedforward process of the network. Because the number of labeled boxes in a training mAP is usually only a few or even one, even if the IOU threshold of 0.5 is used for matching, the number of candidate boxes that can match the superscript box is still only a small part of all the candidate boxes.
In the training of animation character recognition deep learning model, the number of background samples is far more than the positive samples including animation characters [12]. erefore, the model tends to learn "what is the background" rather than "what is the target," which reduces the accuracy of target detection in application. Since the main application scenario of this paper is the main animation characters in the picture, it is not necessary to identify the small animation characters in the image, and the characters with resolution below 150 × 150 are also excluded from the annotation data. We can selectively retain some  contents and items in the original network, and need a detection module with high matching degree, thus the efficiency of model training and reasoning can be improved without reducing the accuracy. erefore, a weight parameter w is added to the confidence loss function of SSD. New confidence loss function is as follows:

Data Sources.
First of all, a large number of cartoon character images are needed, and the characters in the pictures are marked. 36 animation characters in 11 classic animation works need to be identified with 500 pictures. In addition, no less than 30% of the training set sample size should be prepared as the test set. e format of animation character pictures is JPG, and the resolution should not be less than 640 * 480, and the size of characters in the picture should not be less than 150 * 150 pixels. e process of obtaining training data source is shown in Figure 3: As far as the whole animation works are concerned, the number of image data of animation characters sorted out in this paper is relatively small, and the training of actual animation character detection model is far from enough. erefore, this paper enhances the original animation image data. To some extent, data enhancement can prevent over fitting of training, and it assumes a significant part in the last acknowledgment capacity and speculation capacity of the prepared model. Its methods are shown in Figure 4.

Data Label.
For the prepared source data images containing the animation characters, the annotation terminal downloads the source data images from the annotation subsystem through FTP protocol, and then labels them. e specific labeling method is to mark the position of the cartoon characters in the picture in a rectangular box, that is, frame out the position of the characters, and label the horizontal and vertical coordinates of the upper left corner and the lower right corner of the picture. e requirements are as follows: (1) Use a rectangular frame to label the target cartoon characters in the pictures; (2) e head and body of the character are framed respectively, and the rectangular box is close to the edge of the character; (3) If there are too few body parts of animation characters to be framed, the body cannot be selected by frame selection; (4) All the front, side, and back of characters should be selected by frame; (5) When the animation character size is less than 150 * 150 pixels, it is not labeled; (6) When the annotator cannot judge and the prior data are missing, the animation character is not labeled; (7) Try to avoid adding subtitles and other irrelevant information to animation characters; (8) When the rectangle box is on the edge of the image, the rectangle box should be close to the edge.
For large-scale training data generated by data enhancement, the size and position of animation characters are known information in the process of enhancement, which can directly generate annotation information without manual annotation, which can save a lot of labor cost and avoid the error caused by manual annotation.

Test Code.
All experiments in this paper are implemented on Caffe open source framework [13]. e improved SSD with a positive loss weighting factor of 3 was used, and the training effect was compared with the original SSD. To implement the positive loss weighting method, we need to add the code of calculating the loss value of Caffe-SSD source code.

Evaluation Index.
e training set includes about 15000 images in the original animation image data and the enhanced images of these data. e verification set is randomly extracted from the original animation images, with a total of 5000 pieces. e training iteration number is set to 30000 times, and each iteration is verified once every 1000 times, the results of verification use the mean average precision (mAP) as the evaluation standard. For the target detection task, each detected target can calculate its precision and recall. After many experiments, we can get a P-R curve for each target category. e value of AP is the area under the curve, and the value of mAP is the average value of AP value of the target category. e value range of mAP is 0 to 1. Figure 5 is a graph of animation image data enhancement mAP. Among them, the red curve represents the change of mAP value of verification set with the increase of iteration times when the original data are used as training set. e green curve shows the change of the mAP value of the verification set with the increase of iteration times when the enhanced data is used as the training set for training.

Data Enhancement.
It can be clearly seen from Figure 5 that although before about 12500 iterations, the mAP value of the red curve is slightly higher than that of the green curve, but after that, the growth of the mAP value of the red curve tends to be flat, while the mAP value of the green curve continues to maintain the growth trend, and soon exceeds the mAP value of the red curve. Finally, the mAP value of the model trained by the original data is stable at 0.65, while trained with the enhanced data is stable at about 0.72, which is increased by 10.7%. It can be seen that data enhancement can enhance the performance.  Figure 6, the ordinate is the mAP result of verification of the verification set in the training process. e red curve and the yellow curve represent the change of the mAP value verified by the verification set when w is set to 0 and 3 for training.

Model Optimization. As shown in
As can be clearly seen, for the improved SSD, mAP showed the fastest growth trend and stabilized at a high value of 0.77, while the original SSD stabilized at 0.69, which decreased by 10.4% in comparison. It can be seen that properly increasing the loss weight of positive samples can improve the performance of the animation character detection model. Figure 7 shows the comparison of loss function between original SSD and improved SSDs.
It can be clearly seen from the figure that the variation trend of the corresponding convergence curve varies with different values of W, and as the value of W increases, the final convergence loss function value increases correspondingly. is indicates that the loss weight of positive samples does affect the training of the model.

Application Scenarios
After the training of animation character intelligent recognition model is completed, it needs to be deployed as a back-end service for the server to call. is service needs to be deployed in a server that supports fast forward derivation of the deep learning model, as shown in Figure 8.

Mobile Terminal.
When watching animation, mobile client users will see unfamiliar animation characters and want to know more about the animation characters. ey can take animation pictures with camera in the client system or read local animation pictures directly. ey can upload pictures by using the image search function in global search.

5.2.
Client. e product side server will store the image data in the dimensional download image server, the dimensional download image server returns the image URL address to the product side server; while the client sends a request to the intelligent analysis capability service of animation characters through the server on the product side. e latter obtains the image source file from the dimensional download image server by analyzing the URL address of the picture, analyzes the animation characters in the picture, and sends the processing result to the server on the product side.

Product Terminal.
On the product side, the server selects the animation character with the highest correlation according to the processing results, uses the name of the animation character as the keyword to search the works related to the animation character through the search engine, and recommends the relevant information to the client user.

Conclusion
is paper proposes an intelligent analysis algorithm of animation characters based on SSD target detection.
rough online crawler to collect a database containing 36 kinds of animation characters, large-scale and high-precision training data are produced in the way of data enhancement. e SSD is improved by modifying the classification weight in the loss function. e results show that, without increasing the model parameters and test time, the accuracy of animation character recognition is improved from 69% to 77%, which provides a strong algorithm support for the actual application of animation character recognition. By deploying the application service of intelligent analysis ability of animation characters, users can easily understand all kinds of cartoon characters and effectively enhance the efficiency of information dissemination, which plays a role in promoting the development of the animation industry.
Data Availability e dataset can be accessed upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.