Research Article Video Style Transfer based on Convolutional Neural Networks

Video style transfer using convolutional neural networks (CNN), a method from the deep learning (DL) ﬁeld, is described. The CNN model, the style transfer algorithm, and the video transfer process are presented ﬁrst; then, the feasibility and validity of the proposed CNN-based video transfer method are estimated in a video style transfer experiment on The Eyes of Van Gogh . The experimental results show that the proposed approach not only yields video style transfer but also eﬀectively eliminates ﬂickering and other secondary problems in video style transfer.


Introduction
In the deep learning field, image style transfer is an important research topic [1]. Some traditional methods for style transfer include texture synthesis, support vector machines, histogram matching, and automatic sample collection [2][3][4]. Although special effects can be produced, image distortion as well as other prominent problems can also occur, such as the loss of detail, bending and deformation of straight lines, and color change over a large range. In addition, special algorithms are usually needed to further correct mistakes, resulting in low-style transfer efficiency and poor image quality. Recently, convolutional neural network DL models have been successfully applied to image style transfer problems, reigniting interest in this research field [5][6][7][8]. In the present study, a style transfer algorithm was developed and tested on e Eyes of Van Gogh, an American biography and feature film directed by Alexander Barnett, with the main roles played by Dane Agostini and John Alexander, which narrates the secret story of Van Gogh in St-Remy, Bedlam for 12 months; this film shows the legend of the talented artist who created, loved, and changed the world through hallucinations, nightmares, and painful memories. e purpose of this paper is that we tried adopting CNN based style transfer method for the video style transfer experiment. e foregoing CNN-based style transfer method is seldom used in the video field. e proposed style transfer algorithm uses techniques from the DL field and is based on a CNN; the feasibility and validity of the proposed CNNbased video style transfer algorithm are estimated in an experiment on the transfer of the painting style of Van Gogh's e Starry Night to the film special effects.

CNN.
A CNN is a DL method developed recently that has attracted considerable attention. In general, a CNN is a multilayered network [9][10][11]; a typical CNN is shown schematically in Figure 1. A CNN consists of a series of convolution (C) and subsampling (S) layers. Each layer is composed of multiple 2D planes, with each serving as a feature map; the network also includes some fully connected (FC) hidden layers. ere is only one input layer in a CNN.
is input layer receives two-dimensional objects directly, and the process of feature extraction into samples is performed by the convolution and sampling layers. Multiple fully connected hidden layers are used mostly used to accomplish specific tasks [12].

Style Transfer Algorithm.
e methodology of Gatys et al. [13,14] is reviewed. On this basis, the feature extraction and storage of style images and content images (single frames of video) are proposed. A style image a → is transmitted through the network (expressed as A l ), and the styles included in all layers are computed and stored (A l ∈ R N l ×N l ). A content image P → is transmitted through the network (expressed as P l ) and stored in layer l (P l ∈ R N l ×D l ). erein, N l represents the number of filters in that layer; and D l is the spatial dimension of the feature map, namely the product of width and height. en, a random white noise image x → is transmitted through the network; both the content feature F l and the style feature G l are computed. F l [·] ∈ R N l ×D l and F l ij are the activations of the i-th filter at j in layer l; G l [·] ∈ R N l ×N l and G l ij are the vectorization results of layer l in i and j feature maps; the feature correlation is obtained as G l ij � D l k�1 F l ik F l jk . For each layer of the style image, the mean quadratic deviation of elements between G l and A l is computed, and the style loss L style is computed using equation (1).
(1) e mean quadratic deviation between F L and P L is computed, and the content image loss L content is computed from equation (2).
(2) e total loss L singleframe is a linear combination of the style and content loss functions; it could propagate computation reversibly with errors, pertaining to the derivatives of pixel values; a gradient is used to iterate and upgrade the image x → until it matches the style feature of the style image a → and the content feature of the content image p → simultaneously; weight factors, including both α and β, determine the importance of the two components, content and style, which is calculated from equation (3).

Elimination of Flickering in Video Style
Transfer. At present, most restoration methods require modeling to account for flickering in the image sequence; the flicker parameters of the model are first estimated, and then color correction and restoration are performed. However, the existing methods cannot treat flicker problems arising from video style transfer. A color transfer algorithm proposed by Reinhard et al. [15] is put forward in this paper, based on which the steps of video frame color correction and sequence restoration are simplified; the steps are specific to flickering after video style transfer, and the interframe color transfer is used directly. us, the data processing load and computational complexity are reduced, yet the restoration efficiency is increased. Video color transfer is an algorithm that changes the frame color [16]. A synthetic frame with the form of the original frame and reference frame color can be obtained by defining a reference frame that provides the original frame for both the structure and color layout, which is especially suitable for continuous video processing. e specific algorithm is as follows: (1) Both the original frame and the reference frame of the video are converted to the l, α, and β color space from the RGB color space, and the correlation between the two frames is removed.
Here, l S , α S , and β S are all pixel values for the three channels of the original frame, and l S ′ , α S ′ , and β S ′ are the pixel values for the three channels of the original frame after weakening. (4) e standard deviation ratio of the original frame and reference frame is taken as the coefficient of channel value offset; the detailed information of the reference frame is mapped to the original frame in accordance with the following equation: Here, l ′ , α ′ , and β ′ are all pixel values for the synthetic frame in the three channels of the l, α, and β color space. (5) e overall information about the reference frame is added to the synthetic frame; that is, this information is added to the mean value for each channel of the reference frame as shown in the following equation, and the final synthetic frame is thereby obtained.
Here, l, α, and β are all pixel values for the three channels finally obtained by the synthetic frame. (6) After color transfer, the synthetic frame from the l, α, and β color space is converted to the RGB color space.

Video Style Transfer Process
(1) e video is converted into a single frame. Preprocessing should be performed on the transferred video; single-frame processing is performed for continuous videos, and the preprocessed video is saved as a JPG file. MATLAB software is applied for processing, and classified conservation is conducted based on the shot sequence.

Model Parameters.
Model selection and parameter optimization are key steps in video style transfer, and a proper model and parameters can significantly enhance the transfer of high-quality artistic videos. First, four artificial intelligence models, namely CaffeNet, GoogLeNet, VGG16, and VGG19, were selected, all of which have their own unique advantages [17][18][19][20]. CaffeNet is a classical DL model; its advantages include network expansion and the ability to solve fitting problems. It is also the simplest network among the four models; since these models have been proposed, several deeper network structures have been proposed. GoogLeNet utilizes the concept of an inception module, aiming at strengthening the function of basic feature extraction modules. It considerably enhances the feature extraction ability of a single layer, but does not significantly increase the computed amount.
Although VGG-Net has inherited some network frameworks from LeNet and AlexNet, the former is not identical to the latter ones; VGG-Net uses more layers, usually from 16 to 19. VGG-Net mainly increases the network structure depth, while reducing parameter configuration. e model suitable for the style transfer of the video e Eyes of Van Gogh must be analyzed in advance. Second, the style/content conversion rates (10 −1 , 10 −2 , 10 −3 , 10 −4 , and 10 −5 ) are key parameters for transfer, and a preexperiment is also required for parameter selection. erefore, the video clip of e Eyes of Van Gogh was selected for the style transfer experimental analysis. Figure 2 shows the experimental results obtained when the style/content conversion rates of the four models were set to 10 −1 , 10 −2 , and 10 −3 ; the video frame style transfer was not sufficient because the images in the original video remained, owing to the excessively low conversion rate. When the conversion rate was 10 −5 , the style transfer was excessive and some important information, such as the form and structure of the original video frame, was lost. e CaffeNet-based style transfer revealed several serious errors, such as distortion, making it not suitable for this video transfer; although Goo-gLeNet performed slightly better than CaffeNet, there were still many errors; the performances of VGG16 and VGG19 were better, and these methods achieved optimal results, especially for the conversion rate of 10 −4 . e experimental results for the VGG16/19 model were further compared, and the results of the VGG19-based transfer were considered to have higher fidelity, more plentiful layers, and a high transfer efficiency (marked by a red frame in Figure 2). Figure 3 shows the computational times for the different models and parameters.
As a result, the VGG19 model and the style/content conversion rate of 10 −4 were selected for the style transfer in the video transfer experiment on e Eyes of Van Gogh. [21][22][23][24], the style transfer algorithm based on the Caffe platform was used to input the video frame to be transferred. e VGG-19 network trained in advance was used for computing the loss; conv4_2 in this network represents the content; conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1 represent the style; the loss of the parameter weight used was not more than 0.02%; and the number of iterations was 512. e hardware device used was a high-performance workstation (HPZ840 workstation, parameter configuration: Intel Xeon E5 Eight-core, two GPUs, Nvidia TITAN Xp. memory: 64 GB). e following experimental steps were performed: first, we imported 24 frames (800 × 480 per frame in size) at a time for continuous style transfer. e model was VGG19, and the conversion rate was 10 −4 ; second, Van Gogh's representative work, e Starry Night, was selected as the source style image; its short yet thick brushwork and fiery color are filled with personality characteristics and artistic charm; finally, CUDA was used for parallel computing and the outputting of the JPG video frame (the maximal length was 1024). Figure 4 shows the results of the style transfer experiment for a continuous video. Figures 4(a) and 4(c) were selected from the original frames in the experimental video e Eyes of Van Gogh; we chose two groups of shots, with low indoor brightness and high outdoor brightness, respectively, for the video style transfer experiment. is group of video shots was relatively fixed, and the characters had no large displacements. Figures 4(b) and 4(d) show the realization of the video style transfer using equations (1)-(3) that were presented in Section 2 of this paper. e experimental results show that, on the one hand, the style transfer retains the form structure of the original video frame and other information, and when mixed with the brushwork texture and color elements of e Starry Night, their combination produces a unique visual effect; on the other hand, the video frame details obtained in the style transfer are excellent with rich colors, without any Considering the real and effective experimental results, we selected a set of continuous frames in which the characters had large deviations from the video for the style transfer experiment, to estimate the reliability and validity of the style transfer algorithm for video style transfer applications. e experimental steps and model parameters were the same as those mentioned previously. Figure 5 shows the results of this video style transfer. Although the form structure information, mixture with textures and colors of the source style image, and other elements of the target video frame were reserved to produce visual effects, we also noted some mistakes in the details of the video style transfer. A prominent problem was interframe flickering, as shown in the part marked with a red frame in Figure 5(b). erein, some parts of several single frames exhibited hue and brightness deviations, which caused flickering as secondary damage during continuous play. us, we used the color transfer algorithm to further process and eliminate flickering, thereby attaining optimal video transfer.  e shooting process was performed for various videos imported in accordance with equations (4)-(6) that were presented in Section 2 of this paper. A proper frame in the same shot was selected as a reference frame (here, we chose the middle frame as the reference frame), and the color feature of the reference frame  was transferred to each frame in the same group of shots successively; after the video processing of the same group, the above steps were repeated until all of the frames imported into the video were processed. Figure 6(a) shows the video frames after the elimination of flickering using the color transfer algorithm; we observe that compared with the areas marked with red frames in Figure 6(a) (the character's chest, forehead, and other body parts), the style transfer errors were effectively eliminated. Figure 7 shows the mean statistics of videos before and after the elimination of flickering; Figure 7(a) shows the statistics before the flickering elimination process, and

Conclusion
e CNN-based style transfer algorithm quickly and effectively generates diverse and stylized videos, as well as unique visual effects. e experiment proved that the video style transfer method proposed herein is feasible and effective. In terms of parameter optimization of the video style transfer model, we found that the style transfer results are strongly determined by the style/content conversion rate and model selection. e experiment also showed that for the film, e Eyes of Van Gogh, the optimal model was VGG19 and the optimal conversion rate was 10 −4 . It should be noted that model parameters should have been selected in combination with different videos; a sample analysis experiment should be conducted in advance to obtain the best results. In addition, as flickering and other secondary problems often occur in video style transfers, the video after style transfer requires further processing using the color transfer algorithm to obtain high-quality experimental results.
In future work, we hope to explore the use of the proposed CNN-based style transfer algorithm for other video transformation tasks, such as the production of stable and visually appealing stylized videos even in the presence of fast motion and strong occlusion. Owing to the subjectivity of video quality evaluation, we also plan to establish a subjective evaluation index system for better evaluation of style transfer video quality. Video style transfer is a common problem like loss of details, bending and deformation, or color change over a large range, which is to cause secondary video damage like a flicker. Subjective evaluation is the most commonly used method in video quality evaluation. However, subjective evaluation is a time-consuming task. For this, we plan to employ a forced-choice evaluation on Amazon Mechanical Turk (AMT) with 200 different users to evaluate our experimental results. is is a part of further research. In addition, we plan to extend the dataset to include more videos, which would make our approach more generalizable.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
All authors declare that there are no conflicts of interest with this study.