Transformer-Based Data-Driven Video Coding Acceleration for Industrial Applications

Information Security Center, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China National Engineering Laboratory for Disaster Backup and Recovery, Beijing University of Posts and Telecommunications, Beijing 100876, China State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China Department of Computer Science and Engineering, Santa Clara University, 95053 Santa Clara, USA


Introduction
With the development of the smart industry, image and video play a key role in many industrial scenarios [1][2][3]. As a result, video coding standards have been used more widely than ever [4]. Although advanced video coding (AVC) was introduced in 2003, it is still used by many applications today due to its fast coding speed [5]. However, with the increasing demand for high resolution and ultra-high resolution video and the improvement of hardware computing, HEVC is gradually replacing AVC because of its superior coding e ciency. Although high e ciency video coding (HEVC) was developed about one decade ago, the computational complexity is still not low at present due to the introduction of many advanced coding tools [6]. Some certain industrial scenarios have an urgent demand for real-time high denition display, and high encoding complexity hinds the applications of HEVC in these scenarios. Fortunately, it is becoming more mature for the use of big data and deep learning techniques in the industry [7][8][9][10][11][12]. Besides, mobile video applications have exploded in recent years and produced massive encoding data, which makes it possible for us to employ big data and deep learning techniques to reduce the encoding time and speed up the encoding process in industrial applications [13]. e main computing burden of HEVC encoding comes from a complicated partition rule for intracoding called quadtree structure. To lower the computational complexity, many scholars have developed lots of fast algorithms for HEVC encoding by using various techniques, such as the learning-based technique and 3-type fuzzy logic system (FLS) [14]. Learning-based techniques are good at finding patterns in data. Dong et al. [15] proposed a learning-based fast algorithm for versatile video coding (VVC) from two aspects of mode selection and prediction terminating to reduce coding complexity, and 3-type fuzzy logic systems are good at solving equations or finding a mathematical model to represent the relationship between output and associated input variables [16]. With the advances in modeling problems, 3-type FLS is suitable for mode decision or rate control tasks in video coding and may bring potential improvement for encoding performance [14]. ere are quite a number of algorithms focusing on fast coding unit (CU) partitioning. Furthermore, with the popularity of deep learning technology [17], researchers use neural networks (NNs) to boost CU partitioning and achieve satisfactory results [18]. Works using the deep learning method can be roughly classified into two main categories which are the multistage partition approach and the end-to-end structure decision approach.
In the category of multistage partition, approaches regard the structure determination of CTU partitioning as a combination of several binary classification problems. Xu et al. [19] proposed a three-level CTU splitting algorithm based on a convolution neural network (CNN). ey trained three deep CNN models, of which each predicted the split flag for a CU in a certain depth. Kim and Ro [20] also proposed a CTU partition algorithm by using three CNNbased classifiers, they used different networks to predict the splitting decisions of CUs in different sizes. Shi et al. [21] proposed an AK-CNN for fast CTU partition prediction. eir AK-CNN classifiers are well-designed and can detect texture complexity of the CU quickly. Shen et al. [22] proposed early determination and a bypass strategy for CU size decisions by using the texture property of the current CU and coding information from neighboring CUs. Moreover, Shen et al. [23] proposed a fast intermode decision algorithm for HEVC by jointly using the interlevel correlation of quadtree structure and the spatiotemporal correlation. Based on visual perception and machine learning, Chen et al. [24] proposed a fast algorithm by using random forest models to quickly select the partition for VVC intracoding. However, methods in this category usually need as many as three NN models, which require much training work and cause implementation difficulties. Besides, the splitting of the CU is usually related to partition statuses of neighboring CUs, and methods in this category obviously ignore the splitting information from neighboring CUs by considering the splitting problem hierarchically.
In the category of end-to-end decision, one partition structure or several possible result candidates of the CTU can be generated through a single prediction. Liu et al. [18] proposed a VLSI friendly approach to partition a CTU by designing a shallow CNN. Feng et al. [25] proposed a fast block partitioning algorithm using CNN-based depth map prediction for HEVC intracoding. ey used a depth map to represent the partition structure of the CTU so that quadtree structure can be predicted end to end. Tissier et al. [26] proposed an edge possibility prediction approach by using CNN. In their approach, most possible CTU partition results were generated, and the final partition was determined through postprocessing and rate-distortion optimization (RDO). Although methods in this category consider the influence of the entire CTU space information on the splitting of the current CU, they do not take into account the influence of the splitting of parent CUs. Furthermore, for some end-to-end methods, which take a strategy of reducing the CTU partition candidates for RDO, the RDO process is not fully skipped so that the complexity reduction is limited.
To make full use of splitting information from both neighboring and parent CUs, we first propose a new representation for CTU partitioning structure. Specifically, we use an array called split vector (SV) to represent a CTU partitioning structure. en, we design and train a transformer model to predict the PV of the CTU. With the introduction of the transformer, the CTU partition is regarded as a sequence problem. Finally, the CTU partition structure is decided from SV through postprocessing, and the RDO process is no longer needed for CTU partition searching. e main contributions in this paper are described as follows: (1) An array called SV is proposed to represent the partition structure of the CTU. With the use of SV, RDO progress searching for optimal CTU partitioning structures can be fully skipped. (2) We introduce transformer models into the fast determination task for CTU partition structure and imaginatively model the partition problem as a sequence problem affecting each other. (3) We not only design effective transformer models to predict the SV of the target CTU with high accuracy but also build several datasets for transformer model training.
is paper is organized as follows: Section 1 introduces the background of the proposed approach. Section 2 introduces the fundamental knowledge used in this paper. Section 3 describes the proposed fast algorithm for CTU partitioning. Section 4 shows the experiment results, and Section 5 concludes this paper.

Fundamental Knowledge
In this section, we introduce the quadtree structure of HEVC intracoding, and a brief description of the transformer is also given.

Quadtree Structure of HEVC Intrapartitioning.
First, each intraframe is divided into nonoverlapped square blocks called coding tree units (CTUs) which are usually 64 × 64 pixels. To cope with the texture characteristic of the CTU, the CTU can be further split into four equal-sized square blocks, and each block can be iteratively further split into four squared sub-blocks according to the quadtree partition structure. In the quadtree of the final CTU partitioning, the CTU serves as the root, and each leaf node represents a coding unit (CU). e CU size of HEVC intracoding varies from 64 × 64 pixels to 8 × 8 pixels since the max depth of a quadtree is 3. Figure 1 shows the partitioning result of the CTU and the corresponding quadtree structure. e deeper the quadtree, the smaller the CU.
CTU can be adaptively partitioned into CUs of different sizes to achieve optimal coding efficiency, and the final partition structure is decided according to a brute-force method called rate-distortion optimization (RDO) [27]. RDO evaluates the cost of every possible partition structure in terms of bit rate and visual quality first. As a result, the partition structure with minimal cost is selected as the final partition result for the CTU. Obviously, the encoding time spent on the RDO process increases exponentially with the increase of quadtree depth. us, it is essential to reduce the computational burden by replacing the RDO process with an end-to-end approach for CTU partition structure determination.

Transformer Network.
Attention mechanism has been widely used in neural network models such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) [28]. Since the attention mechanism was proposed, sequence-to-sequence models with attention have shown performance improvement in various tasks. In 2017, Ashish et al. [29] first proposed an attention fully based model named transformer. In order to integrate the advantages of CNNs and RNNs, they creatively used the full attention mechanism to build the network. ey applied the transformer to machine translation tasks and achieved state-ofthe-art effects at that time.
Like most sequence-to-sequence models, the transformer can be divided into two main parts, i.e., encoder and decoder. e encoder is responsible for mapping the input sequence into a hidden layer, which is the mathematical expression of the input sequence. e decoder then maps the hidden layer back to the target sequence. Using a transformer, we can solve various problems, such as image classification, summary generation, and machine translation. Figure 2 shows the model structure diagram of the transformer. In Figure 2, the encoder consists of N subencoders, and each subencoder includes two layers, which are a multihead attention mechanism and a fully connected feed-forward network, respectively. Besides, residual connection and normalization are also added to each layer. As we can see from Figure 2, transformer decoders are also composed of N subdecoders, and each subdecoder has one more masked multihead attention layer than the subencoder. e transformer is a new network architecture designed to replace RNN and CNN. It can be stacked to very deep depth so that it can fully explore the characteristics of a deep neural network and achieve high accuracy. Unlike CNN, which is only able to obtain local information, it can directly obtain global information and capture sequence   dependence. Compared with RNN, the transformer can realize fast parallelism using the attention mechanism, and the training time is greatly reduced.

Proposed Method
is section is divided into parts. Firstly, we introduce how the CTU partition structure is represented in the proposed approach.
en, we introduce the training process of transformer models, and loss function designing and train dataset preparation are both included. Finally, we present the postprocessing procedure for the outputs of transformer models.

Representation of the CTU Partition Structure.
In this paper, we propose a novel representation way for the CTU partition structure. Specifically, each of the CTU partition structures can be represented by a vector called split vector (denoted as SV) which contains 21 Boolean split flags. Figure 3 illustrates how the CTU partition structure is reflected in an SV.
According to the quadtree partition rule used in HEVC intracoding, the CTU is split recursively until the max CU depth is reached. us, a CTU partition structure can be represented depth wisely. In depth 0, f 1 is the split flag of a CTU. In depth 1, four Boolean values (f 2 , f 3 , f 4 , and f 5 in Figure 3) are used to represent the split status of the four sub-CUs of the current CTU. Furthermore, f 6 -f 21 represent the split flag of each 16 × 16 CU in depth 2, respectively.
Finally, these 21 CU split flags form the SV of the current CTU.
It is convenient and effective to put these flags together into SV. e SV provides us with a new way of reviewing the CTU partitioning jointly, and it avoids treating the CTU partition problem as a level-wise binary classification task by using several cascade prediction models. In other words, SV makes it possible for our method to predict the split flags of blocks in the same CTU jointly due to the high correlation among them.

Overview of the Proposed Approach.
Most existing methods, either statistic-based or learning-based, partition a CTU by deciding whether to split CUs in particular depths. ese methods usually employ as many as 3 modes to complete the binary prediction for CUs in different depths. However, with the proposed end-to-end method called TBFA, the partition structure of CTUs is decided through a single prediction by using SV.
Besides, the splitting of the CU is not only highly related to the entire CTU it locates but is also affected by other CUs locating the same CTU. However, many previous methods only focus on the current block without considering information from neighboring or father CUs. BTFA is designed to be able to consider them jointly by using SV, which consists of 21 split flags of neighboring and father CUs in the current CTU.
Also, we specifically design a transformer to predict the SV of the CTU. In BTFA, the transformer is used to explore   the relations among those 21 split flags. e transformer takes the luminance values of the CTU as input and outputs a prediction vector (denoted as PV) of the CTU. rough training the transformer models, PV equals the target SV of the current CTU with high accuracy. us, SV can be obtained from PV after the postprocessing procedure. Once the SV is obtained, the entire CTU partition is determined. en, the encoder can encode each CTU directly, completely avoiding the RDO process for partitioning. Figure 4 shows the whole flow of the proposed TBFA.

Transformer Training.
e transformer predictor plays a key role in BTFA, and a classic transformer structure is employed by our approach. In this part, the design of the loss function is described, and the way in which data are prepared for training is also given.

Loss Function Design.
e loss function is used to measure the distance or bias between the ground truth and the predicted result outputted by a model for the current sample. In the case of this paper, the loss function is employed to measure the error between PV and SV.
Actually, PV output by the transformer contains the partition probability of each part corresponding to the current CTU. Conversion from PV output by the transformer to SV is required for encoding, and the conversion process is quite simple that only a comparison between each element and 0.5 is needed. Specifically, if one element in PV is smaller than 0.5, the corresponding element located at the same position in SV is set to 0. On the contrary, if one element in PV is greater than 0.5, the colocated element in SV is set to 1. Contradictory values may exist in SV transformed from PV since the judging of each element of PV is independent during the conversion process. When a CTU is not split, related sub-CUs in depth 1 and 2 are not split either. us, if f 1 in Figure 3 equals 0, the values of rest elements in SV are supposed to be 0. To address the inconsistency among values of SV, we carefully designed a loss function for training.
When optimizing the weight parameters of the model iteratively, there is no point in considering the prediction error of the corresponding four sub-CUs if the real label of the current CU is nonsplit. Based on this, we designed the loss function, and the final loss of a sample can be calculated as where l i denotes the ith element of SV, p i denotes the ith element of PV, i is an integer from 1 to 21, and g i is the cross entropy function for each of 21 element pairs. e loss function used in our model training is designed and improved from cross entropy. It removes the effect brought by contradictory values during the training procedure, improves prediction accuracy, and reduces computational burden as well.

Train Data Preparation.
e dataset for training is so important that it can influence model prediction performance greatly. To make sure transformer models achieve their best performance, we construct a reasonable dataset for each model. In this paper, the transformer model is used to make a one-shot prediction of the partition structure for the CTU. us, only one transformer model is needed when a video sequence is being encoded. However, the quantization parameter (QP) is one of the key factors affecting CU splitting result, and the partitioning structure of the CTU differs from different QP values. To make our algorithm more specific and achieve higher prediction accuracy and better encoding performance, we train one dedicated transformer model for each QP. To validate the proposed method, we take 4 classic QP values in this paper, and there are a total of 4 datasets constructed for QP 22, 27, 32, and 37, respectively.
Once encoding is completed on these training sequences, we can obtain encoding results of each training sequence under each certain QP. In particular, partitioning structures of all CTUs in training sequences are generated and are further converted to ground truth SVs served as target labels in the training dataset for corresponding QP. As a result, Mathematical Problems in Engineering four training datasets are assembled, and each sample of a dataset consists of luminance pixels and an SV of each CTU. It is worth noting that the only difference among these four datasets is SVs. Samples of different datasets share the same luminance pixel values since they are from the same training sequences.

Postprocessing of the Split
Vector. PV is first output by our transformer model once the derivation is completed. en, the predicted SV of the current CTU is generated from PV through a conversion process. SV consists of 21 binary elements, and each corresponds to a potential sub-CU of the current CTU. Although the contradiction among SV elements has been mitigated slightly by a well-designed loss function, the influence of loss function designing is not great enough to remove all contradictions in SV. So SV cannot be used directly as the split flags of CUs in the CTU. It is necessary to carry out a postprocessing procedure on SV converted from PV, and the detailed postprocessing procedure is described in Figure 5.
As we can see from Figure 5, the postprocessing procedure is easy enough to be carried out quickly. To be specific, we set SV 2 all the way to SV 21 to be 0 if SV 1 is 0. If SV 2 is 0, we set SV 6 , SV 7 , SV 8 , and SV 9 to be 0. Similarly, we set SV 10 , SV 11 , SV 12 , and SV 13 to be 0 if SV 3 is 0. We set SV 14 , SV 15 , SV 16 , and SV 17 to be 0 if SV 4 is 0. We set SV 18 , SV 19 , SV 20 , and SV 21 to be 0 if SV 5 is 0. After the fine-tuning described above, SV is able to be used in video encoding finally.

Experiment Result
In this section, we first analyze the prediction accuracy of transformer models employed by the proposed approach. Deep learning framework PyTorch is used to complete training and prediction. Simulations were executed on a Windows 7 64 bit operating system workstation with NVIDIA RTX 2080s GPU and Intel(R) Xeon(R) CPU E5-2623 v3 @ 3.00 GHz and 3.00 GHz (2 processors), 64.0 GB. en, to verify the effectiveness of the proposed one-shot approach for CTU partitioning, we implemented it on HEVC reference platform HM16.7. Coding parameters, such as additional coding tools, were set as default. Besides, all-intra main configuration was adopted to encode all the frames of video sequences in the standard test set. e BDrate was employed to evaluate the coding performance of the proposed method, and the time saving ratio denoted by TS was used to measure the complexity reduction of encoding algorithms. It is defined as where time o denotes time spent by the original HM16.7 encoder and time p is the time spent by the encoder on which the proposed algorithm is implemented. At last, partition results of original HM16.7 and the proposed approach are compared on one frame randomly selected. e comparison results visually demonstrate the partitioning effect of the proposed one-shot algorithm for CTU partitioning structure determination. Set SV i =0 (i∈ [2,21], integer) Set SV i =0 (i∈ [10,13], integer) Set SV i =0 (i∈ [18,21], integer) Set SV i =0 (i∈ [14,17], integer) Set SV i =0 (i∈ [6,9], integer)  Mathematical Problems in Engineering 4.1. Splitting Accuracy. Accuracy of splitting decisions predicted by transformer models for CUs of each depth is shown in Table 1. Depth levels in Table 1 represent different sizes of the CU, accuracy of depth level 0 means the percentage of right splitting decisions predicted by transformer model for CTUs, and accuracy of depth levels 1 and 2 means the percentage of right splitting decisions for CUs of size 32 × 32 and 16 × 16, respectively. As we can see from Table 1, every transformer model achieves its own highest accuracy on the splitting prediction on depth level 0, and accuracies on levels 1 and 2 do not have much difference and are lower than that on level 0 by around 10%. e reason is that splitting prediction of large CUs is easier than that of small CUs since large CU provides more information for prediction, and texture is more obvious. Another phenomenon is that prediction accuracy of transformer models at depth level 0 decreases as QP increases, while the accuracy at depth level 1 and 2 increases as QP increases. It means that the transformer employed by the proposed approach is good at predicting the splitting results of large CUs when QP is small, while it predicts the splitting results of small CUs easily under large QP values. ese properties provide guidance for the industry application of our method.

Inferring Time Overhead.
e proposed approach aims at reducing the encoding time of HEVC by employing transformer models. However, inferring time of transformers must be considered due to their parameter scale. During the encoding process, the inferring time is included for a fair performance evaluation. Besides, we calculate the percentage of inferring time to encoding time under different QPs, and the results are shown in Table 2. As we can see from Table 2, the inferring time overhead of transformers takes a quite low percentage of the encoding time while the proposed method is used, which is because we use GPUs to complete the inferring calculation of transformers.

Encoding Performance.
e last two columns in Table 3 show the encoding performance of the proposed method. Compared to original HM16.7, 58.77% encoding time on    both contain objects moving fast, which indicates our fast approach is ideal for industrial applications working in scenes involving a lot of movement.
To further analyze the coding performance of the proposed algorithm, we compare it with three recent studies. Respectively, they are convolution neural network-based fast algorithm (CNNFA) proposed by Liu et al. [18], fast intra-CU splitting algorithm (FICUSA) proposed by Zhong et al. [30], and bagged tree-based fast algorithm (BTFA) proposed by Li et al. [31]. CNNFA, FICUSA, and BTFA are all effective schemes for intra-CU size decision of HEVC, and their performance on standard test sequences is listed in Table 3. As we can see, FICUSA provides 0.97% less BD loss compared to ours, but the time saving of our method outperforms about 18%. Considering CNNFA, though its time saving is slightly higher than ours by 2.04%, their BD loss is also higher by 2.11%. Moreover, compared with BTFA, our approach outperforms in terms of both BD-rate and time saving.
Moreover, we also compared the proposed approach with other two fast partition approaches, which are the effective CU size decision approach (ESDA) [22], bagged tree, and ResNet joint fast approach (BTRNFA) [32]. Table 4 shows the encoding performance of ESDA and BTRNFA on different sequence resolution classes. As we can see from Table 4, the proposed approach outperforms ESDA in terms of time saving, while it causes more BD-rate loss. Compared with BTRNFA, the proposed approach makes more BD-rate loss by 2.85% and almost the same time saving. ough the proposed approach does not defeat BTRNFA, it does not require feature extraction and finishes partition prediction using as few as one transformer model. Tables 3 and 4 prove that the performance of the proposed transformer-based fast approach for CTU partitioning is satisfactory and competitive and has good capacity in practical industrial applications according to the comprehensive performance of various coding scenarios.

Partition Comparison.
To visualize the encoding performance of the proposed fast approach for HEVC intracoding, we give the partition results predicted by our algorithm on the 200 th frame of sequence Basketball Pass (416 × 240) under QP 22. Black, red, and green lines in Figure 8 represent borders of CUs for final encoding. To verify the correctness of CU splitting, we compare the partitioning results of our approach and original HM16.7, and differences are shown with red and green lines in Figure 8.
e black line represents that our algorithm and original HM16.7 share the same partition results. e green line represents the boundaries of CUs, which are split by original HM16.7 but are not split by our approach.  Boundaries of CUs decided nonsplit by original HM16.7 but are split by our approach are shown with red lines in Figure 8. As we can see from Figure 8, the partition results of the proposed algorithm are consistent with those of original HM16.7 in most CU cases. Splitting results of CUs located on main subjects in a frame are almost the same, while differences mainly exist in the background region of a frame. Compared with original HM16.7, the proposed algorithm is more likely to split the CU. Red lines outnumber green lines, which indicates that the splitting errors are far more than nonsplitting errors of the proposed approach.

Conclusion
In this paper, we focus on speeding up the video encoding process of industry applications. As a result, we propose a transformer-based fast CTU partitioning algorithm for HEVC intracoding. We convert the CTU partitioning structure to a split vector and employ transformer models to predict that of the target CTU while encoding. Compared with the original HM 16.7 encoder, our approach reduces encoding time by 58.77% on average while sacrificing negligible rate-distortion performance on selected video sequences. Intensive analysis and experiments show that the proposed solution has great capacity for working in industry applications, especially for scenes involving a lot of movement.

Data Availability
Data used and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.