Rendered Image Superresolution Reconstruction with Multichannel Feature Network

In the process of lm and television production, clear images can give the audience a real sensory experience, but high-resolution images require a massive amount of production time and highly specialized imaging equipment, which is not a cost-eective solution at the moment. To achieve a better cost eciency during video production, we propose a multichannel featured superresolution network model that utilizes rendered low-resolution images according to their characteristics. is model includes a feature extraction layer, a series of subnetworks, and a reconstruction module. Inside the network model, a series of subnetworks are cascaded to improve the information ow from coarse to ne, which helps to fully extract the depth, normal vector, edge, and texture features from low-resolution rendered images to reconstruct the high-resolution image. Additionally, residual learning is introduced at each stage to further improve the reconstruction performance. We experiment with the model on the classic Disney Monte Carlo datasets and compare it with several related algorithms. e results show that our algorithm is able to reconstruct the image with clearer details and texture. us, our research not only helps to maintain the audience’s sensory experience but also increases the eciency of lm and television production, which also brings considerable economic benets.


Introduction
Image resolution is a set of performance parameters used to evaluate the richness of detailed information contained in an image. is includes temporal resolution, spatial resolution, and color scale resolution, which all together indicate the ability of an imaging system that can re ect the detailed information of an object. Compared with low-resolution (LR) images, high-resolution (HR) images usually contain higher pixel density, richer texture details, and higherdelity. Modern lm and television productions have high demands for the visual e ect on picture qualities as technologies advance. However, to achieve a high-de nition picture quality, it inevitably requires a long image rendering time, and the production cost also increases accordingly. How to reduce rendering time and production cost without a ecting the visual experience of the audience is a vital research direction. More practically, it is important to maintain an ideal balance between the rendering time and the visual quality of an image such that the cost of the production can be kept minimum, and thus, how to improve the image's resolution e ciently and quickly is the key to solving the above problems.
Physically, solving low-resolution problems of images is often too costly. Constrained by many factors such as processing power of the hardware, format type of the image, network bandwidth, and the image degradation model itself, it is hard to obtain an ideal high-resolution image without applying edge sharpening and block deblurring. In many cases, animations and e ect maps are drawn by cartographic software to restore details. Similarly, it is entirely possible that the details of high-resolution (HR) images can be recovered from low-resolution (LR) images by using various types of algorithms. Early methods include interpolation, such as double edge interpolation and Lanczos resampling [1]. More powerful methods using statistical image prior [2,3] or internal patch recursion [4], probabilistic graphic model [5], neighborhood embedding [6,7], sparse encoding [8], and linear or nonlinear regression are also used to convert LR images back into HR images [9][10][11]. e method of using deep learning to improve image resolution has made rapid development in recent years. Various superresolution network algorithms continue to be invented for better efficiency of restoring. In 2014, Dong et al. [12] first applied a convolutional neural network to an image superresolution task and proposed a superresolution convolutional neural network model to predict the nonlinear relationship between interpolated images and high-resolution images (SRCNN). e model carried out end-to-end image pair training, which greatly improved the reconstruction effect at that time and thus proved the excellent performance of deep learning in the field of image superresolution. On top of that, Dong et al. [13] proposed to enlarge the size of the deconvolution layer and improve the SRCNN model to reconstruct a high-resolution image (FSRCNN). In 2016, Kim et al. proposed two very deep convolution neural networks, VDSR [14] and DRCN [15] models, to deeply extract image features and to accelerate the network's convergence speed by introducing the residual idea. It has been proved that the deep network can extract more features and achieve a better reconstruction effect. Tai et al. [16] constructed a persistent memory network for image restoration (MemNet), accompanied by memory blocks with a tight connection structure. In 2021, Hu et al. [17] designed a network to extract more detailed image features and cross-fused HR images (MSICF).
Although these models have made significant progress, they are mainly aimed at the restoration of natural images. However, there is a lack of research focused on the superresolution reconstruction of rendered images with multichannel features in the field of video production, and this has brought up drawbacks due to some bottlenecks. For example, the single-channel network cannot extract the main features such as depth and the normal vector of rendered images. Furthermore, the utilization rate of image information obtained from different channels is not sufficient, and this can result in insufficient details of reconstructed images. Given the above situation, we propose a superresolution reconstruction method for rendered images and produce more realistic results. e main contributions of this paper are summarized as follows: (1) We propose a multichannel featured superresolution reconstruction algorithm of the rendered image, which is able to obtain multidimensional information from different channels such as depth, normal vector, edges, and texture in the rendered image. After the multichannel featured information passes through the network model, it is fused into three RGB color channels. At the same time, global residual and local residual learning are introduced to prevent the disappearance or explosion of the gradient, accelerate the convergence speed, and obtain a considerable reconstruction effect.
(2) e network model includes a feature extraction layer, a series of subnetworks, and a reconstruction module. e extraction layer adopts double filters, which can retain high-frequency information such as image edge and texture details. e subnetworks cascade to improve the information flow from coarse to fine, and the reconstruction layer adopts weighting to improve the reconstruction accuracy. Compared with several other advanced algorithms, experiments show that the algorithm in this paper is more effective in image detail restoration.
(3) In the process of video production, our innovative method firstly renders the low-resolution image and then applies the proposed superresolution network algorithm to reconstruct the high-resolution image.
In this way, we not only maintain an excellent level of audience's sensory experience but also reduce the film and television production time and cost for achieving increased economic benefits. As a result, this research has brought distinct economic and practical values to the video industry.

The Proposed Method
We propose a multichannel featured superresolution reconstruction algorithm for rendered images, as shown in Figure 1. It includes a feature extraction layer, a series of subnetworks, and a reconstruction module. e image input enters the subnetwork after convolution in the first layer, and the information is output to the reconstruction layer after passing through each subnetwork. At the same time, the overall residual learning is introduced. Finally, the reconstruction layer carries the weighted fusion to reconstruct the three-channel high-resolution image.

e Proposed Network.
e extraction layer extracts the features of the original low-resolution image. e extraction layer of our design is composed of 64 filters with a convolution kernel size of 3 * 3. We defined convolution layer expression as S 0 � f 0 (x), where x is the input low-resolution image, f 0 is the feature extraction function, and S 0 is the output after feature extraction. In the subnetwork shown in Figure 2, according to the multichannel features of the rendered image, we design a cross-module with multifeatures to extract the multichannel, fuse the feature information, and then stack several submodules to learn the residual information between the input and output features. More specifically, the feature information of 10 channels enters the network, and the output of the feature extracted by convolution is used as the input of the next subnetwork. e convolution operation is carried out in a turn-taking sequence. Once the information has passed through all subnetworks, all types of features such as texture, depth, and normal vector can be fully extracted. At the same time, local residual learning is introduced to provide supplemental image feature data, which improves the accuracy and efficiency of image feature learning process. e first convolution layer adopts 64 filters, which are superimposed and fused after entering the subnetwork; it is then followed by layers that adopt 128, 192, and 256 filters; all come with a size of 3 * 3, respectively.
In Figure 2, it can be represented by equation (1). We define f q (·) as the feature extraction function and S q−1 as the input of the q th subnetwork. M is the number of subnetworks. e expression of subnetwork is G M q (·) . We introduce local residual learning between G(·) and G Last (·) for the subnetwork to obtain different image features. We define S 0 as the expression of the first subnetwork, ξ q (q � 1, 2, 3, . . . , S) as the expression of the q th subnetwork, and n as the number of subnetwork modules. is gives (1) where S n is the subnetwork.
In the training process, to make the prediction closer to the real data, all prediction data are managed in a cascade way from the subnetwork to the final output of the whole network. We define R q (·) as the expression of the reconstruction layer, S q as the output of the q th subnetwork, and y as the output of the whole network. Concurrently, we use global residual learning to improve the reconstruction accuracy. In this way, all intermediate predictions are deeply convoluted, and the final output is weighted at the reconstruction layer. Equation (2) can be expressed as follows:

Preliminary Work.
For the preliminary preparation of data training, we select 200 HR images from the classic Disney Monte Carlo rendered image datasets which contain tenchannel information respectively, including "R," "G," "B," "normal.R," "normal.G," "normal.B," "albedo.R," "albedo.G," "albedo.B," and "depth.Z." e letter "R," "G," and "B" represent red, green, and blue, "normal.R," "normal.G," and "normal.B" represent three vectors perpendicular to the surface, and "albedo.R," "albedo.G," and "albedo.B" represent different albedo textures. e "depth. Z″ represents the depth of the image. We expand the training data by performing horizontal, vertical, and flip operations on these images, reduce the sampling by 2, 3, and 4 times, and then restore them to their original size by applying a bicubic interpolation algorithm. Finally, the interpolated LR images and the original HR images are divided into 64 * 64 fragments, which are stored in pairs in a training data file.

Residual Learning.
In previous neural network models, each layer learns a mapping of y � F(x), where x is the input, y is the output, and F(x) is the mapping function. One can think in the way that the error of y happens to accumulate with the deepening of the layers, and the gradient becomes more divergent in the process of backpropagation as the error becomes larger. Meanwhile, we add the original input at the end of each layer, the output becomes F(x) + x, and the mapping relationship of each layer then becomes y � F(x) + x. It is obvious that the value of F(x) tends to be irrelevant and negligible compared to the size of x In this way, even if the number of layers continues to increase, the error F(x) is still controlled at a very small value, as shown in Figure 3. Generally speaking, it is noteworthy that the residual learning includes global residual learning and local residual learning, which has been widely used in deep learning [18][19][20]. More particularly in this paper, our designed network model introduces global residual learning into the overall network and local residual learning into the subnetwork to ensure stable training can be carried out. Moreover, we also prove the effectiveness of residual learning by conducting subsequent training. e depth of the whole model can be calculated by (3). On the right side inside the brackets, multiplication of 3 means that there are three convolution layers during one extraction operation, and the addition of 1 means that each subnetwork contains one convolution layer. In addition, the added value of 1 at the end of the right side represents the last convolution layer in the reconstructed network.

Model Training
Every 64 groups of the stored training image data are divided into a batch, and each batch contains 64 * 64 segmented images that are categorized by 10 different channels. Once these images are prepared, they are then sent to the network as fuzzy input for training. After learning the deep convolution network and extracting the data features of each channel, we generate a 3-channel target image. By calculating the loss function between the target image and the real image and then updating the weight parameters through backpropagation, we can get an optimized network model. rough training, we discover that the deepening of the convolution layer of the subnetwork leads to a disappearance of the gradient, so we choose to set 10 convolution layers for each subnetwork. Tentatively, when the number of subnetworks is 3, the loss function can be reduced to an ideal effect, and this concludes that the depth of the whole network is also ideal when set to 31.
We compared the training-with-residual-learning results against those ones without it and found that the network with residual learning can accelerate the convergence and thus improve the accuracy. As shown in Table 1, we go through iteration 100 times, and the loss function converges to a very observable degree at the 40th epoch.
Our network is very deep and uses a 64 * 64 training map, which is larger than the receptive field information of general algorithms. erefore, we used a stochastic gradient descent (SGD) solver and filled zero before convolution to keep the sizes of all feature maps the same for pixels near the image boundary [21]. We also train our network model at different scales, using 2x, 3x, and 4x images, and combine all LR and HR pairs of different scales into our training datasets. We set the momentum parameter to 0.9 and the weight attenuation to 10 −4 [22,23]. We also set the initial learning rate to 0.1, and then gradually reduced it by 10 times for every 25 epochs. To deal with the gradient problem of disappearance and explosion, we set the gradient clip on the gradient within a designed range [−θ, θ], where θ is set to 0.4 [24][25][26]. Each convolution layer is followed by standard regularization and activation operations, and the negative slope of the activation layer is 0.2 [27,28]. e training code of our network is shown in Table 2. After the training of 100 generations, we select the 83rd generation model as the follow-up experiment for analysis and evaluation according to the effect of the loss function.  We then select 15, 25, 50, and 100 rendered images from the classic Disney Monte Carlo datasets as our test data. e test data are downsampled using Python program code by 2, 3, and 4 times and then magnified to the original size by applying bicubic interpolation method. Finally, the images are classified and stored in four folders named car, house, classroom, and bathroom, and the preparation of experimental data is completed.

Evaluation and Comparison.
We take the peak signalto-noise ratio (PSNR) [29], structural similarity (SSIM) [30], and execution time (ET) as the evaluation indexes of this experiment. e shorter the execution time (ET) is [31], the faster the algorithm runs. e mathematical expression of PSNR is shown in (5) and (6), where X ori and X res present the original image and the reconstructed image, respectively, the MSE refers to the mean squared error between X ori and X res , and m and n represent the number of rows and columns of the image respectively. e unit of the PSNR is dB. It is known that the larger the value of the PSNR is, the closer the reconstructed image is to the ground truth image.
e mathematical expression of the SSIM is shown in (7). e SSIM evaluates the image quality by comparing the similarity of the structural information between images. If the value of the SSIM is close to 1, it means that the reconstructed image is close to the original image.
We select this training model for experimenting with car, house, classroom, and bathroom test datasets. After inputting the experimental images into our network model and outputting the three-channel RGB superresolution images, the color conversion is carried out, and the brightness values of the images are extracted as the reference values for calculating the evaluation indexes. e car datasets amplified by 2, 3, and 4 times produce an average result of the PSNR with values at 37.99 dB, 34.97 dB, and 32.83 dB and an average result of the SSIM with values of 0.9762, 0.9421, and 0.9179. To verify the effectiveness of the proposed method, the above experimental data are also input into other six advanced superresolution algorithms, the SRCNN, the FSRCNN, the VDSR, the DRCN, the MenNet, and the MSICF, for direct results comparison.
As shown in Table 3, the PSNR value of the algorithm proposed in this paper exceeds the above reconstruction algorithms when using different combinations of magnifications and test sets.
e experimental results show that our algorithm is better than the other six advanced superresolution algorithms. On the house test set amplified twice, the PSNR value produced by the algorithm in this paper is 1.21 dB higher than the one produced by the MSICF and 1.87 dB higher than the one produced by the MemNet. On the bathroom test set, the PSNR value produced by the algorithm in this paper is 2.11 dB higher than the one produced by the VDSR and 3.28 dB higher than the one produced by the SRCNN. It is obvious in Table 4 that the SSIM values produced by the algorithm in this paper are higher than those produced by the other six advanced superresolution algorithms when different magnifications and test sets are applied. With the classroom test set magnified by 4 times, the SSIM value produced by the algorithm in this paper is 0.0318 higher than the one produced by the MSICF and 0.0830 higher than the one produced by the MemNet. On the car test set, the SSIM value produced by the algorithm in this paper is 0.0321 higher than the one produced by the VDSR and 0.0505 higher than the one produced by the SRCNN. Figure 4 shows the PSNR and the SSIM values when applying different algorithms using the house and classroom test sets, and these are direct indications for demonstrating the superiority of our algorithm over the others.

Ours
MSICF [17] MemNet [16] DRCN [15] VDSR [14] FSRCNN [13] SRCNN [12] Bicubic 0.00 5.00 10 Ours MSICF [17] MemNet [16] DRCN [15] VDSR [14] FSRCNN [13] SRCNN [12] Bicubic  Scientific Programming our method shows a better visual experience with improved clarity of the background. e gloss and smoothness of the two cylinders have also been significantly improved, while other algorithms still have various degrees of blur or distortion. e experimental results on the "16532963-08192spp" picture in the car test set enlarged by 3 times are shown in Figure 7. e image recovered by our method is closer to the real image. e outline of the car rearview mirror is relatively complete and smooth, and the lines, edges, and corners of the car frame are clearly visible. e recovery effect of the VDSR algorithm is obviously blurred, and the images recovered by the MemNet and the MSICF algorithms are too stiff in terms of contour and lines. e effect of the experiment on the "98036695-08192spp" picture in the 4x enlarged classroom test set is shown in Figure 8. e SRCNN and the DRCN algorithms   e effect of the experiment on the "32481698-08192spp" picture in the 4x enlarged house test set is shown in Figure 9. Most algorithms such as the FSRCNN and the DRCN do not restore enough clarity to the lines and gaps of the fence, resulting in distortion and deformation. Our method can give a clear view on the  number and gap of fences, and thus, our restoration on edges, corners, and cylinders is obviously closer to the real image.

Conclusion
In this paper, the main contributions are as follows: (1) rough the above experiments, it is proved that our method is able to extract more features of the image and reconstruct the image with clearer details and texture than other related algorithms. (2) Film and television producers can save time on rendering images and then utilize our proposed model to superresolution the images later. is provides an excellent cost-effective solution that may increase the profitability to the current video industry. (3) Our method does not affect the audience's sensory experience when LR images come to play. is is a major advantage when data size is limited. e research content of this paper provides an initial solution to the stated problem with promising results. More specifically, we have achieved the expected research objectives, have reached the goal of creating a cost-effective solution to the film and television production without affecting the audience's sensory experience, and thus have brought distinct economic values to the industry. In the future, we will continue to optimize the parameters of the network model and to introduce the attention mechanism into the model to improve the efficiency and recovery effect of the algorithm. At the same time, we will also apply our research results to the field of medical CT imaging and infrared imaging with remote sensing.
Ethical Approval e study complies with ethical standards.

Conflicts of Interest
e authors declare that they have no conflicts of interest.