Multiprojector Interreflection Compensation Using a Deep Convolution Network

-e aim of multiprojector interreflection compensation is to modify input images to remove complex physical stray-light effects (interreflection) from a multiprojector immersive system. -is is an important but often ignored problem, which can lead to degradation of a projection image. Traditional methods usually address this problem by computing a matrix inversion. -ese traditional methods often ignore issue of the clarity of the generated images. In this paper, we describe a method for learning the inversion using a deep convolutional neural network (CNN), named Superresolution Compensation Net (SRCN). SRCN consists of four convolution layers to learn interactions of global light, six convolution layers, and two transposed convolution layers to extract multilevel features and generate compensation images. We also used a subpixel convolution layer to increase the resolution. Tomake compensation images more consistent with human visual perception, we used a perceptual loss, which compares the differences between feature maps on the VGG16 network. We implemented an immersive projector-camera display prototype (Pro-Cam) and calculated the quality index of the compensation images and the projection results. Our method achieved better results than previous methods in both objective evaluations and subjective visual perception.


Introduction
Multiprojector systems are used in virtual reality (VR) systems, exhibitions, and tower simulators. For these applications, a good projection effect is crucial, since it can bring people an immersive visual experience, producing a highly realistic effect. However, the imaging effect of multiprojector systems is usually not ideal due to multiple factors, such as generation of points with noise and interreflection. Approaches to the multiprojector noise problem are relatively mature, and some researchers [1,2] have proposed techniques to solve the noise problem, but the interreflection problem is very common and easily ignored, and there is still considerable scope for the development of effective solutions. When interreflection is serious, such as when there are too many projectors, folding projection surfaces, or curved projection screens, the display image mixed by the light from the projector and the interference light of superimposed reflections leads to poor display image quality. e contrast of the projection display images is low, which disturbs user immersion and becomes an important factor hampering the popularization, application, and development of these systems. For example, this phenomenon has already led to the failure of an aerospace industry tower simulator to be put into practical teaching in a university. Internal reflection is a serious problem that prevents multiprojection systems from meeting application requirements and therefore needs to be solved. Methods for multiprojector reflection compensation can help multiprojector systems produce an optimal immersive visual experience. Conventional methods require considerable optical knowledge, and they are often ineffective in dealing with the interreflection problem in large, complex systems. erefore, we aimed to minimize the heavy reliance on optical knowledge and explore the use of deep learning methods to solve the interreflection problem in sophisticated environments.
Generally, a light transport matrix (LTM) is used to describe the multiplication of the projected incident light and the reflection from the immersive scene [3][4][5][6][7][8], and the interreflection compensation is regarded as a matrix inversion problem. Inverting the LTM, we can calculate the compensated images, so when they are reprojected, the interreflection will be eliminated. e acquisition of an LTM, however, is a laborious process. e available methods require the devices to be radiometrically calibrated and carefully set up. e LTM, which is determined by the resolution of the projector-camera system (Pro-Cam), is very large, so it is very difficult to calculate the inversion. To obtain an LTM suitable for matrix inversion, methods are used to downsize or simplify the matrix. e performance of these methods is limited, so they are not practical for use in a huge immersive environment.
In the image processing field, the inverse problem is common. Assuming that an observed image y represents the output of model T, and x is the input of T, then given output y, calculating the input x is the inverse problem [9]. In recent years, convolutional neural networks (CNNs) have become a popular way to solve the inverse problem [10][11][12][13][14][15] in problems such as dehazing, style transfer, and image superresolution reconstruction. ey have shown outstanding performance on large databases of images.
Interreflection compensation is also an inverse problem. However, in the last decade, few researchers have solved the problem using CNNs. We realized that this was a possible approach to reduce the undesired interreflection without LTM inversion. Compared with traditional methods, CNN does not require a large amount of knowledge about optics, such as the radiometric precalibration of the Pro-Cam. e method generates a visually optimal compensation image even with slight misalignment.
In this study, we innovatively used a deep learning-based image processing method to solve the problem of interreflection in complex immersive projection systems, built a projector-camera display prototype (Figure 1), and developed a novel neural network, named Superresolution Compensation Net (SRCN), for multiprojector interreflection compensation, to improve the projection performance, as shown in Figure 2. First, we used a geometric correction subnet to autocorrect the sampling images captured by the camera. Second, by connecting with several convolution layers, SRCN could be trained to perform matrix inversion to modify the input images. We used a superresolution layer (hereinafter referred to as SR layer) as proposed by Habe et al. [3] to double the resolution of the output images and used a loss calculated from the differences in the outputs of the max-pooling layer of the VGG network to improve the perceptual quality [16], Finally, we evaluated the proposed network model and evaluated its performance compared to that of conventional methods.
Our main contributions are as follows: (1) We removed multiprojector interreflection using a learning process, greatly improving multiprojector system imaging and simplifying the process of obtaining an LTM and calculating its inverse (2) We utilized SR compensation to further improve the definition of compensated images (3) We used a perceptual loss with coefficients in addition to pixelwise loss [17], so that the compensated images are more invariant to changes in pixel space [18,19] (4) We created a dataset in our Pro-Cam environment and made the dataset public e rest of this paper is organized as follows. Section 2 reviews and discusses the relevant research. In Section 3, the multiprojector interreflection model-based deep learning is described. e experimental results are compared with other methods, and a comparison of the self-control method is introduced in Section 4, and Section 5 presents the conclusions.

Related Work
Two research fields-interreflection compensation and convolutional neural networks-are closely related to our proposed method. We introduce the related fields and discuss the development of our approach in this section.

Interreflection
Compensation. Some techniques have been proposed to remove interreflection by modifying the uncompensated projection images. e method was initially presented by Bimber et al. [4], who divided the uncompensated projection image into small patches, and based on the Jacobi iteration, each patch was compensated offline to eliminate the scattered light. Similarly, Bai et al. [7] compute the compensation iteratively. ese iterative methods do not require direct LTM inversion so it is effective to solve only a single image compensation. However, for multiple images, the compensated projector images need to be separately and iteratively computed for each input.
Other methods have been proposed to remove interreflection by precorrecting input images. is is a popular research method for the reduction of undesired lights on various types of surfaces. However, such approaches usually need to find the inverse of the LTM. Compared with iterative computation, this approach precomputes the matrix inversion only once, and then any desired images can be compensated using matrix-vector multiplication. is approach was taken by Habe  ey traverse all the pixels in the projection image and find a pixel of the highest luminance contribution as the center of each patch in turn, and add some adjacent pixels into the patch. ese clusters were computed for many small patches. Another approach is to simplify the matrix without changing the size of the LTM. Ding et al. [5] took pairs of white images in their Pro-Cam to acquire the LTM. ereafter, they used the LTM to construct a matrix to approximate inversion. Recently, Grundhöfer and Iwai [8] proposed an interpolation based on TPS used to calculate an accurate color transformation.

Convolutional Neural Networks.
Recently, CNNs have attracted considerable attention. ey imitate the visual perception mechanism of biology and are widely used to solve the inverse problem in image processing. ey can reach a stable effect, and there are no additional feature engineering requirements for the data [20,21]. We review the network architecture and loss function in this part.
In these inverse problems, dehazing provides the main trends in most papers. One important perspective on these dehazing results is that the CNN is learning a mapping between a hazy image and a clear image [12,13]. To improve the image quality and increase the resolution, Ledig et al. [22] propose a four-time superresolution (SR) architecture constructed by connecting two two-time superresolution trained subpixel convolution layers. In [11], the authors propose an encoder-decoder structure named UNet, which keeps the size of feature maps unchanged. In the encoder, it can extract features faster by increasing the number of feature channels. e decoder gradually recovers the edge information of the images. en, Zhang et al. [18] designed a UNet-like backbone network named CompenNet to remove the projected texture background. Both input and output images were 256 × 256 × 3. In CompenNet, the researchers innovatively added an encoder, which is the same as the encoder of the backbone network, to learn the global light. Each layer of the two encoders is connected by elementwise adding. Later, they proposed CompenNet++, which concatenates a geometric correction subnet with CompenNet, to realize both geometric correction and projection texture removal.
e choice of the loss function generally defaults to pixelwise approaches such as l 1 loss and l 2 loss (MSE). However, some limitations are apparent when using these pixelwise loss functions in image processing. For example, although the test index is high, the image quality is not necessarily good, because human visual perception is not taken into account [18], so a structural similarity index (SSIM) was proposed in [17]. SSIM uses luminance, contrast, and structure to evaluate images, to more closely match the human visual system (HVS). Generally, the results using SSIM are more detailed than those using pixelwise loss. In [19] the authors investigated different loss functions and found that using different losses in combination can obtain better results than using only one loss. Recently, a perceptual loss was proposed in [16], achieved by comparing the loss in a feature map.
is approach was extended in SRGAN [22] to enhance the visual quality.
Our method addresses the problem of removing interreflection by precorrecting input images. Inspired by the research into the inverse problem using CNNs, we propose a novel neural network for multiprojector interreflection compensation, instead of computing the LTM inversion. e network can learn the mapping between input images and compensated images, which means that it can learn complex spectral interactions and generate a modified input image. To improve the visual quality, we used SSIM and perceptual loss as well as pixelwise loss.

Problem Formulation.
e purpose of interreflection compensation is to map the input image onto a compensated image. When the image is projected again, the interreflection is reduced or even eliminated. Our research focused on finding an LTM inversion to realize the mapping between the input image and compensated image.
Assuming x is an input image, f p is the optical transfer function of two projectors, s and f s are the surface reflectance property and surface bidirectional reflectance distribution function (BRDF), respectively, E is the global illumination, and f c is the camera's composite capturing function, then we can formulate the camera-captured image y as For simplicity, we can regard f p , f s , f c as T, which is actually interreflection light transport mapping between the projected image and camera-captured image. us, equation (1) can be reformulated as (2) However, the global illumination E and surface reflectance s are hard to measure without additional spectral devices. Because the multiprojection display prototype is fixed, we can use a camera-captured surface image, s ∧ , to approximate global interactions: where x 0 is a pure white image whose grayscale is 255. us, we can substitute E and s with s ∧ in equation (2): Interreflection compensation aims to find a compensated image x * so that the camera-captured result is the same as the original input image x (ground truth): Multiprojector interreflection compensation can be formulated as follows: where T −1 is the T inversion. We used a deep neural network to model it. (4) we can get

Deep Learning-Based Formulation. From equation
Because s ∧ is known, we can use sampled image pairs (x, y) to learn T − 1 . We model T − 1 using a deep convolutional neural network named SRCN and denote it as T(θ) − 1 , where θ is the learning parameter.
where y * is the compensated image of camera-captured image y. Because the projection system is not a plane, y is out of shape, as shown in Figure 1. Generally, y requires manual geometric correction. In this paper, we add a geometric correction subnet G to realize the process automatically. We designed G inspired by [23], which uses a cascaded coarseto-fine structure to generate a sampling grid Ω, and cameracaptured images y can be corrected geometrically using a single bilinear interpolation ⊕. We train T(θ) − 1 with N sets of image pairs (x i , y i ) N i�1 . We want y * to be as close to ground truth x as possible. So, using a loss function L, SRCN can converge by learning as follows: 3.3. Network Architecture. e architecture of SRCN is shown in Figure 3. We used a UNet-like [9] backbone network with several convolution layers to extract features.ŝ and y are fed separately to the same four convolutional layers. en, the multilevel feature maps generated by the convolution of each layer are combined using elementwise addition, which allows learning the complex interreflection information of the immersive environment. To keep the size invariant for the feature maps, the first two convolutional strides are set to 2, the last two convolutional strides are set to 1, and the number of convolution kernels is {32, 64, 128, 256}. Each is followed by a rectified linear unit (ReLU). In addition, we use three skip convolution connections [24] to enrich the representation ability of the network. en we use two convolutional layers with stride 1, padding 1, and two transposed convolutional layers with stride 2, no-padding to gradually reduce the channel of the feature maps. We ultimately use the SR layer [25] to increase the resolution of the input images, which is then followed by ParametricReLU [26] as the activation function.
Our multiprojector interreflection compensation overall architecture consists of three main steps. (1) We first split a plain white image x0 and N sampling images x 1 , x 2 . . .. . .x N into two parts and project them using two projectors. With a camera, we can capture y 1 , y 2 . . .. . .y N . en, we resize the images to 256 × 256 and preprocess them by gamma correction. (2) All of the camera-captured images are input to the geometric correction subnet G and then enter the deep convolution layer to output the compensation images y 1 * , y 2 * . . .. . .y N * . Because of the superresolution mechanism in SRCN, the resolution of the output is doubled. Finally, using our four-loss functions, we can train SRCN to converge. (3) With the converged SRCN, we can input the desired image x and obtain a compensated image x * . If x * is projected, we find that the result is the same as the ground truth x, as Figure 2 shows.

Loss Function.
e loss function in a neural network compares the difference between the predicted value and the true value. In SRCN, we use the loss function below to jointly optimize the color fidelity (pixelwise l 1 and l 2 ), structural similarity (SSIM), and perceived similarity (perceptual loss): where λ is a coefficient to balance perceptual loss and other loss functions. In our experiment, we set λ to 0.02 to balance perceptual loss and other loss functions. All of these loss

Pixelwise Loss.
e output of the network is compared with the ground truth pixel by pixel. l 1 is most simple: l 2 is also called MSE loss and can be computed by

SSIM Loss.
is approach compares the structure similarity of y * i and x i in three dimensions: luminance ℓ(y * i , x i ), contrast c(y * i , x i ), and structure s(y * i , where μ and σ are the means and standard deviations and C 1 and C 2 both are invariant constants. e SSIM loss can be computed by

Perceptual Loss.
Perceptual loss can bring the highlevel information, content, and global structure closer by comparing the features of a generated image with that of the real image. It uses a pretrained 16-layer VGG network V k [27] to obtain the feature map of y * i and x i , where V k indicates the feature map obtained by the k-th max-pooling layer. en, it compares the difference between V k (y * i ) and V k (x i ): Rather than encouraging the pixels of the generated image y * i to exactly match the pixels of the input image x i , Scientific Programming we instead encourage them to have similar feature representations as computed by the VGG network. Using the total losses above, we achieved comparable performance with better reconstructed fine details and edges.

Training Details.
We trained our deep convolutional architecture SRCN for 8 epochs on NVIDIA GEFORCE 1060TI GPUs with a batch size of 8, using 4,200 images for the training set and 72 images for the validation set. Backpropagating the derivative of the loss throughout the network, the network's parameters, θ, such as weight and bias, were updated via the Adam optimizer [28] with the following specifications: we set the fixed l2 penalty factor to 10 −4 . We started with a learning rate of 0.001. e size of input images was 256 × 256. As mentioned in equation (10), we set λ to 0.02 to balance perceptual loss and other loss functions. We used the third and fifth max-pooling layers within the VGG16 network to compute the perceptual loss, so k � 3, 5 in equation (15). We also provided a pretrained model to make the method more practical, using 5,000 pairs of sampling images. θ was initialized by loading the saved weights. During the test time, we used SRCN without geometric correction subnet G to compensate 52 1920 × 1080 colorful images.

Experiments
In this work, we created a dataset using our Pro-Cam. We conducted the experiment under a multiprojector immersive environment and compared the results with different values of gamma correction in the data preprocessing phase. We compared the experimental results from SRCN with those from other methods, from objective and subjective aspects.

Projection Display Prototype and Dataset.
Our immersive multiprojector-camera system consists of a Nikon DX VR camera with a resolution of 2992 × 2000 and two JMGO G7 projectors with a projection resolution of 1920 × 1080. We set the distance between the camera and the two projectors to 300 mm.
e L-shaped screen was located approximately 550 mm in front of the projectors. e camera's white balance mode, shutter speed, ISO, and focus were set to Auto, 1/90, 200, and f � 5.6, respectively. To simulate a real immersive projection system, we captured the pictures in the dark to exclude the influence of global lighting.
To ensure the dataset was as diverse as possible, we projected 5,000 1920 × 1080 colorful images crawled from several free picture websites and obtained N � 5000 camera-captured images automatically by setting the camera mode to interval shooting. en, we resized the images to 256 × 256 and preprocessed them using gamma correction (Figure 4).

Objective Evaluations.
We used objective evaluations to compare the quality of the compensation images. We compared our SRCN model with other methods, including CompenNet [13], CompenNet++ [29], and TPS [8].
To compare the compensation effect of different kinds of images, we divided the test images into four groups. Each group had 13 images. In the first group, the images were bright, and some areas were even overexposed. In the second group, the images were dark. In the third group, the images were multicolored. In the last group, the images were solid color. We named them Group_Bright,  Scientific Programming Group_Dark, Group_Multicolor, and Group_Solidcolor, respectively. e projection results are shown in Figure 5. We compared four groups of projection results on PSRN, SSIM, and RSME. Table 1 shows specific evaluation values for four group images. We can conclude that SRCN generally produced better compensation image quality.

Subjective Evaluations
(1) Compensation Image. We used subjective evaluations to compare the clarity of the compensation images. We utilized the SR mechanism to increase the resolution of the input images. We found that the compensation image was clearer than those of other methods. As shown in Figure 6, we observed that using SRCN yielded better texture detail. e cat's eyes are clearer in our images than in those produced by CompenNet and TPS. CompenNet++ takes geometric correction into consideration, so the compensation result is a little distorted. When it is projected, the image will be distorted back. In this work, we were solely concerned with reducing interreflection. e resolution of the other three methods depends on the training dataset. If we want to obtain a 1920 × 1080 compensation image, we must use thousands of 1920 × 1080 images to train the model. is approach requires more computer memory and a longer training time than ours. (2) Projection Results. Our multiprojector display prototype was designed for an immersive viewing experience, so subjective assessment with respect to the human visual system is very important. We invited 25 raters to evaluate the quality of the compensation images in a mean opinion score (MOS) test and asked them to choose the scores (1 to 5) for different projection images. A score of 1 indicated that the quality of the projection image was poor, and 5 indicated excellent quality. All raters had normal visual power and color vision. We used the ground truth as the reference image. e MOS of each compensated method was used as the final subjective evaluation index, as shown in Table 2.
We computed the heat maps [30] of four projection images. ese images represent Group_Bright, Group_Dark, Group_Multicolor, and Group_Solidcolor, respectively. e heat maps represent the luminance of projection images. e redder the color, the brighter the projection images, and the more the interreflection. As shown in Figure 5(b), when the image was projected directly onto the L-shaped screen without any processing, the projection image contained some scattered light and became lighter than the original image. CompenNet ( Figure 5(e)) reduced the interreflection but also reduced the color. TPS (Figure 5(f)) exhibited color deviation problems when the images were projected. In CompenNet++, the authors thought, from the surface patches illuminated by the projector, the rest of the surface outside the projector FOV did not provide useful information for compensation, so they cropped the images to achieve better geometric correction. When the size of the compensation image is 256 × 256, it can approximate to the original image. However, in our situation, in which we compensate 1920 × 1080 images, the cropping seriously affects the results ( Figure 5(d)). Our method ( Figure 5(c)) can reduce interreflection most effectively, while maintaining the color information.

Effectiveness of Gamma Correction.
In the immersive projection system, observation by the human eye is the most    Scientific Programming important index of image quality. However, the human eye's response to radiation is not a linear function but a curve that is similar to the gamma curve. Generally, our eyes have a greater dynamic range in the shadow than under high illumination, and we are more sensitive to low illumination and less sensitive to bright light. In our research, we took this situation into full consideration and performed a gamma correction on the camera-captured images.
As shown in Figure 7, we compared the different training data sets, with all conditions being the same, other than gamma correction. e values of the gamma correction were 1 (no gamma), 1.5, 1.8, and 2.2. e larger the value of the gamma correction, the darker the compensation image. If there was no gamma correction in the data preprocessing, the interreflection of the compensation image was only slightly reduced. erefore, we set the value of the gamma     Scientific Programming correction to 1.5, which produced the closest results to the ground truth.

Effectiveness of the SR Layer.
In order to investigate whether our learning-based formulation and the SR layer (subpixel convolutional layer) were necessary, we compared the results of the proposed SRCN with and without the SR layer (SRCN w/o SR). e results are shown in Table 3, and visual comparisons are shown in Figure 8. SRCN with an SR layer clearly yielded a better result.

Comparison of Different Loss Functions.
We compared four different loss functions: l 1 , l 2 , l 1 + l 2 + l ssim , and l 1 + l 2 + l ssim + l perceptual loss. e objective and subjective comparisons are shown in Table 4 and Figure 9, respectively. We found that the quality of the compensation image and that of the projection image using l 1 and l 2 were almost the same. e use of l 1 + l 2 + l ssim produced the worst compensation and projection results, while l 1 + l 2 + l ssim + l perceptual produced the best results. Finally, we used l 1 + l 2 + l ssim + l perceptual as our SRCN loss function.

Conclusions
In this paper, in order to solve the problem of serious interreflection in multiprojection system imaging, we developed an SRCN model that reduces interreflection from multiprojector immersive systems. We performed experiments using our own data set to establish the validity of the approach. e technique achieved consistently better perceptual quality than previous methods. We first used a deep convolution network specialized for multiprojector interreflection compensation. We formulated a novel architecture by adding a geometric correction subnet. We used SR layers to improve the resolution of the compensated images. Other useful techniques, including gamma correction and perceptual loss, were employed to improve the image quality and restore more accurate realistic textures.
Data Availability e authors have created a dataset in their Pro-Cam. e artificial dataset is publicly available and its download link is https://drive.google.com/drive/folders/ 1bdT6tqDW9blAfiKSPDWeqnrs_tG3d6hD?usp�sharing.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this study.