Conditional Deep 3D-Convolutional Generative Adversarial Nets for RGB-D Generation

Generation of synthetic data is a challenging task. 'ere are only a few significant works on RGB video generation and no pertinent works on RGB-D data generation. In the present work, we focus our attention on synthesizing RGB-D data which can further be used as dataset for various applications like object tracking, gesture recognition, and action recognition. 'is paper has put forward a proposal for a novel architecture that uses conditional deep 3D-convolutional generative adversarial networks to synthesize RGB-D data by exploiting 3D spatio-temporal convolutional framework. 'e proposed architecture can be used to generate virtually unlimited data. In this work, we have presented the architecture to generate RGB-D data conditioned on class labels. In the architecture, two parallel paths were used, one to generate RGB data and the second to synthesize depth map. 'e output from the two parallel paths is combined to generate RGB-D data.'e proposed model is used for video generation at 30 fps (frames per second). 'e frame referred here is an RGB-D with the spatial resolution of 512× 512.


Introduction
Deep learning requires a huge volume of data to train the networks. Collection of data by physically creating is a daunting task. While capturing images or videos physically, there will be some issues like a foreground objects shadow, background clutter, change in illumination, the effect of moving background objects, and viewpoint of the scene. ese issues evoke the need for depth information in data. With the addition of depth as an extra dimension, useful information about the scene is gathered which is insensitive to variation of illumination. Additionally, combining the depth map with RGB gives a rich 3D scene which is close to real life experience and is very useful in various applications. Despite this requirement and with the availability of a vast variety of sensors, RGB-D data acquisition is a challenge.
For a particular application such as gesture recognition and activity recognition, till now we have two largest datasets, namely, ChaLearn gesture challenge [1,2] and NTU RGB + D [3]. is gives rise to the need for a generation of synthetic data with or without a little intervention (to acquire reference frame) of any RGB-D sensor. ere are a few networks which can generate RGB images and videos from random number.
Synthetic data created are used in training purpose for varied applications related to computer vision and also in the machine learning domain which includes scene reconstruction, camera and object tracking, pose identification, action/gesture recognition, and many more. Using these networks, we can generate scenes which are difficult to capture in real life. From the literature, we can see that the quality of synthetic data generated through generative adversarial networks has been much better than the previously used methods. GAN and its variants are one of the potentially important breakthroughs in deep learning. ough GAN has been used successfully to generate RGB videos, very little attention has been devoted to RGB + D generation.
Motivated by the importance of depth data in many applications, we are proposing a new framework for RGB + D data generation which uses conditional deep 3Dconvolutional generative adversarial network for RGB-D data generation. e proposed framework has two parallel paths. Here, each path is having conditional deep 3Dconvolutional GAN, one is for generating RGB video and second one is to synthesize depth map and combine them to generate RGB-D video. Two generators are fed with a noise vector sampled from normal distribution to generate RGB-D data by two-stream conditional deep 3D-convolutional GAN. Discriminator learns to differentiate between generated synthesized videos and real videos (which are the videos from NTU RGB + D [3] dataset) for both RGB and depth videos.
e remaining paper is organized as follows. Section 2 discusses the related work. Proposed methodology is discussed in Section 3. Section 4 presents the experimental results. Section 5 concludes the paper.

Related Work
Ian Goodfellow gave the sophisticated architecture of GANs (generative adversarial networks) [4] for the purpose of generating data. Since then, several adaptations have been applied for various applications [5]. Early research on data synthesis was dominated by synthesis of image data using GAN and its variants. Durugkar et al. [6] developed an architecture known as generative multiadversarial network in which multiple discriminators and a single generator were used to model images. A similar system called generative adversarial parallelization was created by Daniel et al. [7,8] in which multiple GAN pairs were used that were interchangeable during training. Initially, it resulted in an unstable result but later modified and improved and various variants of the architecture have been used since then. Radford et al. [9] developed an architecture called "Deep Convolutional Generative Adversarial Networks" acronymic to DCGANs for unsupervised representational learning of images by combining GAN and CNN architectures which was later modified, and its variants were used by many researchers. e outcome of these architectures is better than that of conventional GAN. Much of the research on GAN was confined to image generation [10], and very little has been done for video sequences.
Vondrick et al. [11] developed an architecture combining GAN [12] with 3D convolution and used as a milestone for video generation. Our work is motivated from this work in which two stream networks were used, one is to generate foreground and other is to generate background, which was then combined to produce video. It is then fed to the discriminator to discriminate probably fake output from real videos. Arjovsky et al. [13,14] created a variant of GAN known as WGAN (Wasserstein GAN) which uses Wasserstein distance instead of Jensen Shannon distance, resulted in a more stable system. Mathieu et al. [15] created the deep multiscale system to predict future frames, but the accuracy was measured for only for few frames. Similarly, Zhu et al. [16] developed SeqGAN, and Yu et al. [17] developed CycleGAN, with the focus to resolve generator differentiation issue. Xue et al. [18] used single image instead of sequence of images, to generate future frames. Walker et al. [19] used optical flow to generate future frames. ere are other GAN-related works which has been done for different applications [20][21][22][23][24][25][26][27]. e proposed paper addresses the following agendas: (1) e generation of RGB-D data by using two-stream conditional deep 3D-convolutional generative adversarial networks (2) Exploitation of spatio-temporal convolutional architecture for generating both depth as well as RGB videos (3) Generation of 2 second RGB-D video with the rate of 30 frames per second where each RGB-D frame is having 512 × 512 spatial resolution (4) Use of SR process to increase resolution of generated videos with good perception quality (5) RGB prediction architecture for the future quality of both types of videos, both videos are concatenated using merge block to obtain RGB-D video. Each stream has been implemented by a conditional GAN. In conditional GAN, the label of each type of video is provided along with noise sample to generator and same label with training data to the discriminator. e block diagram of the proposed framework is shown in Figure 2. Same noise sample with same label is used for both the generators. Same label is assigned to discriminator associated with real videos to obtain the expected video. e structure of generator and discriminator is discussed in the remaining sections.      Figures 2 and 3, we are using a noise vector of dimension 100 obtained from normal distribution. e noise vector is then concatenated with the label, and the output vector is later reshaped into a [1, 1, 1, 106].

Generator. As shown in
Since we have only 6 classes of activity, we use one-hot encoding technique to get the label vector. To utilize both spatial and temporal information, a 3D convolutiontranspose operation is performed over this reshaped vector with the kernel size of [2,4,4] where 4 and 4 are height and width, respectively, and 2 is the depth of the filter. 512 filters are used in the first convolution-transpose layer with the stride of [1, 1, 1]. e next five layers use the same 3D convolution-transpose operation having the kernel size of [4,4,4] with 256, 128, 64, 32, and 3 filters, respectively. To increase the size of image, we use stride of 2 in each dimension. e output shape of the last layer is 64, 128, and 128 which represents height and width of output video as 128 × 128 and 64 frames in depth. Rectified Linear Unit (ReLU) has been used as an activation function in all layers. We use only 60 frames of the last layer out of 64. e purpose of doing is to create 2 second video at 30 fps requiring only 60 frames. e final output of generator is 60 frames of dimension 128 × 128.
is generator network is replicated into two streams to generate RGB and depth video. e same label is fed into the depth video generating stream. After the generating both videos, we use super-resolution network to improve the perceptual quality of generated videos. Later, both videos have been merged to obtain RGB-D video. e output is the RGB-D video of 2 second length saved at 30 frames per second. e dimension of the output video is the same as the final output layer of each generator network which is 128 × 128.

Discriminator.
e job of discriminator is to act as a classifier. It must distinguish between the real video and fake video. As we are using conditional GAN, the classification is also based on label. e discriminator comprises five 3D convolution layers having a kernel dimension of [4,4,4] at each layer except for the last layer which is [2,4,4]. e spatio-temporal information of both the videos is studied with the help of 3D convolution operation. e filter size of each layer is 64, 128, 256, 512, and 1, respectively. Before feeding the real video into the first layer of network, the label which is of one-hot encode in nature is reshaped into the same size of input video frame which is 128 × 128. After reshaping, it concatenates with input video frame and goes to the first layer of discriminator network. e leaky-ReLU activation function is basically employed in the first three layers, and the ultimate layer makes use of the sigmoid activation function. e training data are augmented using various kinds of data augmentation techniques to prevent overfitting issue. e used data augmentation techniques are image rotation, image cropping, and image filters. ereafter, all images are resized to the actual resolution. To build model, we use two steams one for generation of RGB video and other for depth video. e above-described generator and discriminator model is replicated in two streams. Each generator and discriminator are given proper training with the aid of cross entropy loss methodology. e optimizer is optimized at a learning rate of 0.002. e model is to be trained for 1000 epochs. In every epoch, we train our model for all the video in training folder, and after each complete iteration, we are generating sample video as well as saving the model file.
is 4X super-resolution increases the spatial resolution of RGB frames from 128 × 128 to 512 × 512 and temporal 4X-SR increases the number of frames 4 times than earlier. For depth SR, we are simply interpolating space-time video frames by tri-cubic interpolation.
is SR block helps in improving spatio-temporal resolution of the generated RGB and depth videos.

Future Generation: Prediction of Next Frames.
We have adapted our previous proposed framework for future frames prediction as shown in Figure 4. Here, our input is static frames or reference frames, and we are predicting next frames, so the output is future frames. e working methodology is as follows. e reference frame is fed into a convolutional encoder (CE). e CE learns the features of reference frame and reduces the size of the reference frame equal to the first layer of generator network. en, our proposed framework generates the video by creating the next future frames.  pointing to something. A total of 11 action classes are categorized as mutual conditions such as kicking, hugging, and punching, and 9 action classes are considered as medical conditions such as sneezing, vomiting, and falling. We have shown original video frames in Figure 5.

Result Analysis
Using this dataset, we generated videos as shown in Figures 6 and 7. Figure 7 shows videos of subjects hugging each other, and Figure 8 shows videos of fight. Figures 6 and 7 are of classes kicking and sitting on chair, respectively. All of the videos are 2 seconds long and have 60 frames per second. We are showing only 12 frames per video.
It is very difficult to measure the quality of generated video, and since there is no one to correspond between real data and generated data, calculated mean squared error is not suitable for this kind of quality measurement. Additionally, mean squared error does not reflect the human perception of reality. A commonly adapted tool is Amazon Mechanical Turk, commonly known as MTurk. We conducted a survey on Amazon Mechanical Turk, motivated from [29], to get the subjective MOS based on perceptual quality of generated videos. Our aim is to measure the perception quality of the motion in the generated video. We categorize the MOS (mean opinion score) in five categories such as bad, poor, fair, good, and excellent, score of 1 for bad (least perceivable) and 5 for excellent (most perceivable and realistic). For each HIT (Human Intelligent Test), subjects are asked to rate the videos of two seconds. A total of 1200 ratings were collected from more than 60 unique subjects, and the survey is still ongoing. Some subjects were rejected on the account of reliability, and no subject was allowed to take the survey more than once. By averaging the rating for individual video, we obtain MOS for our generated videos. 76 percent of the rating lies in the range of 3 to 5. By analyzing the scores, we can say that our generated videos look realistic and perceivable with its motion. Figures 8 and 9 show the output of future prediction of frames using a reference frame of any particular action. In these figures, output predicted frames are shown with the gap of 10 frames. e quantitative analysis on testing dataset indicates that the proposed model obtains false-positive and false-negative values as 1.37% and 1.87%, respectively. ese values also indicate that the proposed model does not suffer from the overfitting issue.

Conclusions
e proposed conditional deep 3D-convolutional generative adversarial network can generate more realistic videos of 2 second length as shown in the experimental results. is framework has proven to be more promising for predicting future frames. In the proposed framework, the super-resolution framework enables us to produce the video of high spatio-temporal resolution. In addition, we have generated RGB-D video for each action class. is RGB-D data generation will help in many computer-vision applications as well as help us in understanding of features responsible for action recognition of different classes. In future, we will explore CNN-LSTM-based generative architecture on the lines of the proposed architecture for RGB-D generation with more realistic quality and will develop end-to-end pipeline that contains data generation with application in improving action recognition accuracy for many classes.
Data Availability e data that support the findings of this study are available upon request.

Reference Frame
Predicted Frames Figure 8: Future prediction: here first frame is static or reference frame and next six are future predicted frames with the gap of 10 frames.

Reference Frame Predicted Frames
Figure 9: Future prediction: here first frame is static or reference frame and next six are future predicted frames with the gap of 10 frames.