Deep CNN and Deep GAN in Computational Visual Perception-Driven Image Analysis

Computational visual perception, also known as computer vision, is a field of artificial intelligence that enables computers to process digital images and videos in a similar way as biological vision does. It involves methods to be developed to replicate the capabilities of biological vision. 0e computer vision’s goal is to surpass the capabilities of biological vision in extracting useful information from visual data. 0e massive data generated today is one of the driving factors for the tremendous growth of computer vision. 0is survey incorporates an overview of existing applications of deep learning in computational visual perception. 0e survey explores various deep learning techniques adapted to solve computer vision problems using deep convolutional neural networks and deep generative adversarial networks. 0e pitfalls of deep learning and their solutions are briefly discussed.0e solutions discussed were dropout and augmentation.0e results show that there is a significant improvement in the accuracy using dropout and data augmentation. Deep convolutional neural networks’ applications, namely, image classification, localization and detection, document analysis, and speech recognition, are discussed in detail. In-depth analysis of deep generative adversarial network applications, namely, image-to-image translation, image denoising, face aging, and facial attribute editing, is done. 0e deep generative adversarial network is unsupervised learning, but adding a certain number of labels in practical applications can improve its generating ability. However, it is challenging to acquire many data labels, but a small number of data labels can be acquired. 0erefore, combining semisupervised learning and generative adversarial networks is one of the future directions. 0is article surveys the recent developments in this direction and provides a critical review of the related significant aspects, investigates the current opportunities and future challenges in all the emerging domains, and discusses the current opportunities in many emerging fields such as handwriting recognition, semantic mapping, webcam-based eye trackers, lumen center detection, query-by-string word, intermittently closed and open lakes and lagoons, and landslides.


Introduction
Computer vision (CV), the core component of machine intelligence, is an interdisciplinary field enabling computers to achieve a visual understanding of digital images. For a machine to view as animals or people do, it relies on computer vision. Table 1 contains the list of abbreviations and their expansion used in the manuscript. CV is a booming field and is applied to many of our everyday activities; some of them are face detection, object detection, biometrics, a medical diagnosis from faces, self-checkout kiosk, autonomous vehicles, image recognition, image enhancement, image deblurring, motion tracking, video surveillance, control of robots, analysis of mammography, and X-rays [22][23][24][25][26]. e fundamental goal of all these applications is to create a human observer replica in interpreting the scene in a broad sense and perform decision-making for the task at hand [27]. CV and image processing are confusing terms and are often used interchangeably. Image processing receives images as the input, processes them, and outputs images, while CV receives images as the input, processes them, and interprets the images. e output generated by CV is an abstract representation of the image's constituent or the entire image. D-GAN is proposed as unsupervised learning, but adding a certain number of labels in practical applications can improve its generating ability. However, it is challenging to acquire many data labels, but a small number of data labels can be obtained. erefore, combining semisupervised learning and GAN is one of the future directions. Figure 1(a) represents the general architecture of the deep convolutional neural network (D-CNN). D-CNN is similar to a neural network where D-CNN is built with neurons having learning weights and biases [30]. However, in recent times, D-CNN is widely used over a standard neural network as it is faster and computationally inexpensive compared to neural networks. An image, which is a matrix of pixels, is flattened and fed into the neural network. Furthermore, to flatten an image of size 40 × 30 × 1, 1200 neurons are required at the input layer. Here, the complexity is manageable using a neural network. Colored images have layers corresponding to RGB. A total of 3 layers for each color make the number of neurons required at the input layer very high. When an image of size 1024 × 1024 × 3 has to be fed into the neural network, 3,145,728 neurons are required at the input layer, which is computationally expensive. e number of neurons needed at the input layer increases exponentially as the size of the image increases. Figure 1(b) represents the architecture of deep generative adversarial networks (D-GAN), where G captures the data distribution and generates fake data G(z) whose distribution is pz(z). e generative model improvises to generate distributions similar to p data(x), the real data distribution. e discriminator D is fed either with an actual data sample or generated data sample G(z). e discriminator outputs a probability f(x) belonging to (0, 1), indicating the data source. e generative model is trained to capture the data distribution from the original data. e discriminative model predicts the probability that data have originated from the generator G rather than the original training data. e generator's goal is to generate data close to the distribution of real data and deceive the discriminator. e purpose of the discriminator is to identify the fake data generated by the generator. Model distributions are generated by feeding random noise as the input through a multilayer perceptron. e discriminator has a multilayer perceptron with a classifier at the end [31]. e generator and discriminator compete until the counterfeits generated by the generator are indistinguishable from the data distribution. rough adversarial training, the 2 Complexity   quality of data generated by the generator gradually improves [32]. e quality of data samples generated by the generator and the discriminator's identifying capability improves each iteration interactively. e generator can be any neural network such as artificial neural network, convolutional neural network, recurrent neural network, or long short-term memory, whose task is to learn the data distribution. Simultaneously, a discriminator is essentially a binary classifier capable of classifying the input to be real or fake. e entire network is trained using backpropagation to fine-tune the training. e error is estimated using the sample label, and the discriminator output and the parameters of G are updated using an error backpropagation algorithm. D-GAN is inspired by two-player minimax game theory, which has two players, one benefitting at the loss of the other, and is represented by the following equation [7]: where p data(x) is the model data distribution and is the G(z) generated data distribution. e output of a discriminator is a probability indicating the origin of the data sample. A probability of 1 or a number very close to 1 represents that the data sample is real data. A probability of 0 or a number close to 0 represents the fake data. When the probability is close to 0.5, it indicates that the discriminator finds it hard to identify counterfeit samples. G is trained repeatedly to make D's output approach 1 for the data samples generated by G. e model is trained until Nash equilibrium is achieved where a change in strategy does not change the game anymore. e Nash equilibrium is achieved when the generator has gained the capability to generate data close to the real data. e discriminator does not distinguish the real data and generator data. e generator is now considered to have to learn the real-data distribution.

Contributions of is Survey.
is paper presents the general architecture of D-CNN, its application, various methodologies adopted, and its application-based performance. An overview of D-GANs is also discussed with their existing variants and their application in different domains. Furthermore, this paper identifies GANs' advantages, disadvantages, and recent advancements in the field of computer vision. Also, it aims to investigate and present a comprehensive survey of the essential applications of GANs, covering crucial areas of research with their architectures.  Table 2 shows a comparison between the current survey and existing surveys on D-CNN and D-GAN.

Biological Vision vs. Computer Vision
Biological vision has tremendous capabilities in retrieving vital information from visual data and analyzing them for functional needs. e perceptual mechanism used by people and animals to interpret the visual world is diverse. Research on biological vision is an excellent source of inspiration for CV and focuses on computationally understanding brain functions' mechanism for visual interpretation. Understanding the perceptual mechanism of biological vision is the initial step towards interpreting the visual data. Computational understanding of biological vision in the current research studies is based on the framework defined by David Mar [40]. Biological vision can perform tasks with high reliability, even if the visual data are noisy, cluttered, and ambiguous. It can efficiently solve computationally complex problems and that are still challenging for CV. e fundamental goal of CV, the science of image analysis, is to automate computational methods to extract visual information and understand the image's content for decisionmaking [41,42]. From CV's perspective, an image is a sequence of square pixels that may be aligned as an array or matrix. At a higher level, the structure of both biological and computer vision is the same. Nevertheless, both systems' objective is the same: to extract and represent the visual data into useful information for making actions.

Deep Learning
Deep learning or hierarchical learning has emerged as a subfield of machine learning, which, in turn, is a subfield of artificial intelligence [43]. Artificial intelligence is an effort to make machines think and automatically perform intellectual tasks otherwise performed by humans. AI is a classical programming paradigm where humans craft rules for the data, and the machine outputs the answers. Questions arose as if a machine could automatically learn data processing rules by looking at the data. Machine learning, a new programming paradigm, came into existence as an answer to this question. With machine learning, data and the solutions were fed for the machines to craft the rules. A machine learning model is trained rather than being explicitly programmed. Machine learning and deep learning came into existence when a need arose to solve fuzzy and more complex problems such as language translation, speech recognition, and image classification [44,45]. At its core, deep learning and machine learning are about learning the representation of the data at hand to get the expected output. Deep learning models are capable of learning complex relationships existing between the inputs and the outputs. Deep learning uses multiple processing layers to discover the data's intricate structure with multiple abstraction levels [46]. e deep in deep learning is a reference to successive layers of representation. Weights parameterize the multiple nonlinear hidden layers in a deep learning model. For a network to correctly map the inputs to its targets in a deep learning model, proper values are to be set for the weights of all layers present in a network.
Any deep learning problem has an actual target value and the predicted value. e difference between the actual and the predicted value is called the loss function. A distance score, which represents the network performance, is computed based on real and predicted values. Initially, the input's weights are randomly assigned, and the expected output is far from the actual output, and accordingly, the distance score is very high. e weights are then adjusted, the training loop is repeated, and the distance score decreases. Tens and hundreds of iterations are performed over thousands of examples, and a minimal loss value represents that the outputs are close to the target. Deep learning has exponentially improved state-of-the-art object detection, speech recognition, and many other domains [47]. Figure 2 shows the deep learning models. D-CNN has excelled in processing images, speech, audio, and video, while RNN has brought a breakthrough in processing sequential data.

Deep Convolutional Neural Network.
e deep convolutional neural networks, popularly known as D-CNN or D-ConvNets, are a robust ANN class and are the most established deep learning algorithm that have become dominant in computer vision and tons of other applications [48]. Convolutional layers, pooling layers, and fully connected layers are the D-CNN [49] building blocks [50]. D-CNN is designed to process the data that arrive in multiple arrays or grids through its numerous building blocks. e convolutional and pooling layers' role is to extract the features, while fully connected layers map the extracted features to the output. Deep CNN has multiple convolutions and pooling layers, followed by single or multiple fully connected layers. e input passes through these layers and gets transformed into output through forwarding propagation. Convolution operation and activation function are the backbones of the D-CNN [51].
Tensor, kernel, and feature map are the three essential terminologies to perform convolution operation. Tensor is the input, which is a multidimensional array, the kernel is a small array of numbers, and the feature map is the output tensor, shown in Figure 3. e convolution operation is a linear process where a dot product is performed between the State-of-the-art GAN architectures are surveyed, and their application domains on natural language processing and computer vision are discussed. e loss functions of the GAN variants are discussed.
State-of-the-art GAN is discussed along with its performance on the MNIST dataset. Generator and discriminator losses are visually represented for the GAN variants.
2 Recent progress on generative adversarial networks (GANs): A survey [34] Basic theory and different GAN models are summarized. e models derived from the GAN are classified, and evaluation metrics are discussed.
Variants of the GAN, their application, architecture, methodology, advantage, and disadvantages are analyzed and summarized.
Evolution of the GAN with conditions, encoders, loss functions, and process discrete data are separately discussed.

3
A survey of the recent architectures of deep convolutional neural networks [35] An overview of different layers of D-CNN, namely, the convolutional layer and pooling layer, is discussed. An outline of the pitfalls of deep learning is briefed.
Different layers of D-CNN, namely, the convolution layer, pooling layer, and the operations performed in the convolution and pooling layers, are discussed in detail. A detailed review of deep learning pitfalls, namely, overfitting, underfitting, and data insufficiency, is discussed along with their possible solutions.

4
Deep learning for generic object detection: A survey [36] Recent achievements in the field of object detection have been discussed.
Recent advancements of the D-CNN in computer vision have been tabulated and discussed with their methodology and performance. Activation functions that are used for computer vision problems are tabulated.

5
A survey on image data augmentation for deep learning [37] is survey presents the existing methods for data augmentation.
Advantages of data augmentation and comparing results showing the model's performance with and without data augmentation are accomplished.

6
Adversarial-learning-based image-to-image transformation: A survey [38] is survey presents an overview of adversarial learning-based methods by focusing on the image-to-image transformation scenario.
e existing survey mainly focused on imageto-image translation. is survey discusses several applications based on adversarial learning.

7
Survey of convolutional neural networks for image captioning [39] is survey presents a shallow overview of image captioning performed using D-CNN.
is survey elaborately discusses various applications using the D-CNN.
Complexity 5 kernel and the input tensor. Each element of the kernel is multiplied with the corresponding tensor element and summed up to arrive at the output value placed in the feature map's corresponding position [52]. e convolution operation is defined by kernel size and the number of kernels. e kernel size may be 3 × 3, 5 × 5, or 7 × 7 based on the size of the input tensor. e number of kernels is arbitrary, each kernel representing various characteristics of the input. e convolution operation is repeated for each kernel. Here, the kernel size is 3 × 3, which is applied across each input tensor element to perform a dot product and return the corresponding value for the output tensor. A drawback of the convolution operation is that the feature map shrinks in size compared to the input tensor. Moreover, this is because the kernel center is not applied across the bordering elements at the input tensor's right. With a 5 × 5 input tensor, the feature map size shrunk to 3 × 3. e size of the feature mapf s for a t × t tensor and a k × k kernel is determined using the following formula: Applying this formula, a 49-pixel input will shrink into a 25-pixel feature map, which will further shrink when the process is repeated, resulting in the loss of essential features of the input. Furthermore, to address this issue in deep CNN models with more layers, padding is used, where columns and rows of zeroes are added on all the input tensor sides. Moreover, this is performed to fit the center of the kernel to   6 Complexity the input tensor's rightmost bordering elements for maintaining the size of the feature map the same as that of the input tensor. Figure 4 shows zero padding where rows and columns are added to all the sides of the input tensor. As a result of zero padding, the feature map's size is 5 × 5, the same as the input tensor. e number of pixel shifts performed by the kernel over the input tensor is called stride. When the value of stride is 1, the kernel shifts 1 pixel at a time. When the value of stride is 2, the kernel shifts 2 pixels at a time, and so on. Activation functions that are frequently used are logistic sigmoid, hyperbolic tangent, and ReLU. Table 3 presents the recent advancements of the D-CNN in computer vision.

Activation Functions.
e deep learning mechanism is the input is fed into the network, and to the product of input and the weights, a bias is added. An activation function is then applied to the result, and the same process is repeated until the last layer is reached. Activation functions play a significant role in a neural network to define a neuron's output for a given set of inputs. e activation function takes up the weighted sum of inputs and performs a transformation operation to compress the output between a lower and upper limit. Activation functions are of two types: linear and nonlinear. Deep learning uses nonlinear activation functions for all its classification problems as the output lies between 0 and 1. Without nonlinearity, each layer of the network would execute linear transformations, in which case an equivalent single layer can replace the hidden layers. For a backpropagation to be executed, it is required that the activation function be differentiable. For deep learning, an activation function has to be both nonlinear and differentiable. Some of the standard activation functions in deep learning are sigmoid, tanh, ReLU, and leaky ReLU. Table 4 shows the most frequently used activation functions. Sigmoid transforms the output between 0 and 1. In recent times, sigmoid has become one of the least used activation functions because of its drawbacks. First, it causes gradients to vanish when the neuron's activation saturates close to 0 or 1; the gradients in this region are close to zero. e second drawback is that the output is not zero-centered. tanh is another activation function that performs better than the sigmoid activation function. e output lies in the range of −1 and 1 [69]. ReLU is the most popular and frequently used activation function in deep learning. e two problems overcome with ReLU are slow training time of the S-type activation function and vanishing gradient [70,71]. e mathematics behind ReLU is when the output is 0, conversely if, the output is a linear function [72]. e range of the output is between 0 and infinity.
Since ReLU has zero output for the input's negative values, the gradient will be zero at this point because the network will not respond to any variations in the input or the error. is problem can make part of the network passive because of dead neurons. is problem called dying leaky ReLU can overcome ReLU. Leaky ReLU is similar to ReLU, except that leaky ReLU does not make the negative input to zero. Instead, it gives a small nonzero value of 0.01 in case of a negative regime of the input. e range is between −∞ and ∞. e purpose of leaky ReLU is to minimize the dying neuron input problem. In multiclass classification problems, the output layer employs softmax as the classification function [73]. Softmax produces output that sums up to a numerical value of 1. So, the output of softmax specifies the probability distribution for n different classes of the target. Moreover, this is why softmax is used explicitly in multiclass classification problems. Due to the gradient disappearance problem, tanh and sigmoid activation functions are not employed on the D-CNN.

Pitfalls of a Deep Learning Model.
e model's performance indicates how well the model is trained and how well it generalizes new or unseen data. Evaluating the performance of a model is a crucial step in data science. e common barriers in creating high-performance models are overfitting, underfitting, and significantly few training data. Figure 5 shows overfitting, and it is said to occur when a model performs so well on a training set. e performance of the model depreciates on the validation set-loss during the training phase decreases, but the loss during the validation phase increases. Furthermore, this is because the model learns even the unnecessary information from the training set; hence, the model's performance is too good on the training set. Nevertheless, the model fails to perform on the validation set. Overfitting can be addressed by improving the model and obtaining more training data. e model can be enhanced by randomly omitting feature detectors from the model's architecture.
is technique is called dropout, developed by Geoff Hinton [74]. A vast number of different networks can be trained in a reasonable time using random dropouts. us, different networks are presented for each training case. In a nutshell, the dropout technique assumes that a randomly selected portion of the network is muted for each training case [75]. Furthermore, this is a useful technique as it prevents any single neuron within the network from becoming excessively influential. us, the model does not rely too much on any specific feature of the data. Dropout is used when there is an overfitting. Dropout can improve the validation accuracy in later epochs, even if there is no overfitting. Dropout is added based on experimentation, and the usual dropout range is  Visual aesthetic quality assessment [56] To present a biological model for three tasks: aesthetic score regression, aesthetic quality classification, and aesthetic score distribution prediction A double-subnet gated peripheral-foveal convolutional neural network: a foveal and a peripheral subnet. e peripheral subnet mimics peripheral vision, while foveal extracts fine-grained features.
Standard aesthetic visual assessment datasets and photo. Net datasets are used for unified aesthetic prediction tasks.
Nine-layer CNN 5 3D object recognition [57] To take multiview images captured from partial angles as the input and perform 3D object detection using the 3D CNN 3D object information is encoded from the 3D spatial dimension. 3D kernel, the view images are applied to perform 3D convolution.  between 20 and 50 percent of the neurons. e graph is shown in Figures 5(a) and 5(b) which show the accuracy and loss curves after two dropouts with a dropout rate of 0.25 and 0.5 have been added. It can be seen that accuracy has improved, and loss decreased after adding dropout. In many cases, the data available may not be sufficient to train a model. With significantly few training data, the model may not learn patterns from the training data and inhibit the model's capability to generalize unseen data [76]. It is challenging, expensive, and time-consuming to collect the required new data to train the model [77]. Under such circumstances, data augmentation is warranted, which is a powerful technique for mitigating overfitting. It is a powerful and computationally inexpensive technique of artificially inflating training data size with the data in hand without collecting new data [71,78]. One or more deformations are applied to the data, while the labels' semantic meaning is preserved during the transformation [79]. With more data provided to the model, it will generalize well on the validation data. Some popularly used data augmentation techniques are rotation, flip, skew, crop, contrast, brightness adjustment, and zoom in/out [80,81]. Figure 6 shows different augmentations applied to a single image. Here, flip, rescale, zoom, height, and width shift augmentations are applied to a cat image. Training the model with the additional deformed data makes the model generalize better with unseen data. Figure 7 shows an improvement in accuracy after       adding dropout and applying different augmentations to the data. e next problem usually faced by AI practitioners is underfitting. e model is said to underfit when it cannot learn the patterns even from the training set and exhibits poor performance on the training set. Moreover, this can be addressed by increasing the training data, improving the model complexity, and increasing the training epochs. Deeper models with more neurons per layer can avoid underfitting [82]. Imbalanced datasets are another crucial problem faced as they widely exist in real-world situations and have proven to be the greatest challenge in classification problems. Data access has become comfortable with the advancement of technology; however, data imbalance has become ubiquitous in most of the collected datasets. For example, in medical data, most people are healthy, and unhealthy people are less in proportion than healthy people, significantly affecting classification accuracy. Here, the classes with adequate samples are called the majority class, and classes with inadequate samples are called the minority class. Prediction of minority class becomes problematic as it has a fewer number of samples or insufficient samples.
Several techniques are adopted to handle data imbalance in the dataset. Some of them are weight balancing, over-and undersampling (resampling), and penalizing algorithms [83]. Weight balancing is performed by modifying the weights carried by the training samples when computing the loss. Resampling is one of the frequently adopted techniques where undersampling is done to remove samples from the majority class or oversampling is done to add more samples to the minority class. Oversampling is done by duplicating random records from the minority class. Penalizing algorithm is another technique where the cost of classification mistakes is increased on the minority class. Label noise is another problem in deep learning, and some of its sources are nonexpert labeling, automatic labeling, and data poisoning, adversaries. Dan Hendrycks et al. [84] recommended a loss correction technique to utilize trusted data with clean labels. e authors effectively used trusted data to overcome the effects of label noise on classification.  [7]. Significant improvements were achieved in computer vision applications such as image super-resolution [85], image classification [86], image steganography [87], image transformation [88], video generation, image synthesis [89], video super-resolution [90], and image style transformation. Variants of the D-GAN model were also proposed in recent times. Figure 8 shows the D-GAN architecture where the generator generates fake images of human faces, and the discriminator's job is to distinguish the real faces from the fake faces. In general, plausible data are generated by the generator. ese data generated by the generator become negative examples of training the discriminator. e discriminator, a binary classification neural network, takes in the real samples and the samples from the generator and learns to distinguish the real samples from the generator's fake samples. Two loss functions, generator loss and discriminator loss, are backpropagated to the generator and discriminator. e discriminator ignores the generator loss. e generator and the discriminator update the weights based on the loss, where the noise samples i ranging from 1 to m are represented by log D x i + log 1 − D G z i ,

Evolution of the Deep GAN. With deep GAN's advent by
Goodfellow, several variants of the deep GAN were proposed for various CV applications. ese deep GAN variants have their own architecture, methodology, advantages, and disadvantages but with the same two-player minimax game theory as the base. Figure 9 shows D-GAN's evolution with conditions, encoders, loss functions, and process discrete data. Table 5 shows D-GAN's evolution with its application, architecture, methodology, advantage, and disadvantage. D-GAN is successfully used in many computer vision applications, and image generation is at the forefront of all these. D-GAN generates images, gradually enhancing the resolution and the quality of images generated. Variants of D-GAN are used for various applications such as image transformation, image deraining [88], increasing image resolution, facial attribute transformation, and fusion of the image. Table 6 shows some of the progressively increasing applications of the D-GAN in computer vision.

Applications of the D-CNN in Computer Vision
Most of the D-CNN applications are related to images, while applications of the D-GAN are related to data generation. is section will progress through the essential applications of the D-CNN.

Image Classification Using the D-CNN.
ere are several image classification tasks performed using D-CNN [11,[104][105][106][107][108][109]. One of the vital image classification tasks is handwritten digit recognition which recognizes numbers between 0 and 9, where the data from the MNIST database are obtained to predict the correct label for the handwritten digits. MNIST is a database of handwritten numbers widely used as a testbed for various deep learning applications. It has 70,000 images, of which 60,000 are training images and 10,000 are testing images [110]. Figure 10 shows sample images from the MNIST dataset. e images are greyscaled with 28 × 28 pixels, as represented in Figure 10. e 28 × 28 pixels are flattened into a 1D vector of size 784 pixels, and each of these pixels has values between 0 and 255. e black pixel takes the value 255, while the white pixel takes the value one, and various other shades of grey take values between 0 and  255. Handwritten digits are recognized using the D-CNN, which is considered the most suitable model for performing this task. Data are downloaded from the MNIST database, and it takes some time to download the data during the first run, and the subsequent runs fetch the cached data. e data obtained from the MNIST database have features and labels. e features range from 0 to 225, corresponding to pixels of 28 × 28 images representing digits 0 through 9. e labels represent digits 0-9 of the respective image. It is normalized to scale the data to be between 0 and 1. e actual image from the MNIST dataset and the normalized image are represented in Figures 11(a) and 11(b).
Handwritten digit recognition is implemented using eight hidden layers, where the first layer is the convolutional layer used for feature extraction. ReLU is used as the activation function with 32 filters and a kernel size of 3 × 3 pixels. Another convolutional layer is used with ReLU as the activation function, 64 filters, and kernel size of 3 × 3 pixels. e next hidden layer is a pooling layer where max pooling  Realistic images are generated Inception score is boosted, Frechet inception distance is reduced, and images are generated sequentially Attention is not extended with 2 × 2 pixels as pool size is used. Next to the pooling layer is the dropout, a technique adapted to prevent overfitting in the neural network [111]. e dropout technique's key idea is to randomly drop a few units from the network and its connections during the training to reduce overfitting significantly. A dropout rate of 0.25 randomly drops out 1 in 4 units from the network. Between the convolutional layer and the fully connected output layer is the flatten layer. e flatten layer aims to transform the 2D matrix into a 1D vector fed into the fully connected layer. e flattened 1D vector is then passed on to the fully connected layer with ReLU as the activation function. Another dropout with a dropout rate of 0.50 is used. Finally, the output layer with the softmax activation function is used. e softmax activation function is used as its role is to specify the probability distribution for ten different classes. Since the task in hand is a multiclass classification, the output layer has ten nodes or perceptron corresponding to each of the ten categories to predict each class's probability distribution. e perceptron with the highest probability is picked, and the label associated with it is returned as the output. e model is fit over 12 epochs. e test accuracy achieved is 98.56%, and the test loss is 0.0513, as represented in Figures 12(a) and 12(b). It can be seen that the accuracy increases with the increase in epoch, and loss decreases with the increase in accuracy.
Image classification is a classical problem of computer vision and deep learning. It is challenging because of the image's variations due to light effects and misalignments [112]. Image classification in a computer sense is a course of action for grouping and categorizing images and labeling them based on their features and attributes [80]. It trains computers to use a well-defined dataset to interpret and classify images to narrow the gap between human vision and computer vision [113]. Some of the existing use cases of image classification are gender classification, social media applications such as Facebook and Snapchat which use image classification to enhance the user experience, and self-driving cars where various objects on their path, namely, vehicle, people, and other moving objects, are recognized [114].
Amerini et al. [115] proposed a novel framework called FusionNet by combining two D-CNN architectures to identify the source social network based on the images. 1D-CNN learns discriminative features, while 2D-CNN architecture infers unique attributes from the image. e learned features are fused using FusionNet, and then the classification is performed. Distinctive traces of social networks embedded in the images are exploited to identify the source. e fullframe images are broken into fixed dimension patches, and the patches are then classified independently. Each of the image patches is processed with D-CNN, and the predictions   10 Gesture recognition [100] To propose a new gesture recognition algorithm based on D-CNN and DCGAN For a particular gesture, the model recognizes the meaning of the gesture. DCGAN is used to solve overfitting in case of data insufficiency. Preprocessing is done to improve illumination conditions. An accuracy of 90.45% is achieved.
Data collected using a computer containing 1200 images for each gesture.
11 Face depth estimation [101] To develop a D-GANbased method to estimate the depth map for a given facial image D-GAN architecture is used to estimate the depth of a 2D image for 3D reconstruction. Data augmentation is done to improve the robustness of the models. Transformations such as slight rotation clockwise, Gaussian blur, and histogram equalization were applied to the image.
Several variants of the D-GAN were evaluated for depth estimation. Wasserstein GAN was found to be the most robust model for depth estimation.
e Texas 3D face recognition database and Bosphorus database for 3D face analysis.

12
Image enhancement [102] To propose an image enhancement model using the conditional D-GAN based on the nonsaturating game e super-resolution method is combined with the D-GAN to generate a clearer image. e architecture has 23 layers composed of convolution layers with the ReLU activation function.
e model is compared with existing methods, which showed an improvement in peak signal-to-noise ratio by 2.38 dB.
Images from Flickr and ImageNet datasets were used without augmentation.

13
Retinal image synthesis [103] To propose multiplechannels-multiplelandmarks, a preprocessing pipeline to synthesize retinal images from optic cup images Residual neural network and U-Net were integrated to form residual U-Net architecture. Residual U-Net is capable of capturing finerscale details. Multiplelandmark maps comprise of batch normal layer, convolution layer, and ReLU activation. e final layer has a sigmoid activation function. e proposed multiplechannels-multiplelandmarks model outperformed the existing single vessel-based methods. Pix2Pix, using the proposed method, generated realistic images.
Public fundus image datasets DRIVE and DRISHTI-GS were used.
18 Complexity are obtained. Furthermore, to get the prediction at the image level, a voting strategy is applied at each patch. e label with the majority vote is assigned as the final prediction label. e average accuracy of 94.77% is achieved at the patch level.

Image Localization and Detection Using the D-CNN.
On a glance over the object, human vision is capable of detecting the object, its size, location, and various other features. Object detection using deep learning allows Label is 5 Label is 5 Label is 2 Label is 2 Label is 3 Label is 3 Label is 7 Label is 8 Label is 3 Label is 6 Label is 6 Label is 0 Label is 4 Label is 9 Label is 9 Label is 9  Complexity computers to play a crucial role in many real-world applications such as robots, smart vehicles, and self-driving cars [116]. Object detection is one of the most challenging problems and the most important goal of computer vision. Object detection involves identifying different objects in an image using a bounded box. e identified objects can be further analyzed at a granular level to digging deeper into the image. Earlier object identification was performed by splitting the image into multiple pieces and then passing them into a classifier for object detection. Splitting the images into multiple pieces is performed using a sliding window algorithm. In this approach, the detection window is slid through the actual image at multiple positions, and each grid is a smaller piece of the image. Robust visual descriptors needed for object detection are extracted from the image using image processing. e convolutional neural network used visual descriptors to make object or nonobject decisions [117]. Since the process had to be repeated multiple times, it was computationally expensive.
Moreover, to overcome the sliding window algorithm's shortcomings, object detection was performed using image segmentation. Segmentation is categorized into boundarybased, thresholding, region-based, and boundary-based. When a digital image is passed, the neurons are synchronized based on pixels with similar intensities to form a connected region [118]. It can be contextual if spatial relationships in an image are considered or noncontextual if spatial relationships are not considered. e goal of image segmentation is to alter the representation of an image into a form that is meaningful and easier for analysis. e accuracy of object detection is based on the quality of image segmentation. Similar to image classification problems, networks that are deeper exhibited better performance in object detection. Object localization is the next level of object classification where the objects' position was also determined with a bounding box and labeling the objects. e difference between localization and detection is that classification with localization handles only one object, whereas detection finds multiple objects in an image and labels.
Tu et al. [119] proposed a method to detect passion fruits based on multiple-scale faster region-based CNN. e detection phase involves multiple-scale feature extractors that extract low-and high-level features. Data augmentation is done to enlarge the training data size. Pretrained residual neural network-101 architecture is used for object detection.

Document Analysis Using the D-CNN.
Documents are the source of information for several cognitive processes, namely, graphic understanding, document retrieval, and OCR. Document analysis plays a crucial role in cognitive computing to extract information from document images. Document analysis is performed by identifying and categorizing images based on regions of interest.
ere are several existing methods for document analysis, such as pixel-based classification methods, region-based classification method, and connected component classification method. e region-based classification method segments document images into zones and classifies them into semantic classes. Pixel-based classification methods perform document analysis by taking each pixel and generate labeled images using the classifier. e connected component method creates the object hypothesis using local information, further inspected, refined, and classified [120].
D-CNN is widely adopted for document analysis to reduce computational complexity, cost, and data without compromising accuracy. With D-CNN, it is possible to classify images directly from segmented objects without extracting handcrafted features. Maryem Rhanoui et al. [121] performed document-level sentiment analysis using a combination of D-CNN and bidirectional long short-term memory and achieved an accuracy of 90.66%. e features are extracted by the D-CNN, and the extracted features are  passed as the input to long short-term memory. Vectors' built-in word embedding is passed as the input to the CNN. Four filters are applied, and the layer of max pooling is applied after each filter. e results of max pooling were concatenated and passed as the input to binary long shortterm memory. e output of binary long short-term memory is passed as the input to a fully connected layer. e fully connected layer connects each piece of information from the input with output information. Finally, the softmax function is applied as an activation function to produce the desired output by assigning classes to articles.

4.4.
Speech Recognition Using the D-CNN. Human-machine interaction for intelligent devices, namely, domestic robots, smartphones, and autonomous cars, is becoming increasingly common in daily life. Hence, noise robust automatic speech recognition has become very crucial for the human-machine interface. e basic idea behind speech recognition is to utilize the speaker's lip movement's visual information to complement the corrupted audio speech inputs. Automatic speech recognition models the relationship between phones and acoustic speech signals by extracting features and classifying speech signals. Furthermore, this is usually performed in two steps, where in the first step, the raw speech signal is transformed into features using dimensionality reduction and information selection. e second step estimates phonemes using generative or discriminative models. Phoneme class conditional probabilities can be estimated using the D-CNN through the raw speech signal as the input. e features are learned from the raw speech signal in both continuous speech recognition and phoneme recognition tasks.
Kuniaki Noda et al. [122] proposed a CNN-based approach for audiovisual speech recognition. Here, the authors first used a denoising autoencoder to acquire noise features. en, the authors used the CNN to extract features from mouth area images. e training data for the CNN were raw images and their corresponding phoneme outputs. Lastly, the authors applied the multistream hidden Markov model to integrate audio and visual hidden Markov models trained with corresponding features. e model achieved a 65% word recognition rate with denoised mel-frequency cepstral coefficients with the signal-to-noise ratio under 10 dB for the audio signal input.

Image-to-Image Translation Using the Deep GAN.
Remarkable progress has been achieved in image-image translation with the advent of the deep GAN. e imageimage translation aims to learn the mapping to translate the image within two different domains, from the source to a target domain, without losing the original image's identity and reducing the reconstruction loss. Some of the essential image-image translations are converting the real-world images into cartoon images, coloring the greyscale images, and changing a nighttime picture to a daylight picture.
D-GAN's role is to confuse the discriminator by generating images that are close to the real images. D-GAN is incredibly successful in super-resolution, representation learning [123], image generation [124,125], and image-image translation. Kim et al. proposed a novel method for image-image translation by incorporating a learnable normalization function and a new attention module. Existing attentionbased models lacked behind in handling the geometric changes. is model is incredibly successful in translating images with massive shape changes. e auxiliary classifier is used to obtain an attention map to distinguish between source and target domains. Furthermore, this is done to focus on the region of interest, ignoring other minor regions. Attention maps are inserted both in the generator and discriminator to focus on the region of interest. e attention map embedded in the generator focuses on the essential areas that distinguish the two domains. In contrast, the attention map embedded in the discriminator focuses on distinguishing the target domain's real and fake image. e choice of the normalization mechanism dramatically improves the quality of the transformed images. Adaptive Layer-Instance Normalization is used to select a ratio between layer normalization and instance normalization adaptively, and the parameters are learned during the training process. e class activation map gives discriminative image regions to determine the class. e model's performance is superior to the existing state-of-the-art methods on both style transfer and object transfiguration.

Image
Denoising. Image denoising removes noise from images retaining the detailing of the images. Image denoising is significantly improved with the advancement in the D-CNN [126]. However, D-CNN models focus mainly on reducing the mean squared error resulting in images lacking high-frequency details. Furthermore, to overcome this issue, D-GAN is applied to remove noise from images [127,128]. Zhong et al. proposed a method to remove noise from images using the D-GAN. e architecture of the generator in the D-GAN has a convolutional block and eight dense blocks. Each block comprises a convolutional layer, batch normalization, and ReLU activation. Each layer, except the last layer in the network, is fed with each of the previous layers using skip connections. is method effectively reduces the vanishing gradient problem. e convolutional layer extracts low-level features, while the dense blocks extract the high-level features. e generator network is capable of learning the residual difference between the ground truth and the noisy image. e final 3 × 3 convolutional layer generates the output images. e discriminator network differentiates the fake and the ground truth image, making the final denoised visually appealing image. e model can handle different types of noise, but it cannot handle unknown real noises.

Face Aging and Facial Attribute
Editing. Deep GANbased methods have been proposed to alter facial attributes to anticipate a person's future look. Conditional GAN has been widely adopted to perform face aging [129]. D-GANs Complexity are also incorporated in facial attribute editing, manipulating facial images' attributes to generate face with the desired attribute, retaining other facial images' details. e latent representation of the facial images is decoded to edit the facial attributes. GAN-based methods have been proposed for facial attribute editing, which changes only the desired attributes and preserves the other identities of the facial images, retaining the facial image's identity [130]. e work uses reconstruction learning to preserve the attribute details and "only change what you want." e authors applied attribute classification constraints to the generated image rather than imposing constraints on the latent representation to warrant the desired attributes' correct change. e facial attributes are manipulated to change the facial image with and without a beard, black hair and brown hair, mouth open and mouth close, brown hair and blond hair, and young and old. Yujun Shen et al. [131] interpreted the latent codes of trained models such as StyleGAN and progressive GAN and encoded various semantics in the interpreted latent space. Given a synthesized face, different face attributes such as pose, age, and expression are edited without having to retrain the D-GAN model. Table 7 shows the comparison of handwritten digits generated by D-GAN variants.

Open Problems and Future Opportunities for Computational Visual Perception-Driven Image Analysis
is paper discussed the development of computational visual perception with D-GAN and D-CNN. e advantages and disadvantages of various architectures of D-GAN models are discussed. Future research with D-GAN can be performed on model collapse, nonconvergence, and training difficulties. Also, various other shortcomings of the CNN and their solution are reviewed. Table 8 lists the challenges and open problems for computational visual perceptiondriven image analysis.
(i) Handwriting recognition mainly relies on the language model that we furnish to the system and the character modeling quality. is research can be performed to obtain a better handwriting recognition system that is faster and more accurate. (ii) Semantic mapping is promising in autonomous vehicles, but state-of-the-art methods still need improvement to produce reliable tools. To do this, better 3D geometry must be included in mapping to achieve more accurate results in the semantic segmentation process. Moreover, map updating has to be done to ensure that maps are always coherent with reality. (iii) e biggest problem with calibration in webcambased eye trackers is a variation in the head pose. Furthermore, this has to be handled without modifying the base components of the system. e future works can be directed towards a 3D modelbased head tracking, gaze estimation calculated geometrically, and accurate iris segmentation method. (iv) Lumen center detection is based on the geometry and appearance of the lumen. Future works can be directed towards lumen segmentation using the center point of the lumen computed previously as the seed and filling tracheal ring discontinuities to improve segmentation accuracy. (v) In query-by-string word, spotting the latent semantic model's performance improves when more samples are used to build the model. However, acquiring the transcription of handwritten documents can be tricky so that synthetic information can train the whole framework. Another problem that requires a solution is the vast possible parameter combinations of the   CGAN is not strictly unsupervised. Some labeling is required for it to work strictly.  24 Complexity interval. Image registration techniques can be used to register these to each other, and the movement can be recorded, which can help predict future landslides. is technique will be beneficial when satellite images of areas that are prone to landslides are available. Deviations in the positions of hills are estimated to predict the number of future landslides. (x) e research related to underwater imaging is evolving to expose undiscovered species underwater. It is not possible to identify all the underwater species by continuously visualizing the recorded videos underwater. erefore, an automated system to classify or detect the species underwater is required. is article extensively surveyed the current opportunities and future challenges in all the emerging domains. is article discussed the current opportunities in many emerging domains such as handwriting recognition, semantic mapping, webcambased eye trackers, lumen center detection, query-by-string word, intermittently closed and open lakes and lagoons, and landslides. Future research with the D-GAN has to be directed towards model collapse, nonconvergence, and training difficulties. ough there are vast improvements such as weight regularization, weight pruning, and Nash equilibrium, future research in this area is still mandatory. D-GAN in the security domain has more research scope as adversarial attacks on neural networks have become very common. Slight perturbation in samples may lead to the wrong classification by neural networks. Furthermore, to outdo adversarial attacks, it is necessary to make D-GANs more robust to adversarial attacks. ough D-GAN is put forward as unsupervised learning, adding labels to the data will significantly improve the D-GAN's data quality. Modifying the D-GAN in this way is one of the future research directions.  Advanced convolutional neural networks can be used. Instead of using image processing techniques, the model's generalizability can be assessed using realworld data for the testing phase.

Conflicts of Interest
C. investigated the data and performed the methodology. C-Y. C. and K. S. carried out the project administration and validated the data. N. A. R, P. D. R. V, K. S., and U. T wrote, reviewed, and edited the manuscript. All authors read and agreed to the published version of the manuscript.