Deep Learning in Visual Computing and Signal Processing

Deep learning is a subfield of machine learning, which aims to learn a hierarchy of features from input data. Nowadays, researchers have intensively investigated deep learning algorithms for solving challenging problems in many areas such as image classification, speech recognition, signal processing, and natural language processing. In this study, we not only review typical deep learning algorithms in computer vision and signal processing but also provide detailed information on how to apply deep learning to specific areas such as road crack detection, fault diagnosis, and human activity detection. Besides, this study also discusses the challenges of designing and training deep neural networks.


Introduction
Deep learning methods are a group of machine learning methods that can learn features hierarchically from lower level to higher level by building a deep architecture.The deep learning methods have the ability to automatically learn features at multiple levels, which makes the system be able to learn complex mapping function  :  →  directly from data, without help of the human-crafted features.This ability is crucial for high-level feature abstraction since highlevel features are difficult to be described directly from raw training data.Moreover, with the sharp growth of data, the ability to learn high-level features automatically will be even more important.
The most characterizing feature of deep learning methods is that their models all have deep architectures.A deep architecture means it has multiple hidden layers in the network.In contrast, a shallow architecture has only few hidden layers (1 to 2 layers).Deep architectures are loosely inspired by mammal brain.When given an input percept, mammal brain processes it using different area of cortex which abstracts different levels of features.Researchers usually describe such concepts in hierarchical ways, with many levels of abstraction.Furthermore, mammal brains also seem to process information through many stages of transformation and representation.A very clear example is that the information in the primate visual system is processed in a sequence of stages: edge detection, primitive shapes, and more complex visual shapes.
Inspired by the deep architecture of mammal brain, researchers investigated deep neural networks for two decades but did not find effective training methods before 2006: researchers only obtained good experimental results of neural network with one or two hidden layers but could not get good results of neural network with more hidden layers.In 2006, Hinton et al. proposed deep belief networks (DBNs) [1], with a learning algorithm that uses unsupervised learning algorithm to greedily train deep neural network layer by layer.This training method, which is called deep learning, turns out to be very effective and efficient in training deep neural networks.

Deep Learning Algorithms
Deep learning algorithms have been extensively studied in recent years.As a consequence, there are a large number of related approaches.Generally speaking, these algorithms can be grouped into two categories based on their architectures: restricted Boltzmann machines (RBMs) and convolutional neural networks (CNNs).In the following sections, we will briefly review these deep learning methods and their developments.

Deep Neural Network. This section introduces how to build and train RBM-based deep neural networks (DNNs).
The building and training procedures of a DNN contain two steps.First, build a deep belief network (DBN) by stacking restricted Boltzmann machines (RBMs) and feed unlabeled data to pretrain the DBN.The pretrained DBN provides initial parameters for the deep neural network.In the second step, labeled data is fed to train the DNN using backpropagation.After two steps of training, a trained DNN is obtained.This section is organized as follows.Section 2.1.1 introduces RBM, which is the basic component of DBN.In Section 2.1.2,RBM-based DNN is introduced.

Restricted Boltzmann Machines.
RBM is an energybased probabilistic generative model [26][27][28][29].It is composed of one layer of visible units and one layer of hidden units.The visible units represent the input vector of a data sample and the hidden units represent features that are abstracted from the visible units.Every visible unit is connected to every hidden unit, whereas no connection exists within the visible layer or hidden layer.Figure 1 illustrates the graphical model of restricted Boltzmann machine.
As a result of the lack of hidden-hidden and input-input interactions, the energy function of a RBM is where  = {W, b, c} are the parameters of RBM and they need to be learned during the training procedure; W denotes the weights between the visible layer and hidden layer; b and c are the bias of the visible layer and hidden layer, respectively; this model is called binary RBM because the vectors v and h only contain binary values (0 or 1).
We can obtain a tractable expression for the conditional probability (ℎ | V) [30]: For binary RBM, where ℎ  ∈ {0, 1}, the equation for a hidden unit's output given its input is Because V and ℎ play a symmetric role in the energy function, the following equation can be derived: and for the visible unit V  ∈ {0, 1}, we have where  ⋅ is the th column of .
Although binary RBMs can achieve good performance when dealing with discrete inputs, they have limitations to handle continuous-valued inputs due to their structure.Thus, in order to achieve better performance on continuous-valued inputs, Gaussian RBMs are utilized for the visible layer [4,31].The energy function of a Gaussian RBM is where   and   are the mean and the standard deviation of visible unit .Note here that only the visible layer V is continuous-valued and hidden layer ℎ is still binary.In practical situation, the input data is normalized, which makes   = 0 and   = 1.Therefore, (6) becomes  2.1.2.Deep Neural Network.Hinton et al. [1] showed that RBMs can be stacked and trained in a greedy manner to form so-called deep belief networks (DBNs) [32].DBNs are graphical models which learn to extract deep hierarchical representation of the training data.A DBN model with  layers models the joint distribution between observed vector V and ℓ hidden layers ℎ  as follows [30]: (8) where V = ℎ 0 , (ℎ −1 | ℎ  ) is a conditional distribution for the visible units conditioned on the hidden units of the RBM at level  and (ℎ ℓ−1 , ℎ ℓ ) is the visible-hidden joint distribution in the top-level RBM.This is illustrated in Figure 2.
As Figure 2 shows, the hidden layer of low-level RBM is the visible layer of high-level RBM, which means that the output of low-level RBM is the input of high-level RBM.By using this structure, the high-level RBM is able to learn high-level features from low-level features generated from the low-level RBM.Thus, DBN allows latent variable space in its hidden layers.In order to train a DBN effectively, we need to train its RBM from low level to high level successively.
After the unsupervised pretraining step for DBN, the next step is to use parameters from DBN to initialize the DNN and do supervised training for DNN using back-propagation.The parameters of the -layer DNN are initialized as follows: parameters {  ,   } ( = 1,...,) except the top layer parameters are set the same as the DBN, and the top layer weights {  ,   } are initialized stochastically.After that, the whole network can be fine-tuned by back-propagation in a supervised way using labeled data.

Convolutional Neural Network.
Convolutional neural network is one of the most powerful classes of deep neural networks in image processing tasks.It is highly effective and commonly used in computer vision applications [33].The convolution neural network contains three types of layers: convolution layers, subsampling layers, and full connection layers.The whole architecture of convolutional neural network is shown in Figure 3.A brief introduction to each type of layer is provided in the following paragraphs.

Convolution Layer.
As Figure 4 shows, in convolution layer, the left matrix is the input, which is a digital image, and the right matrix is a convolution matrix.The convolution layer takes the convolution of the input image with the convolution matrix and generates the output image.Usually the convolution matrix is called filter and the output image is called filter response or filter map.An example of convolution calculation is demonstrated in Figure 5.Each time, a block of pixels is convoluted with a filter and generates a pixel in a new image.

Subsampling Layer.
The subsampling layer is an important layer to convolutional neural network.This layer is mainly to reduce the input image size in order to give the neural network more invariance and robustness.The most used method for subsampling layer in image processing tasks is max pooling.So the subsampling layer is frequently called max pooling layer.The max pooling method is shown in Figure 6.The image is divided into blocks and the maximum value of each block is the corresponding pixel value of the output image.The reason to use subsampling layer is as follows.First, the subsampling layer has fewer parameters and it is faster to train.Second, a subsampling layer makes convolution layer tolerate translation and rotation among the input pattern.

Full Connection Layer.
Full connection layers are similar to the traditional feed-forward neural layer.They make the neural network fed forward into vectors with a predefined length.We could fit the vector into certain categories or take it as a representation vector for further processing.

Training Strategy
Compared to conventional machine learning methods, the advantage of the deep learning is that it can build deep architectures to learn more multiscale abstract features.Unfortunately, the large amount of parameters of the deep architectures may lead to overfitting problem.

Data Augmentation.
The key idea of data augmentation is to generate additional data without introducing extra labeling costs.In general, the data augmentation is achieved by deforming the existing ones.Mirroring, scaling, and rotation are the most common methods for data augmentation [34][35][36].Wu et al. extended the deforming idea to color space, the provided color casting, vignetting, and lens distortion For visual tasks, when it is hard to get sufficient data, a recommendable way is to fine-tune the pretrained CNN by natural images (e.g., ImageNet) and then use specific data set to fine-tune the CNN [36,38,39].Tajbakhsh et al. showed that, for medical applications, the use of a pretrained CNN with adequate fine-tuning outperformed or, in the worst case, performed as well as a CNN trained from scratch [38].
On the other hand, the deep learning architecture contains hundreds of thousands of parameters to be initialized even with sufficient data.Erhan et al. provided the evidence to explain that the pretraining step helps train deep architectures such as deep belief networks and stacked autoencoders [40].Their experiments supported a regularization explanation for the effect of pretraining, which helps the deeplearned model obtain better generalization from the training data set.

Applications
Deep learning has been widely applied in various fields, such as computer vision [25], signal processing [24], and speech recognition [41].In this section, we will briefly review several recently developed applications of deep learning (all the results are referred from the original papers).

CNN-Based Applications in Visual
Computing.As we know, convolutional neural networks are very powerful tools for image recognition and classification.These different types of CNNs are often tested on well-known ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) data set and achieved state-of-the-art performance in recent years [42][43][44].After winning the ImageNet competition in 2012 [42], the CNN-based methods have brought about a revolution in computer vision.CNNs have been applied with great success to the object detection [35,45,46], object segmentation [47,48], and recognition of objects and regions in images [49][50][51][52][53][54].Compared with hand-crafted features, for example, Local Binary Patterns (LBP) [55] and Scale Invariant Feature Transform (SIFT) [56], which need additional classifiers to solve vision problems [57][58][59], the CNNs can learn the features and the classifiers jointly and provide superior performance.In next subsection, we review how the deep-learned CNN is applied to recent face recognition and road crack detection problem in order to provide an overview for applying the CNN to specific problems.

CNN for Face Recognition.
Face recognition has been one of the most important computer vision tasks since the 1970s [60].Face recognition systems typically consist of four steps.First, given an input image with one or more faces, a face detector locates and isolates faces.Then, each face is preprocessed and aligned using either 2D or 3D modeling methods.Next, a feature extractor extracts features from an aligned face to obtain a low-dimensional representation (or embedding).Finally, a classifier makes predictions based on the low-dimensional representation.The key to get good performances for face recognition systems is obtaining an effective low-dimensional representation.Face recognition systems using hand-crafted features include [61][62][63][64].Lawrence et al. [65] first proposed using CNNs for face recognition.Currently, the state-of-the-art performance of face recognition systems, that is, Facebook's DeepFace [66] and Google's FaceNet [67], are based on CNNs.Other notable CNN-based face recognition systems are lightened convolutional neural networks [68] and Visual Geometry Group (VGG) Face Descriptor [69].
Figure 7 shows the logic flow of CNN-based face recognition systems.Instead of using hand-crafted features, CNNs are directly applied to RGB pixel values and used as a feature extractor to provide a low-dimensional representation characterizing a person's face.In order to normalize the input image to make the face robust to different view angles, DeepFace [66] models a face in 3D and aligns it to appear as a frontal face.Then, the normalized input is fed to a single convolution-pooling-convolution filter.Next, 3 locally connected layers and 2 fully connected layers are used to make Table 1: Experiment results on LFW benchmark [70].

Technique
Accuracy Human-level (cropped) [74] 0.9753 FaceNet [67] 0.9964 ± 0.009 DeepFace-ensemble [66] 0.9735 ± 0.0025 OpenFace [70] 0.9292 ± 0.0134 final predictions.The architecture of DeepFace is shown in Figure 8.Though DeepFace achieves the best performance on face recognition up to date, its representation is difficult to interpret and use because the faces of the same person are not clustered necessarily during the training process.In contrast, FaceNet defines a triplet loss function directly on the representation, which makes the training procedure learn to cluster face representation of the same person [70].It should also be noted that OpenFace uses a simple 2D affine transformation to align face input.Nowadays, face recognition in mobile computing is a very attractive topic [71,72].While DeepFace and FaceNet remain private and are of large size, OpenFace [70] offers a lightweighted, real-time, and open-source face recognition system with competitive accuracy, which is suitable for mobile computing.OpenFace implements FaceNet's architecture but it is one order of magnitude smaller than DeepFace and two orders of magnitude smaller than FaceNet.Their performances are compared on Labeled Faces in the Wild data set (LFW) [73], which is a standard benchmark in face recognition.The experiment results are demonstrated in Table 1.Though the accuracy of OpenFace is slightly lower than the state of the art, its smaller size and fast execution time show great potential in mobile face recognition scenarios.medical problems, a deep learning based method for crack detection is proposed [23].

CNN for
Data Preparation.A data set with more than 500 pavement pictures of size 3264 × 2448 is collected at the Temple University campus by using a smartphone as the data sensor.Each image is annotated by multiple annotators.Patches of size 99 × 99 are used for training and testing the proposed method.640,000 patches, 160,000 patches, and 200,000 patches are selected as training set, validation set, and testing set, respectively.

Design and Train the CNN.
A deep learning architecture is designed, which is illustrated in Figure 9 and conv, mp, and fc represent convolutional, max pooling, and fully connected layers, respectively.The CNNs are trained using the stochastic gradient descent (SGD) method on GPU with a batch size of 48 examples, momentum of 0.9, and weight decay of 0.0005.Less than 20 epochs are needed to reach a minimum on the validation set.The dropout method is used between two fully connected layers with a probability of 0.5 and the rectified linear units (ReLU) as the activation function.
Evaluate the Performance of the CNN.The proposed method is compared against the support vector machine (SVM) and the Boosting methods.The features for training the SVM and the Boosting method are based on color and texture of each patch which are associated with a binary label indicating the presence or absence of cracked pavement.The feature vector is 93-dimensional and is composed of color elements, histograms of textons, and LBP descriptor within the patch.
The Receiver Operating Characteristic (ROC) curves of the proposed method, the SVM, and the Boosting method are shown in Figure 10.Both the ROC curve and Area under the Curve (AUC) of the proposed method indicate that the proposed deep learning based method can outperform the shallow structure learned from hand-crafted features.In addition, more comprehensive experiments are conducted on 300 × 300 scenes as shown in Figure 11.
For each scene, each row shows the original image with crack, ground truth, and probability maps generated by the SVM and the Boosting methods and that by the ConvNet.The pixels in green and in blue denote the crack and the noncrack, respectively, and higher brightness means higher confidence.The SVM cannot distinguish the crack from the background, and some of the cracks have been misclassified.Compared to the SVM, the Boosting method can detect the cracks with a higher accuracy.However, some of the background patches are classified as cracks, resulting in isolated green parts in Figure 11.In contrast to these two methods, the proposed method provides superior performance in correctly classifying crack patches from background ones.

DNN for Fault Diagnosis.
Plant faults may cause abnormal operations, emergency shutdowns, equipment damage, or even casualties.With the increasing complexity of modern plants, it is difficult even for experienced operators to diagnose faults fast and accurately.Thus, designing an intelligent fault detection and diagnose system to aid human operators is a critical task in process engineering.Data-driven methods for fault diagnosis are becoming very popular in recent years, since they utilize powerful machine learning algorithms.Conventional supervised learning algorithms used for fault diagnosis are Artificial Neural Networks [76][77][78][79][80][81] and support vector machines [82][83][84].As one of emerging machine learning techniques, deep learning techniques are investigated for fault diagnosis in a few current studies [22,[85][86][87][88].This subsection reviews a study which uses Hierarchical Deep Neural Network (HDNN) [22] to diagnose faults in a well-known data set called Tennessee Eastman Process (TEP).
TEP is a simulation model that simulates a real industry process.The model was first created by Eastman Chemical Company [75].It consists of five units: a condenser, a compressor, a reactor, a separator, and a stripper.Two liquid products G and H are produced from the process with the gaseous inputs A, C, D, and E and the inert component B. The flowsheet of TEP is shown in Figure 12.
Data Preparation.The TEP is monitored by a network of  sensors that collect measurement at the same sampling time.At the th sample, the state of th sensor is represented by a scalar    .By combining all  sensors, the state of the whole process in th sampling interval is represented as a row vector The fault occurring at the th sampling interval is indicated with class label   ∈ {1, 2, . . ., }, where value 1 to  represents one of  fault types.There are total  historical observations collected from all  sensors to form a data set  = {(  ,   ),  = 1, 2, . . ., ,   ∈ {1, 2, . . ., }}.The objective of fault diagnosis is to train a classification ℎ :   →   given data set  = {(  ,   ),  = 1, 2, . . ., }.
For each simulation run, the simulation starts without faults and the faults are introduced at sample 1.Each run collects a total of 1000 pieces of sample data.Each single fault type has 5 independent simulation runs.The Tennessee Eastman Process has 20 different predefined faults but faults Design and Train the HDNN.The general diagnosis scheme of HDNN [22] is as follows.The symptom data generated by simulation is transmitted to a supervisory DNN.The supervisory DNN then classifies symptom data into different groups and triggers the DNN which is specially trained for that group to do further fault diagnosis.Figure 13 illustrates the fault diagnosis scheme of the HDNN, where each agent represents a DNN.
Evaluate the Performance of the DNN.The experiment result of the HDNN is compared to single neural network and Duty-Oriented Hierarchical Artificial Neural Network (DOHANN) [76] and is shown in Figure 14.7 out of 17 faults have been diagnosed with 90% accuracy.The highest Correct Classification Rate (CCR) is 99.6% from fault 4, while the lowest CCR is 50.4% from fault 13.The average CCR of our method is 80.5%, while the average of CCRs of SNN and DOHANN is 49.7% and 70.7%, respectively.It demonstrates that the DNN-based algorithm outperforms other conventional NN-based algorithms.

DNN for Human Activity Detection.
Human activity detection has drawn much attention from researchers due to high demands for security, law enforcement, and health care [90][91][92][93].In contrast to using cameras to detect human  [76], and HDNN [22].
activity, sensors such as worn accelerometers or in-home radar which use signals to detect human activities are robust to environmental conditions such as weather conditions and light variations [94][95][96][97][98][99].Nowadays, there are a few emerging research works that focus on using deep learning technologies to detect human activities based on signals [89,92,100].
Fall detection is one of the very important human activity detection scenarios for researchers, since falls are a main cause of both fatal and nonfatal injuries for the elderly.Khan and Taati [100] proposed a deep learning method for falls detection based on signals collected from wearable devices.They propose an ensemble of autoencoders to extract features from each channel of sensing data.Unlike wearable devices which are intrusive and easily broken and must be carried, in-home radars which are safe, nonintrusive, and robust to lighting conditions show their advantages for fall detection.Jokanovic et al. [89] proposed a method that uses deep learning to detect fall motion through in-home radar.The procedure is demonstrated in Figure 15.They first denoise and normalize the spectrogram as input.Then, stacked autoencoders are performed as a feature extractor.On top of the stacked autoencoders, a softmax regression classifier is used to make predictions.The whole model is compared with a SVM model.Experiment results show that the overall correct classification rate for deep learning approach is 87%, whereas the overall correct classification rate for SVM is 78%.

Challenges
Though deep learning techniques achieve promising performance on multiple fields, there are still several big challenges as research articles indicate.These challenges are described as follows.Recently, two possible solutions draw attention from researchers.One of the solutions is to generalize new training data from original training data using multiple data augmentation methods.Traditional ones include rotation, scaling, and cropping.In addition to these, Wu et al. [37] adopted vignetting, color casting, and lens distortion techniques.These techniques can further produce more different training examples.Another solution is to obtain more training data using weak learning algorithms.Song et al. [101] proposed a weakly supervised method that can label image-level objectpresence.This method helps to reduce laborious bounding box annotation costs while generating training data.

Time Complexity.
Training deep neural network is very time-consuming in early years.It needs a large amount of computational resources and is not suitable for realtime applications.By default, GPUs are used to accelerate training of large DNNs with the help of parallel computing technique.Thus, it is important to make the most of GPU computing ability when training DNNs.He and Sun [102] investigated training CNN under time cost constrains and proposed fast training methods for real-world applications while having similar performance as existing CNN models.Li et al. [103] remove all the redundant computations during training CNNs for pixel wise classification, which leads to a speedup of 1500 times.

Theoretical Understanding.
Though deep learning algorithms achieve promising results on many tasks, the underlying theory is still not very clear.There are many questions that need to be answered.For instance, which architecture is better than other architectures in certain task?How many layers and how many nodes in each layer should be chosen in a DNN? Besides, there are a few hyperparameters such as learning rate, dropout rate, and the strength of regularizer which need to be tuned with specific knowledge.
Several approaches are developed to help researchers to get better understanding in DNN.Zeiler and Fergus [43] proposed a visualization method that illustrates features in intermediate layers.It displays intermediate features in interpretable patterns, which may help design better architectures for future DNNs.In addition to visualizing features, Girshick et al. [49] tried to discover the learning pattern of CNN by testing the performance layer by layer during the training process.It demonstrates that convolutional layers can learn more generalized features.
Although there is progress in understanding the theory of deep learning, there is still large room to improve in deep learning theory aspect.

Conclusion
This paper gives an overview of deep learning algorithms and their applications.Several classic deep learning algorithms such as restricted Boltzmann machines, deep belief networks, and convolutional neural networks are introduced.In addition to deep learning algorithms, their applications are reviewed and compared with other machine learning methods.Though deep neural networks achieve good performance on many tasks, they still have many properties that need to be investigated and justified.We discussed these challenges and pointed out several new trends in understanding and developing deep neural networks.

Figure 4 :
Figure 4: Digital image representation and convolution matrix.

Figure 9 :
Figure 9: Illustration of the architecture of the proposed ConvNet [23].

Figure 15 :
Figure 15: Block diagram of the deep learning based fall detector [89].
Road Crack Detection.Automatic detection of pavement cracks is an important task in transportation maintenance for driving safety assurance.Inspired by recent success in applying deep learning to computer vision and Limited Data.Training deep neural network usually needs large amounts of data as larger training data set can prevent deep learning model from overfitting.Limited training data may severely affect the learning ability of a deep neural network.Unfortunately, there are many applications that lack sufficient labeled data to train a DNN.Thus, how to train DNN with limited data effectively and efficiently becomes a hot topic.