Image Recognition Based on Multiscale Pooling Deep Convolution Neural Networks

Depth neural network (DNN) has become a research hotspot in the field of image recognition. Developing a suitable solution to introduce effective operations and layers into DNN model is of great significance to improve the performance of image and video recognition. To achieve this, through making full use of block information of different sizes and scales in the image, a multiscale pooling deep convolution neural network model is designed in this paper. No matter how large the feature map is, multiscale sampling layer will output three fixed-size character matrices. Experimental results demonstrate that this method greatly improves the performance of the current single training image, which is suitable for solving the image generation, style migration, image editing, and other issues. It provides an effective solution for further industrial practice in the fields of medical image, remote sensing, and satellite imaging.


Introduction
Edge computing extends computing, network, storage and other capabilities to the edge side of the network near the Internet of ings devices, while the artificial intelligence technology represented by deep learning enables each edge computing node to have the ability of computing and decision making, allowing some complex intelligent applications to be processed at the local edge, meeting the needs of agile connection, real-time data optimization business, application intelligence, security, and privacy protection. Intelligent edge computing collects data and analyses computing by means of the edge device of the Internet of ings so as to realize the flow of intelligence between the cloud and the edge. Now that new demands on artificial intelligence algorithms, terminals, and chips have to be met, it appeals more and more artificial intelligence (AI) enterprises [1][2][3]. However, most current edge devices today cannot support deep learning applications in a low-latency, low-power, high-precision manner due to resource constraints; for example, deep learning model inference requires a large amount of computational resources.
Deep neural networks, like many other machine learning models, can be divided into two stages of training and reasoning. e training phase learns the parameters in the model according to data (for the neural network, it is mainly the weight in the network). e inference phase enters the new data into the model and calculates the results. Overparameterization refers to the training phase. e network needs a large number of parameters to capture the small information in the data. And once the training has reached the reasoning stage, it does not need so many parameters. Based on this assumption, the model can be simplified before deployment. However, deep neural network model that exists in the process of removing specific fuzzy kernel and noise usually needs to set up different training models according to different noise levels, which does not only lack flexibility, but also cannot cope with more general image recognition tasks. erefore, how to use DNN model flexibly in image recognition task has more important research value [4].
Although the purpose of neural network is to solve the general machine learning problem, domain knowledge also plays an important role in the design of depth model.
Among the applications related to image and video, the most successful one is the deep convolution network, whose design is only used in the special structure of image. Two of the most important operations, convolution and pooling, come from domain knowledge related to images. How to introduce new effective operation and layer in the depth model through the knowledge of research field is of great significance to improve the performance of image and video recognition. For example, the pool layer brings about local translation invariance, and the deformation pool layer proposed in [5][6][7] can better describe the geometric deformation of various parts of the object. At present, it can be further extended to achieve scale invariance, rotation invariance, and robustness to occlusion.
In this paper, we put forward Multiscale Pooling Deep Convolution Neural Networks model that aims to train a set of fast and effective schemes. Rather than learning MAP inference-guided discriminative models, we instead adopt plain DNN to learn the denoisers, taking advantage of recent progress in DNN as well as the merit of GPU computation. Meanwhile providing good performance for image recognition, the learned set of multiscale pooling is plugged in a modelbased optimization method to tackle various problems. e contribution of this work is summarized as follows:

Feature
Learning. e biggest difference between traditional pattern recognition method and deep learning lies in its characteristics; the feature is automatically learned from big data, rather than using manual design. Good features can improve the performance of pattern recognition system [8][9][10]. In the past few decades, in various applications of pattern recognition, the characteristics of manual design have been in domination level. Manual design mainly depends on the prior knowledge of designers, so it is difficult to take advantages of the advantages of big data. Due to the dependence on manual parameter adjustment, the number of parameters allowed in the design of the feature is very limited. Deep learning can automatically learn representations of features from big data and can contain thousands of parameters [11][12][13][14][15][16]. It takes five to ten years to design effective features by hand, and deep learning can quickly learn new and effective feature representations from training data for new applications. A pattern recognition system includes two parts: feature and classifier. In traditional methods, the optimization of features and classifiers is separated. Under the framework of neural network, feature representation and classifier are jointly optimized, which can give full play to the performance of joint cooperation [17][18][19][20][21].
In 2012, Hinton participated in the ImageNet competition and adopted the convolution network model, which contains 60 million parameters learned from millions of samples [22]. e feature representation learned from ImageNet has very strong generalization ability and can be successfully applied to other datasets and tasks, such as object detection, tracking, and inspection. Another famous competition in the field of computer vision is PSACAL VOC. However, its training set is small and is not suitable for training deep learning models. Some scholars use the feature representation learned from ImageNet for physical examination on PSACAL VOC; the detection rate increased by 20% [23].
Since feature learning is so important, what are good features? In an image, various complex factors are often combined in a nonlinear way. For example, the face image contains a variety of information such as identity, posture, age, expression, light, and so on. e key to deep learning is to successfully separate these factors through multilevel nonlinear mapping; for example, in the last hidden layer of the depth model, different neurons represent different factors. If you take this hidden layer as a feature representation, face recognition, pose estimation, expression recognition, and age estimation will become very simple, because the various factors become a single linear relationship, no longer interfere with each other.

ImageNet Image Classification.
e most important development of deep learning in object recognition is image classification task in ImageNet ILSVRC 3 challenge. e lowest error rate of traditional computer vision method in this test set is 26.172%. In 2012, Hinton's team used convolutional networks to reduce the error rate to 15.315%. is network structure is called AlexNet [24]. Compared with the traditional convolution network, it has three differences. First of all, AlexNet adopts dropout training strategy. During the training process, some neurons in the input layer and the middle layer are randomly set to zero. is simulates the situation that noise interferes with input data so that some neurons fail to detect some visual patterns. Dropout makes the training process converge more slowly, but the resulting network model is more robust. Secondly, AlexNet uses the rectifier linear element as the nonlinear excitation function.
is not only greatly reduces the computational complexity, but also makes the output of neurons sparse and more robust to various interferences. irdly, AlexNet generates more training samples and reduces overfitting by mapping the image of training samples and adding random translation disturbance.
In the ImageNet ILSVRC 2013 competition, the top 20 groups use deep learning techniques. e winner is Rob Fergus's research group of New York University. e depth model adopted is convolutional network, and the network structure has been further optimized. e error rate is 2 Complexity 11.197%. Its model is called clarifai [25]. In the ILSVRC 2014 competition, the winner GoogLeNet [25] reduced the error rate to 6.656%. e outstanding feature of GoogLeNet is that it greatly increases the depth of convolution network, which is more than 20 layers, which was unimaginable before. e deep network structure makes the backpropagation of prediction error difficult, because the prediction error is transmitted from the top layer to the bottom layer, and the error transmitted to the bottom layer is very small, so it is difficult to drive the updating of the bottom-layer parameters. GoogLeNet's strategy is to add the monitoring signal directly to multiple middle layers, which means that the feature representation of middle layer and bottom layer should also be able to accurately classify the training data. ese new features need to remove as many intraclass changes as possible, while retaining interclass changes.
Face recognition includes two tasks: face confirmation and face recognition. Face confirmation is to judge whether two face photos belong to the same person, which is a binary classification problem. e accuracy of random guessing is 50%. Face recognition is to divide a face image into one of N categories, which are defined by the identity of the face. is is a multiclassification problem, which is more challenging. e difficulty increases with the number of categories. e accuracy of random guess is 1/n. Both tasks can learn the facial expressions through deep models.
Using convolution network to learn face features, literature [26][27][28] used face recognition task as a supervisory signal; the recognition rate is 92.52% on LFW. Although this result is lower than the follow-up deep learning method, it also exceeds most of the nondeep learning algorithms. Because face recognition has two classified problems, the efficiency of using it to learn face features is relatively low, and it is easy to have fitting on the training set. Face recognition is a more challenging multiclassification problem, which is not easy to be fitted, and it is more suitable to learn face features through depth model [29][30][31]. On the other hand, in face recognition, each pair of training samples is manually labelled into one of two categories, which contains less information. In face recognition, each training sample is labelled as one of N classes, which has a large amount of information.
Some people think that the success of deep learning is due to the use of complex modules with a large number of parameters [32]. In fact, it is far from so simple to fit the dataset. For example, the success of DeepID2+ lies in its many important and interesting characteristic [33]: its top neurons respond moderately sparsely; it has strong selectivity to face identity and various face attributes and strong robustness to local occlusion. In previous studies, in order to get these attributes, we often need to add various display constraints to the model. DeepID2+ has these attributes automatically through large-scale learning, and the theoretical analysis behind it is worth further study in the future.

Deep Learning for Video Analysis.
e application of deep learning in video classification is still in its infancy; there is still a lot of work to be done in the future. e depth model that can be learned from ImageNet can be used to describe the static image features of video. e difficulty is how to describe the dynamic features. In the past, the description of dynamic features in visual research methods often depends on optical flow estimation, key point tracking, and dynamic texture. How to embody this information in depth model is a difficulty. e most direct way is to treat video as a three-dimensional image and apply convolution network directly [34][35][36]; three-dimensional filters are learned at each level. But this idea obviously does not take into account the differences of time and space dimensions. Another simple but more effective way is to calculate the spatial field distribution of optical flow field or other dynamic characteristics by preprocessing, as an input channel of convolution network [37][38][39].
ere are also research works using deep autoencoder to extract dynamic texture in a nonlinear way [38]. In the latest research work [40], long short-term memory (LSTM) has been widely concerned, which can capture long-term dependence and complex dynamic modelling in video.
In 2018, Ulyanov et al. proposed a depth image prior (Deep Image Prior (DIP)) model [1], which considers that the neural network structure is a priori that does not require learning and pretraining to a large number of datasets; only learning network superparameters adaptively can realize image conversion and it is verified in the tasks of denoising, superresolution, and filling. After dip was proposed, it was successively used in image decomposition-related tasks [41] and blind deconvolution [42]. Double-Dip [43] is an unsupervised layer decomposition of a single image based on a coupled dip network, which is used for image segmentation (foreground layer and background layer), image defogging, watermarking, and other issues. Ren et al. [42] proposed a blind deconvolution framework based on depth Complexity 3 prior-SelfDeblur, using dip and full convolution network, respectively, and fuzzy kernel. Compared with the image processing, the sound signal processing is another direction of successful application in deep learning, especially speech recognition, which many large companies are going for. Unlike images, sound signals are one-dimensional sequence data, and although they can be converted to two-dimensional spectrum by frequency domain conversion algorithms such as FFT, their two dimensions also have specific meanings (vertical axis represents frequency; horizontal axis represents timeframe) and cannot be processed directly in the form of an image, requiring a specific processing method in a specific area. Deep neural network has the ability to extract features automatically, which can jointly optimize the separation of feature representation and classification model, especially after 2016. Deep learning has made great progress in the field of speech recognition; for example, RNN, CRN, and other structures have made a breakthrough in speech recognition performance [44][45][46]. With the research developing and the hardware updating, some end-to-end models have achieved the performance in the state of the art, for example, as the previously mentioned, it is sequence-to-sequence model (CTC). As speech recognition performance continues to improve, industry has produced many applications, such as virtual assistants: Google Home, Amazon Alexa, and Microsoft Cortana.
Although deep learning has achieved great success in practice, moreover, the characteristics (such as sparsity, selectivity, and robustness to occlusion [29]) of the depth model obtained by big data training are striking, but the theoretical analysis behind it still needs to be completed.

Method
In the field of target detection, the manual extraction of characteristics has been replaced by the conventional multiple layer NN, which has saved a great deal of costs of background research. But as the number of network layers increases, the corresponding training methods become lacking. Moreover, the conventional NNs are sensitive to the translation and scale change of the target, which leads to the relatively low detection rate of the target. At present, CNN can realize the simulation of the large-scale neural network by the following three techniques: local receptive field, weight sharing, and pooled sampling. erefore, we put forward a Multiscale Pooling Deep Convolution Neural Networks method, to overcome the restriction during the image detection that the input image must be of a fixed size.

e Basic Structure of Convolutional Neural Network.
A multilayer NN has too many parameters and is difficult to train. To overcome this limitation, the CNN emerges, which is based on ANN and includes convolution and sampling operations. e characteristics extracted by CNN are spatial invariant to certain degree, which makes CNN more suitable for the target detection of an image. Although there are many kinds of CNN, the basic structure is largely the same ( Figure 1): the input layer I, the convolution layer C, the sampling layer S, the output layer O, and sometimes the fullconnection layer F. e main purpose of convolution layer is to realize local feature perception through convolution operation and then realize the same kind of feature extraction through weight sharing and realize different types of feature extraction by using different convolution kernels. ey should be referred to as follows: where X l j represents the convolutional output of the jth characteristic map of the lth layer; f( ) represents the activation function; l represents the number of the layer, K is the kernel; M i represents one choice of the input characteristic map; and b is an offset. e sampling layer is an important layer of convolution neural network, which performs convolution to get the average or maximum value of the local space of the feature map. ey should be referred to as follows: where down(·) is a sampling function obtaining the average or maximum of a rectangle area. e full-connection layer is the penultimate layer of convolution neural network, which means that the input feature map and the output feature map adopt the fullconnection method. Its function is to reduce the dimension effectively and to facilitate the classification processing of the output layer. ey should be referred to as follows: where the constraint i α ij � 1 and 0 ≤ α ij ≤ 1 must be satisfied. e constraint to the variable a ij can be strengthened by turning the variable a ij into an unconstrained Softmax function of the implicit weight c ij , ey should be referred to as follows: For a given j, every group of weight values c ij is independent of those in the other groups. We therefore can omit the subscript j, for the sake of convenience. In other words, one needs only to consider updating a map, because the updating of the other maps is the same process. e only difference is the index j of the map. e details of the regression process about Softmax are not explained here. It should be noted that the input to the FCL needs to have a fixed dimensionality for an easier training of the BP algorithm, which necessitates the consistency between the size of the input image and the size of the training network. is requirement represents one shortcoming of the network 4 Complexity structure. erefore, to solve the problem that the size of the input image must be fixed, the multiscale sampling method is to be used.

e Multiscale Sampling
3.2.1. Shortcomings of the Conventional CNN. e conventional CNN has a fixed window size of the input image, which restricts the size of the input image and length-towidth ratio. When the input image is of arbitrary size, one often has to first fix the image to a size suitable for the training network by clipping the image or deformation along the horizontal and vertical directions. is manual aligning method cannot preserve the size and length-to-width ratio of the original image and cannot guarantee that the clipped image subblocks encompass the whole image, which leads to the loss of some important information and geometrical distortion due to rotation and stretching of the image. When the image size changes to a large extent, the whole network has to be trained again to adapt to the new image, even if the network was trained by images of a given input size. Otherwise the accuracy would reduce of detecting the whole image or the subblocks of the other arbitrary sizes. e fixed input dimensionality greatly restricts the development of CNN in multiscale image detection. e convolutional and sampling layers extract image characteristics by using sliding windows. e FCL, as the hidden intermediate between the sampling and output layers, plays a pivotal role in the whole network structure. In fact, the convolutional layer can generate characteristic maps of different sizes and ratios without inputting images of fixed size. On the contrary, according to the definition of CNN, the input to FCL needs to be images of fixed size. e reason is that only the FCL has a fixed input dimensionality, which further fixes the dimensionality of the parameters used in the training and thus allows for the use of the BP algorithm for the training.
To solve the problem, one can use multiscale sampling to integrate the deep characteristics generated by the previous convolutional-sampling layers to a characteristic expression of fixed length and then send it to the FCL. In this way, one needs not to clip images of different sizes and length-towidth ratio before sending them to the network. At the previous layer of the FCL, namely, the last sampling layer, the multiscale sampling is used to fix the images' sizes so that they can be input to the FCL. Figure 2 shows how the CNN structure is changed by introducing the MSL. Now, one can see that the MSL integrates the image characteristics between the convolutional-pooled layers and the FCL. e integration at deep layers is reasonable because it conforms well with bionic topology: human learning starts from the overall classification but not from the postclipping parts; thus to detect a target, the brain needs not to first process the target and adapt to the fixed size. Some authors proposed the solution of inputting images of different sizes to CNN, but used images of different sizes for both training and testing. ey found no significant improvement of the detection rate. For MSDNN, the training and testing are performed on the same network. e idea of multiscale sampling was primarily inspired by the method of pyramid, which preserves the spatial information of the local parts during the integration. e pioneers applied the pyramid method to all the levels of CNN and extracted multiscale image characteristics by sliding windows. ese image characteristics were then used to train or test the networks of the respective sizes. is actually enriched the scale of the input image and extended the training set, but the training became more difficult with more network parameters that can easily lead to the overfitting problem. e MSL design makes it unnecessary to use different networks to train and test images of different size. e MSL does not simply use the pyramid method to extract the image's multiscale characteristics, but applies the core idea of the pyramid method, namely, multiscale expression, to the deep layers of the network. ese enable not only the Complexity training of NN with images of different sizes, which enriches the characteristic expressions, but also the output of characteristic vectors of the fixed size.

Methods of Multiscale Sampling.
Multiscale expression is applied to the deep layers of NN. We propose to replace the normal sampling with the multiscale sampling at the final sampling layer, where "multiscale" lies in the use of multiple sampling sizes and strides. Figure 3 shows no matter how large the previous characteristic map, the MSL outputs three characteristic matrices with sizes 1 × m, 4 × m, and 9 × m, where m is the number of the previous convolutional layer's characteristic maps. e three characteristic matrices are concatenated in the columnmajor order to form a column vector with fixed dimensionality 14 × m × 1. e column vector is the input to the FCL. e sizes of these sampling windows are matched with those of the input images. at is, the sampling size and sampling stride change as the size of the input changes. In this way, the number of sampling windows, namely, the size of the output characteristic column vector, becomes fixed, no matter how large the input image is. is makes it easy to send the vector to FCL. For example, if the size of the input characteristic map is r × s and we use the maximum sampling method with three sampling sizes and strides, then the computation formulas are as follows: size � r × s, ⌈(r/2) ×(s/2)⌉, ⌈(r/3) ×(s/3)⌉, (5) stride � r × s, ⌈(r/2) ×(s/2)⌉, ⌈(r/3) ×(s/3)⌉, where y 1 j ,y 2 j , and y 3 j represent the output of every characteristic map after the multiscale maximum sampling. ese operations finally obtain three output characteristic matrices with fixed sizes: 1 × 1 × m, 2 × 2 × m, and 3 × 3 × m, which can be converted into characteristic matrices 1 × m, 4 × m, and 9 × m by unfolding in the column-major order. Finally, these characteristic matrices are concatenated in sequence, becoming a characteristic column vector of a fixed size 14 × m × 1. Take two different size input images as an example (Figure 4). e sizes of the characteristic maps are 16 × 16 and 13 × 9, respectively. e target is a 3 × 3 output matrix. In Figure 4(a), the sampling size is 6 × 6 and sampling stride is 5 × 5. By (3) and (4), one obtains the maximum value of the 6 × 6 area and uses it as the output, which finally obtains a 3 × 3 output characteristic matrix. In Figure 4(b), the sampling size is 5 × 3 and sampling stride is 4 × 3. By (3) and (4), one obtains the maximum value of the 5 × 3 area and uses it as the output, which finally obtains a 3 × 3 output characteristic matrix.

e Fully Connected Layer with Multilayer Characteristic Expression.
e FCL, as the hidden intermediate between the sampling and output layers, plays a pivotal role in the whole network structure. Here the neurons of the FCL simultaneously, indirectly, and fully connect to the characteristic maps next to the convolutional layers so that the FCL can receive multiscale and multilevel feature representation input.

3.3.
e Topological Structure of the Overall Network. Figure 5 shows the topological structure of the MSDNN put forward in this paper. Due to the page limit, taking a 64 × 64 input image as an example, we briefly introduce the main parameters and implementation methods of every network layer. e input layer: the preprocessed grayscale image X in . e convolutional layer Cl：the layer consists of 20 different characteristic maps X C1 j . Every characteristic map is convoluted with twenty 5 × 5 convolution kernels, respectively. e result is offset by b C1 j and then used as the independent variable of the activation function ReLU to obtain the characteristic map X C1 j . ey should be referred to as follows: Unlimited image size The convolutional layer

Multiscale sampling
The characteristic column vector of fixed size The fully connected layer where ⊗ represents the convolution operation with stride 1; the activation function is defined to be ReLU(x) � max(0, x); and the size of the obtained X C j is 60 × 60. e sampling layer S1：perform statistical computation on the results obtained by the convolutional later C1, by using the maximum value sampling. After the sampling, the horizontal and vertical spatial resolutions become half of the original resolutions. e size is 30 × 30.

(9)
e sampling layer S2：perform the maximum value sampling with a sampling size 2 and without overlapping of the sampling area. After the sampling, the horizontal and vertical spatial resolutions become half of the original resolutions. e size is 14 × 14.
e convolutional layer C3：by convoluting with convolutional kernels of size 3 × 3, the 40 characteristic maps X S2 j are extended to 60 characteristic map X C3 j of size 12 × 12: e sampling layer S3：perform the maximum value sampling with a sampling size 2 and without overlapping of the sampling area. After the sampling, the horizontal and vertical spatial resolutions become half of the original resolutions. e size is 6 × 6.
e convolutional layer C4：by convoluting with convolutional kernels of size 3 × 3, the 60 characteristic maps X S3 j are extended to 80 characteristic map X C4 j of size 4 × 4: e multiscale sampling layer MS4：perform the maximum value sampling with three different scales on the 80 characteristic maps X C4 j . e three sampling sizes and strides are as follows: is obtains three output characteristic matrices of the fixed sizes 1 × 1 × 80, 2 × 2 × 80, and 3 × 3 × 80. By extending the matrices in the column-major order, one obtains characteristic column vectors of sizes 1 × 80, 4 × 80, and 9 × 80. Finally, the vectors are concatenated in sequence to form a characteristic column vector x MP4 with a fixed size 14 × 80 � 1120 × 1. For the input images of the other sizes, the sampling size and stride change adaptively, but the dimensionality of the final characteristic column vector does not change.
e multiscale sampling layer MS3：perform the maximum value sampling with three different scales on the 60 characteristic mapX C3 j . A characteristic column vector x MS3 with a fixed size 14 × 60 � 840 × 1 is formed. e three sampling sizes and strides are as follows: e multiscale sampling layer MS2： perform the maximum value sampling with three different scales on the 40 characteristic maps X C2 j . A characteristic column vector x MS2 with a fixed size 14 × 40 � 560 × 1 is formed. e three sampling sizes and strides are as follows: size � 28 × 28, 14 × 14, 10 × 10, stride � 28 × 28, 14 × 14, 9 × 9.
e fully connected layer FC5 : the output expression column vector x FC5 , with the size 400 × 1, is obtained in the FCL from the three characteristic column vector obtained at the respective MSLs: Complexity 7 where W 2 , W 3 , and W 4 are the respective weight matrices that connect the three column vectors to the FCL; b FC5 is the offset matrix used in the FCL. e output layer：a labor vector y output is obtained from the output expression column vector x FC5 obtained in FC5.  8 Complexity the database, etc. In general, the larger the database and the more complex the classification are, the deeper the structure required. One then needs to do some preprocession on the database so that the training would be smoother. For small databases, one can expand the data capacity to reduce overfitting. e number of training samples can be increased by operations such as translation and random clipping.

Methods of Training and
Here the method of capacity expansion is used. e dataset is from the Internet, containing 1980 single human face images of different sizes, skin colors, directions, and positions. e same person may have different expressions; some are shielded; some with mustache; some with glasses, etc. To improve robustness of the network and increase multiplicity of the samples, we perform some transformation to a part of images in the initial human face image sample set, such as horizontal flipping, ±20°displacement along the horizontal and vertical directions, ±20°rotation, and the reduction of contrast. Of course, one can combine different transformations and then clip the corresponding image subblocks. In this way, tens of sample images can be obtained from a single image. A part of the human face image samples is illustrated in Figure 6. e non-face image dataset can be a set of non-face images. In fact, any randomly clipped image can be used as a non-face sample. But it is impossible to train a network that can recognize all kinds of non-face samples. One can use the iterative bootstrapping program to collect non-face images and use methods including the iterative retraining system to update the training set. e methods for training MSDNNs can be divided into the forward propagation stage and the backward propagation stage. ey are not introduced here because they are similar to the BP algorithm of the conventional NN.

Validity Analysis of the Multiscale Sampling.
e innovation point of this paper is multiscale sampling, which improves the structure of the conventional CNN for the detection of human face. In the following, we perform experiments to compare their performance. To accelerate the speed of training and testing, we use the popular Caffe deep learning framework to realize GPU computation. e advantage of Caffe is that programming is not needed. Once the network structure is designed, Caffe can automatically help the user to train the NN. However, currently, there are no frameworks that accept input images of variable sizes. erefore, the experiments focus on training the NN with input images of three fixed sizes.
To verify effectiveness of multiscale sampling, we use a single-layer cascade network structure and connect it to the final multiscale layer MS4 ( Figure 5). Two conditions are analyzed in the following.

4.1.1.
e Training of MSDNN with a Single Input Size. Take a 64 × 64 input image as example. e three structures are as follows: (1) Without multiscale sampling, i.e., the ordinary CNN structure (2) One sampling size (1 × 80 characteristic matrix) (3) Two sampling sizes (4 × 80, 9 × 80 characteristic matrices) According to the final testing results obtained by averaging 100 times of experiments (Table 1), MSDNN has a higher detection rate than the ordinary CNN. As far as the number of characteristics is concerned, the dimensionalities of the ordinary CNN and the MSDNN are 1280 and 1120, respectively, which demonstrates that the improvement of the detection rate by MSDNN does not require the increase of the dimensionality. It is the change of the spatial deployment that brings about the improvement of target detection.

e Training of MSDNN with Multiple Input Sizes.
Because the Caffe framework can only accept input of fixed size during one period of training, the trained network is applied to process images of arbitrary sizes in the testing stage. Taking into account the quality of the human face images in the library and the quality of hardware, we train the network with three input image sizes 64 × 64, 80 × 80, and 100 × 100. e images only differ in the resolution, sampling size, and sampling stride, but the other properties (the content, layout, and output characteristic column vector) are the same. In this way, one can first randomly choose any one of the three network sizes (64 × 64) to  complete one training period, at the end of which all the parameters are saved. en, one switches to the other two network sizes to continue the training, until the termination condition is satisfied.
In the experiments, the detection rate of the multiscale sampling training with multiple input sizes is higher than that with a single input size. is is because the network is trained by the characteristics of images of different sizes and is thus more suitable for the real detection. e final detection rates of the networks trained by one, two, and three input sizes are 98.25%, 98.46%, and 98.59%, respectively.

Comparative Experiments.
e experiments follow the scheme of first performing segmentation of skin color from the background color in the YCbCr space to remove the wide background area so that the computation speed can be improved and then using MSDNN classifier to screen the extracted skin-color area in order to determine whether or not a human face exists in the image, and if yes, labelling the human face.
We used MATLAB 2014 as the platform to validate the detection algorithm. e data were from two kinds of sources. e first kind of data was from Internet download or life photos, including human faces of different illumination, different complex background, and different sizes. e other data were some human face images from the CMU database. We performed the detection and analyzed the results both qualitatively and quantitatively.
We performed the simulations by using MATLAB 2014 installed in an ordinary PC with Intel Core i5 CPU, 3.20 GHZ, 8G RAM. Figure 7 is an image of multiple frontview human faces obtained from the Internet. e human faces are different in the size and direction. Figures 8 and 9 present the result of skin-color segmentation by the Gaussian mixture model. erefore, it was necessary to   e characteristic matrices after the three scales sampling were: 1 × 80, 4 × 80, and 9 × 80. e concatenated characteristic column vector, which was sent to the FCL, was 1120 × 1. e results are presented in Figure 10.
To test the validity of the new method in detecting human face images with certain degree of rotation and with different resolutions, we performed human face detection on the segmented original image and the segmented fussy fixated image, respectively, by the conventional CNN and MSDNN methods. We used two sampling scales and obtained 1 × 80 and 4 × 80 characteristic matrices. e circled areas are the detected human faces (Figures 11 and 12).
From Figures 11 and 12, one sees that the conventional CNN can only detect two human faces, while the MSDNN can detect three faces: not only the two front-view faces, but also the accurate positioning of the side-view face. Even if the image is fuzzy and of low resolution, the MSDNN method can still locate the human face. ese results further demonstrate the correctness of the methods put forward in this paper and that multiscale information plays an important role in image detection. To demonstrate the merits of the new method, we compared the new method with the conventional CNN, PCA, and ANN methods by performing experiments on eight face images (50 human face windows) taken from the CMU database. e detection rate and time by these methods are presented in Table 2.
From Table 2, one sees that the detection rate of the new method was higher than that of the CNN, PCA, and ANN methods. Moreover, the computation time was relatively smaller (hardware configuration：Intel Core i5-6500 CPU, 3.20 GHz × 4, 16G RAM, with GUP acceleration). erefore, the performance of the new method is better than the previous methods.

Conclusions
In this paper, a set of fast and effective MSDNN image recognition methods is to be designed and trained. In particular, instead of establishing a multiscale sampling layer to replace the ordinary sampling layer in the network structure, the network structure is improved, and the generalization ability of sample collection is improved to a certain extent. In the future, the depth model is not a black box; it is closely related to the traditional computer vision system, the various layers of neural network through joint learning, and overall optimization, so that the performance has been greatly improved. Applications related to image recognition are also driving the rapid development of deep learning in all aspects of network structure, layer design, and training methods. It can be expected that, in the next few years, deep learning will enter a period of rapid development in theory, algorithms, and applications.

Data Availability
All the data have been included in the paper; therefore, there are no other data available.

Conflicts of Interest
e authors declare that they have no conflicts of interest.