Deep Neural Learning Adaptive Sequential Monte Carlo for Automatic Image and Speech Recognition

,


Introduction
Soft computing is available in several applications due to its usefulness in modeling and optimization.Numerous studies have focused on image and video processing with objectives such as detection and tracking.Various models have been proposed, including neural networks, deep learning, fuzzy logic, and hybrid methods [1].However, their practical use in applications remains problematic because many applications require higher accuracy than the available models can supply.Hybrid methods that combine two or more soft computing techniques can often enhance the efficiency of image and video retrieval processes [2].In an image context, a 3D geographical information system (GIS) data plan for the WiMax network was integrated to optimize both the network performance and the investment costs, both of which are relevant to the required number of base stations and sectors [3].In addition, soft computing plays an important role in GIS research [4][5][6][7].One important aspect of implementing soft computing is the quality of the dataset.Soft computing can also be used to generate meaningful and human-interpretable big datasets by defining an interface between the numerical and categorical spaces, i.e., the data definition and the linguistic space of human reasoning [8].Furthermore, datasets applied to investigate soft computing methods should use a benchmark dataset intended for validating various methods [1].One example of applying soft computing for decision making was presented [9]; this is a new method named the neurofuzzy analytical network process.e presented method works based on both fuzzy logic and an artificial neural network.Another implementation of soft computing was proposed for tunneling optimization [10].
is model analyzes the relationship between the target tunneling responses and the impact of input parameters, including both geometrical and geological factors.e proposed implementation is useful in reaching robust and low-cost soft computing solutions in the mining industry [11].Soft computing can be applied in environmental management to predict vehicular traffic noise using data such as the volume per hour, percentage of heavy vehicles, and average speed of vehicles as inputs to neural networks or random forests [12].Six methods are used for modeling soil water capacity parameters that are important in environmental management of targeted areas [13].In the aviation industry, a multilayer perceptron neural network is employed to diagnose aerospace structure defects: the classical method uses signal processing and data interpretation [14].Soft computing has also been implemented in path categorization of airplanes [15].Soft computing can also be applied for estimating the position and orientation of spacecraft, which is useful for space technology development [16].
Image classification and speech recognition remain demanding research topics, since they can be applied in various applications [17].One example of an image classification method is a graph-based multiple rank regression model [18], for which the researchers presented a method that can reduce the losses in matrix data correlations that occur when an image is transformed into a vector suitable for image classification processes.An integrated recurrent neural network and a convolutional neural network (CNN), named the multipath x-D recurrent neural network (MxDRNN), has been proposed for image classification [19].In addition, semisupervised deep neural networks implement a robust loss function to enhance image classification performance [20], and hyperspectral image classification has been widely used in many earth observation tasks, including object detection, object recognition, and surveillance.A new joint spatial-spectral hyperspectral image classification method based on differently scaled two-stream convolutional networks and spatial enhancement achieved improved classification performance [21].Image classification for very high-resolution imagery (VHRI) is another challenging task because of the rich detail captured in the images.Many studies have focused on object-based convolutional neural networks (OCNNs) and proposed various innovations, such as integrating a multilevel context-guided classification method with an OCNN to achieve higher VHRI classification accuracy [22].Image classification techniques have also been applied to medical applications such as breast cancer screening through histopathological imaging [23].In addition, speech recognition research is useful for native language tasks, such as the implementation of deep neural networks for the Algerian dialect [24] and for codeswitching among Frisian languages [25].Other speech recognition research has concentrated on recognizing emotion from speech with regard to age and sex using hierarchical models [26].A new approach for speech recognition based on the specific coding of time and frequency characteristics of speech using CNNs has been presented [27].Visual object tracking by using an exponential quantum particle filter and mean shift optimization has been presented as an another challenge for object tracking [28].
e applied method employs the particle filter technique, a state estimation technique, to optimize the gradient  descent optimizer.State estimation is often used in navigation and guidance applications and has sometimes been applied to other optimization methods.For example, for real-time traffic estimation, state estimation has been implemented using an extended Kalman filter instead of using Gaussian process regression models with respect to historical data [29].A particle filter has also been implemented to adjust various parameters to improve image classification [30][31][32] and for some application such as crack propagation filtering [33].e gradient descent algorithm is mainly used to optimize an objective [34].For instance, it was used to implement a demonstration of a morphing wing-tip for an aircraft to reduce low-speed drag [35].
ermal power plants use state estimation to optimize various parameters [36].e adaptive technique presented in this paper, which combines a particle filter with the gradient descent optimizer to adjust and improve the performance on image classification and speech recognition  tasks, is evaluated using the PlanesNet [37] and TensorFlow speech recognition challenge [38] datasets.

Materials
2.1.1.PlanesNet Dataset.Future airport designs should provide improved passenger convenience, such as reducing airplane delays or requiring less check-in time.Air traffic management, as the backbone of the aviation industry, is one factor leading airports to become more intelligent [17].Airplane detection is a fundamental task in tracking, positioning, and predicting the positions of airplanes.Pla-nesNet is a medium-resolution, labeled, remote sensing image dataset that can serve as training data for training machine learning algorithms [37].e dataset consists of 20 × 20 RGB images labeled as "plane" or "no-plane" as shown in Figures 1 and 2, respectively.e "plane" images mainly consist of the wings, tail, and nose of the airplane.e images labeled "no-plane" may include land cover features such as water, vegetation, bare earth, or buildings and do not show any part of an airplane.Some example image data are presented in the following figures.

Speech Commands Dataset.
Another dataset adopted in this study for testing the applied method is a public dataset for single word speech recognition, which was initially compiled for use in the TensorFlow Speech Recognition Challenge [38].e dataset consists of audio files in which a single speaker says one word.e objective is to predict the audio files in the testing dataset, which are categorized in one of twelve categories: "silence," "unknown," "yes," "no," "up," "down," "left," "right," "on," "off," "stop," and "go."It should be noted that the applied method is based on a CNN, which is normally applied to 2D spatial problems.In contrast, audio is inherently a one-dimensional continuous signal across time.e dataset was preprocessed into images by defining a time window into which the spoken words fit; then, the captured audio signal is converted into an image by grouping the incoming audio samples into short segments, just a few milliseconds long, and calculating the strength of the frequencies across a set of bands.Each set of frequency strengths from a segment is treated as a vector of numbers, and those vectors are arranged in time sequence to form a two-dimensional array.
is array of values can then be treated such as a single-channel image called a spectrogram.

Methods.
e applied method is implemented based on a combination of a particle filter and minibatch gradient descent optimizer processes as expressed in equation ( 1) with the goal of obtaining a suitable optimizer for the target dataset: where θ is the weight, η is the learning rate, and g θ is a gradient of the cost function J(θ) with respect to weight changes.Stochastic gradient descent (SGD) performs a parameter update after processing each training example x (i)  and label y (i) , which means that the batch size is 1. e cost function in minibatch gradient descent is the average over a small data batch, which usually ranges in size between 50 and 256, but can vary depending on the application.e applied method uses a generated particle process in combination with variables from the minibatch gradient The proposed method (50, 50) The proposed method (150, 100) The proposed method (180, 300) The gradient descent  Applied Computational Intelligence and Soft Computing descent optimizer.Consequently, the applied optimizer performs updates by using the computed variables instead of the conventional variables from the minibatch gradient descent optimizer.e applied method can be expressed as shown in the following equation: where K is an adjustment value obtained from the particle filter process.K is multiplied by the deep learning rate before being added to the second equation term of the conventional minibatch gradient descent optimizer in equation (1). Figure 3 illustrates the working process of a particle filter.It works based on historical information from the prior stage.PF works iteratively by generating a particle, propagating it to the next time step t, and then performing an update to obtain an accurate value of the time step.A workflow of the applied method to obtain the K value is depicted in Figure 4. e applied method shown in Figure 4 is described as follows [32]: (1) Initialization: at t � 0, generate n particles, and set their weights to

Image Classification Result.
is experiment uses the inception_v3 model, which is a pretrained model intended for image classification applications.e PlanesNet dataset deployed in this experiment has a total of 18,085 images divided into two classes (7,995 "plane" images and 10,090 "no-plane" images).e data are divided into a training set with 14,377 images and a testing set with 3,708 images.e training batch size is set to 100, the leaning rate is 0.001, and the deep learning computation requires 10,000 epochs.
e results of the applied method are compared with those of the conventional gradient descent optimizer.e applied method shows three cases (with different numbers of particles and particle filter iterations in parentheses).e results of the applied method and those of the gradient descent optimizer for image classification in Table 1 reveal that iterations using the applied method (180, 300) achieve the best performance as measured by the mean cross entropy in every iteration (0.3193) and by the final test accuracy (89.860%).e applied method (50, 50) achieves the best performance with regard to mean accuracy (87.4291%), which is calculated after every iteration.
e accuracy and cross entropy after each deep learning iteration are shown in Figure 5. e graphs do not clearly express different model efficiencies because the performance improves only slightly as shown in Table 1.However, both accuracy and cross entropy (Figures 5(a) and 5(b), respectively) present the values of the corresponding trends for the applied method and the conventional method.
e confusion matrices for all cases are shown in Figure 6, clearly revealing that the applied method with 180 particles and 300 particle filter iterations achieves the best prediction result for the category of "no-plane;" however, it shows poor prediction results for the "plane" category.e confusion matrices for the other three results in Figures 6(a), 6(b), and 6(d) show no large differences in either the "plane" or the "no-plane" categories.ese results imply that differences in the number of particles and the number of iterations in the particle filter affect the overall performance of the applied method.us, each application should select the most appropriate model based on user requirements and acceptable model accuracy.The proposed method (50, 50)

Speech Recognition
The proposed method( 150, 100) The proposed method (180, 300) The gradient descent 0 5000 10000 15000 20000 25000 Epoch number The proposed method (50, 50) The proposed method (150, 100) The proposed method (180, 300) The gradient descent 6 Applied Computational Intelligence and Soft Computing the testing dataset.Similar to the image classification experiment, this experiment compares the results of the applied method under different numbers of particles and particle filter iterations with the results from the conventional minibatch gradient descent optimizer.e results are presented in Table 2, which show that the applied method (50, 50) achieves exceptional performance compared to the other models and obtains the best mean accuracy (77.8163%), mean cross entropy (0.6772), and final test accuracy (89.693%).e conventional minibatch gradient descent optimizer is the second best.From these results, we can conclude that the applied method configured with an appropriate number of particles and particle filter iterations can achieve a better performance than the conventional method.e accuracy and cross entropy results after each iteration are illustrated in Figure 7, which did not reveal obvious overall differences; therefore, the improvements are listed in Table 2. Confusion matrices are presented in Figure 8. e applied method (50, 50) shows exceptional performance on the "no," "right," and "off" classes.However, the conventional method achieves the best performance on the "yes," "down," and "go" classes.e other two versions of the applied method achieve a good performance on the "unknown" class.Finally, the applied method (150, 100) achieves the best results on the "left" and "on" classes.
e overall results of the speech recognition experiment show that the applied method performs better than the conventional method in terms of both accuracy and cross entropy.However, the confusion matrix results should be considered in detail before selecting the most suitable model for a given application.e overall performance of using the applied method with image classification and speech recognition provides better accuracy.However, confusion matrices for both image classification and speech recognition illustrate some failure cases that remain a challenging task for further research.
is is a very important consideration for some applications that require high precision of image classification, such as in the health care industry, or high precision of speech recognition, such as in rescue processes.erefore, the applied method in this experiment, based on state estimation and a well-known optimizer, is helpful to slightly improve performance in both applications.To apply this method in practical applications, more consideration of acceptable cases and failure cases using confusion matrices is required to reach optimal performance.

Conclusions
e goal of this study was to use the particle filter technique to optimize a variable in a gradient descent optimizer.e applied method was validated by applying it to two different types of public datasets: the PlanesNet dataset (for image classification) and the Speech Commands dataset (for speech recognition).Moreover, three variations of the applied method that use different numbers of particles and different numbers of iterations were tested on those two datasets: the three model variations used 50 particles and 50 particle filter iterations, 150 particles and 100 particle filter iterations, and 180 particles and 300 particle filter iterations, respectively.
e overall results show that the applied method achieves exceptional performances on both datasets, obtaining higher accuracy and lower cross entropy than the conventional method.e experiments also showed that the number of particles and the number of iterations used in the particle filter process affect the model's overall performance.erefore, to build a high-accuracy model, appropriate parameter values should be selected for the particle filter process in the applied method according to each application.A confusion matrix can be used as an assistive tool to select the most suitable model for a given application.

Figure 1 :
Figure 1: Example of images in the PlanesNet dataset labeled as the "plane" category.

Figure 2 :Figure 3 :
Figure 2: Example of images in the PlanesNet dataset labeled as the "no-plane" category.

Figure 4 :
Figure 4: Working processes of the applied method.

Figure 5 :
Figure 5: Image classification performance: (a) accuracy after each learning step; (b) cross entropy after each learning step.
Result.A simple deep CNN is used in this experiment to generate a model for the audio file.e models are trained for 25,000 epochs with a batch size of 100 and a learning rate of 0.001.e audio files include 105,829 individual files: 100,939 in the training dataset and 4

Figure 7 :
Figure 7: Speech recognition performance: (a) accuracy after each learning step; (b) cross entropy after each learning step.

Table 1 :
Results of the applied method and the gradient descent optimizer for image classification.

Table 2 :
Results of the applied method and the gradient descent optimizer for speech recognition.