A Fast Learning Method for Multilayer Perceptrons in Automatic Speech Recognition Systems

. We propose a fast learning method for multilayer perceptrons (MLPs) on large vocabulary continuous speech recognition (LVCSR) tasks. A preadjusting strategy based on separation of training data and dynamic learning-rate with a cosine function is used to increase the accuracy of a stochastic initial MLP. Weight matrices of the preadjusted MLP are restructured by a method based on singular value decomposition (SVD), reducing the dimensionality of the MLP. A back propagation (BP) algorithm that fits the unfolded weight matrices is used to train the restructured MLP, reducing the time complexity of the learning process. Experimental results indicate that on LVCSR tasks, in comparison with the conventional learning method, this fast learning method can achieve a speedup of around 2.0 times with improvement on both the cross entropy loss and the frame accuracy. Moreover, it can achieve a speedup of approximately 3.5 times with only a little loss of the cross entropy loss and the frame accuracy. Since this method consumes less time and space than the conventional method, it is more suitable for robots which have limitations on hardware.


Introduction
Pattern recognition is one of the most important topics on humanoid robots.To make robots have capabilities of communicating with and learning from the realistic world, recognizing information such as speeches and images is needed.There is much former relevant work.For instance, methods of speech recognition have been used for facilitating interactions between human and humanoid robots for more than ten years [1].An automated speech recognisor, which has relatively better performance on separating sentences and reducing noises than before, has been then applied to robots [2].Besides, methods of image recognition have been widely applied to such humanoid robots.A classic example is the use of the robotic vision, such as gesture recognition to realize the direct commanding from humans to robots [3,4].
However, there are some problems restricting the application of such methods to robots, the chief among which is that the recognising results are not satisfying.Fortunately, deep neural networks (DNNs) can resolve this problem to a great degree.DNNs were first successfully applied to image recognition, bringing evident improvement on the recognition performance [5].Then they have been used in speech recognition, especially in LVCSR tasks, over the past few years.Former work reveals that automatic speech recognition (ASR) systems based on context dependent Gaussian mixture models (CD-GMMs) and hidden Markov models (HMMs) are improved by replacing GMMs with DNNs [6][7][8][9][10].Moreover, new usages of DNNs are proposed in recent work [11][12][13][14][15][16][17][18].
An MLP based on a supervised BP learning algorithm is one of the widely used DNNs in ASR systems.However, learning is difficult in the MLP due to the heavy computational burdens of densely connected structures, multilayers, and several epochs of iterations, and thus it requires considerably long time to achieve an essential recognition accuracy.
Another drawback of DNNs is that it is hard to decode them as the decoding processes also entail a large amount of time.
Some methods have been proposed to ameliorate these disadvantages.Since graphics processing units (GPUs) have powerful abilities on parallel computations, they have been used to improve the speed of computing matrix multiplications in regard to the dense weight matrices of MLPs [19].Meanwhile, asynchronous training algorithms have been applied to the training processes, making several computers or processing units work asynchronously so that the training tasks were allocated to parallel simultaneous jobs [20][21][22].Moreover, Hessian-free (HF) optimisation focuses on reducing the number of iterations, which makes parameters converge faster than conventional stochastic gradient descent (SGD) [23][24][25].Nevertheless, the heavy computational burdens of learning MLPs still exist, especially on realistic tasks that demand markedly sufficient learning to improve the recognition accuracy.To speed up the decoding processes, SVD is used to restructure the models, but it requires extra time for retraining and once again increases the time consumption [26,27].
In this paper, we propose a fast learning method, reducing the computational burdens of learning MLPs and decoding them.The basic concept of this method is to preadjust roughly the initial MLP and then train the MLP using an unconventional BP algorithm after restructuring weight matrices via SVD.The preadjusting process alters the distributions of singular values before the MLP is accurately trained.Since SVD reduces the dimensionality of weight matrices, the burdens of computing matrix multiplications are lessened.
The rest of this paper is organized as follows.Section 2 describes the fast learning method.Section 3 shows experimental results and discussions and in Section 4 we draw conclusions.

A Fast Learning Method
where  0 denotes the initial learning rate and 0 <  < 1.
The proportions of these bunches are different, observing a rule based on the cosine function.The proportion of the th bunch is Particularly, to ensure that the rest of data are contained in the last bunch, the proportion of the th bunch is In fact, ∑ −1 =1 () converges to 1 when  tends to positive infinity, because lim It ensures that all bunches observe the rule of the cosine function and all data are used when  tends to positive infinity.Nonetheless, it is impossible to let  tend to positive infinity in reality, so  is set to a big positive integer practically.We particularly name this strategy as preadjusting (PA), as the learning-rates and data arrangement are different from those conventional training methods.
The dynamic declining learning-rate is used due to the fact that the PA process requires going through the training data once and achieving heightened accuracies as far as possible.Relatively high learning-rates learn models effectively, but low precision exists, whereas relatively low learning-rates learn MLPs slowly but achieve high recognition accuracies.In (1), the initial learning-rate is high, facilitating the learning speed at the beginning, and, then,  −1 decays this rate exponentially, ensuring the precisions of the intermediate and last learning.

Weight Matrix Restructuring and
Training.An MLP consists of an input layer, several hidden layers, and an output layer.Except the input layer that obtains states directly from input vectors, each of the other layers uses a weight matrix, a set of biases, and an activation function to compute states.The computational burdens are mainly due to the weight matrices.Concretely, both forward and backward computations demand the products of weight matrices and various vectors; thus the time complexity of the MLP is determined by the dimensionality of weight matrices.
SVD is one of the basic and important analysis methods in linear algebra [28], which can be used to reduce the dimensionality of matrices and has the following equation [26,27]: where the numbers in "( )" stand for dimensions, W (m×n) stands for an  ×  weight matrix, U (m×m) , Σ (m×n) , and V T (n×n) stand for three matrices generated by SVD, W 1(m×l) and W 2(l×n) stand for two new obtained weight matrices, and  < max(, ) stands for the number of kept singular values.The time complexity of computing a product of W (m×n) and a vector k (n) is originally ( × ).By replacing , the time complexity is reduced to (( + ) × ) when  <  × /( + ).Since the effectiveness of SVD, to some extent, depends on the meaningful parameters of weight matrices, the SVD-based method is arranged after preadjusting.In other words, SVD is meaningless to stochastic weight matrices which have not learned anything.
To simplify the discussion, consider a single layer.Let b (m) denote an -dimensional set that contains  biases, and () denotes an activation function.The forward computation transforms an -dimensional input vector i (n) to an dimensional output vector o (m) by Since the weight matrices are unfolded, the backward computation is required to fit the doubled matrix structure.Let e (m) stand for a received error signal,   () for the derivative of the activation function,  (m) for a gradient, e (n) for an error signal that will be transmitted to the beneath layer, Δb (m) , ΔW 1(m×l) , and ΔW 2(l×n) for the deltas, and  for a learning-rate.According to the BP theory, the gradient is The update rule of b (m) is The update rule of W 1(m×l) is The error signal becomes W T 1(m×l) ⋅ (m) through W 1(m×l) ; thus, the update rule of W 2(l×n) is The error e (n) is Algorithm 1 (the weight-matrix-restructuring-based BP algorithm).
After being trained by this algorithm, the final weight matrices can be inversely converted to the original structure via Nonetheless, it is not necessary to convert them to the original structure unless being seriously demanded, because converting inversely does not improve the recognition accuracy but increases the computational burdens of recognition.

The Complexity Reduction Theorem.
As previously mentioned, the SVD-based method reduces the time complexities of matrix multiplications, which is summarized by the following theorem.
Theorem 2. Assume that W is an  ×  weight matrix and i is an -dimensional vector.By applying the SVD-based method on W and keeping  largest singular values, the time complexity of computing W ⋅ i is reduced from ( × ) to (( + ) × ), when  <  × /( + ).
Proof.Computing W ⋅ i requires  ×  times of real number multiplications, so the time complexity of computing W ⋅ i is ( × ).Apply the SVD method on W and obtain W 1 and W 2 .After replacing W by According to the associative law, we obtain Computing W 2 ⋅ i requires  ×  times of real number multiplications and gets an -dimensional vector.Computing the product of W 2 , the -dimensional vector requires  ×  times of real number multiplications, so W 1 ⋅ (W 2 ⋅ i) requires ( + ) ×  times of real number multiplications.The number of real number multiplications is reduced when and we obtain Therefore, the time complexity is reduced from ( × ) to (( + ) × ), when  <  × /( + ).

Experiments
Then, MLPs are trained on the basis of GMMs.Featurespace maximum likelihood linear regression (FMLLR) is used as features of speeches for training MLPs.Alignments from GMMs are used as labels of supervised learning.Each MLP has an input layer, five hidden layers, and an output layer.The input layer has 440 units, corresponding to 440-dimensional input vectors.More specifically, each vector contains 40 real numbers that are the features of the corresponding frame of speeches and 40 × (5 + 5) real numbers that are the features of 5 frames before this frame and 5 frames after this frame.Each hidden layer has 1024 units.Sigmoid is chosen as the activation function of the hidden layers.The output layer has 1952 units.To deal with multiclassification problems in ASR systems, softmax is chosen as the activation function of the output layer.All parameters of these layers, including weight matrices and biases, are stochastically initialized.The conventional method and the PA strategy are used to train this initial stochastic MLP, respectively.The number of bunches () is set to 20.For the conventional task, the data are averagely separated into bunches, and the learning-rate is set to 0.032.For the PA task, the data are separated by ( 2) and (3).The initial learning-rate  0 is set to 0.032 and  in (1) is set to 0.975.
Next, the SVD-based matrix restructuring method is applied to the basic model, keeping 384, 256, and 128 of the largest singular values, respectively.Since the input layer has 440 units, applying the SVD-based method to the first weight matrix will not evidently decrease the time complexity.Therefore, the SVD-based method will not be applied to the first weight matrix, but to all of the other matrices, including the one of the output layer.The structure of the model which keeps 256 singular values is shown in Figure 1 as an example, where the bottleneck means the linear transform.The reason of the numbers of kept largest singular values being set to 384(1024 × 3/8), 256(1024 × 1/4), and 128(1024 × 1/8), respectively, is that the time complexity is reduced when  <  × /( + ), and therefore  < 512 if  =  = 1024.After that, the BP algorithm illustrated in Section 2.2 is used to train the restructured models.The learning-rates of iterations are decayed from an initial value: when the increment of the frame accuracy on cross validation (The frame accuracy is equal to ( correct / total ) × 100, where  correct denotes  the number of correct recognized states on softmax and  total denotes the total number of states.) is not smaller than 0.5, the learning-rate does not change, but when the increment of the frame accuracy on cross validation is smaller than 0.5, the learning-rate is halved.The initial learning-rate is set to 1 × 10 −5 .
In these experiments, the cross entropy losses and the frame accuracies on cross validation are used to appraise the performance of MLPs.The word error rate (WER) is used to assess the performance of final CD-MLP-HMMs, which is equal to the number of misrecognized words divided by the total number of words.

Results and Discussions.
Figure 2 shows the changes of the cross entropy loss during the first epoch.The curves of the PA task and the conventional task are provided.Both of them first drop sharply, followed by slight decreases after training by 8 bunches.However, the PA task drops more significantly when training by the first 8 bunches, after which it remains stable.By contrast, the cross entropy loss of the conventional task keeps decreasing when training, but finally it is still higher than that of the PA task, which is because the first 8 bunches on the PA task contain more data due to the fact that they are based on the cosine function.Another further contributing factor is that the dynamic learning-rate facilitates the training, which is also the reason why the PA task has a considerable drop when training by the 3rd-7th bunches.
Figure 3 reveals the changes of frame accuracies on cross validation during the first epoch.Combining with Figure 2, we can see that the frame accuracy increases when the cross entropy loss decreases.However, the changes of frame accuracies are more evident.After training by the first 5 bunches, the frame accuracy of the PA task reaches a very high point, whereas the low point of the cross entropy loss occurs after 8 bunches.A similar phenomenon also occurs on the conventional task.More importantly, the final frame accuracy of the PA task is higher than that of the conventional one.Such a high accuracy facilitates the subsequent training, and it is the reason why we use the PA strategy.A glance at Figure 4 shows some differences on cross entropy losses between the PA-SVD training method and the conventional method.The initial cross entropy loss of the conventional task is significantly higher than those of the PA-SVD tasks due to the fact that the PA strategy has better performance on reducing the cross entropy loss during the first epoch.With regard to the PA-SVD tasks, the initial cross entropy loss is low, and the more bottlenecks mean the lower value.However, the cross entropy losses of the PA-SVD tasks increase during the second epoch, achieving peaks which are dramatically higher than before, which is attributed to the fact that the structures of these models have been altered by the SVD method, and the training algorithm is different from the conventional BP method.After the peaks, marked declines of the cross entropy losses occur to these tasks, followed by sustained decreases.Finally, all of these cross entropy losses become more and more similar to each other.More importantly, the final cross entropy losses of the PA-tasks (PA-SVD-384 and PA-SVD-256) are still slightly lower than that of the conventional task, indicating that the former models have better performance than the latter one.In fact, on LVCSR tasks, the frame accuracy is more practical, because it directly indicates the proportion of correct recognition results of MLPs. Figure 5 provides the changes of frame accuracies on cross validation.It is easy to note that the initial frame accuracies of PA-SVD tasks are evidently higher than that of the conventional one, which means that the PA strategy improves not only the cross entropy loss (see Figure 4) but also the frame accuracy.Meanwhile, small gaps occur among the three PA-SVD tasks.This phenomenon is attributed to the fact that the SVD method brings loss of information to models, particularly when the number of bottlenecks is small.Then the frame accuracies of the PA-SVD tasks reach minima after the second epoch, and the reason is the same as that of the increasing of the cross entropy loss.After that, the frame accuracies keep increasing till the end of training.With regard to the conventional task, the frame accuracy has a slight decrease during the third epoch, which is because the learning-rate is high during this epoch, and from this point it is halved.Finally, the frame accuracy of the PA-SVD-384 task as well as that of the PA-SVD-256 task is slightly higher than that of the conventional task, whereas the frame accuracy of the PA-SVD-128 task is a little lower.These results again indicate that the PA-SVD-384 model and the PA-SVD-256 model perform better than the conventional model.
Table 1 provides the final results of the overall LVSCR tasks, including the WERs and the numbers of parameters.It is easy to note that the bigger number of parameters means the lower WER, but the gaps among them are very small.In comparison with the previous results, although the PA-SVD-256 task and the PA-SVD-384 task have higher WERs than the conventional task, they have better cross entropy losses and frame accuracies, which is because WERs not only depend on the performance of MLPs but also are affected by the ARPA models above them.For the same MLP, using different ARPA models will bring different results.
With regard to the complexities, the number of computations (including real number multiplications and additions) for both a forward pass and a backward pass is approximately equal to the number of parameters.During training, the computing is on GPUs, and a forward pass and a backward pass are required.Thus, the time complexity for training is where  GPU denotes the number of GPU cores which can realistically run parallel (In reality, for some tasks, not all of the GPU cores can work simultaneously, but it is difficult to discuss in this work, as parallel computing is very complicated.).During decoding, only a forward pass is required.The time complexity for decoding is In our experiments, the NVIDIA GeForce GTX TITAN Black graphics card, including 2880 GPU cores, is used.Since  GPU is large, the experiments run relatively fast and are finished in a few days.However, the volume of this graphics card is big (26.67cm × 11.12 cm × 7.44 cm), so it is hard to embed it into a humanoid robot for decoding.If smaller graphics cards or CPUs are used in the robot, it will take considerable longer time for training and decoding.Thus, it is important to reduce the time complexities.
Equations ( 16) and (17) reveal that the time cost depends on the number of parameters.Revisiting Table 1, we notice that the PA-SVD tasks have significant less time cost than the conventional task, whereas the WERs are almost the same.Particularlly, the PA-SVD-256 task achieves a 2.0 times speedup and the PA-SVD-128 task achieves a 3.5 times speedup, which provides a way for humanoid robots to learn and recognize speech much more efficiently and effectively.Besides, the memories of robots are much smaller than servers, as robots have restrictions on sizes, weights, and powers.It is easy to note that the final models of the PA-SVD tasks have markedly lower numbers of parameters than the conventional model, which consequently also provides a way for robots to reduce their sizes, weights, and consumptions of energy.

Conclusions
We propose a fast learning method for MLPs in ASR systems in this paper, which is suitable for humanoid robots whose CPU/GPUs and memories are limited, as its time complexities are low, and the final model sizes are small.First, the PA strategy improves the frame accuracies and the cross entropy losses of the MLP during the first training epoch, based on the cosine function separation of training data and the dynamic learning-rate.The SVD-based method then restructures the weight matrices of the preadjusted MLPs and reduces their dimensionality.After that, the BP algorithm that fits the unfolded weight matrices is used to train the MLP obtained by the SVD restructuring.In the experiments, this method accelerates the training processes to around 2.0 times faster than before with improvements on the cross entropy loss and the frame accuracy, and moreover it accelerates the training processes to around 3.5 times faster than before with just a negligible increase of the cross entropy loss as well as a tiny loss of the frame accuracy.

3. 1 .
Experimental Settings.We conduct experiments of LVCSR tasks on a server with 4 Intel Xeon E5-2620 CPUs and 512 GB memory.The training of MLPs is accelerated by an NVIDIA GeForce GTX TITAN Black graphics card.We use hours (h) of speech databases and their transcriptions to train and test acoustic models.The training data contain a 120 h speech database and the testing data contain a 3 h speech database.The texts of the testing data contain 17,221 words.The language model used is a 5-gram ARPA model.

Figure 1 :
Figure 1: A model restructured by the SVD method.

Figure 2 :Figure 3 :
Figure 2: Cross entropy losses during the first epoch.
The basic concept of this strategy is to roughly train the MLP before accurate learning.Concretely, it goes through all of the training data only once during the first epoch, using the conventional BP algorithm.During this epoch, the frame accuracy of the MLP is heightened as far as possible.This strategy first separates averagely the training data into  bunches.When training with the th bunch data, a dynamically declining learning rate is used, which is 2.1.A Learning Strategy for the First Epoch.

Table 1 :
WERs and model scales.