This paper will present a new method of identifying Vietnamese voice commands using Google speech recognition (GSR) service results. The problem is that the percentage of correct identifications of Vietnamese voice commands in the Google system is not high. We propose a supervised machine-learning approach to address cases in which Google incorrectly identifies voice commands. First, we build a voice command dataset that includes hypotheses of GSR for each corresponding voice command. Next, we propose a correction system using support vector machine (SVM) and convolutional neural network (CNN) models. The results show that the correction system reduces errors in recognizing Vietnamese voice commands from 35.06% to 7.08% using the SVM model and 5.15% using the CNN model.
Users’ need for devices that support voice communication is increasing rapidly. Voice is the natural approach to human-machine communication, so voice control support systems must meet the needs of a variety of users. According to Jeffs’ survey [
In this study, we developed a method of Vietnamese voice command recognition using GSR services, as in the above studies [
To overcome errors in GSR results, we propose a new method based on a supervised machine-learning approach. The purpose of the proposed method is to transform incorrect results to correct ones. First, we build a voice command dataset that includes the hypotheses of GSR for each corresponding voice command. Based on this dataset, we construct a command classifier, which receives a voice command (returned from GSR) and outputs the corresponding label of the command. Here, the label is in a set of voice commands supported by the system. This method will help the system correctly classify the commands that GSR identified incorrectly.
We took a similar approach to classifying text by category. There are also many text classification approaches based on supervised learning methods. Kumaresan et al. [
The next part of the paper is organized as follows. Section
The method we propose is described in Figure
Model of voice command identification system for Vietnamese.
In the data preprocessing step, we first use the bag-of-words (BOW) algorithm [
Next, in the dataset, we find that some words appear in many voice commands. For example, in the following two voice commands: “hãy gọi điện thoại” (“please call”) and “hãy gửi tin nhắn” (“please send a message”), it is clear that the word “hãy” (“please”) appears in both voice commands but does not provide information that helps to distinguish between the commands. The term frequency-inverse document frequency (TF-IDF) algorithm enhances the value of distinctive words and reduces the value of words that appear often (less discriminatory) [
The TF-IDF is the weight of a word in the text obtained through a statistics process. This value shows the importance of this word in the text. Term frequency (TF) is used to estimate the frequency of words in the text (Equation (
In Equation (
Inverse document frequency (IDF) is a quantity used to estimate the importance of a word. In calculating a TF value, the words are considered equally important. However, some words are commonly used but not important to the meaning of the text; for example, linking words or prepositions. We need to reduce the importance of those words using IDF. This value is calculated by
Thus, we use the TF-IDF algorithm to represent the value of each word in the BOW. Each voice command will be represented by the following set of values: < index1 >:< value1 >; < index2 >:< value2 >; < index3 >:< value3 >; etc.
The number of attributes (values) of each voice command is the size of that bag, so all voice commands have the same number of attributes. We use the LIBSVM library [
Among popular transformation functions (kernel functions), radial basis function (RBF) is used the most. The RBF kernel of two samples (
Therefore, we use the RBF kernel for the SVM model. We need to adjust two parameters:
The architecture of the CNN network is described in Figure
The CNN network architecture used for identifying Vietnamese voice commands.
The first layer we use is the embedding layer, which turns positive integers (indexes of words) into dense vectors of fixed size. This layer is represented by the matrix embedding
The convolution layer uses a set of filters (or kernels) to calculate and extract new features from the input matrix. Whether a classification model is effective depends on whether it is able to extract the new features, so this convolution layer is very important in the CNN network. The convolution layer is represented by filter size (usually 3, 4, and 5) and number of filters (usually equal to embedding size, so we use 128 here).
We can imagine the convolution layer acting as a small filter for the input matrix. This filter will multiply the values in the input matrix with the weights in the filter and add them to the specificity of the convolution. After completion, the new matrix is called a convolution feature.
The value at the point of the coordinates ( In our CNN model, the number of feature map
After the filters perform convolution on the input matrix, the data will be processed by a nonlinear activation function. If there is no activation function, a neural network layer can only perform a linear transformation on the input data. There are several activation functions, such as sigmoid, tanh, and Rectified Linear Unit (ReLU). In this model, we use the ReLU activation function. The formula of function ReLU is
The pooling layer is one of the computational components in the CNN structure. Pooling is the calculation process of the matrix in which the goal after calculating is to reduce the size of the input matrix but highlight its characteristics. There are many pooling operators, such as sum pooling and max pooling, but max pooling is commonly used. In terms of meaning, max pooling will highlight the most characteristics of the input features, so we use max pooling for our CNN model.
A CNN always has a fully connected layer, and this layer is often designed as the last layer of the network. Neurons in the fully connected layer will fully connect to all neurons in the previous layer. This layer plays the role of classifying data as required by each problem. In this layer, output
This layer takes as input vector of
We designed the CNN model to return the probability distribution of voice commands in the vocabulary set. So, for other voice commands that are not in the vocabulary set, these probabilities are usually very small, and the system will use this feature to exclude these voice commands. The goal of the system is to correctly identify the voice commands in the list that have been selected previously for each specific application. In the case of adding new voice commands to the system, the following tasks should be done: collecting data of these commands using the GSR service and retraining the model with checking carefully the performance of the system on the validation set.
We trained our CNN model with Adam optimization (
In this section, we conduct an assessment of the effectiveness of machine-learning applications in improving GSR’s voice command recognition results. We built the dataset based on GSR service results, evaluating the rate of incorrect results from GSR and when applying the machine-learning algorithms.
The dataset was collected using the GSR service. For each audio input, GSR will return up to 5 hypotheses arranged in order of high to low probability by the content of the audio input.
To collect data, we made a total of 2350 recordings corresponding to 22 types of voice commands using a Samsung Galaxy A8+ smartphone in a quiet environment (Figure
The process of collecting voice command data on an Android mobile device platform using GSR.
We use two measurements to measure the accuracy of GSR. The first is called the command error, which allows us to compare the output hypothesis (hypothesis with the highest probability of the returned results) of the GSR with the voice command content. If two strings are different, we consider the command having an accuracy of 0; if the strings are the same, the accuracy is equal to 1. The command error is 1 minus the average accuracy of all the collected statements.
The second measure we use is the word error rate (WER) measurement, which measures the deviation of the output hypothesis (hypothesis with the highest probability of all results returned) with the contents of the voice command. The WER is the minimum number of steps needed to add, replace, or delete a word to convert hypothetical text into reference text [
In Equation (
When testing GSR on our test set, we obtained a command error of 35.06% and WER of 22.25%. So, the results returned by GSR had a high error rate. In the following section, we describe the experimental results using SVM and CNN machine-learning models to improve the accuracy of the voice command recognition system.
To evaluate the performance of the voice command recognition system, we recommend the following measures:
P1: incorrect (%) rate voice command, which is calculated as
P2: incorrect rate (%) WER in voice command (Equation ( R1: number of commands that Google recognized incorrectly and our model recognized correctly R2: number of commands that Google recognized correctly and our model recognized incorrectly R3: number of commands that Google and our model recognized incorrectly
After the construction process, we obtained the dataset (2350 samples of 22 voice commands, each of which consisting of up to 5 hypotheses by GSR arranged in order of probability from high to low). We used 20% of the dataset for tests. Due to the specificity of the test set, it was used to check the accuracy of the machine-learning model, so the test set did not change. For test data, in the 5 hypotheses returned by Google, only one hypothesis of the highest probability for each command was used.
We used 70% of the remaining dataset for training and 10% for the validation set. The training set can be changed to improve the machine-learning model. Therefore, we constructed three types of training datasets as follows:
1-BEST: use the best hypothesis returned by Google (with the highest probability) TYPE-1: combine N-best ( TYPE-2: keep N-best (
Thus, it can be seen that in the TYPE-2 training set, the more N-best hypotheses occur, the more the number of training commands increases, but the command length does not increase. As for the TYPE-1 training set, the length of each command will increase as best hypotheses combine, but the number of commands will not increase.
In the voice command identification test using the SVM model, we only used the N-BEST dataset and TYPE-1. The TYPE-2 set had a relatively large amount of data, so it could not be used effectively with the SVM model [
Accuracy of recognition of voice commands using the SVM model.
Training set | P1 (%) | P2 (%) | R1 | R2 | R3 |
---|---|---|---|---|---|
Train 5-best, TYPE-1 | 9.01 | 7.70 | 96 | 0 | 42 |
Train 4-best, TYPE-1 | 9.03 | 8.19 | 95 | 0 | 43 |
Train 3-best, TYPE-1 | 9.87 | 8.34 | 92 | 0 | 46 |
Train 2-best, TYPE-1 | 8.58 | 6.83 | 98 | 0 | 40 |
Train 1-best | 7.08 | 6.46 | 105 | 0 | 33 |
Test results of identification of voice commands using the CNN model are described in Table
Accuracy of recognition of voice commands using the CNN model.
Training set | P1 (%) | P2 (%) | R1 | R2 | R3 |
---|---|---|---|---|---|
Train 5-best, TYPE-1 | 9.3 | 5.93 | 104 | 0 | 34 |
Train 4-best, TYPE-1 | 9.66 | 7.60 | 93 | 0 | 45 |
Train 3-best, TYPE-1 | 10.73 | 8.34 | 91 | 3 | 47 |
Train 2-best, TYPE-1 | 9.66 | 7.05 | 96 | 3 | 42 |
Train 1-best | 7.51 | 6.43 | 103 | 0 | 35 |
Train 5-best, TYPE-2 | 5.15 | 3.93 | 115 | 3 | 23 |
Train 4-best, TYPE-2 | 5.58 | 3.96 | 114 | 0 | 24 |
Train 5-best, TYPE-2 | 6.22 | 4.61 | 112 | 3 | 26 |
Train 2-best, TYPE-2 | 7.08 | 6.51 | 108 | 3 | 30 |
Experimental results show that, with the SVM model, 105 of 138 incorrect identification commands were fixed, and with the CNN model, 115 of 138 incorrect commands were fixed. This demonstrates that the application of machine-learning algorithms has significantly improved GSR’s voice recognition capability for voice commands.
As we described in
The architecture of the application is depicted in Figure Voice recognition service: responsible for recording audio, sending it to the GSR service to handle speech recognition, and receiving textual hypotheses from GSR Voice command classification: responsible for assigning command labels to text hypotheses obtained from the voice recognition service. With supervised machine-learning approaches, this section will help correctly recategorize the voice commands that GSR misidentifies. After the classification process is complete, the command label will be sent to the filter and execute function Filter and execute functions: responsible for receiving the label of the command from the voice command classifier and performing analysis to filter out the parameters needed to execute the corresponding function. For dual voice commands (including multiple steps), the function filter can be connected directly to the voice recognition unit
Architecture of application for mobile device control by Vietnamese voice command.
This application has been successfully built and deployed on mobile devices running the Android operating system. The results of the experiment with the CNN model were better than those of the SVM model, so in this system, we installed only the CNN model. Table
The results listed in Table
Execution time (in milliseconds) to identify smartphone control commands (total times of 10 commands repeated 10 times each). T1 is GSR’s time for speech recognition processing. T2 is the processing time for postprocessing voice command analysis by the CNN model.
Command index | T1 (in milliseconds) | T2 (in milliseconds) |
---|---|---|
1 | 11934 | 1676 |
2 | 11394 | 1029 |
3 | 13352 | 216 |
4 | 4971 | 178 |
5 | 4639 | 160 |
6 | 10680 | 217 |
7 | 3912 | 167 |
8 | 5795 | 694 |
9 | 3630 | 182 |
10 | 7742 | 214 |
In this study, we have proposed a novel method to identify Vietnamese voice commands using GSR service results. First, we built a dataset for voice commands based on the GSR service. Then, we applied SVM and CNN models to increase the recognition accuracy of voice commands. The results show that using SVM and CNN machine-learning algorithms significantly improved the ability to recognize voice commands compared to results returned directly by GSR. The CNN model gave better identification results than the SVM model; the command error rate decreased from 35.06% with GSR to 7.08% with the SVM model and 5.15% with the CNN model. Indeed, the CNN model promises good results with increasingly diverse training data.
We also offer a method for deploying device control applications via voice command and implementing this application on devices running the Android operating system. Experimental results show that the processing time for analyzing the execution of voice commands compares favourably to the waiting time for results from GSR.
In the future, we plan to conduct training with datasets in many environments and contexts, for example, in outdoor environments and on datasets consisting of various Vietnamese dialects.
The “Vietnamese Voice Commands” dataset used to support the findings of this study are included within the supplementary information files.
The authors declare that there is no conflict of interest regarding the publication of this paper.
Vietnamese voice commands.