Acoustic Model with Multiple Lexicon Types for Indonesian Speech Recognition

,


Introduction
e machine's ability to transform voice into text has made it easier in various elds, including simplifying the recording process and displaying video transcripts or subtitles. One advantage is that each person's conversation can be recorded directly and quickly converted into texts [1,2]. ese records can be further indexed for the searching and retrieving processes using keywords [3]. Another bene t of speech-totext technology is that it allows users to command computers and smartphones through voice [4].
Data show that many research works for speech-to-text in English have been carried out. In [5], Hinton et al. built an acoustic model using the deep neural network (DNN) algorithm to replace the Gaussian mixture model (GMM) technique. Hinton mentioned that a DNN with many hidden layers and nodes could improve accuracy. e work in [6] conducted similar research by changing the DNN structure to long short-term memory (LSTM) to train the acoustic model. In this study, LSTM, which is also part of the recurrent neural network (RNN), can obtain better accuracy when compared to the DNN, although the training process takes more time. In addition, Williams et al. [7] studied language models and built the models using the RNN to optimize the model for a speech-to-text system. e Indonesian speech-to-text studies were conducted by [8][9][10][11][12]. e systems were built using the GMM and hidden Markov model (HMM) techniques and converted the speech spontaneously through dictation. However, the accuracy is still below 85%, and the speech datasets are not publicly available. Hence, it is not easy to compare the methods.
Our contributions are as follows: First, this research builds a speech dataset to assist automatic speech recognition (ASR) voice data processing in Indonesian, the target language. e data sources are the videos from YouTube with subtitles [13].
e YouTube channel manages many videos and provides a feature to download them. e video on YouTube channels generally has transcripts, a series of writings placed at the bottom of the video related to the conversation. Voice and video transcripts are downloaded to build a speech-to-text dataset. Second, this research provides an open Indonesian speech-to-text dataset. ird, this research builds acoustic models by applying the alignment data from the Gaussian mixture model-hidden Markov model (GMM-HMM), TDNN factorization (TDNNF), and CNN-TDNNF-augmented models. Fourth, data augmentation is utilized to increase the number of validated datasets and improve the performance of the acoustic model.

Related Works.
Research related to speech-to-text has been widely carried out in various languages such as English, Mandarin [14,15], African [16], Pakistani [17], Italian [18], and Indian [19]. Most works built acoustic models using the Mel frequency cepstral coefficient (MFCC) feature, a stable and accurate cepstral coefficient representing sound and music [20,21].
In addition, many acoustic speech-to-text models were built using a DNN that utilizes the alignment model of the GMM-HMM as the target class to train the model [22,23]. Various DNNs have been developed to build acoustic models. e work in [23] used the DNN with four hidden layers with 1.500 nodes for each layer to build the acoustic model from heterogeneous datasets. ey found a phoneme error rate of 12.76% for children, 10.91% for adult women, and 8.62% for adult males. In [24], Sak used the LSTM-RNN architecture to train an acoustic model from a Google voice search dataset of three million utterances or approximately 1.900 hours. e word error rate (WER) was 10.7% for test data containing 22,500 utterances. e research by Zia and Zahid [25] also used the same architecture to train an acoustic model for the Urdu language dataset of 20 speakers consisting of hundreds of words. e acoustic model was trained using several types of LSTM-RNN architecture, such     Other architectures were also used to build acoustic models to study long-term temporal relationships from speech data, including the time delay neural network (TDNN) [31,32]. Meanwhile, convolutional neural networks (CNNs) [22,33,34] were used to improve the performance compared to the previous architecture when using English datasets Table 1.
Various studies were recently conducted to reduce word error rates using different techniques. One of the standard techniques is data augmentation (DA), such as the work done by [26] using the 34 hours of speech from a diverse English dataset. In [26], the performance of speech recognition was successfully reduced to 16.85% by implementing data augmentation using the DNN. Similarly, the DNN combined with pitch enhancement was also used by [28] on the Punjabi adult and children's speech dataset to build different acoustic model conditions. Bhardwaj and Kukreja [28] enhanced the pitch using cepstral analysis in the feature extraction process and achieved a 10.98%∼12.24% WER under different acoustic conditions. Speech-to-text for the Indonesian language has been conducted using various Indonesian speech recognition datasets [8-12, 35, 36]. Winursito [9] used principal component analysis (PCA) to reduce the dimension size of the MFCC features from 26 to 10. e accuracy of the speech-totext system increased from 86.43% to 89.29%, trained with several speakers with 28 and 140 utterances. Another study conducted by Teduh [11] built a corpus of speech in Indonesian with 100,000 utterances recorded by 400 speakers from various regions in Indonesia. e corpus was used to build a speech-to-text model with a 20% WER.
Our work differs from the previous studies mentioned above in terms of the spontaneous speech dataset which was collected from YouTube, different lexicon constructions, and acoustic model construction using validated and unvalidated

Materials and Methods
e construction of the speech-to-text system consists of several stages. It begins with collecting audio data from a YouTube channel with a transcript so that the tokenization process of the audio that represents the existing transcript becomes easier. At this stage, checking for fake subtitles is also carried out. If the audio is detected as having a transcript, but the transcript's length does not meet the specified threshold, then the transcript is inaccurate and will not be downloaded.
e second stage is the audio tokenization process, which is based on the transcript at a specific duration. Each sentence in the transcript is extracted based on the start time and duration. At this stage, detecting empty transcripts but having a duration and start time is essential to remove them from the dataset. ird, the audios extracted into utterances and transcripts were cleaned by removing punctuation marks, changing uppercase to lowercase letters, and removing meaningless symbols.
Next, the fourth stage is to build a speech-to-text system through learning and testing, as illustrated in Figure 1. Before training and testing, the unique words in the transcript are extracted for the training. Furthermore, pronunciation is built for each unique word. Lexicon is used as a label in the acoustic model training process using the DNN.
In learning the acoustic model, the characteristics of the voice data are extracted using the MFCC method. e data is a matrix representing the speech information in the audio  Applied Computational Intelligence and Soft Computing data at each frame. After the extraction process, the features are augmented to increase the training data size. e last stage is acoustic model testing (decoding). Before evaluating the acoustic model, the language model for constructing the graph must be built. e graph contains a lexicon used to train the acoustic model and its weighted equivalents in the language model. e graph is used to find the right word equivalent based on the acoustic model's probability value of the syllables. In the end, the output is in the form of text based on the speech data used.
e performance of the model is measured using WER as follows: where S is the number of words replaced, D is the number of words deleted, I is the number of words added, C is the number of correct words, and N is the total number of words in the transcript reference.

Data Collection and Preparation.
We downloaded the audio data in the Indonesian language from the YouTube channel with a specific duration based on the transcript. All audios collected were transcribed and used as a candidate dataset for the Indonesian speech-to-text system. e steps are shown in Figure 2. During data acquisition, YouTube IDs containing speech and transcripts were stored in a database. e audio and its transcript were downloaded using the YouTube API. Each downloaded file was saved in WAV and XML formats. e audio    e subsequent process is data preparation, i.e., cleaning each transcript and matching the audio and the transcript, as shown in Figure 3. e transcript containing numbers was eliminated because it is complicated to produce the pronunciation.
In this study, the Kaldi ASR toolkit [37] was used to read a unique file format to train the acoustic model, as shown in Table 2. e text file contains the ID and transcript for each audio, while the wav.scp contains the id and the location where the audio was stored. Meanwhile, the utt2pk and spk2utt files contain the ID of each speech and its speaker.
ere were 1,980 audio data with transcripts collected from the YouTube channels, grouped into 15 categories, as summarized in Table 3. e total duration of all transcripts   before cleaning is 310.09 hours. After cleaning, it is shortened to 181.25 hours. e validation process was carried out for each utterance to get the best speech quality from the audio. e utterances were validated by several validators using a simple validation interface created for the process. e total number of validated utterances in the first round was 10,333, with a 7.953 hour duration, as summarized in Table 4. e validated dataset contains all utterances that have been validated by the validators, whereas the unvalidated dataset contains the original utterances without the validation process.
Before training the acoustic model, several types of vocabulary (lexicon) were prepared. e lexicon was extracted from the unique words in the transcripts. e process of generating the lexicon is shown in Figure 4. About 41,351 unique words in the dictionary were successfully extracted from the transcript, consisting of 3,330 English and 6,334 Indonesian words. e rest are informal slang words.
ere are four lexicon types used in this work. Each type is described in detail in [38]. e words in the lexicon are the mapping words to syllables or their pronunciations. e pronunciations are used for different words so that the number of pronunciations is not greater than the number of words in the dictionary, as summarized in Table 5. Lexicon vocab_1_char and norm_1_vocab_full have 26 pronunciations extracted as a character. In contrast, enmap_vocab_1_char and enmap_norm_1_vocab_full have 49 different pronunciations because each word is mapped to the CMUDict English dictionary. Furthermore, kv_vocab_full has 135 pronunciations summed up into 26 single characters, 21 consonants multiplied by 5 vowels, and 4 consonants combinations that make up consonants/ kh/,/ng/,/ny/, and/sy/.

Experimental Setup.
In this study, we built several acoustic and language models. We trained the acoustic model using the Kaldi ASR toolkit. e MFCC feature was extracted with 40 dimensions and a window size for each frame of 25 ms, adding a shift of 10 ms to the audio duration. e feature was extracted for each audio track. In addition, the acoustic models were also trained using the GMM-HMM technique for monophone and triphone with different techniques such as DELTA, +DELTA-DELTA, and speaker adaptation training (SAT) features. e GMM-HMM acoustic model aligns the training data to get each frame's appropriate labels (phonemes). e matched training data are later used to train the model using the DNN. e time delay neural network factorization (TDNNF) [39] is used in this study. e idea is to decompose the existing TDNN structure into a small matrix whose dimensions can be multiplied. For example, a TDNN structure has a hidden layer with 700 dimensions. e structure's weights (parameters) are a matrix of size 700 × 2,100, where 2,100 is obtained from 3 frames consisting of the number of frames and their right and left offsets multiplied by the dimensions of the hidden layers. e 700 × 2,100 matrix is factorized into 2M � AB matrices with 250 dimensions; then, the A matrix size becomes 700 × 250, and the B matrix size becomes 250 × 2,100, with the B matrix being semiorthogonal. In the illustration above, the value 250 is the linear bottleneck dimension, and 700 is the hidden layer dimension. e number of hidden layers used in the TDNN structure is 13, with 512 dimensions for each layer and 80 for linear bottlenecks, while the steps (stride) for each frame are 6. e TDNN structure [39] is shown in Figure 5.
e Lattice-Free Maximum Mutual Information (LF-MMI) was adopted as the objective function and the chain model in the Kaldi ASR toolkit to train the acoustic model using the TDNN structure. In Figure 5, the TDNN structure    receives input in 40-dimensional features and an i-vector with 100 dimensions. Both are then combined. Furthermore, the dimension reduction is carried out using the latent Dirichlet analysis (LDA) technique before being sent to the hidden layer. e acoustic model is trained for different lexicons and uses five epochs with initial and final learning rates of 0.0015 and 0.00015, respectively. Furthermore, the language model was trained using the SRILM toolkit with 3 grams, which was used to evaluate the acoustic model. e language model was trained using the transcript of the training data.

Results and Discussion
As described previously, the acoustic model was built using several lexicon types. e acoustic model was trained using unvalidated and validated datasets. e acoustic models were built using unvalidated utterances. e number of unvalidated utterances used to train and test the models was 206,206 and 5,287, respectively. e best acoustic model utilized the enmap_kv_vocab_full lexicon type, with a WER of 29.41%, trained with TDNNF, as summarized in Table 6. e subsequent evaluations were carried out using three different sizes of validated utterances. e number of validated utterances is 49,000, which later will be evaluated using several lexicon types. e training utterances for evaluation are summarized in Table 7.
e lexicon used are 1c-vocab-full; enmap-1c-vocab-full; enmap-kv-vocab-full; and kv-vocab-full. It was found that the two types of the lexicon (norm-1c-vocab-full and enmap-norm-1c-vocab-full) used in the previous test were very similar to the lexicon types 1c-vocab-full and enmap-1c-vocab-full, so that these two types of the lexicon were not used in this evaluation. e tests were carried out for both validated and unvalidated utterances.   e following evaluations were conducted by training the GMM-HMM models using 10,000 validated utterances. e model's performance was evaluated using 2,450 validated testing utterances, summarized in Table 8. e results show that the smallest WER is 63.20%, generated by the GMM-HMM triphone (SAT) model, trained using the 1c-vocab-full lexicon type and 10,000 validated utterances. Figure 6 compares the %WER of GMM-HMM models.
Furthermore, the following evaluation was also carried out using 10,000 unvalidated utterances. e results, summarized in Table 9, show that the lowest WER percentage for the GMM-HMM model is 71.24%. e performance was obtained using the lexicon type of 1c-vocabfull. However, the WER percentage is greater than the WER percentage of the acoustic model trained using 10,000 validated utterances. Figure 7 compares the %WER of GMM-HMM models using 10,000 unvalidated utterances. e GMM-HMM model trained using the validated utterances outperformed the model trained using the unvalidated utterances. e following evaluations were conducted for the GMM-HMM models trained using 30,000 validated utterances. e model's performance was tested using the same 2,450 validated testing utterances. e best GMM-HMM model was obtained by the 1c-vocab-full lexicon type with a percentage of WER of 54.9%. e percentage is better than the GMM-HMM model on 10,000 validated data, i.e., 63.2%. ese results illustrate that increasing the number of validated data can improve the model's performance and reduce %WER. e results are summarized in Table 10. Figure 8 compares the %WER on several GMM-HMM models using 30,000 validated utterances. e evaluation was also done for the GMM-HMM models trained using 30,000 unvalidated utterances. We found that the %WER of the GMM-HMM models was even worse than the model trained using the validated utterances, i.e., 63.28%, obtained using the lexicon type of 1c-vocab-full   as summarized in Table 11. e %WER was close to the % WER of the model trained using 10,000 unvalidated utterances. is result further shows that increasing the number of validated utterances can improve the model's performance and reduce %WER. Figure 9 compares the % WER of GMM-HMM models using 30,000 unvalidated utterances. Again, we discovered that the GMM-HMM model trained using the validated utterances outperformed the one trained using the unvalidated utterances.
We also trained the GMM-HMM models using 46,550 validated utterances and tested the models using 2,450 validated utterances. We found that the lowest %WER of the models was obtained using the enmap-1c-vocab-full lexicon type, i.e., 50.83%, as shown in Table 12. Moreover, Figure 10 compares the %WER of the models using 46,550 validated utterances.
When we trained the GMM-HMM acoustic models using 46,550 unvalidated utterances, the %WER of the models was not even better. e best %WER was 60.45%, worse than %WER of the models that were trained using the validated utterances, as summarized in Table 13. Figure 11 compares the %WER of GMM-HMM models using 46,550 unvalidated utterances.
All lexicon types were combined in the subsequent evaluation. We named it the four-combination lexicon. e lexicon combines several ways of pronouncing the four lexicons into one lexicon. is combination aims to obtain different pronunciation information that can be trained with a single model. Table 14 shows that %WER for the fourcombination lexicon is lower than the %WER of the other GMM-HMM acoustic models, i.e., 42.8%. Figure 12     In the following evaluation, we increased the amount of data using the data augmentation approach by changing the tempo of the original audio. e factor values are 0.9 to slow down and 1.1 to speed up. e size of the augmented data is three times larger than the nonaugmented data. e results show that %WER of the GMM-HMM acoustic model decreased when trained using the augmented utterances, i.e., 40.85%. In addition, the %WER of the acoustic model trained using TDNNF was also the smallest when the augmented utterances were utilized, as summarized in Table 15. ese results confirmed that %WER decreased when more validated utterances were used. Figure 13 compares the %WER of GMM-HMM models and TDNNF even if data training was augmented or not.
Furthermore, the CNN model was trained using validated augmented utterances and the 4-combination lexicon type. e result shows that the acoustic model returns the best %WER, i.e., 19.03% using the same testing utterances. Table 16 shows the result.
It is not easy to compare the WER of the models for other languages with ours because the datasets are different. Moreover, most datasets are not spontaneous speech datasets like ours, collected from YouTube, which is very noisy and sometimes contains overlapped voices. If the WER of our best result is compared with the one conducted by [8] on their spontaneous speech dataset, our WER, 19.03%, is considerably lower than theirs, i.e., 43.14%. is is also true for the work in [11]. ey used speech corpora from several recording projects, which is not a spontaneous speech dataset, and their WER is 20%, slightly higher than ours.

Conclusions
e 181 hours audio dataset was collected from the YouTube channel, consisting of 215,291 utterances with transcripts. A dictionary of 41,351 unique words was also extracted from the transcripts and used to construct the four lexicon types with different pronunciation patterns. e validated dataset and all lexicon types were used to train the acoustic models with a TDNN approach. e results show that the acoustic model built using the validated dataset is better than the one trained using the unvalidated dataset for all lexicon types. When the acoustic models were trained using the combination of all lexicon types and augmented utterances, the % WER of the GMM-HMM, TDNNF, and CNN-TDNNFaugmented models was reduced to 40.85%, 24.96%, and 19.03%, respectively. e limitation of this work is that the size of the validated utterances was considered small to recognize all phonemes from the existing lexicon, although the experimental results show improvements in the model's performance. In addition, data augmentation with various approaches is an excellent approach for increasing the number of validated datasets. In the future, we will try to increase the size of the validated dataset. An end-to-end approach could also be a potential solution for building a speech recognition model without constructing a lexicon. In addition, transfer learning using a pretrained model is also an interesting study to observe using Indonesian speech recognition datasets in the future.
Data Availability e datasets used in this research are available from the corresponding author upon request.