Improvement in Automatic Speech Recognition of South Asian Accent Using Transfer Learning of DeepSpeech2

Automatic speech recognition (ASR) has ensured a convenient and fast mode of communication between humans and computers. It has become more accurate over the passage of time. However, in majority of ASR systems, the models have been trained using native English accents. While they serve best for native English speakers, their accuracy drops drastically for non-native English accents. Our proposed model covers this limitation for non-native English accents. We fine-tuned the DeepSpeech2 model, pretrained on the native English accent dataset by LibriSpeech. We retrain the model on a subset of the common voice dataset having only South Asian accents using the proposed novel loss function. We experimented with three different layer config-urations of model to learn the best features for South Asian accents. Three evaluation parameters, word error rate (WER), match error rate (MER), and word information loss (WIL) were used. The results show that DeepSpeech2 can perform significantly well for South Asian accents if the weights of initial convolutional layers are retained while updating weights of deeper layers in the model (i.e., RNN and fully connected layers). Our model gave WER of 18.08%, which is the minimum error achieved for non-native English accents in comparison with the original model.


Introduction
Automatic speech recognition (ASR) is a key component in making human-computer interaction (HCI) hassle-free because it is the most interactive and convenient mode of communication between automated systems and humans [1]. e interaction between human and voice-based systems is mostly accomplished in English language. Some of the common applications of voice-controlled systems in AI world are chatbots, humanoid robots, healthcare systems [2], self-driving cars, surveillance systems, industrial robots, and many more. From these applications, chatbots are the according to the British Council, 1.75 billion people in this world speak English [11]. America has the biggest share with around 268 million native English-speaking people [12], UK is second with 59.6 million speakers [13], Europe is third with collectively 70 million people, who speak English as a native language, [14] and Canada is fourth with 19 million native speakers [15]. Although the abovementioned countries have more population of native English-speaking people, but there are several non-native English speakers, present in these countries as well. Apart from this, there is number of counties, which uses English as a second language and their accent is different than that of native English speakers.
For instance, Pakistan and India are holding the biggest share of non-native English speakers, i.e., around 88.6 million people in Pakistan and 125 million in India speak English as a secondary language [16]. Due to regional differences, South Asians and Gulf countries vocal accent for English is different from native English-speaking countries [17].
is highlights the issue of usability, for the voicecontrolled systems with dissimilar accents. Which eventually generates hurdles in the use of discussed human voicecontrolled systems. Although ASR is being used widely, it is not flexible enough for non-native English accents-thus, 290 million people are unable to use its applications properly. e motivation behind this research area is given as follows: (i) Make more accurate the English ASR for non-native English speakers (ii) Use existing ASR system's accuracy and make it more robust by adding an additional pipeline for non-native English Speakers. (iii) Make model learnable for a small amount of data For the last two decades, Hidden Markov's model (HMMs) and Gaussian mixture models (GMMs) were very effective in improving the recognition accuracies of ASR. However, in recent few years, deep neural network (DNNs) [18] has replaced GMMs although the remaining part of GMM-based recognition architecture is still kept for several experiments. ese systems are called hybrid ASR systems [19] because they use classic HMM/GMM-based architecture and after the training, they replace the GMM with DNN. Likewise, recurrent neural networks (RNNs) are used as well in a similar manner for language modeling.
ere is a plethora of work regarding ASR. For most of the ASR systems, models are trained using a native English accent. Some of the described models have achieved above 90.0% accuracy. Like Google's model for ASR [20] can be as accurate as the human itself in some cases. ey claimed of achieving 95.0% of accuracy. But this model is not performing well for non-native English accents until now. Research and development in ASR continuously getting better [21] with large-scale pieces of training, deeper network architectures, and reduced word error rate (W.E.R) have been providing efficient results for ASR. Microsoft AI and Research Lab published research shows 5.1% W.E.R on 2000 switch-board evaluation set by adding additional acoustic model architecture to their system [22]. ey called it CNN-bidirectional long short-term memory networks (CNN-BLSTM). eir work clearly reflects closeness to human efficiency. But the problem still occurs when it comes to South Asian accents.
In deep learning, the learning of the model is directly proportional to the amount of data. e more data model has for training, it learns more and makes general decision boundaries. Sometimes, if the model needs some modification in class labels then it requires training from scratch. e solution to this problem was provided by transfer learning [23]. rough transfer learning, the learning of the model can be transferred to new similar problems with some modifications to the last layers of the model. We can use the trained weights of the model and find the best weights for the last layer, that is, called the classification layer.
Recently, DeepSpeech2 [24] provides a deep learningbased architecture that gave promising results in English and Mandarin languages. e deep speech architecture consists on 1D convolutional layers, RNN layers, and fully connected layers. e DeepSpeech2 architecture gives awesome results for two very different languages. It means it has the capability to learn the features of different languages. So, we decided to evaluate DeepSpeech2 on non-native English speakers and improve the quality of DeepSpeech2 by transfer learning.
e most basic limitation in training these models is the limited available dataset, i.e., non-native English speakers' dataset is limited and not widely available. Consequently, most of the models are unable to recognize non-native English accents. To cover this gap, we proposed a system for the recognition of English language, specifically for nonnative English accent speakers. Our system will recognize and generate a transcription of human voice using deep learning model named DeepSpeech2 [25]. Our proposed solution will address the following points to reduce the word error rate (W.E.R) on South Asian accents for English language automatic speech recognition (ASR).
(1) We propose a hybrid model based on DeepSpeech2 with two pipelines that learn both English and nonnative English accent.

Literature Review
Automatic speech recognition is not new to this era of 4th industrial revolution wave. It all started in the middle of 19th century. In 1950, researchers from bell labs build a system named "Audrey" [27] to recognize a digit for single person [28]. Audrey was a six-foot-high relay rack, capturing considerable power in addition to streams of cable. It was capable of recognizing digits from speech using phonemes.

Mathematical Problems in Engineering
Although in the 1950s computer systems were not so good, they had limited computational speed and memory. But Audrey was perfect in recognizing digits from 1 to 9 with more than 90% of accuracy [29]. It also produces above 70% accuracy in the case of some selected unknown speakers. But Audrey was not comfortable with unknown voices, which means lesser accuracy. From 1960 to 1970, most of the exploration and phonetic segmentation work was completed. Some major techniques used for ASR in 60 s were brute force approach and template. It was good in results but it is hard to scale. A big breakthrough in that era was speech understanding research (SUR), the project of DARPA [30].
Later on, in the 1980s some of remarkable discoveries were found. First appearance of hidden Markov models (HMM) [31] changed the way of speech recognition. With HMM, neural networks also played their role. Layered feedforward networks with sigmoid function are used to train the model for speech recognition [32]. A three-layer net was constructed and a back-propagation learning procedure is used to train the network. Results of these neural nets are better than HMMs. Time delay neural network (TDNN) achieved 1.5% of word error rate over HMM's 6.3% of word error rate [33]. With the development of HMM and neural networks, many breakthroughs were achieved. DARPA started new speech projects, HMM become popular. Rabiner at bell labs performs well using HMM, AT&T performed the first large-scale deployment of speech recognition named (voice recognition, call processing) VRCP [34,35]. Automated systems were deployed, in the late 1990s United Airlines launches an automatic flight information system.
In the last decade of research and development in the area of ASR, sensor networks [36] computer vision [37], and natural language processing grows rapidly. Deep learning innovation played a vital role in it. In a recently published Microsoft research [38], very impressive results had been recorded. LSTM was preferred on RNN-LMs to achieve proficiency in reducing W.E.R. Models were built using CNTK. Human versus machine errors is analyzed, which indicates substantial equivalence. In this research, NIST 2000 dataset was used, which produces 4.9% of word error rate [39]. But NIST 2000 dataset was originally recorded from calls with native English accent. at is why it is not very much accurate with South Asian accent.
Kadyrov et al. [40] proposed an ASR based on spectrogram images of speech signals. ey achieved 98.34% accuracy. But they used a self-generated dataset. ey did not evaluate the model on different accents of English speakers, we consider the different accents of the English speakers and evaluate them on standard benchmark.
Another method to gain efficiency in automatic speech recognition was through active learning. In it, a gradientbased active learning method was used [41]. Active learning aims to label only the most informative data. It helps in reducing labeling costs. In a result, it outperformed the confidence score method used in ASR. Deep learning approaches have achieved significant accuracies in ASR. CNN is key player in achieving these accuracies. Mostly less than 10 layers CNN architecture is used to design models for learning features. But Yisen Wang proposed deep and wide CNN architecture.
is architecture is known as RCNN-CTC, it consists of residual connections and connectionist temporal classification loss function [25]. Resulting in 14.92% W.E.R on WSJ dev93 and 6.52% W.E.R on Tencent chat datasets.
One of the core difficulties in automatic speech recognition is noise. Because real-time speech data are filled with different noises like background noises, sampling rates, and codec distortion. Google recently published research to overcome this issue [42]. ey trained their model on 162,000 hours of speech. eir goal is to make a generally robust system. Previously most of the models are domain robust, like noise. Google applied various techniques to ensure the robustness of the system by using multiple codecs for encoding inputs in the presence of background noise [43]. More interestingly their model performs very well in new unseen conditions. eir multidomain model trained on 10 hours of data outperformed a model trained for 700 hours of speech data on a new domain only [44]. e survey of previous ASR methodologies is also described in Table 1.
All of the abovementioned outstanding results provide an overview of ASR history and development. But there is one common problem with all these systems, which is nonnative English accents. All datasets, used for those training, are recorded or collected from native English-speaking sources, which are different from South Asian English accents. Our proposed system is for the recognition of English language specifically for the non-native English accent as Asian accent is different from native English language, by reducing the (W.E.R) on South Asian accent. Our framework will perceive and create interpretation of human voice utilizing DeepSpeech2, where a few adjustments are suggested in the system layers.

Methodology
is paper makes a contribution toward automatic speech recognition for English language in native English and nonnative English accents.
is research work is inspired by DeepSpeech2 model for English and Mandarin languages. e whole system architecture is shown in Figure 1. e first step involves the preprocessing of the dataset, the second part is feature extraction from audio signals, and the third part is proposed two pipelined CNN-RNN models. e last stage of the system is the decoder, which is used for postprocessing of the predicted transcriptions. Each module of the proposed system is explained below: 3.1. Data Preprocessing. Common voice (CV) dataset was not recorded in a controlled environment, which means that volunteers used their own devices for the recording of CV dataset.
e recording was completed with the help of microphones and Internet browsers. Due to the fact that the recording took place in an uncontrolled environment, too much noise was introduced in the background of recorded audio, for instance, the distortion at the beginning of the audio, similar to the noise generated by the microphone, when plugged into the port. Moreover, it has empty gaps (unnecessary silences) between the words and sentences, for example, the speaker starts speaking after 0.5 to 1.5 second delay, and sometimes takes a long stay in between two sentences while reading paragraphs. us, CV dataset was useless in raw form and required tons of cleaning and preprocessing. We have cleaned all of South Asian separated audio files by removing the noise and deleting silences between sentences. We employed a self-generated Algorithm 1 to scan each audio file and perform the following activities on it to make sure data are useable for training and prediction. e preprocessing steps include as follows: (1) Deletion of empty audio files (2) Elimination of unnecessary silences between the sentences and words (3) Removal of loud noisy sounds from the beginning of recordings (4) Extraction of audio file in the FLAC format as per the requirement of network In order to remove silence and loud noisy sounds, we used zero crossing rate (ZCR) methodology [50]. It is observed that speech section of audio file computes a low zero crossing value and in silent parts, it gives a higher zero crossing value [51]. It is because of the fact that zero crossing count indicates frequency, which is concentrated by energy in the spectrum of voice signals. Vocal sounds are produced by repeated flow of air through the glottis by excitation of the vocal tract, which usually generates a low zero crossing count. Whereas speech other than voice is formed by a narrow vocal tract to cause turbulent airflow that will eventually result in noisy sound and outputs a high zero crossing count.
where S � Signal, T � Length of Signal, T � time, After filtering audio files, we saved it to FLAC format because FLAC files are better in audio quality than mp3. e visual representation of audio signal before and after preprocessing is shown in Figure 2.

Feature Extraction.
e MFCC features are used as input for the model. In the last few decades, these features show very excellent results in automatic speech recognition, semantic analysis through speech, gender classification, and emotion recognition through speech. MFCC features are calculated by the given equation as follows: log(E(j))cos n j − 1 2 π m , for n � 1, 2, 3, . . . , k, (2) e proposed model also implements the attention mechanism to weight the extracted features. e functionality of the attention layer can be expressed through the following equation: here a i represents the attention score of ith features and f i is the actual value of ith feature.

Proposed Network.
e proposed network architecture of DeepSpeech2 has been used with an extra pipeline for non-native English accents and a novel loss function described in the loss function section. e DeepSpeech2 architecture contains two 1D convolution layers, three bidirectional RNN layers, and one fully connected layer. e 1D convolution layers extract the features from the signal by convolving the 1D-kernel over the Mel frequency cepstral coefficients (MFCC) features, where RNN layer is a state full layer, that extracts the temporal information from the features extracted by convolution layers. e fully connected layer then predicts the text using the features extracted by both convolutional and RNN layers. e proposed model consists of two pipelines of convolution layers. ese two pipelines are introduced to extract features of English and non-native English accents. As the DeepSpeech2 model was trained on the English accent dataset so, we used the DeepSpeech2 model with its trained weights to extract English accent features. e structure of the second pipeline is same as DeepSpeech2 model having initial DeepSpeech2 weights that are further fine-tuned using non-native English accent dataset.
MFCC features have been used as input for both pipelines of model. First convolution layer contains 32 filters of size 11 × 41 × 1 with a stride size of 3 × 2. Second convolution layer contains 32 filters of size 11 × 21 with a stride size of 1 × 2. Both convolution layers perform padding to avoid the down sampling of data. ree bidirectional RNN layers are stacked followed by convolution layers. e last bidirectional RNN layer of both pipelines out 2,048 features. e features from both pipelines are concatenated and further passed to FC1 layer having 4,096 neurons. Now the proposed network extracts double features (2048 + 2048 � 4096) than deep speech features (2048). e excess of features leads the network to overfit. To avoid overfitting dropout layer is used after FC1 layer. An attention layer is also introduced after the dropout layer for weighting the features from English and non-native English accent pipelines. e weighted features are further passed to FC2 layer having a number of neurons equal to vocabulary size. e softmax layer is used to predict the probability of each character. e probabilities of each character can be calculated by the following equation (5): here p (c) shows the probability of z class and e c shows the score of z class that is produced by FC2 layer. K represent the size of the vocabulary. e architecture of the proposed network is shown in Figure 3.

Decoder.
e transcriptions produced by the model are mostly correct without English language constraints like spacing and sentence boundary, etc. To handle this problem a language model is used by Amodei [24]. We extend the vocabulary of LibriSpeech dataset [52] with the common voice dataset's transcriptions. A decoder is developed using a language model and vocabulary that accepts the predicted transcriptions from RNN model and produced the  Before Pre-processing A er Pre-processing transcription that satisfies the English language constraints as shown in Table 2.

Experiments and Results
In this section, the experimental setup of model training, evaluation measure, and results are discussed.

Experimental Setup.
e training is done for three different layer configurations of the network. In configuration A, the convolutional layer freezes, and learning of RNN layer and FC layers takes place by modifying their weights. Configuration B is made by freezing the convolution layer and FC layer and modifying weights of RNN layer. In configuration C, the RNN and FC layers are frozen and let the convolution layers learn. ese three configurations are shown in Table 3.
While training, one pipeline for non-native English accent was trained and the other pipeline was used as is with pretrained weights of the original DeepSpeech2 model. e English accent pipeline was freezed by setting the learning rate 0 for all layers. e proceeding implementation details for non-native English accent pipeline. e training of model is done using ReLU activation in all of the layers. e proposed loss function as discussed in Section 4.1.1 is used as a criterion. Stochastic gradient descent (SGD) optimizer is used with dynamic learning rate, starting from l � 1 × e −3 with decay rate of 1 × e −1 . e model is trained for 200 epochs. e experiments are performed on the system having Nvidia's 1080 Ti GPU having 3584 Cuda cores and 11 GB cud memory. e system contains 16 GB DDR3 RAM and a 2.8 quad-core processor. e total dataset is split in a 70-30 ratio. A total of 70% of the data are utilized for learning the model and the remaining 30% of data are used to evaluate the learning of the model. e model's Read CSV of common voice dataset Choose "Pakistani," "Indian," "Dutch," and "Sri-Lankan" accent rows from country column Select "filename" columns from CSV    Table 4.

Loss Function.
Most existing ASR models were optimized using some cost functions that take the predicted output y′ and ground truth y. e difference y and y′ is used to optimize the model parameters. e proposed optimization technique intends to reduce the feature gap between the South Asian and European accents. So, the proposed loss calculation function uses the features of South Asian and European accent audio from FC2 layer, and the mean square difference of these features is used as a loss to optimize the model parameters as shown in the following equation: here a is feature vector of South Asian accent and e is feature vector of European accent on the same transcription and N is number of samples. e objective of this loss function is to decrease the difference between the feature vector of South Asian and European accents to reduce the accuracy gap in ASR for South Asian and European accents.

Dataset.
We have used common voice [53] dataset for fine-tuning of DeepSpeech2 [54] models. common voice dataset was recorded for more than 18 languages by Mozilla for the purpose of research. It consists of total 1087 hours of audio files, from which 780 hours were validated with transcription. is dataset was recorded with both male and female voices with the ratio of 47% and 11% at a sampling frequency of 16 kHz. A detailed description of dataset accent according to the region is listed in Table 5.
As stated earlier, we are focusing on non-native English accents, so we used a subset of this dataset by filtering out Pakistani, Indian, Dutch, and Sri-Lankan speaker's recordings, with the help of Algorithm 1. Because accent variation is affected by the geographical area in which the speaker grows up and lives as well as by factors such as social class, culture, education, and working environment. All of these factors have an impact on the accuracy of the automatic speech recognition system. After splitting desired recorded files, we got a total of 10,219 audio files with an average playtime of 5 seconds. Table 6 shows the further splitting of training and testing classes accordingly with respect to gender.

Evaluation Measures.
e model evaluation parameters that we have selected are word error rate (WER), match error rate (MER), and word information rate (WIR).

Word Error
Rate. WER is one of the most common evaluation parameters for ASR models and it provides a good comparison between the results of our proposed model and other related work done so far. e WER tells the rate of error in the transcript generated by ASR by comparing it to the original transcript. It can be calculated by the following equation: where N is the total number of words spoken in the original transcript.
here, S is the number of substitutions. I is the number of insertions, D is the number of deletions, and H is the total number of hits, i.e., correctly transcribed words.

Match Error Rate.
Match error rate tells the probability of given input-output word matches being incorrect. It can be calculated by equation.
Unlike in WER, here N is sum of all four terms.

Word Information Loss.
Word information loss gives the probability of any input word is matched with an equal output word and vice versa. It can be calculated by equation.  . (11) All three evaluation metrics represent the errors and loss in the output, hence, the lesser the value is, the better the model predicts. However, WER is not an actual percentage as it has no upper bound limit because of the insertion I parameter. So, WER can only be used to compare different models while MER and WIL can be interpreted as how well the model performs.

Results.
Before any modifications happened in Deep-Speech2, we loaded pretrained model on LibriSpeech dataset. As the majority of speakers of this dataset were from US, the model achieved WER of around 6% on the test set on US English accents. As the model was trained on American people and the test was also containing American people, that is why we obtained 6% WER. However, the same model, when evaluated using common voice dataset, gave a drastically high WER of 43% for South Asian accents. e reason for 43% WER is pretrained model of DeepSpeech2, which is not trained for non-native American or non-native English guys, that is why we use parallel pipeline for processing for the nonnative English peoples.
is is called a learning network. When it finds the input of non-native speakers for English it learns and updates its weights accordingly, meanwhile, we have frozen the learning rate for the freeze network, hence, we can save the performance of our model for the native English users and weights would not be changed for this pipeline. Whereas if the user is not-native English speaker then the learning network will entertain that user and update the weights. at is why our model is working better than other models due to its learning controls. e training of DeepSpeech2 is done in three different layer configurations as mentioned in Table 3.
e loss comparison graph of configurations A, B, and C is shown in Figure 4. e purpose of experimenting through these configurations is that we want to make DeepSpeech2 perform better for South Asian accents by transfer learning. We experimented with different dropout ratios and the best results were achieved with a dropout ratio 0.7. All the results shown here of different configurations have the same dropout ratio of 0.7.
(1) For configuration A, where all layers were learned using a common voice dataset, the model achieved W.E.R of 35.35% on the validation set. e training and validation cost of this modification is shown in Figure 5(a). (2) For configuration B, the weights of RNN layer have been learned by freezing the both CNN and FC layers. In this configuration, the model retained its low-level features learned from LibriSpeech dataset and learned only the new high-level features from the common voice dataset. e WER achieved on the validation set is 20.419%. Training and validation cost for these configurations is shown in Figure 5(b). (3) Finally, for configuration C, we learned the weights of RNN and FC layers by freezing the convolutional layer. W.E.R achieved on the validation set is 18.0859%. e cost of training and validation is shown in Figure 5 Table 7.
We contrasted our DeepSpeech2 algorithm to Apple Dictation, Bing Speech, Google Speech API, and wit.ai, which are all for profit speech technologies. Our test is intended to monitor success in noisy situations. is circumstance complicates the evaluation of web audio APIs: whenever the SNR is just too small or, in certain situations, whenever the phrase is too lengthy, such algorithms will provide no results. As an outcome, we limit our analysis to phrases under that all algorithms gave a not-a-void outcome. Table 8 shows the outcome of assessing each system on our test files.

Conclusion
ASR is being used extensively to enable natural humanmachine interaction, but not pliable enough for South Asian accent for English language for which almost 200 million people are unable to use its applications. Our contribution towards resolving this obstacle is the proposed system, that is, inspired by DeepSpeech2. e proposed method provides the two pipelined deep learning architectures that achieve minimum character error rate (CER) and word error rate (WER) on common voice (CV) benchmark. By setting up different experimental configurations and modifications, we are successful in achieving minimum WER, that is, reduced from 43% to 18.08% at a lower validation cost. As this work focused on South Asian's English accents so, there is a little bit of increase in WER and CER for English speakers. e system will be further scalable towards targeting other South Asian languages like Bengali, Urdu, Hindi, and others with more robust datasets and higher accuracy and the training of both pipelines parallelly.

Data Availability
ere are two datasets that are used in experiments for the proposed research. e first one is LibreSpeech dataset is audio signals in English language. e total length of the dataset is 1000 hours. e annotation is provided along the dataset in form of a transcription of the audio signal. e readers can find the dataset at this (https://www.openslr.org/ 12) link. e second dataset, that is, used for transfer learning of DeepSpeech2 is a common voice dataset version 7.0.
is dataset also includes the audio signal and their transcription in English language. e total length of dataset is 2637 hours and 75879 different voices. e size of the total dataset is 65 GB. e reader can find the dataset at this (https://commonvoice.mozilla.org/en/datasets) link.

Conflicts of Interest
e authors declare that they have no conflicts of interest.