Spoken Language Identification Using Deep Learning

The process of detecting language from an audio clip by an unknown speaker, regardless of gender, manner of speaking, and distinct age speaker, is defined as spoken language identification (SLID). The considerable task is to recognize the features that can distinguish between languages clearly and efficiently. The model uses audio files and converts those files into spectrogram images. It applies the convolutional neural network (CNN) to bring out main attributes or features to detect output easily. The main objective is to detect languages out of English, French, Spanish, and German, Estonian, Tamil, Mandarin, Turkish, Chinese, Arabic, Hindi, Indonesian, Portuguese, Japanese, Latin, Dutch, Portuguese, Pushto, Romanian, Korean, Russian, Swedish, Tamil, Thai, and Urdu. An experiment was conducted on different audio files using the Kaggle dataset named spoken language identification. These audio files are comprised of utterances, each of them spanning over a fixed duration of 10 seconds. The whole dataset is split into training and test sets. Preparatory results give an overall accuracy of 98%. Extensive and accurate testing show an overall accuracy of 88%.


Introduction
Spoken language identification (SLID) is recognizing the language being a talk by an anonymous speaker from an audio clip. Humans are the most error-free language identification system [1]. ere are various implementations of spoken language identification like creating front ends for multilanguage speech identification systems, automatic customer routing in call centers, monitoring, and web information retrieval [2].
e SLID system has three main parts, data collection, feature removal, and language classification, as shown in Figure 1. An essential for developing and evaluating a speech recognition system is the accessibility of a suitable database [3].
Different methods have been proposed to figure out the problems of automatic language identification with the acoustic phonetics approach [1,4]. Encouragement in the deep learning field has empowered researchers to use GANs for language identification for robustness on unsupervised and semisupervised tasks [5]. Support vector machines (SVM) classifiers do not work well on short utterances, giving less accuracy [6]. Conventional identification systems are supported on i-vector systems for spoken language processing tasks, which are inefficient [7].
To resolve the problems which are mentioned above, a log-Mel spectrum is used to generate the spectrograms of audio snippets, which can record or store the frequency of particular audio utterances. It is efficient and fast, and further, we can apply the convolutional neural network (CNN) technique to classify different languages. is work has been done using the spectrum technique using deep learning by the authors [8]. Many researchers are working on image creations and image identification using a deep learning technique that gives good results and better accuracy in 2D [9,10], 2.5D [11], and 3D [12,13] domain.
ere is a challenge to identify spoken language with different genders, other age groups, and various accents. ere is noise in the background in some of the audio clips, so it is very hard to identify the language. A deep learning CNN technique is proposed to draw out the attributes. Figure 2 shows the phases of the proposed spoken language identification framework. A prediction is made by a model, which can easily identify classification in the proposed framework.
1 A novel deep learning-based model is proposed to apply the convolutional neural network (CNN) to draw out attributes from images. 2 e proposed model is analyzed with different deep learning and machine learning techniques over four datasets. 3 e proposed approach differs from other state-of-theart methods on various evaluation metrics and shows the comparison with different techniques.
e formation of work is as follows: Section 2 represents an explorative concept of spoken language identification using CNN. Section 3 discusses past studies in the language identification field. Section 4 discusses the model architecture of raw waveforms and log-Mel spectrogram images. Section 5 represents an experimental result. In Section 6, consequences and results are discussed. Finally, Section 7 concludes the paper.

Background
is section discusses the preliminary concepts of spoken language identification using CNN, spectrograms, and Multinomial Naïve Bayes.

Spoken Language Identification Using CNN.
e process of spoken language identification using the CNN technique uses spectrograms of raw audio signals as input to a convolutional neural network (CNN) [8,14]. A spoken language identification dataset is collected and preprocessed for the training phase. e main focus is on preprocessing, in which we convert audio utterances in spectrogram images. After that, data is portioned into the training and testing phase. Furthermore, we apply CNN to extract features from it. After the training is completed, the test dataset is used for validation. e prediction accuracy is calculated based on the model's performance in the validation phase.

Generation of Spectrogram.
e spectrogram refers to representing the frequencies on the image that are present on a signal over time. e signal's frequency gives rise from a time series signal of data points using Fast Fourier Transform (FFT). Fast Fourier Transform (FFT) can be put into time-series data to calculate the magnitude of the frequency for a fixed moment in time. e process of time-series data is first windowed, usually in small chunks, and the FFT data is kept together to form the spectrogram images, which empowers us to see how fast the frequencies improve.
Since the work was done to generate spectrograms on audio clips or utterances of data, then data was regenerated into Mel spectrograms, known as spectrograms images. e conversion frequencies from f hertz type to m mels are shown in Figures 3 and 4: m � 2595 log 10 2.3. Bernoulli Naïve Bayes. Bernoulli Naïve Bayes uses discrete data and it works on Bernoulli distribution. e main feature of Bernoulli Naïve Bayes is that it accepts features only as binary values like true or false, yes or no, 0 and 1, success or failure, and so on. As they deal with binary values, let us consider 'p' as the probability of success and 'q' as the probability of failure and q � 1 − p, for a random variable 'X' in Bernoulli distribution: where 'x' is in binary form like 0 or 1. e Bernoulli Naïve Bayes machine learning classifier is based on

Related Work
Literature [15] proposed a deep learning-based spoken language identification system.  lid corpus. With these two approaches, they failed to achieve good results. ey also compare the performance with Equal Error Rate (EER), where they got an average of 9.58% using the DNN model. Literature [16] proposed a method that can improve generalization to identify short speech-language using triplet entropy loss with the help of CNN; they combine cross-entropy loss (CEL) and triplet loss through which they generalize the data and use it on Slavic languages. ey use ResNet50 pretrained model and use the softmax function at the last layer. It uses an Adam optimizer with the learning rate (LR) of 10e-4, uses the batch size of 32, and ridge-based regularization to reduce overfitting. e top accuracy they achieved was 78% using triplet loss. ey showed Triplet Entropy loss is better than cross-entropy loss, but they failed to achieve good performance.
Literature [17] proposed an unsupervised neural-based model which can be used in spoken language identification and can decrease the distribution variance on both attributes and classifiers for the training and testing datasets. It proposed the optimal transport (OT) method to measure the distribution of the discrepancy.
e Time Delay Neural Network (TDNN) framework is used for determining training and test set adaption. Literature [18] proposed a deep neural network-based model which identifies a Slavic language or those languages which are similar. ey created the model with two parameters: segment level feature extractor and language classifier. e model uses the CNN with 128, 256, and 512 filters with 5, 10, and 10 for each layer with stride 1 at each layer. ey use two techniques, Baseline LID and Robust LID. In baseline LID, they got 53.25 as average accuracy, and in robust LID, they got 87.32 as average accuracy. Literature [19] proposed a framework that combines CNN and LSTM system that uses CTC loss function to train the model. In this, audio clips are converted into spectrograms, further CNN is applied to extract the features from it, and further LSTM is used to store the data from previous layers. e speech signal is sampled at 16 KHz with window size 200 ms, and window stride with 100 ms. is method gives an accuracy between 74% to 76%.
Literature [20] proposed a capsule network framework for identifying spoken language identification systems. In CapsNet, a convolutional layer has a total of 128 kernels with the size of (9,9,1) and a step size of 1 with ReLU function. It divided the CapsNet into two parts: encoder and decoder. e first 4 layers represent the encoder, and the last 3 layers represent the decoder. ey achieved an accuracy of 91.80% with 5-second audio clips using the CapsNet approach. Literature [21] proposed various feature selection methods like top-k selection, forward feature selection, and recursive feature elimination, so the model can work efficiently. In the first phase, preprocessing is done in which it removes the punctuations, emoticons, links, hashtags, URLs, then removes the less important words using the English stop words, and then removes redundant data from the dataset. e top-k feature method performs well; it selects 550 features compared to the other methods. Literature [22] proposed a Recurrent Neural Network Transducer (RNN-T) for speech recognition and spoken language identification. ey use two languages pair: English-Spanish and English-Hindi. e RNN-T framework uses 5 encoder LSTM layers with 1024 units and 2 decoder LSTM layers with 1024 units; a 512-d embedding layer was used as a decoder input. e discrete fracture network (DFN) has 512 hidden neurons, along with tanh and softmax functions.
Literature [23] proposed a technique that associates with acoustic level representations with embedding on Automatic Speech Recognition, which gives a 50% reduction in error rate. ey used 64-d log-Mel feature extractors for training on a 25 ms window with a 10 ms overlap. e first 3 LSTM layers comprise 768 units, and all the data is further passed to FC layers and softmax function. At last, they used a semisupervised technique to increase the accuracy and good results of the model. Literature [24] proposed a signal combination approach for language identification. ey used a deep learning model to combine the signals from recognizers with the baseline, which uses low-level acoustic signals. It helps to decreases the error rate from 5.50% to 4.30%. ey work with 11 different models and use ReLU, dropouts, Adam, batch normalization, and various other attributes to get good results. Literature [25]  Computational Intelligence and Neuroscience with a mean and standard deviation, 1 fully connected layer with 1024 units, and at last softmax function is used with unit 1. With this framework, they got 97.0% accuracy. Literature [5] proposed a conditional GANs classifier framework for language identification, choosing GANs is a better option on large datasets, giving good results. 2 × 2 is used to perforate upsampling, 5 × 5 Conv. 1, tanh, and an output (49) tanh is used. With this framework, they achieved 97% accuracy. Table 1 summarizes the previous studies, features, and results as discussed above.

Proposed Spoken Language Identification Framework
is section discusses the motivation and the spoken language identification framework.

Motivation.
Various state-of-the-art results on various audio classification tasks have been obtained by using log-Mel spectrograms of raw audio, like features, which convert the audio utterance into images [8]. CNN gives an excellent performance gain in classification on these features [14]. e motivation of work has come from these studies. e computation time is more for converting audio into spectral images, giving us a new direction to develop the computationally efficient and more accurate spoken identification technique.

Proposed Spoken Identification Framework.
e proposed deep learning-based spoken language identification framework: while designing this framework, audio utterances are converted into spectrograms based on their frequency and time. After this, a convolutional neural network (CNN) is applied to images to extract their features for classification. At last, the softmax activation function is applied for multilanguag classification.

Preprocessing.
In the preprocessing phase, data augmentation is used to solve class imbalance problems. Data augmentation reduces overfitting and acts as a regularizer when training a model. With the help of data augmentation, it can increase the amount of data by adding some modifications of existing data like crop, rotate, flip, shearing, and much more effects. e use of data augmentation is good while using transfer learning models works well on more data and predicts good results.

Description of Features.
e duration of each audio is 10 seconds (sharp) with a sample rate of 22050, a bit depth of 16 bits, and channels 1, and each audio file is a Free Lossless Audio Codec (FLAC) audio sample. e dataset is divided into two directories: train, which contains (73080) samples, and test, which contains (540) samples, with three languages English, German, and Spanish. Several audio transformations are applied, like pitch, speed, and noise. It contains the voice of 90 original speakers of male and female.

Model
Description. In the model description, it describes the framework to all the models for the experiment purpose: a) An appropriate pooling layer always follows every convolutional layer. It helps to include the explosion of attributes and keeps the model small and efficient. b) Each convolution layer is followed by the dropout layer, ReLU, and batch normalization. e batch normalization is responsible for the convergence of learned representations. c) Finally, a dense layer is used, which acts as an output layer of the model.

Model Details: Bernoulli Naïve Bayes.
is approach uses the Bernoulli Naïve Bayes machine learning technique to identify language from a given dataset. In a preprocessing step, all the data are first split into X and Y and then encode the data using a label encoder library. Following that, perform data cleaning to convert all the sentences into lower case. en, the Naïve Bayes approach is applied, which takes 29.7 seconds to fit in the model and gives me 93.0% accuracy. Table 4 shows the metrics that are performed using Naïve Bayes.

Experimental Results and Discussion
is section contains results and a discussion of different techniques. All the details regarding the use of datasets, hyperparameter settings, evaluation metrics, and computational time analysis of the different proposed approaches are illustrated.

Datasets.
e experiment of the different techniques is implemented using the four datasets, spoken language identification [30], language identification dataset [31], common voice Kaggle [32], and Mozilla common voice dataset [33], which are described in Table 5.
Spoken language identification [30]   Log-Mel images were used as features for language identification coupled with SGD based neural network. [2] Computational Intelligence and Neuroscience

Hyperparameter Details.
e attributes of the proposed method are represented in Table 6. e trial and error method is used while running the convolution neural network [8,14], word embedding Keras [34,35], and Naïve Bayes [36][37][38]. e selection of hyperparameter is also defined as an NP-complete problem [39,40]. e efficient selection of hyperparameters can achieve better results [41,42]. In CNN, the epochs are set to be 60, and the size of the batch is 32 with ReLU as an activation function. Dropouts [43,44] are used with Adam optimizer. At the output layer, the softmax function [45] is used. In word embedding, it is a pretrained model by Keras. It is used in which the epochs are 25, and categorical cross-entropy loss is applied with Adam optimizer. Bernoulli naïve Bayes classifier is implemented with kernel function Bernoulli.

Performance Evaluation Metrics.
e evaluation metrics are used for experimentation to check the performance of the model. ose are precision, recall, F1 score, and accuracy as shown in Table 7.
A receiver operating characteristic (ROC) curve is a graph that represents the classification model at different classification threshold values. ese curves plot two parameters or attributes of ROC: false positive rate (FPR) and true positive rate (TPR).
In Figure 5, a multiclass ROC curve for language identification is presented. is spoken language identification Kaggle dataset contains three languages: German, English, and Spanish. Similarly, the ROC curve is also made like this for the language identification Kaggle dataset, which contains 22 languages: English, Arabic, French, Hindi, Urdu, Portuguese, Persian, Pushto, Spanish, Korean, Tamil, Turkish, Estonian, Russian, Romanian, Chinese, Swedish, Latin, German, Dutch, Japanese, and ai.

Results and Discussion
e presented work discusses various methods which attain state-of-the-art results using four different datasets with audio, and the first dataset contains three languages, the second dataset includes 22 languages, and the third dataset includes 16 languages. All are available on the Kaggle and fourth Mozilla common voice dataset contains four languages and is available on the Mozilla website. In the image domain, 2D convolutional neural networks obtained an accuracy of 98%. In another dataset of CSV file, word embedding using the pretrained model obtained an accuracy of 95%. With Bernoulli Naïve Bayes approach, we obtained an accuracy of 93% on a 22-language dataset. Using the SVM and random forest classifier model achieved 82.88% and 72.42% accuracy on the 16-language dataset.

Misclassification.
Various languages in the world belong to the Indo Persian and European families. In this group, the languages are separated into three subparts: Germanic, Romance, and Slavic. Our model confuses those languages with the same words; for example, "Cat" word in English, "Chatte" word in French, "Kat" word in Dutch, and "Katze" in German all have the same sound and pronunciation; hence, it is very difficult for a model to understand. Our model confuses Russian (Ru) and French (Fr) because they have similar accents; many words are adopted from French to Russian, so it is very difficult to give accurate results.

Performance of Classification Model: Confusion Matrix.
In this section, the performance of the model is shown in Figure 6, using a confusion matrix for multiclass classification representing three classes of English, Spanish, and German. In this matrix, diagonal elements are predicted the same as the true value while nondiagonal elements are not classified properly by the model. On the x-axis, there is the true label, and on the y-axis, there is a predicted label.
In Figure 6, the multiclass confusion matrix for language identification in this spoken language identification Kaggle dataset is used, which contains three languages: German, English, and Spanish. Similarly, the confusion matrix is also made like this for the language identification Kaggle dataset, which includes 22 languages: English, Arabic, French, Hindi, Urdu, Portuguese, Persian, Pushto,  Ref.
1 Accuracy (acc) (tp + tn/tp + fp + tn + fn) It is the ratio of correct outputs compared to the total number of outputs. [30] 2 Precision (p) (tp/tp + fp) It is the ratio of correct positive predictions from the total prediction from the positive class. [30] 3 Recall (r) (tp/tp + tn) e recall is used to measure the fraction of positive patterns that are correctly classified. [30] 4 F1 score (FM) (2 * p * r/p + r) e F1 score refers to or represents the harmonic mean between recall and precision values.

Convergence for Training and Validation.
is section is basically on the use of various optimizers on train and validation accuracy to compare our model. In Figure 7(a), the RMSprop optimizer with five epochs gives good results. In Figure 7(b), the use of Nadam optimizer with five epochs and its performance is not so good as compared to other optimizers. In Figure 7(c), the use of an SGD optimizer with five epochs and performance is a little bit better than the Nadam optimizer. In Figure 7(d), the use of Adam optimizer with five epochs also works well and gives good results.

Conclusion and Future Scope
ere are two contributions of the paper in the field of spoken language identification. Firstly, we use the deep learning architecture for image classification in identifying languages from generated images from audio. Powerful performance can be achieved using relatively short files with minimum preprocessing. We believe that this model can be extended to more languages as long as sufficient. is approach achieved an accuracy of 98% and gave us good results. Secondly, we use the Bernoulli Naïve Bayes approach on a language identification dataset with 22 languages. It takes a little bit more time as compared to CNN in model fitting data. is approach gives us an accuracy of 93%. And further, we apply another approach to this dataset, a pretrained model by Keras that is word embedding. It is a little bit faster and more accurate than Naïve Bayes. is approach achieved an accuracy of 95%. e performance of log-Mel spectrograms can be additionally refined by removing the noise from audio. ere is a possibility for improvement by data augmentation on the available data using different methods like pitch shifting, crop, rotate, flip, adding random noise, and changing audio speed, and various methods. ese help in making neural networks more robust to modifications that might be present in real-world scenarios. ere is often further observation or review of various feature extraction techniques like Constant-Q transform and Fast Fourier Transform and their impact on language identification. ese are known to possess a positive impact on the performance of convolutional neural networks.
Data Availability e data that support the findings of this study are available upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. 10 Computational Intelligence and Neuroscience