Parallel Big Bang-Big Crunch-LSTM Approach for Developing a Marathi Speech Recognition System

Department of Computer Science and Engineering, University of Jammu, Jammu, Jammu and Kashmir, India School of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India Cluster University of Jammu, Jammu, Jammu and Kashmir, India Computer Science and Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, AP, India Department of Computer Science and Engineering, SRM University, Amaravati, AP, India School of Computer Applications, Lovely Professional University, Phagwara, Punjab 144001, India Department of IT Engineering, Nepal College of Information Technology (NCIT), Pokhara University, Lekhnath, Nepal


Introduction
Due to the high degree of exibility o ered by speech recognition (SR) software and voice recording devices with multiple microphones, various models of hands-free speech communication are used in di erent types of application domain such as automatic speech recognition (ASR) and multimicrophone portables. Because of the e ects of room reverberation, background noises, and interfering speakers on the considered speech signals, the performance of the automatic speech recognition model is generally minimized. Moreover, di erent speech enhancement techniques are intended to minimize the noise without a ecting the speech signals to improve recognition models' robustness and performance [1]. However, automatic speech recognition models are complex due to constraints like freestyle or spontaneous speech, as well as lack of reliability to speech di erentiations such as speaking rate, gender, sociolinguistics, accents, and environmental noise. ere is a requirement for bridging the space among the speech recognition methods and humans to solve the challenges in these models.
Automatic speech recognition (ASR) models are crucial due to the complications present in the classi cation of languages with the common origin and intermixing of di erent languages along with the multilingual SC [2]. erefore, there is a need to solve the limitations present in the existing recognition systems to get optimal results. Consequently, some of the research works have considered one of the South Asian languages like Marathi. However, there is no evidence for o ering e ective solutions while recognizing the Marathi language [3]. Moreover, the Marathi language model suffers from inadequate SC and small size vocabulary systems [4].
Different deep learning approaches are well performed for SR fields. ese approaches are used for automatic recognition models in "single-channel speech enhancement," and, thus, the recognition performance can be improved [5]. In existing studies, different speaker adaptation approaches are developed by targeting diverse speakers. Even though these existing deep learning algorithms often give more benefits, they also suffer from computational and language complexities [6]. After 12 kilometers of travel, the Marathi dialect is believed to shift. Due to numerous difficulties, speech signal processing is typically a challenging undertaking, yet effective research can provide solutions to all of these issues. Due to India's digitization, Marathi ASR and other Indian language ASR are in high demand [7]. Furthermore, the lack of research on Marathi language models inspires the researchers to design a new framework for Marathi language.
Significant contribution of the suggested framework is listed as follows: (i) Developing novel framework on the Marathi language with multiple steps including preprocessing, feature extraction, and classification using a heuristic-based classification approach. (ii) To extract the useful attributes from the speech signals with MFCC in addition to spectral-based features for increasing the performance. Here, the attained features are reduced to get significant features using Principal Component Analysis (PCA) technique. (iii) Optimization of the hidden neurons and weight in the LSTM classification method using the PB3C algorithm to recognize speech signals to maximize the recognition accuracy.
e remaining sections of this paper discuss the literature survey and analyze the architectural view of Marathi SR, feature extraction, and feature selection for Marathi SR and the results and discussions. e paper ends with the conclusion.

Literature Survey
In 2021, Smit et al. [8] have described one new model for implementing subword language systems by considering the Deep Neural Networks (DNN), weighted finite-state transducers, and Hidden Markov Models (HMM). is paper has considered an acoustic system with character models and subword language systems without requiring the pronunciation dictionaries. ey have also proposed approaches to combine the advantages of diverse classes of language model units through the reconstruction and combined recognition lattices. e developed model has constructed the Neural Network Language Models (NNLMs), which was practical due to fewer input and output layers. e four languages "Finnish, Swedish, Arabic, and English" were used to evaluate different subword units on SC. e experimental analysis was carried out and it showed more consistent results and reduced the error rate.
In 2019, Tu et al. [9] had developed a new Iterative Mask Estimation (IME) assembly for boosting the complex Gaussian mixture model-(CGMM-) based beamforming method to get the complete information. is model has developed a neural network-(NN) based ideal ratio mask estimator educated from the multicondition SC for incorporating the previous information. Subsequently, voice activity prediction information was attained from speech recognition results to use the rich context information in language models and deep acoustics, which was then employed to reduce the insertion errors and refine the mask estimation.
e developed model experimented with the CHiME-4 Challenge ASR job of recognizing 6-channel microphone array speech in the testing process. e results of the experiments have revealed that the suggested IME method has consistently and significantly outperformed the existing CGMM method and reduced the error rate.
In 2017, Kipyatkova and Karpov [10] had implemented a Russian language automatic speech recognition model using recurrent artificial neural networks. It has considered hidden layers with different counting of elements, and the baseline trigram language model was performed with linear interpolation of NN models. e performance of the developed model was analyzed in terms of WER.
In 2015, Zhou et al. [11] had implemented a new "DNNbased acoustic modeling" structure for the ASR model, in which the multiple DNNs (mDNN) were computed to use the posterior probabilities of HMM states. Initially, the HMM states were clustered into different disjoint clusters by considering the data-driven approaches. en, the mDNN was trained to cluster the states. ey have shown that the considered training process using the mDNN model was employed to increase the training speed, including sequence-level discriminative training and frame-level crossentropy. e suggested model has increased the capabilities of the developed model.
In 2014, Xue et al. [12] implemented a DNN-based ASR model by presenting different layers of pretrained DNN using a novel group of linking weights. Furthermore, the training approaches have learned a new condition code for each and every test condition from adaptation data. is developed model has used a fast adaptation strategy for developing an ASR model with supervised speaker adaptation. ey have also implemented several speaker codes, in which the experimental analysis of the proposed adaptation scheme was carried out by comparison with different approaches. Lastly, they have attained superior performance in terms of WER, accuracy, and precision.
Bashir et al. [13] have proposed DNN-based emotion detection for Urdu language. e proposed DNN-based model outperforms other machine learning approaches. Akram et al. [14] projected a linguistic prototype for social text based on deep autoencoder. ey have implemented this model for low resource language Urdu. e key addition in this exploration is converting high-dimensional feature space to low-dimensional one for Urdu language.

Problem Statement.
In recent years, different ASR models have been proposed, which are discussed in Table 1. RNN [8] increases the performance with a better accuracy rate. However, this model is not suitable for a large amount of training data. DNN and unidirectional LSTM [9] reduce the word error rate (WER). Conversely, it is not suitable for executing the objective functions with joint learning. Artificial neural network [10] reduces the WER. However, this model suffers from the demographic influence on the languages. mDNN [11] increases the recognition performance and increases the training speed. Conversely, the accuracy rate can be degraded. DNN [12] improves the performance and efficiency while adapting larger DNN models and attains less WER. On the other hand, this model cannot optimize the speaker representations. Moreover, the ASR model for the Marathi language is not focused on recent research works.

Architectural View of Marathi Speech Recognition (MSR)
From the past many years, more research studies have considered SR models using machine learning approaches. Different speech-related applications are focused on deep learning algorithms. Because of the usage of different ML and DL algorithms, speech recognition (SR) models are emerging areas in the research area.

Proposed Model and Description.
Speech recognition models are crucial problems due to the complexities in determining local languages and correlation among different languages. us, the speech recognition framework on the Marathi language must be adopted with deep learning related approaches, represented in Figure 1. Significant stages of the proposed framework for the Marathi language are "preprocessing, feature extraction, feature selection, and classification." Collected audio signals passed through preprocessing stage, which is done with smoothing and median filtering techniques. Moreover, the extracted features are reduced to get the optimal features using PCA to reduce the information's dimensionality. e selected attributes are forwarded to the labelling phase, in which the combination of LSTM with the PB3C algorithm takes place. Finally, the recognized speech signals are attained using the PB3C-based LSTM method. e P3BC optimization mechanism is applied to evolve the optimum quantity of neurons of different hidden layers of the LSTM. We also use PB3C algorithm to compute the weights of each link in the LSTM. is model aims to enhance the accuracy of LSTM model for Marathi speech recognition and find out light weighted machine learning model.

PB3C-LSTM Approach.
e PCA-selected features are given to the PB3C-LSTM model for the efficient SR signals in the Marathi language. Here, the hidden neurons counting in LSTM is optimized with the assistance of PB3C technique.
is model aims to improve the accuracy of speech-recognized signals.
In general, LSTM [15] is considered as the variants of the recurrent network by means of memory blocks. LSTM consists of input layer, hidden layers, and output layers. LSTM entails three gates, namely, output gate (OG), input gate (IG), and forget gate (FG). IG and OG are used to  Mobile Information Systems regulate input and output functions in a block of memory cells.
is is followed by the addition of the forget gate, where the LSTM network is used for getting the unit activations from the series of input Fs PCA n , where n � 1, 2, . . . , N and N stands for the number of features from PCA, which gets the output as o n � (o 1 , o 2 , . . . , o N optimal ) to find mapping among them. It is equated as follows: In the above-mentioned equations, the FG bias vector (v f ), the OG bias vector (v h ), the input vector, or current time step is denoted as Fs PCA n , and the IG bias vector and the IG are termed as v i and i, respectively. Moreover, the FG and the weight matrices are symbolized as fg and wg, respectively, and the cell activation vector, output gate, and the previous output from the blocks are denoted as p, h, and p (n− 1) , respectively. e cell output functions, the cell input functions, and sigmoid function are mentioned as j, go, and α, respectively, and ϕ is the output activation function. Further, the activation function (tanh) is employed in the multilayer LSTM. Similarly, the notations h n and q n indicate the memory of the current blocks and output of the current blocks, respectively. e peepholes connections diagonal weights are given as terms wg i Fs PCA n , wg f Fs PCA n , and wg p Fs PCA n , the highest weight value of the IG to the input is given as wg i Fs PCA n , and i (n− 1) signifies the output coming of the preceding memory from input blocks. Here, the new LSTM is used by optimizing the number of hidden neurons using the PB3C algorithm. It is aimed at improving the classification accuracy to get accurate speech signals in the Marathi language.

BB-BC Algorithm.
It is motivated by the "Big Bang eory" in cosmological science that explains the conception of the cosmos in an explosion. e population is randomly distributed based on the center of mass (COM) in the big crunch stage. In the search space, the initial candidates are uniformly distributed in the BB-BC technique. Moreover, the big bang phase is immediately followed by the big crunch phase, where the fitness functions of each candidate and the current positions are replaced by the convergence operator for producing a weighted average point that is termed as a COM as formulated in the following equation: In equation (7), the z th solution of the fitness function at the b th iteration is termed as fh z b , the a th factor of the z th answer at the b th round is symbolized as c z a,b , the a th factor of the center of a mass point at the b th round is considered as c y a,b , the whole count of candidates in the population is shown as np, the current iteration is expressed as b, the current candidate in the population is denoted as z, and the current dimension is mentioned as a. e recent COM is considered as the essential in the next iteration and then explodes in the big bang phase. Further, the new members are produced by the explosion, which follows the normal distribution just about the COM as formulated here.
In equation (8), the standard normal distribution's random number is stated as rn a , the upper and lower limits are termed as c max and c min , respectively, parameter δ confines the parameters of the search domain, and the new candidate is noted as c a,(b+1) . e optimal results are attained by fixing the value of standard deviation from equation (8), while the standard deviation is fixed for inversely decreasing the current iteration. e big crunch contraction phase is used for recalculating the COM, after the big bang explosion. Until the termination criterion is met, the explosion and contraction processes are continuously repeated. is BB-BC algorithm aims to attain the optimal results regarding SR in the Marathi language, which reflects the key goal as the maximization of recognition accuracy. Steps along with code of BB-BC algorithm are given in Algorithm 1. [17].

PB3C Algorithm
e extended version of BB-BC algorithm is a multipopulation optimization algorithm that shows superior accuracy and convergence rate while comparing with the BB-BC algorithm. is algorithm works by updating the elite by considering the local best solution in the population. e solutions are updated using the following equation: e best fitness individual is selected based on the COM. Moreover, the new candidate solutions are updated around the COM as through subtracting or adding a normal random number that is decreased when the iterations are elapsed as given in equation (10). To generate the new population, we generate a change matrix between − 1 and +1. e size of the change matrix should be similar to the extent of the candidate solutions in the population. We would get the new population after adding change matrix with the elite solution.
Here, the maximum number of iterations is termed as l, and a random number is mentioned as rn. PB3C algorithm is depicted in Algorithm 2.

4
Mobile Information Systems

Objective Model.
e suggested framework on Marathi language using P3BC-LSTM focuses on maximizing the accuracy to offer precise recognition. e objective function is formulated in the following equation: where fh represents the fitness function of the suggested SR model in Marathi language hidden neurons represented by HN. e accuracy is represented as Ac, which is an observation ratio to the whole observations as given in the following equation: Ac � po true + po neg po true + po neg + fa true + fa neg .
Here, po true denotes true positive, fa true denotes false positive, po neg denotes true negative, and fa neg denotes false negative.

Signal Preprocessing.
Signal preprocessing is the initial stage for processing. e collected signals are in an analog waveform that cannot be applied directly in any digital model. Smoothing and median filtering techniques have been used for preprocessing. [17]. It is used for reducing the noise present in the speech signals for noise reduction. A signal's data points are modified during the smoothing process, and the individual points are performed with their adjacent points, which have to be reduced in size as well. [19]. It works in signal blocks. It aims at denoising the noise that existed in the input speech signal. e median filtering for the proposed Marathi speech signal is derived in the following equation:

Median Filtering
A sorted set of K values is considered as ys(k), while D is taken as odd. e term ys(K − 1/2) indicates the middle Initialization of "N" population and each population consist of "C" candidate solutions Compute the fitness of every candidate Set i � 1 While (i < TC) for b � 1: N for j � 1 : C Compute the fitness of j th candidate solution. end for Calculate the local best of b th population end for Amongst the "N" local best solutions, find out the global best With the given probability, move the local best candidate solution towards the global best for b � 1: N Create new population around local best candidate solution end for i � i + 1 End while Terminate ALGORITHM 2: PB3C algorithm [18].
Initialization of random number, population, and iteration Estimate center of mass by equation (7) While Create new solutions by equation (8)  Mobile Information Systems value of MF and the median filters are chosen while considering that the odd length is mentioned as K.

Feature Extraction.
e preprocessed signals are fed to the feature extraction procedure to get the significant features using MFCC and spectral features. [16]. MFCC is one of the significant feature extraction methods in the developed SR of Marathi language. It extracts all features, which are considered as given in the following equation:

(14)
In equation (14), ξ mf represents the dynamic range coefficients, the amplification factor is represented as ξ CE , and the energy in each channel is termed as En fp as formulated in the following equation: Here, the value of fp lies among 0 ≤ fp < FP, G � 24, and σ fp denotes the number of triangular filters, where Xs gw � |xs(gw)| 2 and 0 ≤ gw < GW. erefore, the features related to MFCC collected from the preprocessed signals are given to the next preprocessing step: Spectral-based features [20]: the proposed model gathers spectral attributes like spectral (roll-off, flux, and centroid). e Fourier transform is used to convert the time-based signal into a frequency-based signal, which results in the appearance of spectral features. ese techniques identify the pitch, notes, melody, and rhythm in the speech signals. Spectral centroid: it is described as signal center of spectrum power distribution with distinct values for voiced and unvoiced speech. e sign function is derived as given in the following equation: . (16) Here, tr frame is mentioned as with Vr variables and number of observations as Nu. Here, the PCAbased dimension reduction is derived in the following equation: Here, the values of ps represent the principal components. e term Q is determined from the covariance matrix CM as derived in the following equation: In equation (20), the matrix of eigenvectors of CM and diagonal matrix of the eigenvalues are termed as Eg and Dt, respectively. Assume that AC is the matrix of Nu × Vr with nu th column as Fs MS nu − β.
Here, the mean vector β is derived as β � (1/Nu)(Fs MS 1 + ... + Fs MS No ) and CM is estimated with size of Vr × Vr as shown in the following equation: Finally, the attained PCA reduced features are denoted as Fs PCA No , where No � 1, 2, . . . , Np and Np is the total number of PCA reduced features, which is taken as 20.

Results and Discussion
is section explains the experimental setup, performance evaluation metrics taken for comparison of models, and the result analysis of proposed work.

Experimental Process.
e developed framework takes into account a maximum of 25 iterations and a maximum of 10 populations in order to evaluate the performances. To evaluate algorithm's performance, it was tested on a Marathi SC obtained from the ILTPDC, Govt. of India, which was divided into six SCs for analysis. e collected SCs consisted of approximately 44500 speech files that were accompanied by their pronunciation. When using the LSTM model, the proposed PB3C-LSTM model was evaluated for performance [15] and BB-BC [23] on 6 SCs.

Performance Measures.
To evaluate the performance of LSTM, BB-BC LSTM, and PB3C-LSTM, word accuracy rate (WAR), WER, and sentence error rate (SER) are considered, which are described as follows: (a) WER: it is used for measuring the word error rate of the designed framework.
Here, NS represents the number of substitutions in test, DT represents the number of deletions in the test, NT represents the number of words utilized in a test, and IE represents the number of insertion errors in the test. (b) Word accuracy rate (WAR): it is used in measuring the word accuracy rate of the designed framework. It is derived in the following equation:   Mobile Information Systems (c) Sentence error rate (SER): it is correlated among audio predicted correctly to the total number of audios as given in the following equation: Here, terms PA and TA represent the audio signals predicted correctly and the total number of audios, respectively.

Result Analysis.
e results of the LSTM, BBBC-LSTM, and PB3C-LSTM SR model, are analyzed with error measures like WAR, WER, and SER as depicted in Table 2 and the results are presented in Figures 2-4, respectively. ese measures are discussed in the following paragraphs.
Experimentation is conducted on WER for examining the efficiency of the designed PB3C-LSTM framework, given in Table 2 for all six SCs. e PB3C-LSTM is 6% and 4% boosted compared to LSTM and BBBC-LSTM, respectively. For SC 2, the PB3C-LSTM and LSTM have the same WER rate, but they are 1.33% superior to BBBC-LSTM. For SC 3, the WER performance of the PB3C-LSTM is 2.67% advanced and 2.67% declined compared to LSTM and BBBC-LSTM, respectively. For SC 4, the PB3C-LSTM is 0.67% and 0.67% progressed than LSTM and BBBC-LSTM, respectively. For SC 5, the WER performance of the PB3C-LSTM is 0.6% declined and 0.67% advanced compared to LSTM and BBBC-LSTM, respectively. For SC 6, the PB3C-LSTM is 0.4% and 0.6% boosted compared to LSTM and BBBC-LSTM respectively. e WAR measure analysis is carried out to show the efficiency of the LSTM, BBBC-LSTM, and the proposed PB3C-LSTM SR model, which is given in Table 2 for all six SCs. For SC 1, PB3C-LSTM is 3.33% enhanced, and BBBC-LSTM is 1.33%, respectively. For SC 2, the WAR rate for PB3C-LSTM and BBBC-LSTM is 0.6 percent lower than that for LSTM. For SC 3, PB3C-LSTM's WAR performance is 1.33% advanced and 2.6% lower than the LSTM and BBBC-LSTM performance, respectively. e WAR of PB3C-LSTM is similar to that of the LSTM in terms of SC 4 and the BBBC-LSTM, respectively. For SC 5, the WAR performance of the PB3C-LSTM is 3.33% and 4% boosted than LSTM and BBBC-LSTM, respectively. e PB3C-LSTM for voice corpus 6 is 1.33% percent higher than the LSTM and the BBBC-LSTM, respectively. e SER measure analysis for measuring the performances of the LSTM, BBBC-LSTM, and proposed PB3C-LSTM SR model is carried out, which is given in Table 2   LSTM and BBBC-LSTM performance in SER than LSTM and LSTM. e SER of PB3C-LSTM is up 2.53% and 2.99%, respectively, for SC 4 compared to the SER of LSTM and the BBBC-LSTM. SC 5 saw PB3C-SER LSTM's performance decrease by 0.7% and advance by 0.8% compared to LSTM and BBBC-LSTM. e PB3C-LSTM has been improved 2.19 percent and 2.26 percent, respectively, in the case of SC 6 compared to LSTM and BBBC-LSTM.

Conclusion
is article has contributed new framework on Marathi language using P3BC-LSTM. It is composed of four significant processes: (1) classification, (2) feature selection, (3) feature extraction, and (4) preprocessing. e gathered speech signals were pretreated using smoothing and median filtering methods that were given to the feature extraction stage. It was carried out by MFCC and spectral-based attributes. Further, these attributes were significantly selected using PCA, which were forwarded to the classification stage using P3BC-LSTM. PB3C-LSTM module works in two phases. Phase 1 of PB3C-LSTM module automatically evolves the optimal architecture of LSTM. e second phase computes the optimum weight of each link in LSTM for Marathi SR. e P3BC-LSTM was proposed by optimizing the hidden neurons counting and optimized weights via the P3BC algorithm that intended to get the recognized speech signals. Consequently, from the experimental results, the word accuracy rate (WAR) of the proposed SR model using PB3C-LSTM was 1.34% and 3.34% increased compared to LSTM and BB-BC-LSTM, respectively, while considering SC 1. e proposed P3BC-LSTM model has attained 4% and 6% less WER and 4.45% and 3.25% less SER than LSTM and BB-BC-LSTM, respectively, for SC 1, and it has comparable performance with rest of the corpus. From results, we observe that the proposed system outperforms two other techniques for Marathi SR.

Conflicts of Interest
e authors declare that there are no conflicts of interest associated with the publication of this paper.