Regional Language Speech Recognition from Bone-Conducted Speech Signals through Different Deep Learning Architectures

Bone-conducted microphone (BCM) senses vibrations from bones in the skull during speech to electrical audio signal. When transmitting speech signals, bone-conduction microphones (BCMs) capture speech signals based on the vibrations of the speaker's skull and have better noise-resistance capabilities than standard air-conduction microphones (ACMs). BCMs have a different frequency response than ACMs because they only capture the low-frequency portion of speech signals. When we replace an ACM with a BCM, we may get satisfactory noise suppression results, but the speech quality and intelligibility may suffer due to the nature of the solid vibration. Mismatched BCM and ACM characteristics can also have an impact on ASR performance, and it is impossible to recreate a new ASR system using voice data from BCMs. The speech intelligibility of a BCM-conducted speech signal is determined by the location of the bone used to acquire the signal and accurately model phonemes of words. Deep learning techniques such as neural network have traditionally been used for speech recognition. However, neural networks have a high computational cost and are unable to model phonemes in signals. In this paper, the intelligibility of BCM signal speech was evaluated for different bone locations, namely the right ramus, larynx, and right mastoid. Listener and deep learning architectures such as CapsuleNet, UNet, and S-Net were used to acquire the BCM signal for Tamil words and evaluate speech intelligibility. As validated by the listener and deep learning architectures, the Larynx bone location improves speech intelligibility.


Introduction
e speech quality and intelligibility degrade due to ambient noise and implant location of air-conducted and boneconducted devices. e speech intelligibility of noise-affected speech improves by noise suppression techniques. background noise, including musical noise, babbling noise, coloured noise, and nonstationary noise. In a speech recognition system, the noise suppresses automatically by filters such as Wiener and Kalman filter, noise subtraction techniques, and speech enhancement algorithm. However, the residual noise signal is caused by the nonlinear nature of the noise signal. e residual noise seriously affects speech intelligibility and recognition. Traditionally, noise suppression from speech signals has been accomplished by estimating the power spectrum of the noise signal. Because noise signals are nonlinear, power spectrum estimation is inaccurate. Due to the presence of residual noise, the obtained speech signal has reduced speech intelligibility and perception. Deep learning methods improve speech intelligibility and perception by suppressing nonlinear noise signals. e early fusion and late fusion of ensemble learning strategy along with convolutional neural network enhance speech signal obtained with bone-conducted microphone (BCM). e acoustic characteristics of BCM and air conducted microphone (ACM) signal learned by ensemble approach and convolutional neural network for speech signal enhancement [1]. e BCM and ACM conducted speech signal obtained in the noisy environment of 61.7 dBA to 73.9 dBA, transform to match with each other by deep denoising autoencoder. e speech recognition accuracy improves by adjusting the weight of the speech intelligibility index [2]. e MED-EL bonebridge device speech perception was evaluated with a tone audiogram. e MED-EL bonebridge device has improved speech perception upon implantation. e implanted device's speech perception was tested with Freiburg monosyllable [3].
Speech perception is enabled with a transcutaneous bone-conduction implant (BCI BB) placed near to the mastoid bone's sinodural angle. e speech perception of the device evaluates with functional hearing gain. e BCI BB provides better speech perception under a noisy environment [4].
e Radioear B-71 bone vibrator and TDH-39 earphoneme speech perception were evaluated under quiet, pink, white, and babble conditions with Callsign Acquisition Test. e device's speech intelligibility was tested for mastoid and condyle locations. e speech intelligibility varied for gender due to background noise validated by post hoc analysis [5]. e speech intelligibility of bone-conducted ultrasound (BCU) and air-conducted ultrasound (ACU) signal of ceramic vibrator placed at mastoid region was evaluated with ANOVA test. e ACU speech intelligibility increased with a higher sound level compared to BCU [6].
Performance evaluation of the B-72 device is with regard to background noise, voice gender, and ear position. e modified rhyme test (MRT) was conducted with Fonix FA-12 audiometer and Telephonics TDH-39P earphoneme. e MRT results show condyle region increase speech intelligibility compared to the mastoid region [7]. Table 1 explains the characteristics of the existing models.

Related Works
Using noise-resistant recording devices is a simple way to collect less distorted speech signals. As previously stated, a BCM records signals via bone vibrations and is thus less sensitive to air background noise than an ACM. However, BCM-recorded speech signals frequently suffer from a loss of high acoustic-frequency components, which was addressed and partially alleviated by the BCM-to-ACM conversion technique.

Methodology
e study of speech signals and signal processing methods is known as speech processing. Because the signals are typically processed in digital form, speech processing can be thought of as a subset of digital signal processing applied to speech signals.
e BCM speech acquires with MEMS acoustic sensor. e transducer converts vibrations induced at the bones of the skull to a spectral-rich electrical signal. e bones conduct vibrations from the vocal tract during speech. e vocal track causes vibrations on bones such as right ramus, larynx, and right mastoid as shown in Figure 1. e speech stimuli involved in the study were five common words from the Tamil langue as shown in Table 2.
e words are frequently used in conversation and represent Tamil language phonetic characteristics. e Tamil words spoken by male at 60 dB were recorded in a quiet environment with microphone placed at three feet from lips sampled at 22 kHz. Similarly, the words were recorded with an ADMP401 microphone placed at the right ramus, larynx, and right mastoid as shown in Figure 2. e ADMP401 was positioned over bone and prevents from drifting during speech with a headband. e ADMP401 signal was amplified by a class B power amplifier and recorded with Hp laptop and Sigview software.

CapsuleNet.
A capsule network trained to detect objects in this database improved model accuracy by 45 percent when compared to traditional CNN models.

UNet.
A general convolutional neural network focuses on image classification, where the input is an image and the output is one label, but in biomedical cases, we must not only determine whether disease exists but also localise the area of abnormality. UNet is committed to resolving this issue. It can localise and distinguish borders by performing classification on every pixel, so the input and output are the same sizes.

S-Net.
S-Net was the first parallel neural network implementation. It employs the data division method, and the system employs one server and any number of clients. It was written in the C programming language. TCP/IP sockets are used by clients to connect to the server. Each client receives their own thread. Each client computes update matrices for his portion of the data (bunch-size/N), sends them to the server, and then waits for a response. When the server is aware of all update matrices, the main thread updates the weight. When the update is complete, the server sends new weights to clients via threaded client communication.
When compared to other methods, CapsuleNet, UNet, and S-Net recognised Tamil words accurately for BCM signals obtained from the larynx bone.

Results and Discussion
e Fourier domain analysis of BCM voice signals is from the right ramus, larynx, and right mastoid.
e Fourier shows tone and phoneme variation of the speech signal. e low-frequency speech signal fails to conduct through bone compared to the high-frequency speech signal. e lowfrequency speech signal and phoneme distort in the right ramus and right mastoid. However, the low-and highfrequency signals are conducted through the larynx to provide a clear representation of phoneme in speech signal. Each word of speech signal records for five times from different locations. e words were recorded at one-minute interval to reduce speaker fatigue. e recorded speech signal evaluates by the listener for speech intelligibility. e recorded speech signal and BCM signal were assessed with a slider scale. e listeners correlated recorded speech signal and BCM signal from the right ramus, right mastoid, and larynx with 72%, 84%, and 91% speech intelligibility. e right ramus, right mastoid, and larynx conducted speech signal showed mean speech intelligibility of 75%, 87%, and 92%, respectively. e BCM speech signal from the larynx shows higher speech intelligibility compared to other regions. e different bone locations are shown in Figure 2. e speech signal is acquired from the larynx bone train with CapusleNet, UNet, and S-Net for automatic speech recognition. Figure 3(a) shows the acquired speech signal of "Amam" Tamil word. e speech signal spectrogram in Figure 3(b) shows the phoneme of "Amam" word signal. e low-frequency component of the signal shows similar speech intelligibility compared to the speech signal acquired through the microphone. e BCM further reduces the presence of noise in the speech signal and shows the feature of spoken words since BCM is in direct contact with the larynx bone. e magnitude response of the speech signal shows the variation in "Amam" word phoneme in the range of 10 to 55 dB as in Figure 3(c). e amplitude spectrum signal shows the Fourier representation of the speech signal as in Figure 3(d). e Fourier representation of speech signal shows the time signature of word phoneme. e autocorrelation and probability distribution of the signal is shown in Figures 3(e) and 3(f ). Similarly, the second word "Vena" acquired speech signal is shown in Figure 4(a). e spectrogram of "Vena" signal in Figure 4(b) shows speech intelligibility at 4 Hz. e magnitude of the signal ranges from 50 to 10 dB due to the initial low phoneme variation, "Ve." e amplitude spectrum of speech signal shows the speech intelligibility of word phoneme in the range of −50 to −120 dB. e autocorrelation and probability distribution of "Vena" speech signal is shown in Figures 4(e) and 4(f ). Similarly, the Tamil word "Iruku" has speech intelligibility at 5 Hz, and its magnitude response changes in the range of 50 to 20 dB. e "Illa" word has speech intelligibility at 7 Hz and magnitude response in the range of 50 to 25 dB. e "Enna" word has speech intelligibility at 4.5 Hz and magnitude response in the range of 60 to 10 dB. Table 2 shows the speech signal parameter of different Tamil words use for analysis.

CapsuleNet.
e CapsuleNet architecture is shown in Figure 5 which consists of the convolutional fully connected layer. e convolutional layers have 9 × 9 convolutional kernels and ReLU activation which extracts speech signal Evaluation of nonaudible murmur microphone robustness with real and simulated noisy data [9] Softband bone conducted hearing device Analyze auditory, speech development of bilateral microtia-affected children e speech development of children assesses with a meaningful auditory integration scale and speech intelligibility rating [10] roat, acoustic microphone Improve throat acoustic microphone speech recognition e throat and acoustic microphone correlate to extract acoustic feature vector for speech recognition [11] Baha attract bone hearing system Speech recognition of wireless bluetooth device in patients using a baha attract bone hearing system and traditional hearing aid Speech perception, recognition of Korean sentences were performed in quiet and noisy conditions [12] Bonebridge ™ MED-EL Speech recognition performance comparison of semiimplanted bonebridge MED-EL and adhesive bone-conduction device Free-field audiometry test was conducted with speech, noise produced through loud speaker [13] Air and bone conduction microphone Evaluate enhanced speech quality signal e equalised bone conducted speech produced by maximum likelihood and bone conducted estimator for high and low SNR conditions, respectively. e equalised bone conducted speech quality evaluates with wiener gain and priori SNR estimator [14] Bone conducted microphone Nonstationary noise suppression of speech signal where v j represents output produced by capsule j for input S j . e input S j is weight adjusted by capsules to predict  speech outcome. e prediction u j|i form by the product of the W ij weight matrix and output u i is represented by the following equation: where c ij represents coupling coefficients represented by the following equation: e coupling coefficient in capsules forms by routing softmax. e routing softmax has logits (b ij ) and determines the capsule coupling among layers. e capsule determines the features in the input speech signal based on the instantiation vector. e margin loss (Lk) for multiple features in the input signal for each capsule k is represented by the following equation:   Figure 7 shows the architecture of UNet. e UNet forms by a convolutional neural network (CNN) in "U" shape. e UNet has paths namely contraction path (or) encoder and expansion path (or) decoder. e encoder performs activation, convolution, and pooling which captures the input BCM speech signal. e decoder extracts spectral features and spatial information to feature map of the speech signal by up convolution and concatenation process. e feature map has rich spectral information in the encoder phase, and the intermediate low-level features are combined in the decoder phase are combined to form feature channels.

UNet.
e feature samples propagate speech information to higher layers of CNN. e input speech signal is preprocessed to remove background noise and unsampled by a factor of two to form an enhanced feature map. e enhanced feature map formed from the encoder is concatenated. e concatenated feature map upsamples by two factors before applying to convolutional layers. e process continues till vocal spectral content is present for speech recognition. e UNet consists of a convolutional network with two 3 × 3 convolutions, pooling and rectified linear units to perform downsampling. e downsampling process increases feature channels, and upsampling at the show query and recognised speech signal of the Tamil word "Vena." e speech intelligibility of UNet is low for ramus and mastoid bone. e larynx bone has higher speech intelligibility with respect to a phoneme in the speech signal.

S-Net.
S-Net works with Shufflenet for feature detection as shown in Figure 9. e S-Net provides efficient computing in dense convolutions (1 × 1). e S-Net and Shufflenet use pointwise group convolutions and channel shuffle operation for speech input weight adjustment in feature channels. e Shufflenet block consists of a 6 × 6 layer with 6 × 6 convolution to map speech input in the feature map. e Shufflenet performs average pooling and channel concatenation to handle the feature dimension of input speech. e Shufflenet has less complexity as it requires minimal FLOPs and convolutions. Figures 10(a) and 10(b) show speech recognition of S-Net for "Illai" Tamil word. e S-Net clearly recognises signals acquired from larynx bone compared to other bones.

Support Vector Machine (SVM).
e SVM is a supervised linear classifier. e SVM recognises features and patterns in signals based on supervised learning. e SVM separates dimensional data by hyperplane into a different class. e hyperplane separates nonlinear data by projecting data to higher dimensional space. e high-dimensional space forms by kernel-induced feature space. e kernels namely dot product, RBF, and polynomial kernel implement to classify nonlinear data. e dot product, RBF, and polynomial kernel represent by equations (5)- (7). e data projection into high-dimensional space causes overfitting. e overfitting overcomes by the dot product. e SVM performs well to classify unknown data and likelihood can be calculated.
K(x, x) � x AE x¢; where d represents positive integer degree of kernel.
where σ is a real number.   Computational Intelligence and Neuroscience spare solution, kernels, and support vectors for function estimation. e SVR obtains from SVM by e-tube. E-tube is an e-insensitive region of the function. e e-tube reformulates to determine the best-valued function with minimal prediction error. e e-tube predicts function such that the tube has multiple training instances. e function represents by the following equation: where ‖w‖ represents the magnitude of the vector being approximated. Table 3 shows voice and BCM signal correlation with LSSVM, SVM, and SVR.

Conclusion
e study describes the identification of optimal bone to provide speech intelligibility with BCM. e BCM speech signal was acquired from three different bone locations namely right ramus, larynx, and right mastoid. e BCMconducted speech signal from different bones was rated for speech intelligibility by listeners, spectral analysis of the signal, and deep learning architectures namely CapsuleNet, UNet, and S-Net. e larynx bone-conducted speech signal showed a mean speech intelligibility of 92%. e Capsu-leNet, UNet, and S-Net recognised Tamil word accurately for BCM signals obtained from larynx bone accurately compared to other ramus and mastoid. In the future, we will work to improve the model performance of this system and expand its application to more severe environments.
Data Availability e datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.