A New Speech Enhancement Technique Based on Stationary Bionic Wavelet Transform and MMSE Estimate of Spectral Amplitude

Speech enhancement has gained considerable attention in the employment of speech transmission via the communication channel, speaker identification, speech-based biometric systems, video conference, hearing aids, mobile phones, voice conversion, microphones, and so on. *e background noise processing is needed for designing a successful speech enhancement system. In this work, a new speech enhancement technique based on Stationary Bionic Wavelet Transform (SBWT) andMinimumMean Square Error (MMSE) Estimate of Spectral Amplitude is proposed.*is technique consists at the first step in applying the SBWT to the noisy speech signal, in order to obtain eight noisy wavelet coefficients. *e denoising of each of those coefficients is performed through the application of the denoising method based on MMSE Estimate of Spectral Amplitude.*e SBWT inverse, SBWT−1, is applied to the obtained denoised stationary wavelet coefficients for finally obtaining the enhanced speech signal. *e proposed technique’s performance is proved by the calculation of the Signal to Noise Ratio (SNR), the Segmental SNR (SSNR), and the Perceptual Evaluation of Speech Quality (PESQ).


Introduction
In many speech-related applications, an input speech signal is frequently corrupted by environmental noise and needs further processing using a speech enhancement technique for ameliorating the associated quality before being employed [1]. Generally, speech enhancement techniques can be grouped into two groups which are supervised and unsupervised. Unsupervised techniques include spectral subtraction (SS) [2][3][4], Wiener filtering [5,6], short-time spectral amplitude (STSA) estimation [7], and short-time log-spectral amplitude estimation (logSTSA) [8]. Concerning the supervised speech enhancement techniques, they employ a training set for learning diverse models for noisy and clean speech signals, and examples include codebook-based methods [9] and Hidden Markov Model (HMM)-based techniques [10]. Classical speech enhancement techniques are frequently processing a noisy utterance in a frame-wise way, that is, for enhancing each shorttime period of the utterance nearly in independent manner. Some research works showed that considering the interframe variation over a relatively long span of time can contribute to superior performance in enhancing speech [1]. Famous approaches along this direction include modulation-domain spectral subtraction [11], Kalman filtering, and modulation-domain Wiener filtering [12,13]. Moreover, when we compare the discrete wavelet transform (DWT) to the Fourier transform (FT) where only the frequency parts are taken into consideration, though, in the expression of the DWT [14], both temporal and frequency characteristics of the signal to be analyzed are taken into consideration. e DWT has become a well-known method in speech analysis. In Wavelet resholding Denoising (WTD) [15], the wavelet transform is applied for splitting the time-domain signal into sub-bands. After that, thresholding of the obtained wavelet coefficients (subbands) is performed. In [16], the DWT [17,18] was applied to the speech signal to simply conserve the obtained approximation portion, which simultaneously attains data compression and noise robustness in recognition. In [1], the DWT was employed for analyzing the spectrogram of a noisy utterance along the temporal axis, and then the resulting detail portion was devalued with an expect of reducing noise effect in order to promote speech quality. Despite the ease of its implementation, the preliminary evaluation results indicate that the technique proposed in [1] permits to have input signals with better perceptual quality. It was proved that this technique [1] can be paired with many well-known speech enhancement approaches for achieving even better performance [1]. In this work, a novel speech enhancement technique based on the Stationary Bionic Wavelet Transform (SBWT) [19][20][21] and Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude [22] is proposed. In this paper, this approach is evaluated and compared to four other speech enhancement approaches which are as follows: (i) Unsupervised speech denoising via perceptually motivated robust principal component analysis [23]. (ii) e speech enhancement technique based on MSS-SMPO [24,25]. (iii) e denoising technique based on MMSE Estimate of Spectral Amplitude [22]. (iv) Our previous speech enhancement technique based on LWT and Artificial Neural Network (ANN) and using MMSE Estimate of Spectral Amplitude [26].
e fourth technique which is based on LWT and ANN [27][28][29] and uses MMSE Estimate of Spectral Amplitude [26] can be summarized by the following steps: (i) First step: applying the LWT to the noisy speech signal for obtaining two noisy details coefficients, cD1 and cD2, and one approximation coefficient, cA2.
(ii) Second step: denoising cD1 and cD2 by soft thresholding, and for their thresholding, suitable thresholds, thrj, 1 ≤ j ≤ 2, have to be used. ose thresholds are determined by using an Artificial Neural Network (ANN). is soft thresholding is performed for having two denoised coefficients, cDd1 and cDd2. (iii) ird step: applying the denoising approach based on MMSE Estimate of Spectral Amplitude [22] to cA2 for obtaining a denoised coefficient, cAd2. (iv) Fourth step: applying the inverse of LWT, LWT − 1 , to cDd1, cDd2, and cAd2, for finally obtaining the enhanced signal.
In Section 2 of this paper, materials and methods are presented. Section 2.4 describes the speech enhancement technique proposed in this work. In Section 3, results and discussion are presented. Finally, Section 4 concludes the paper.

e Stationary Bionic Wavelet Transform (SBWT).
In [19], the SBWT has been proposed as a novel wavelet transform.
is transform was initially introduced for solving the problem of perfect reconstruction that exists with the Bionic Wavelet Transform (BWT). Its application was performed for speech enhancement [19,20] and also for ECG denoising [21].

e MMSE Estimate of Spectral Amplitude.
In the literature, it was proposed to estimate the noise power spectral density employing MMSE (Minimum Mean Square Error) optimal estimation [22]. It was proved that the obtained estimator can be considered as a VAD (Voice Activity Detector)-based noise power estimator, and the noise power is updated alone if speech absence is detected, compensated with a required bias compensation [22]. It was proved that the bias compensation is not needed if the VAD is substituted by a soft SPP (Speech Presence Probability) with fixed priors [22]. When choosing fixed priors, this has the benefit of decoupling the noise power estimator from subsequent steps in a speech enhancement algorithm, such as the estimation of the speech power and that of the clean speech [22]. Gerkmann and Richard [22] proved that the proposed SPP approach permits to maintain the quick noise tracking performance of the biascompensated MMSE-based technique while exhibiting less overestimation of the spectral noise power and an even lower complexity of calculation.

Signal Model.
In [22], Gerkmann and Richard considered frame-by-frame processing of time-domain signals where the Discrete Fourier Transform (DFT) is applied to these frames. Let the complex spectral noise and speech coefficients be given, respectively, by N k (l) and S k (l), where l is the time frame index and k is the frequency bin index [22]. In [22], it was assumed that in the short-time Fourier domain, both noise and speech signals tend to be additive. erefore, the complex spectral noisy observation has the following expression: (1) In [22], it was supposed that the noise and speech signals own zero mean and are independent so that where E(•)denotes the statistical expectation operator. e spectral noise and speech power are expressed as follows: en, both a posteriori SNR and a priori SNR are expressed as follows: All details about MMSE-based noise power estimation are given in [22].

2.4.
e Proposed Speech Enhancement Technique. e speech enhancement technique introduced in this work is based on the SBWT [19][20][21] and the MMSE Estimate of Spectral Amplitude [22]. e novelty of this approach consists in applying the speech enhancement method based on MMSE Estimate of Spectral Amplitude [1,22] in the SBWT domain. In fact, this technique [22] is applied to each noisy stationary bionic wavelet coefficient for its denoising.
ose noisy coefficients are obtained by applying the SBWT to the noisy speech signal. en, the inverse of SBWT (SBWT − 1 ) is applied to the obtained denoised coefficients in order to obtain finally the enhanced speech signal. Figure 1 illustrates the flowchart of this proposed technique.
According to Figure 1, the first step of the proposed approach is to apply the SBWT to the noisy speech signal for obtaining eight noisy stationary bionic wavelet coefficients. ose coefficients are named wb i , 1 ≤ i ≤ 8, and each of them is denoised by the speech enhancement technique based on MMSE Estimate of Spectral Amplitude [1,22]. and we obtain eight denoised coefficients, wd i , 1 ≤ i ≤ 8 ( Figure 1). In those coefficients, wd i , 1 ≤ i ≤ 8 inverse is applied for SBWT (SBWT-1) in order to obtain the enhanced signal finally.

Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude in the SBWT Domain.
In general, classical speech enhancement approaches based on thresholding in the wavelet transform domain can introduce some distortions to the original speech signal. is particularly occurs for the unvoiced sounds. Consequently, a great number of speech enhancement techniques based on wavelet transforms are employing other tools such as spectral subtraction (SS), Wiener filtering, and MMSE-STSA estimation [39,40].
is is the reason why we apply the Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude in the SBWT domain in our speech enhancement system. e application of the SBWT permits to solve the problem of the perfect reconstruction existing when we apply the BWT [19].
Furthermore, the SBWT among all wavelet transforms [41,42] tends to uncorrelated data [43] and facilitates the noise suppression. e fact that the Minimum Mean Square Error (MMSE) Estimate of Spectral Amplitude [22] is applied to each noisy stationary bionic coefficient permits to have a better adaptation for speech and noise estimations compared to the application of this technique [22] to the whole noisy speech signal. [23]. To overcome the shortcomings in the existing sparse and low-rank speech denoising technique that the auditory perceptual properties are not fully exploited and the speech degradation is simply perceived, a perceptually motivated robust principal component analysis (ISNRPCA) technique was presented. In order to reflect the non-linear property for frequency perception of the basilar membrane, cochleagram is employed as inputs of ISNRPCA. e latter employs the perceptually meaningful Itakura-Saito measure as its optimization objective function. Furthermore, non-negative constraints are also compulsory for regularizing the decomposed terms with respect to their physical meaning [23]. In [23], Min et al. proposed an alternating direction technique of multipliers (ADMM) for solving the optimization problem of ISNRPCA. e latter is completely unsupervised, and neither the noise nor the speech model requires to be trained beforehand. Experimental results under diverse kinds of noise and different SNRs prove that the ISNRPCA is showing promising results for speech denoising [23]. [25]. In [25], a two-step enhancement technique based on spectral subtraction and phase spectrum compensation was presented for noisy speeches in diverse environments requiring non-stationary noise and medium to low levels of SNR. In the first step of the technique proposed in [25], the magnitude of the noisy speech spectrum is modified by a spectral subtraction technique, where a noise estimation approach was introduced. e latter is based on the low-frequency information of the noisy speech. is noise estimation technique is able to estimate precisely the non-stationary noise. In the second step, the phase spectrum of the noisy speech is modified consisting of phase spectrum compensation, where an SNR-dependent technique is incorporated for determining the amount of compensation to be compulsory on the phase spectrum [25]. A modified complex spectrum is obtained by aggregating the magnitude from the step of spectral subtraction and the modified phase spectrum from the step of phase compensation, which is found to be a better representation of enhanced speech spectrum.

Results and Discussion
In this work, the evaluation of the proposed technique is performed by its application to ten Arabic speech sentences pronounced by a male speaker and ten others by a female speaker (Table 1).
ose speech signals are degraded in Security and Communication Networks 3 artificial manner by an additive noise at different values of SNRi (before denoising). In order to corrupt those speech signals (Table 1), we have chosen four kinds of noise which are white Gaussian, car, F16, and tank noises. ose twenty speech signals are sampled at 16 kHz and are listed in Table 1. Also, for evaluating the proposed technique, it is compared with other three speech enhancement approaches which are as follows: (i) e denoising approach based on MMSE Estimate of Spectral Amplitude [22]. (ii) e unsupervised speech denoising technique via perceptually motivated robust principal component analysis [23]. (iii) e speech enhancement approach based on MSS-SMPO [24].
is evaluation is performed through the computations of the SNR (Signal to Noise Ratio), the Segmental SNR (SSNR), and the PESQ (Perceptual Evaluation of Speech Quality). e results obtained from these computations are presented in Tables 2-16. According to these tables, the best results are the values in italics and they are practically obtained from the application of the proposed technique. erefore, this technique outperforms the other speech enhancement approaches [22][23][24][25] applied for this evaluation. Figure 2 illustrates an example of speech enhancement applying the proposed technique to the clean speech signal (Figure 2(a)) corrupted in additive manner by a car noise (Volvo) with SNR � 0dB (Figure 2(b)). According to this figure, this technique permits to considerably reduce noise and to obtain an enhanced speech signal (Figure 2(c)) with little distortions despite the fact that the value of the SNR is low (0 dB). Figure 3 illustrates the spectrograms of the clean, noisy, and enhanced speech signals. e spectrogram in Figure 3(b) shows that the type of noise corrupting the speech signal is localized in low-frequency parts. e spectrogram in Figure 3(c) shows that the car noise is considerably reduced by using the proposed speech enhancement technique. Moreover, this technique permits to have an enhanced speech signal with low distortions compared to the clean speech signal (Figure 2(a)).
In the following, we will compare the proposed technique with our previous speech enhancement approach which is based on LWT and ANN and uses MMSE [26]. e first difference between the speech enhancement technique proposed in this work and our previous approach is that they use two completely different wavelet transforms which are the SBWT for the technique proposed in this paper and the LWT for our previous approach proposed in [26]. e second difference between these two techniques is that the denoising approach based on MMSE Estimate of Spectral Amplitude is applied [22] to all stationary bionic wavelet coefficients for the technique proposed in this paper. However, we apply this approach [22] only to the approximation coefficient for our previous speech enhancement technique proposed in [26]. e latter also uses an                       present the results obtained from the computation of SNR, SSNR, and PESQ for the two techniques.

Security and Communication Networks
According to these tables, the best results are the values in italics and they are obtained from the application of the proposed technique. erefore, this technique outperforms the other speech enhancement approach proposed in [26].

Conclusion
In this paper, we propose a new speech enhancement technique based on SBWT and MMSE Estimate of Spectral Amplitude. In the first step of this technique, the SBWT is applied to the noisy speech signal for obtaining eight noisy stationary bionic wavelet coefficients. e denoising of each of those coefficients is performed through the application of the denoising approach based on MMSE Estimate of Spectral Amplitude. Finally, the inverse of SBWT(SBWT − 1 ) is applied to the obtained stationary wavelet coefficients, for obtaining the enhanced speech signal. An evaluation of this technique is performed by its comparison with four other speech enhancement approaches where the first one is the denoising technique based on MMSE Estimate of Spectral Amplitude. e second one is the speech enhancement technique based on MSS-SMPO. e third one is the unsupervised speech denoising approach through perceptually motivated robust principal component analysis. e fourth one is the speech enhancement technique based on LWT and ANN and using MMSE Estimate of spectral amplitude.
is evaluation is performed through the computations of Signal to Noise Ratio (SNR), the Segmental SNR (SSNR), and the Perceptual Evaluation of Speech Quality (PESQ). e results obtained from these computations show that the proposed technique outperforms the other previously mentioned techniques. Furthermore, the technique proposed in this work permits to considerably reduce the noises corrupting the clean speech signal and to have an enhanced speech signal with good perceptual quality.