Direct Recovery of Clean Speech Using a Hybrid Noise Suppression Algorithm for Robust Speech Recognition System

A new log-power domain feature enhancement algorithm named NLPS is developed. It consists of two parts, direct solution of nonlinear system model and log-power subtraction. In contrast to other methods, the proposed algorithm does not need prior speech/noise statistical model. Instead, it works by direct solution of the nonlinear function derived from the speech recognition system. Separate steps are utilized to refine the accuracy of estimated cepstrum by log-power subtraction, which is the second part of the proposed algorithm. The proposed algorithm manages to solve the speech probability distribution function (PDF) discontinuity problem caused by traditional spectral subtraction series algorithms. The effectiveness of the proposed filter is extensively compared using the standard database, AURORA2. The results show that significant improvement can be achieved by incorporating the proposed algorithm. The proposed algorithm reaches a recognition rate of over 86% for noisy speech (average from SNR 0 dB to 20 dB), which means a 48% error reduction over the baseline Mel-frequency Cepstral Coefficient (MFCC) system.


Introduction
The main objective of speech recognition is to get a higher recognition rate. However, lots of factors tend to degrade the performance of automatic speech recognition (ASR) system, such as environmental noise, channel distortion, and speaker variability [1,2]. Generally, automatic speech recognition system consists of two parts, feature extraction and pattern matching. Therefore, methods which aim to improve the performance of ASR system can be mainly divided into two categories, the "model" approach and the "feature" approach. The "model" approach mainly focuses on improving the speech recognizer, where the speech features are classified into different patterns developed from the statistical properties of speech. As for "feature" approach, emphasis is put on improving the robustness of speech features. The method proposed by this paper belongs to this category.
Noise reduction or clean speech estimation is a straight forward "feature" approach to improve the performance of ASR systems. There are different ways to get the estimation. minimum mean square Error is one of the most important ones. Ephraim derived the short-time spectral amplitude (STSA) estimator using minimum mean square error (MMSE) in 1984 [3], which has become a standard approach for clean speech estimation in speech processing. The advantage of MMSE estimator is very obvious. It is mathematically optimized, which theoretically can get a good estimation of the clean speech. Besides, there is solid derivation making it easier to analyze. Originally, the MMSE-based algorithms were intended to be used for speech enhancement.
For speech recognition, several MMSE-based algorithms have been developed. Yu et al. in 2008 developed the MMSE estimator in the log-power domain [4]. The cepstral domain estimator appears also in 2008 [5]. Besides, different distortion models are developed for improving speech recognition system [5,6]. Recently, some more complicated MMSEbased algorithms which require the so called stereo data input are proposed [7]. Admittedly, MMSE works well for speech enhancement and speech recognition. The main idea 2 ISRN Signal Processing of MMSE is to estimate the clean speech from the noisy speech. The success of MMSE in previous implementation reveals that it is one of the means to improve the performance of an automatic speech recognition system (ASR). However, it is not necessarily the only one. Mathematically, the recovery of clean speech from noisy corpus is a problem of solving a nonlinear function. The above mentioned MMSE approach can be treated as a kind of statistical approach to solve the function. However, there are other ways for nonlinear function root finding. In this paper, the iterative root finding approach is incorporated to recover the clean speech.
Unlike many other algorithms, the proposed algorithm does not need stereo data input, which makes it more robust to different conditions. It is because stereo data is impossible to get in practical situations. The novelty of this paper lies in that the two parts of MFCCs (c1∼c12, logpower) are processed separately. Direct solution of nonlinear system function is much easier than the statistical approach. Besides, compared with earlier MMSE-based algorithms, the proposed method does not need any additional training. The AURORA2 database is used for verification tests. It is a widely used, standard English database, which contains isolated digits as well as digit serials. Comparison is made against ordinary MFCCs, MMSE-STSA [3], Spectral Subtraction (SS) [8], Cepstral Mean and Variance Normalization (CMVN), the ETSI standard advanced front-end feature extraction algorithm (AFE) [9], and Mean Variance Normalization and ARMA filtering (MVA) [10]. Experimental results show the excellent performance of the proposed method.
The rest of the paper is organized as follows. In Section 2, the system model, nonlinear function, is presented. Section 3 discusses the details of the proposed algorithm, including detailed iteration steps of root finding algorithm, prior estimates for clean speech and noise, and the novel log-power subtraction method. The experimental speech databases, ASR systems and the additional comparison methods are described in Section 4. Conclusions are summarized in Section 5.

System Model
Following similar derivation procedure from [11], the clean speech waveform is denoted as x t where t is the time index. It is assumed that x t is corrupted by the independent additive noise waveform n t and becomes the noisy speech waveform y t as shown in The speech signal is cut into frames and transformed into frequency domain using DFT. Then (1) becomes By assuming the additivity on the powers of the components in the frequency domain [12], the power spectrum of the noisy speech is given by After applying Mel-filterbanks to the power spectra, where W l f stands for the transfer function for the lth filter. Define the log channel energy vectors as: . .
where log(·) denotes the natural logarithm Then (4) becomes Then changing (6) to the log-power domain, then where 1 stands for a vector with all elements equal to one. Then, the MFCCs can be calculated by

Algorithm Description
where | X f ,t | 2 is the clean speech estimate. Equation (10) is just the basic idea of Spectral Subtraction (SS) [8]. However, (10) only exists when The clean speech estimate becomes zero or negative, which is obviously wrong.
A traditional way to solve the above mentioned problem is to implement a threshold to guarantee the clean speech estimate to be positive: where the parameter ε is a small positive constant value. Equation (11) is a very common way to implement Spectral Subtraction (SS) in speech recognition which will be denoted as SS in later discussion. Admittedly, (11) manages to increase SNR, which in return is a straight forward way to improve the performance of speech recognition systems.
However, there is a very serious problem caused by SS. Because of the threshold, ε, a certain portion of the recovered speech is forced to be corrected to ε. Figure 1 gives an example of the effect of SS on speech spectrogram. The blue area in Figure 1 After processed by SS or other similar methods, the probability distribution of speech is greatly changed. For example, in Figure 2, it can be easily found out that the probability of speech power equal ε is greatly increased, which makes the pdf of the processed speech discontinuous.
Most state of the art ASR systems incorporate statistical methods to perform pattern recognition. HMM is one of the most popular ones. These statistical methods are all developed based on certain statistical model of the speech. In other words, a probability distribution is always assumed as the basis of recognizer derivation. SS like algorithms greatly changes the pdf of speech, which in return causes the performance of ASR systems to drop.
The proposed algorithm intends to achieve the clean speech in an iterative manner, which means the clean speech estimation | X f ,t | 2 slowly converges to a better guess. There would not be a mass force assignment of the negative elements to a certain value. Thus the discontinuity problem is avoided.

Iterative Solution.
The novelty of implementing iterative root finding algorithm is that unlike the Spectral Subtraction like approaches it manages to overcome the awkward |Y f ,t | 2 ≤ |N f ,t | 2 problem without causing discontinuity in the speech PDF. The statistical approach handles this by applying a series of mathematical operations which are not sensitive to the above mentioned problem. In power domain, the final expression is (12) which fundamentally avoids the possibility of |Y f ,t | 2 ≤ |N f ,t | 2 . It is because the equivalent noise estimate is (1−G)× |Y f ,t | 2 , which is generated from only the current frame.
As described before, the iterative root finding algorithm can also handle |Y f ,t | 2 ≤ |N f ,t | 2 very well. Equation (8) can be reshaped to where Y is the noisy speech vector, X is the parameter that needed to be recovered. If the noise vector N can be reasonably estimated, (13) becomes a nonlinear function about X, which can be solved by iterative root finding algorithms. Denoting then According to Newton's method, given a function f (X), its derivative f (X) and a first guess X 0 , the solution to the function can be reached by where i is the iteration index.
For the iterative step it has to be noted that Therefore, a threshold β is adopted to guarantee the denominator to be non-zero. Then (17) is modified to With a successful guess of the initial step, clean speech vector X 0 and noise vector N, the clean speech estimate X can be satisfactorily approximated. Equation (19) can work very well even if |Y f ,t | 2 ≤ |N f ,t | 2 . About the discontinuity problem, at extreme conditions where the threshold β works, the iteration becomes It can be easily seen that (20) would not cause mass assignment of the same value, which means the discontinuity problem will not appear.

Prior Estimates.
In statistics, a minimum mean square error (MMSE) estimator is the approach which minimizes the mean square error (MSE), which is widely used in lots of areas in signal processing. In 1984, Ephraim and Malah derived the short time spectral amplitude (STSA) estimator using MMSE [3]. After that, MMSE has become a standard approach for enhancing the quality of speech. Therefore, it is chosen to generate the prior estimate of the clean speech.
The following equation shows the standard cost function for MMSE approach [3]: By following MMSE-STSA [3] the clean speech estimate can be reached by where Γ(·) denotes the gamma function; M(a; c; x) is the confluent hypergeometric function; I 0 (·) and I 1 (·) denote the zero and first order modified Bessel function; ξ f ,t and γ f ,t are the a priori and a posteriori signal-to-noise ratios (SNR), respectively: Then the clean amplitude estimate is transferred to logpower domain: Equation (24) will serve as the initial guess for the iterative approach.

Log-Power Subtraction (LPS)
3.3.1. Algorithm Description. As is shown in Figure 3, there are mainly four different domains in the MFCC scheme. The proposed algorithm works in the log-power domain.
The clean speech estimate generated by the proposed algorithm in (19) is actually It is the clean speech log-power vector, in the log-power domain as described in Figure 3.
The MFCC static parameters can be divided into two parts, c1∼c12 and c0/log-energy. Strictly speaking, the proposed algorithm mainly focuses on c1∼c12. For logenergy, traditionally it should be calculated by The clean speech power estimate, |X f ,t | 2 , cannot be perfectly recovered from (25) because of the Mel-filterbanks. Additional distortion will be introduced to the feature vectors. For c0, although it seems to work smoothly, the recognition results are just about "average". Therefore, a separate noise removing scheme is developed. At the iterative root finding part, an estimate for the noise is reached. The frame clean speech power can be estimated by Then the log-energy can be calculated by However, problem arises when P c ≤ 0. Therefore, a weighting parameter is incorporated to reduce the chances of imaginary parts appearing. Then (28) becomes Furthermore, another parameter ε 0 is set the guarantee the log-energy not to be infinity. Therefore, the log-power part becomes

Theoretical Analysis.
The basic idea of log-power subtraction is similar to the Spectral Subtraction (SS) algorithm developed by Boll in 1979 [8]. Figure 4 shows the diagram of Spectral Subtraction algorithm. Boll defined the SS in the magnitude domain in [8]. When adopted in speech recognition, SS is normally implemented in the power spectral domain.
Most of noise estimation algorithms are developed based on statistical models of clean and noisy speech, which makes the estimation at a specific point, |N f ,t | 2 , more like an expectation or average of noise based on previous frames. Therefore, when used in spectral subtraction, lots of elements will become negative, especially in the non-speech period, which will lead to the problem described in Section 3. However, for the proposed LPS approach the effect of the above mentioned problem is to a certain extent avoided. It is because traditionally the log-power is calculated by (26). It is based on the sum of all the speech power elements in one frame. Mathematically, from (29), the following equation can be derived: where F is the total number of frequency bins. Equation (31) shows that LPS is equivalent to performing spectral subtraction after averaging all the elements in the current frame. Due to the averaging process, the whole spectral subtraction scheme become more robust since both speech and noise estimate are kind of expectations of the actual signal.

Implementation Details.
The proposed algorithm consists of two parts, iterative solution of the nonlinear function and log-power subtraction. Figure 5 shows the block diagram of the proposed algorithm. The detailed parameter settings are α = 0.9, α f ,t = 0.4, β = 0.8, ε 0 = 10 −10 , and the iteration is performed only once. Minimum Statistics (MS) is used for noise estimation [13].  method. The AURORA2 data is based on a version of the original TIDigits (as available from LDC) sampled to 8 kHz [14]. Noise is artificially added at several SNRs (20 dB, 15 dB, 10 dB, 5 dB, 0 dB, −5 dB). Set A and Set B are filtered with a G712 [15] characteristic filter, which simulates the response of filters found in the A/D interface of PCM transmission systems. Set C is filtered with MIRS filter to simulate a telephone system. There are two training conditions in AURORA2, clean training set and multi-condition training set. For clean training condition, the training set has no noise added and it consists of 8440 utterances recorded from 55 male and 55 female adults. 4004 utterances from 52 male and 52 female speakers are split equally into 4 subsets with 1001 utterances each, with all speakers being present in each subset. In the multi-condition training set, four types of noises have been added at various SNR levels [14,16].

System Description.
The proposed front-end feature extractor is modified from the MFCC model provided by Voicebox Toolkit [17]. The demo scripts from the AURORA2 database are used for training. In the evaluation experiments, log-energy (log E) together with c1 to c12 is used as the static feature vector, and then the delta and delta-delta features are calculated using the frame-differential.
The same recognizer is used for both the proposed frontend feature extraction algorithm and the baseline system for comparison. Each digit is modeled by a simple left-to-right 18 states (including two non-emitting states) HMM model, with 3 Gaussian mixtures per state. Two pause models are defined. One is "sil", which has 3 HMM states and models the pauses before and after each utterance. The other one is "sp", which is a single state model (tied with the middle state of "sil") and models the pauses among words.

Comparison Targets.
The proposed algorithm does not need stereo data input. Therefore, algorithms such as SPLICE [18] are not selected for comparison, since comparison between algorithms with and without clean speech input is unfair. Because the proposed method is developed based on MFCC, it is chosen to be the baseline. The diagram is given in Figure 6. MMSE-STSA is the standard MMSE approach for mathematically recovering the clean speech. In this paper, the STSA algorithm is implemented with minimum statistics as the noise estimation part. The log-power subtraction approach is similar to SS, so it is chosen to show that in speech recognition log-power subtraction is much better than SS. MVA is a cepstral domain approach, which is chosen to show the superiority of the proposed algorithm in relevant area. Figure 7 shows the diagram of MVA.
The ETSI standard advanced front-end feature extraction algorithm (AFE) is also implemented for comparison [9].

Experimental Results.
Experiments are conducted to show the speech recognition results of the proposed NLPS algorithm with different iterations. Detailed recognition results are given in Table 1. It can be easily found out that the optimal result comes at the second iteration.
There are two training conditions in the AURORA2 database demo, clean-training condition and multi-training condition. In the multi-training condition noisy speech together with the clean speech are used for training HMMs. Therefore, the recognition results from multi-training condition are very good. For most of the SNR levels, the recognition results are over 90%. It makes all of the above mentioned methods yields similar recognition results, about      In Table 3, LPS stands for log-power subtraction. LPS + CMVN means only LPS and CMVN are implemented in the speech recognition system. Newton refers to the system with only Newton's iterative method. Newton + CMVN means both methods are implemented. NLPS is the final form of the proposed algorithm which involves the implementation of all the three methods, Newton's method, LPS and CMVN. The experimental results in Table 3 are given to show that the three parts of the proposed algorithm all helps to improve the performance of speech recognition system.
In the following discussion, MFCC denotes the traditional 13 Mel Frequency Cepstral Coefficients together with the corresponding velocity and acceleration parameters. Results are averaged over the noisy test sets with SNRs from 0 to 20 dB, denoted as Avg 0-20. Another point that has to be mentioned is that the clean set results of all the above mentioned algorithms are over 99%. It is also meaningless to attempt to achieve significant improvements at this level. Therefore, discussion will be carried out mainly for Avg 0-20 and SNR −5 dB. Figure 8 shows the experimental results at Avg 0-20 and SNR −5 dB. Table 3 show that the implementation of LPS and Newton's method greatly improves the recognition results. Besides, the two fundamental parts, LPS and Newton's method, both contribute to the excellent performance of the proposed algorithm. As shown in Table 3, Newton's method alone can reach a recognition rate of 83.83% at Avg 0-20. With the combination of LPS and CMVN, the performance of the speech recognition system is further improved, 84.96% for Newton + CMVN and 85.76% for the proposed NLPS algorithm. Comparisons in Tables 3 and 4 show that the proposed algorithm significantly improved the performance of speech recognition system. The relative improvement ratios are shown in Table 5. Compared with the baseline MFCC system, the proposed algorithm achieves very impressive improvements, 19.4% in terms of Avg 0-20 and 108% in SNR −5 dB. For CMVN and STSA, also very significant improvements are reached. In the level of Avg 0-20, the relative improvements are 10.3% over CMVN, and 7% over STSA. When it comes to SNR −5 dB, the improvements become much more significant, 100.4% over CMVN and 37.1% over STSA.

Results Analysis. Experimental results in
In speech processing technique, there is a kind of awkward situation when speech enhancement algorithm sometimes cannot improve the speech recognition results even if it manages to improve speech quality in terms of human listening test. SS is just one of the above mentioned methods. Direct implementation of the SS in [8] yields terrible results. Therefore, in our evaluation test, the noise estimation part of SS is replaced by Minimum Statistics [19]. In terms of Avg 0-20, the relative improvement reaches 18.4%. At SNR −5 dB the relative improvement is 10.3%. The performance of SS can successfully support the novelty of the LPS method, which is an indispensible part of the proposed NLPS algorithm. As for MVA, admittedly it is a very successful algorithm. However, the proposed algorithm still yields better results. At Avg 0-20, a 1.9% improvement is reached. For SNR −5 dB, the relative improvement is 6.1%. For the European Telecommunications Standards Institute (ETSI) standard AFE, at Avg 0-20, a relative improvement of 4.3% is reached. For SNR −5 dB, the relative improvement is 12.4%.

Conclusion
In this paper, a novel algorithm for robust speech recognition system is presented with its detailed derivation, implementation, and evaluation. It is based on the direct solution of a nonlinear system model together with a novel logpower subtraction method. The novelty of the proposed algorithm lies in four parts. Firstly, the proposed method does not need any additional training process, which makes the computational burden very small. Besides, the proposed method is a blind approach, which means that the proposed method yields good performance at all SNRs and noise types. Another advantage of the proposed algorithm is its ability to adapt to changing environments. The adaptation can be made by simply changing the noise estimation part. Finally, the NLPS algorithm can be easily combined with other algorithm, such as the MVA discussed above. The proposed algorithm is implemented and evaluated with AURORA2 database. Comparison is made against STSA, SS, and MVA. Experimental results demonstrate significant improvement in the recognition accuracy.