A new logpower domain feature enhancement algorithm named NLPS is developed. It consists of two parts, direct solution of nonlinear system model and logpower subtraction. In contrast to other methods, the proposed algorithm does not need prior speech/noise statistical model. Instead, it works by direct solution of the nonlinear function derived from the speech recognition system. Separate steps are utilized to refine the accuracy of estimated cepstrum by logpower subtraction, which is the second part of the proposed algorithm. The proposed algorithm manages to solve the speech probability distribution function (PDF) discontinuity problem caused by traditional spectral subtraction series algorithms. The effectiveness of the proposed filter is extensively compared using the standard database, AURORA2. The results show that significant improvement can be achieved by incorporating the proposed algorithm. The proposed algorithm reaches a recognition rate of over 86% for noisy speech (average from SNR 0 dB to 20 dB), which means a 48% error reduction over the baseline Melfrequency Cepstral Coefficient (MFCC) system.
The main objective of speech recognition is to get a higher recognition rate. However, lots of factors tend to degrade the performance of automatic speech recognition (ASR) system, such as environmental noise, channel distortion, and speaker variability [
Noise reduction or clean speech estimation is a straight forward “feature” approach to improve the performance of ASR systems. There are different ways to get the estimation. minimum mean square Error is one of the most important ones. Ephraim derived the shorttime spectral amplitude (STSA) estimator using minimum mean square error (MMSE) in 1984 [
For speech recognition, several MMSEbased algorithms have been developed. Yu et al. in 2008 developed the MMSE estimator in the logpower domain [
Unlike many other algorithms, the proposed algorithm does not need stereo data input, which makes it more robust to different conditions. It is because stereo data is impossible to get in practical situations. The novelty of this paper lies in that the two parts of MFCCs (c1~c12, logpower) are processed separately. Direct solution of nonlinear system function is much easier than the statistical approach. Besides, compared with earlier MMSEbased algorithms, the proposed method does not need any additional training. The AURORA2 database is used for verification tests. It is a widely used, standard English database, which contains isolated digits as well as digit serials. Comparison is made against ordinary MFCCs, MMSESTSA [
The rest of the paper is organized as follows. In Section
Following similar derivation procedure from [
Define the log channel energy vectors as:
Then (
Then changing (
Then, the MFCCs can be calculated by
In fact, with no additional constraints, and if
Equation (
A traditional way to solve the above mentioned problem is to implement a threshold to guarantee the clean speech estimate to be positive:
Equation (
However, there is a very serious problem caused by SS. Because of the threshold,
Spectrogram of digital string “3Z82”.
Original
Processed by SS
After processed by SS or other similar methods, the probability distribution of speech is greatly changed. For example, in Figure
Speech PDF of Mel channel logpower for digital string “3Z82”.
Clean
Noisy Speech
Processed by SS
Most state of the art ASR systems incorporate statistical methods to perform pattern recognition. HMM is one of the most popular ones. These statistical methods are all developed based on certain statistical model of the speech. In other words, a probability distribution is always assumed as the basis of recognizer derivation. SS like algorithms greatly changes the pdf of speech, which in return causes the performance of ASR systems to drop.
The proposed algorithm intends to achieve the clean speech in an iterative manner, which means the clean speech estimation
The novelty of implementing iterative root finding algorithm is that unlike the Spectral Subtraction like approaches it manages to overcome the awkward
As described before, the iterative root finding algorithm can also handle
Denoting
For the iterative step
In statistics, a minimum mean square error (MMSE) estimator is the approach which minimizes the mean square error (MSE), which is widely used in lots of areas in signal processing. In 1984, Ephraim and Malah derived the short time spectral amplitude (STSA) estimator using MMSE [
As is shown in Figure
Different domains in MFCC scheme.
The clean speech estimate generated by the proposed algorithm in (
The MFCC static parameters can be divided into two parts, c1~c12 and c0/logenergy. Strictly speaking, the proposed algorithm mainly focuses on c1~c12. For logenergy, traditionally it should be calculated by
The basic idea of logpower subtraction is similar to the Spectral Subtraction (SS) algorithm developed by Boll in 1979 [
Diagram of spectral subtraction.
Boll defined the SS in the magnitude domain in [
Most of noise estimation algorithms are developed based on statistical models of clean and noisy speech, which makes the estimation at a specific point,
Equation (
The proposed algorithm consists of two parts, iterative solution of the nonlinear function and logpower subtraction.
Figure
Diagram of the proposed algorithm.
The AURORA2 database is adopted to evaluate the performance of the proposed method. The AURORA2 data is based on a version of the original TIDigits (as available from LDC) sampled to 8 kHz [
The proposed frontend feature extractor is modified from the MFCC model provided by Voicebox Toolkit [
The same recognizer is used for both the proposed frontend feature extraction algorithm and the baseline system for comparison. Each digit is modeled by a simple lefttoright 18 states (including two nonemitting states) HMM model, with 3 Gaussian mixtures per state. Two pause models are defined. One is “sil”, which has 3 HMM states and models the pauses before and after each utterance. The other one is “sp”, which is a single state model (tied with the middle state of “sil”) and models the pauses among words.
The proposed algorithm does not need stereo data input. Therefore, algorithms such as SPLICE [
Diagram of MFCC.
MMSESTSA is the standard MMSE approach for mathematically recovering the clean speech. In this paper, the STSA algorithm is implemented with minimum statistics as the noise estimation part. The logpower subtraction approach is similar to SS, so it is chosen to show that in speech recognition logpower subtraction is much better than SS. MVA is a cepstral domain approach, which is chosen to show the superiority of the proposed algorithm in relevant area. Figure
Diagram of MVA.
The ETSI standard advanced frontend feature extraction algorithm (AFE) is also implemented for comparison [
Experiments are conducted to show the speech recognition results of the proposed NLPS algorithm with different iterations. Detailed recognition results are given in Table
Detailed recognition rates (%).
Iteration No.  1 

3  4  5 

Clean  99.09 

98.37  98.36  97.23 
Avg 0–20 





−5 dB  27.80 

23.24  23.24  21.31 
Comparison is made against MFCC, MMSESTSA [
There are two training conditions in the AURORA2 database demo, cleantraining condition and multitraining condition. In the multitraining condition noisy speech together with the clean speech are used for training HMMs. Therefore, the recognition results from multitraining condition are very good. For most of the SNR levels, the recognition results are over 90%. It makes all of the above mentioned methods yields similar recognition results, about 92% on average. Actually, it is meaningless to make the recognition results increase from 92.1% to 92.5%. Besides, in real life preparing a noisy database for training HMMs is not realistic. It is because there are infinite types of noise and SNRs, which makes it difficult to generate an effective database for training. Moreover, if the noise encountered is very different from that in the database, bad results will be obtained. Therefore, only the clean training condition results are used for comparison. The experiment results are shown in Tables
Detailed recognition rates (%).
SNR  Set A  Set B  Set C  

Subway  Babble  Car  Exhibition  Station  Restaurant  Street  Airport  Restaurant  Street  
Clean  98.83  99.09  99.14  99.29  98.83  99.09  99.14  99.29  98.93  99.15 
Avg 0–20 










−5 dB  29.23  26.18  31.40  29.37  29.63  29.50  31.26  30.64  21.80  26.72 
Recognition results for different parts of the proposed algorithm.
Clean  Avg 0–20  −5  

CMVN  99.32 

13.90 
LPS  99.47 

12.92 
LPS + CMVN  99.07 

20.78 
 
Newton  99.20 

25.65 
Newton + CMVN  99.25 

27.78 
NLPS ( Newton + LPS + CMVN ) 



Recognition results for comparison targets.
Clean  Avg 0–20  −5  

MFCC  99.42 

13.39 
SS  99.32 

25.26 
CMVN  99.32 

13.90 
STSA  99.26 

20.31 
AFE  99.20 

24.77 
MVA  99.20 

26.24 
Recognition results for comparison targets.
Avg 0–20  Relative Imp. 

Relative Imp.  

MFCC 

19.4% 

108.0% 
SS 

18.4% 

10.3% 
CMVN 

10.3% 

100.4% 
STSA 

7.0% 

37.1% 
AFE 

4.3% 

12.4% 
MVA 

1.9% 

6.1% 
In Table
In the following discussion, MFCC denotes the traditional 13 Mel Frequency Cepstral Coefficients together with the corresponding velocity and acceleration parameters. Results are averaged over the noisy test sets with SNRs from 0 to 20 dB, denoted as Avg 0–20. Another point that has to be mentioned is that the clean set results of all the above mentioned algorithms are over 99%. It is also meaningless to attempt to achieve significant improvements at this level. Therefore, discussion will be carried out mainly for Avg 0–20 and SNR −5 dB. Figure
Experimental results.
Experimental results in Table
Comparisons in Tables
In speech processing technique, there is a kind of awkward situation when speech enhancement algorithm sometimes cannot improve the speech recognition results even if it manages to improve speech quality in terms of human listening test. SS is just one of the above mentioned methods. Direct implementation of the SS in [
In this paper, a novel algorithm for robust speech recognition system is presented with its detailed derivation, implementation, and evaluation. It is based on the direct solution of a nonlinear system model together with a novel logpower subtraction method. The novelty of the proposed algorithm lies in four parts. Firstly, the proposed method does not need any additional training process, which makes the computational burden very small. Besides, the proposed method is a blind approach, which means that the proposed method yields good performance at all SNRs and noise types. Another advantage of the proposed algorithm is its ability to adapt to changing environments. The adaptation can be made by simply changing the noise estimation part. Finally, the NLPS algorithm can be easily combined with other algorithm, such as the MVA discussed above. The proposed algorithm is implemented and evaluated with AURORA2 database. Comparison is made against STSA, SS, and MVA. Experimental results demonstrate significant improvement in the recognition accuracy.