Noise Estimation and Suppression Using Nonlinear Function with A Priori Speech Absence Probability in Speech Enhancement

This paper proposes a noise-biased compensation of minimum statistics (MS) method using a nonlinear function and a priori speech absence probability (SAP) for speech enhancement in highly nonstationary noisy environments. The MS method is a wellknown technique for noise power estimation in nonstationary noisy environments; however, it tends to bias noise estimation below that of the true noise level. The proposed method is combined with an adaptive parameter based on a sigmoid function and a priori SAP for residual noise reduction. Additionally, our method uses an autoparameter to control the trade-off between speech distortion and residual noise. We evaluate the estimation of noise power in highly nonstationary and varying noise environments. The improvement can be confirmed in terms of signal-to-noise ratio (SNR) and the Itakura-Saito Distortion Measure (ISDM).


Introduction
Noise estimation algorithms are essential components of many modern mobile communication, speech recognition, and human computer interaction systems for speech enhancement [1,2].It is generally included as a part of the speech enhancement to improve the speech intelligibility or quality of a signal corrupted by noise.However, it is difficult to reduce noise without distorting speech because the performance of any noise estimation algorithm usually depends on a trade-off between speech distortion and noise reduction.
Current single microphone speech enhancement methods belong to two groups, namely, time domain methods such as the subspace method and frequency domain methods such as the spectral subtraction (SS) [3] and minimum mean square error (MMSE) estimator [4].Both methods have their own advantages and drawbacks.Subspace methods provide a mechanism to control the trade-off between speech distortion and residual noise, but with the cost of a heavy computational load [5].Frequency domain methods, on the other hand, usually consume less computational resources but do not have a theoretically established mechanism to control trade-off between speech distortion and residual noise.Among them, spectral subtraction (SS) is computationally efficient and has a simple mechanism to control trade-off between speech distortion and residual noise but suffers from a notorious artifact known as musical noise [6].These spectral noise reduction algorithms require an estimate of the noise spectrum, which can be obtained from speech absence frames indicated by a voice activity detector (VAD) or, alternatively, with the minimum statistic (MS) methods [7], that is, by tracking spectral minima in each frequency band.
Several recent studies have proposed noise estimation schemes for unknown noise signals [1][2][3][4][5][6][7][8][9][10][11][12][13][14].The minimum statistics (MS) noise estimation scheme [7] is one that works well in nonstationary noisy environments.Martin proposed an algorithm for noise estimation based on minimum statistics [7].The ability to track varying noise levels is a prominent feature of the minimum statistics (MS) algorithm [7].The noise estimate is obtained as the minima values of a smoothed power estimate of the noisy signal, multiplied by a factor that compensates the bias.However, the MS algorithm still has 2 Journal of Sensors a tendency to bias the noise estimate below that of the true noise level, regardless of the number of frames [8].Therefore, it leaves residual noise in the frames of speech absence and in the frames of variation of noise characteristic in highly nonstationary noisy environments.
To solve this problem, we propose a combined adaptive factor based on a sigmoid function and a priori speech absence probability (SAP) estimation [9] for biased compensation.Specifically, we apply the adaptive factor  as a posteriori SNR.When the a posteriori SNR decreases,  increases but is constrained to take a value between  min and  max .Thus, the proposed adaptive biased compensation factor  approaches  max at times when the SNR is low.In addition, when the a priori SAP equals unity, the adaptive biased compensation factor  also approaches  max in each frequency bin and vice versa.Furthermore, our method uses another adaptive parameter to control the trade-off between speech distortion and residual noise for suppressing the estimated noise in highly nonstationary and various noisy environments.The autocontrol parameter is controlled by a posteriori signal-to-noise ratio (SNR) as the variation of the noise level.
We evaluate the performance of the proposed algorithm for nonstationary noise and various noise environments.The improvement can be confirmed in the segmental SNR and the Itakura-Saito Distortion Measure (ISDM) [15].The results show that our proposed method is superior to the conventional MS approach.The structure of the paper is as follows.Section 2 reviews the minimum statistics and the a priori SAP estimation algorithms.Section 3 addresses noise estimation and suppression using a linear and a nonlinear function.In Section 4, we express the combined sigmoid function using the a posteriori SNR and a priori SAP estimation for robust biased compensation.In Section 5, we discuss the experimental results.

Minimum Statistics (MS) and Speech
Absence Probability (SAP) The MS algorithm relies on the fact that the noisy power spectrum often becomes equal to the noise power spectrum during periods of speech pauses [7,13,17].Therefore, an estimate of the noise power spectrum is obtained by separately tracking the minimum of the noisy speech in each frequency bin.In addition, because the minimum is biased towards lower values, an unbiased estimate may be obtained through multiplication by a bias factor, which is derived from the statistics of the local minimum.To search for the minimum, we take the first-order recursive of the noisy power spectrum: where   () is the smoothed periodogram and  is the smoothing factor.The smoothing factor used in (2) must be close to 1 to keep the variance of the minimum tracking as small as possible.Hence, time and frequency dependence are required to determine if speech is present or absent.The smoothing factor is therefore derived by minimizing the mean square error between   () and  2 , (): where  2 , () is the noise variance: In ( 4), the time-frequency dependent smoothing factor   () is used instead of the fixed  defined in (2).Substituting (4) into (3) and setting the first derivative to 0, we find the optimum value for   () .
According to (5), the smoothing factor can vary between 0 and 1, but such a smoothing factor is not practical [15].The value of  opt becomes progressively smaller for a large a posteriori SNR  ≈ (  , ( − 1)/ 2 , ()) (speech present).However, smoothing is required even during periods of speech because the speech power spectrum also contains a percentage of noise.Hence, the smoothing factor has a floor of (0.3), which results in a maximum of only (70%) of the original spectrum remaining within any one frame.Conversely, when the a posteriori SNR  is low (speech is absent)  tends towards 1, which causes the smoothed output to lock onto the previous value.To eliminate this, ( 5) is multiplied by  max = 0.96.From (5), we note that  opt, () depends on the true noise variance  2 , (), which is unknown.In practice, we can replace  2 , () with the latest estimated value σ2 , ( − 1).In general, however, this lags the true noise variance, and hence the estimated smoothing factor may be too small or large.Problems may arise when  opt, () is close to 1 because   () will not respond fast enough to changes in the noise.Thus, tracking errors were monitored in [7] by comparing the average short-term smoothed periodogram to the estimated noise variance.After including the correction factor [7]   () = 1 the final factor is also smoothed over time [7].The estimated noise power based the MS algorithm [7] is obtained by searching for a minimum within a finite window length  of the smoothed power estimates (, ):  min, () = min {  () ,   ( − 1) , . . .,   ( − )} . ( Because the minimum power estimate obtained through the time-varying smoothing factor is smaller than the mean value, the MS algorithm requires a bias compensation for the unbiased noise power estimate as detailed in the following [7]: where σ2 , () is the unbiased noise power estimate.The quantity  min, () is the bias compensation factor.

Review of Speech Absence Probability.
The two-state model of speech events can be represented as a binary hypothesis model [9,15,17]: where  0 (, ) and  1 (, ) represent the absence and presence of speech, in the th frequency bin of the th frame, respectively, and where is the a priori probability that speech will be absent.An efficient estimator is derived for the a priori SAP using a softdecision approach based on the estimated a priori SNR [9].A recursive average of this can be defined as where  is a time constant.The decision-directed method proposed by Ephraim and Malah [4] provides a useful estimation scheme for the a priori SNR: where  (0 <  < 1) is a smoothing factor, max is a function that prevents negative values, and (, ) ≈ |(, )| 2 /σ 2  (, ) represents the a posteriori SNR [9].The local and global averaging window are then applied to (13) [9], resulting in where the subscript  may denote either "local" or "global" window and ℎ  is a normalized window of size 2  + 1.We define two parameters  local and  global , which represent the relationship between the above averages and the likelihood of speech in the th frequency bin of the th frame.These parameters are given as [9] where  min and  max are empirical constants, maximized to attenuate noise while leaving weak speech components unaffected.The third parameter  frame (), which is required to attenuate more noise in speech-absent frames, is based on the speech energy in neighboring frames [9]: where  frame () = (1/(/2 + 1)) ∑ /2+1 =1 (, ) is an average in the frequency domain, () represents a soft transition from speech to noise,  peak is a confined peak value of  frame , and  min and  max are empirical constants that determine the delay of the transition, as defined in [9].Finally, the a priori SAP can be defined as [9] q (, ) = 1 −  local (, ) ⋅  global (, ) ⋅  frame () .
Accordingly, q(, ) is larger if either previous frames or recent neighboring frequency bins do not contain speech.Therefore, when SAP goes to 1, the speech presence probability goes to 0.

Combining Adaptive Factor Based on Sigmoid Function
and A Priori SAP.In this section, we propose a method that combines the adaptive factor based on the sigmoid function and the a priori SAP estimation [9] to achieve biased compensation.First, we can detect the adaptive factor by requiring the smoothed power spectrum (, ) be equal to the updated noise power estimator  2  during speech absence region.In particular, we can determine the adaptive factor by minimizing the mean squared error (MSE) between (, ) and  2  as follows: where we assume that the updated noise power estimator  2  during the speech absence region is Substituting ( 18) into ( 17) then after taking the first derivative of the MSE with respect to () and setting it equal to zero, we get the adaptive factor for (): where σ2  is the unbiased noise power estimate in (9).We apply the adaptive factor based on the sigmoid function to the biased compensation factor of the MS algorithm according to the a posteriori SNR: where () is derived from the slope factor  = 0.5 and the empirical constant  = 0.
where ‖ ⋅ ‖ is the Euclidean length of a vector.The adaptive factor () is controlled by the a posteriori SNR.When the a posteriori SNR decreases, () increases but is constrained to take a value between  min and  max .Thus, the proposed adaptive biased compensation factor () approaches  max at times when the SNR is low.In addition, when the a priori SAP equals unity, the adaptive biased compensation factor () is also equal to  max in each frequency bin and vice versa.The adaptive factor is shown to be a biased compensation in Figure 1.It shows, as suggested by ( 20) and ( 21), that as the a posteriori SNR increases, () decreases but () maintains a value between  max ( max ≪ 0.1) and  min .Thus, the adaptive factor () approaches  min when the SNR is close to 20 dB.Simulation results show that an increase in the () is good for noisy signals with a low SNR of less than 5 dB and that a decrease in () is good for noisy signals with a relatively high SNR greater than 10 dB.We can thus control the tradeoff between speech distortion and residual noise in the frame index using ().In ( 22), let  2  (, ) be the updated noise power estimate according to the combined a priori SAP and the adaptive factor: The term q(, ) is the a priori SAP in (16).When q(, ) becomes 1, the adaptive biased compensation factor () is equal to  max .Therefore, the speech absence region is efficiently compensated by combining the a priori SAP and the adaptive factor in the th frequency bin of the th frame.As a result, the updated noise power estimator for the optimal smoothing factor αopt (, ) of  opt (, ) is deduced from (7) as 2 . (23)

Estimated Noise Suppression Using Linear Function.
In this subsection, our method uses another adaptive parameter to control the trade-off between speech distortion and residual noise for suppressing the estimated noise in a highly nonstationary and varying noisy environment.The autocontrol parameter is controlled by a posteriori signal-tonoise ratio (SNR) as the variation of the noise level.The estimated clean speech power spectrum can be represented as shown in (28).One has where () is the oversubtraction factor,   is the slope, and   is the offset.The constants  min = 1,  max = 3, SNR max = 20 dB, and SNR min = −5 dB, respectively [3].The adaptive linear factor () affects the amount of speech distortion caused by the spectral subtraction in (28).The factor () offers a large amount of flexibility to the modified spectral subtraction (MSS) scheme.The SNR() in ( 24) is the a posteriori SNR in frequency bin.The estimated clean speech signal can then be transformed back to the time domain by taking the inverse STFT and synthesizing using the overlapadd method.

Experimental Results and Discussion
The noisy signals used in our evaluation were taken from the NOIZEUS database [15].We used 30 test utterances, of which three each were from male and female speech signals.The analyzed signal was sampled at 8 kHz and shorttime Fourier-transformed using 50% overlapping Hamming windows of 256 samples.Both the MS [7] and proposed methods track the minimum of the noisy speech to update the noise estimate in Figure 2. The MS method is obtained by tracking the minimum of the noisy power spectrum over a specified number of frames.Thus, the MS algorithm noise estimate tends to be biased below the true noise level, regardless of the number of frames.Our proposed method efficiently compensates the speech absence region by combining the adaptive bias compensation factor and a priori SAP.This implies that the proposed method is more accurate than the conventional one and could improve residual noise reduction.
Figure 3 shows the clear superiority of the proposed method in highly nonstationary noisy environments.The conventional method [7] does not work well from initial frame to 20 frames of car noise (15 dB) and from 110 frames to 130 frames of car (15 dB) and also suffered from residual noise.A different outcome is observed in the red circle of Figure 3. Particularly, the robust characteristics of the proposed method in spite of the variation of the noisy environments are well demonstrated.Thus, we can estimate more exactly the noise level to reduce a residual noise when compared with conventional method in highly nonstationary noisy environments.
The spectrum of the clean signal is given in Figure 4(a), and the spectrum of the noisy speech signal for speech enhancement using the MS plus spectral subtraction (SS) (MS + SS) [3,7] method is given in Figure 4(b).We can also observe the minimum controlled recursive averaging (MCRA) with SS in Figure 4(c).There is residual noise in Figure 4(c) from 0 s <  < 0.15 s and at  > 1.8 s, partly  because of the inability of the noise estimation algorithm to bias below the true noise level.The spectrogram of the proposed methods for noise reduction is shown in Figure 4(d).
In contrast, panel Figure 4(d) shows that the residual noise is more clearly reduced than the conventional methods.Tables 1 and 2 summarize the averaged results of the segmental SNR and the Itakura-Saito Distortion Measure (ISDM) [15].The segmental SNR can be evaluated in either the time or frequency domain.The time domain measure is perhaps one of the simplest objective measures used to evaluate speech enhancement method.For this measure to   be meaningful it is important that the original and processed signals be aligned in time and that any phase error present be corrected [15].For various noise types with an input SNR ranging from 0 to 15 dB, the segmental SNR after processing was clearly better for the proposed method compared to conventional ones [7], except for the case of (highlighted in bold).We can also confirm that our methods work well to control the trade-off between speech distortion and residual noise for suppressing the estimated noise in highly nonstationary and various noisy environments.
The ISDM was shown to give a good correlation with subjective intelligibility measures specifically the diagnostic acceptability measure (DAM).This results in an objective test that can be used to produce a good meaningful result.This also results in a test that shows the distortion and noise reduction [15].Here, we can confirm that the results of the ISDM with the proposed method produce good results of ISDM when compared with the conventional methods except for the case of the the MS method with SS in street 10 dB noisy signal.

Conclusion
We presented a modified noise estimation and suppression algorithm that combined the nonlinear function and a priori SAP estimation for biased compensation.Moreover, our method uses another adaptive parameter to control the trade-off between speech distortion and residual noise for suppressing the estimated noise in highly nonstationary and various noisy environments.The performance of the new algorithm was evaluated by measuring the segment SNR and the ISDM.We showed that the proposed algorithm was generally superior to conventional methods, reducing both residual noise and speech distortion in nonstationary and noisy environments.In the future, we plan to evaluate its possible application in preprocessing for signal processing area.

Figure 1 :
Figure 1: (a) Plot of the adaptive factor  in the frame index.(b) Adaptive factor  using a sigmoid function based on the a posteriori SNR.

Figure 2 :
Figure 2: Comparison between the noisy signal, noise estimated by MS, and noise estimated by the proposed method in restaurant 5 dB noisy environment.

Figure 3 :
Figure 3: Comparison between the noisy signal, noise estimated by minimum statistics (MS), and noise estimated by the proposed method in highly nonstationary noisy environments.

Figure 4 :
Figure 4: Frequency domain results of speech enhancement for exhibition noise 5 dB SNRs in noisy environments.(a) Original spectrogram, (b) spectrogram using MS with SS method, (c) spectrogram using the MCRA with SS method, and (d) spectrogram using the proposed method.

Table 1 :
Objective evaluation and comparison of the proposed method segmental SNR values.

Table 2 :
Objective evaluation and comparison of the Itakura-Saito Distortion Measure (ISDM).