Single Channel Speech Enhancement Techniques in Spectral Domain

This paper presents single-channel speech enhancement techniques in spectral domain. One of the most famous single channel speech enhancement techniques is the spectral subtraction method proposed by S.F. Boll in 1979. In this method, an estimated speech spectrum is obtained by simply subtracting a preestimated noise spectrum from an observed one. Hence, the spectral subtraction method is not concerned with speech spectral properties. It is well known that the spectral subtraction method produces an annoying artificial noise in the extracted speech signal. On the other hand, recent successful speech enhancement methods positively utilize the speech property and achieve an efficient speech enhancement capability. This paper presents a historical review about some speech estimation techniques and explicitly states the difference between their theoretical background. Moreover, to evaluate their speech enhancement capabilities, we perform computer simulations. The results show that an adaptive speech enhancement method based on MAP estimation gives the best noise reduction capability in comparison to other speech enhancement methods presented in this paper.


Introduction
In recent years, speech enhancement is required in a wide area of applications including mobile communication and speech recognition systems, where the major example is a cell-phone as shown in Figure 1.Many speech enhancement methods have been established in decades [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15].These speech enhancement techniques can be classified to time domain methods and spectral domain methods.Recent major speech enhancement techniques are of the spectral domain method which is preferably used in a cell phone.In this paper, we focus on the spectral domain speech enhancement techniques that employ a single microphone.
The spectral subtraction method [3] is one of the most popular methods among numerous noise reduction techniques in spectral domain.This method achieves noise reduction by simply subtracting a pre-estimated noise spectral amplitude from an observed spectral amplitude, where the spectral phase is not processed.The spectral subtraction method is easy for implementation and effectively reduces stationary noises.However, it incurs an artificial noise, called musical noise, which is caused from speech estimation errors.Because the spectral subtraction method is not concerned with speech spectral information, it often gives estimation errors.Ephraim and Malah have proposed the MMSE-STSA (Minimum Mean Square Error-Short-Time Spectral Amplitude) method [4] which utilizes a speech PDF (Probability Density Function) and a noise PDF.In the literature in [4], the speech and noise PDFs were modeled by Rayleigh and Gauss density functions, respectively.This method gives an optimal solution of the estimated speech signal in the sense of MMSE-STSA (the solution may change to Wiener filter [5] if we assume Gauss distributions for both of the speech and noise PDFs).Although the MMSE-STSA method gives an estimated speech signal with less musical noise, it requires more complicated computations, for example, the solution required to calculate the modified Bessel function.Moreover, as pointed out by some researchers, real speech histograms do not fit to Rayleigh function employed in [4].
A more efficient method that is based on a maximum a posteriori (MAP) estimation has been established by Lotter and Vary [11].Lotter and Vary modeled the speech PDF by a parametric super-Gaussian function, controlled by two shape parameters.The parametric super-Gaussian function has been developed from a histogram made from a large amount of real speech data in a single narrow SNR (Signal to Noise Ratio) interval.The noise suppression capability of this method is superior to the Wiener filter.However, the residual noise is still persistently perceived.Andrianakis and White were aware that the speech PDF may change in some SNR intervals [12].They utilized three histograms made from speech signals in three different narrow SNR intervals and approximate them with Gamma density function.As reported in [12], changing these three speech PDFs according to the SNR can improve the noise reduction capability.While Andrianakis discretely changes the speech PDF, Tsukamoto et al. continuously change the speech PDF according to the SNR [13].They employed the parametric super-Gaussian function proposed in [11] and adaptively changed its shape parameters according to the SNR.Recently, Thanhikam et al. [16] sophisticated this approach by making and evaluating many real speech histograms made from various narrow SNR intervals.As shown in [16], this method has a very strong noise reduction capability in comparison to other traditional speech enhancement methods, and hence it is effective especially in low SNR environments.
In the following sections, we present a historical review of useful speech enhancement methods mentioned above and compare their speech enhancement capabilities by computer simulations.

Speech Enhancement in Spectral Domain
This section presents several speech enhancement techniques including both traditional methods and recent methods.Particularly, we will carefully explain the difference between them.

General Speech Enhancement
System.Firstly, we explain about a general single-channel speech enhancement system in spectral domain.
We assume that an observed signal is a sum of a speech signal and a noise signal given as

Spectral gain estimation y(t)
x(t) Xk (n) where y(t) is the observed signal at time t.x(t) and d(t) denote the speech signal and the noise signal, respectively.We assume that x(t) is uncorrelated with d(t) through the paper.Taking the DFT of (1), we have where N, n, and k denote the frame length, the frame index, and the frequency bin index, respectively.The analysis frame is shifted by Q samples, where Q = N/2 is used through the paper.The function h(t) denotes an analysis window function, where the Hanning window of size N is used as h(t).The DFT spectrum Y k (n) can be rewritten as where X k (n) and D k (n) are the kth spectra of x(t) and d(t), respectively.The enhanced speech spectrum X k (n) is given as where G k (n) is a spectral gain.The enhanced speech is obtained as the observed signal Y k (n) multiplied by the spectral gain G k (n).Hence, speech enhancement capability depends only on the spectral gain.
A general speech enhancement system can be illustrated in Figure 2, where the value of the spectral gain G k (n) depends on an employed speech enhancement algorithm.We see from (3) and (4) that the ideal spectral gain is given as This spectral gain perfectly provides the original speech signal as the enhanced speech.Since the ideal spectral gain above cannot be directly obtained from Y k (n), we have to approximate the ideal spectral gain by introducing additional assumptions for the speech or the noise signals.
In the following sections, we give some typical spectral gains which have been derived from respective assumptions for the speech or the noise.For avoiding redundant expressions, we omit the indices n and k if they do not play an important role.

Spectral Subtraction.
The most simple and famous speech enhancement technique is the spectral subtraction proposed by Boll in 1979 [3].This method just subtracts a pre-estimated noise spectral amplitude from an observed one to obtain the estimated speech spectral amplitude.In the spectral subtraction method, the spectral phase is not modified; that is, the estimated speech spectral phase is identical to the observed one.This is based on the fact that the spectral phase is unimportant in comparison to the spectral amplitude in human speech perception [17].The spectral subtraction method is achieved by using the following spectral gain.
where | D| is the pre-estimated noise spectral amplitude.Usually, we choose We note that formula ( 6) is an absolute version of ( 5).The spectral subtraction is not concerned with speech spectral property.As a result, the estimated speech signal includes many estimation errors.The estimation error produces an isolated spectrum in the estimated speech signal.This noise is called "musical noise" and it is perceived as an annoying sound for human.To obtain an estimated speech signal with less musical noise, we should introduce a speech property into speech enhancement scheme.In the following sections, we present some speech enhancement methods taking into account speech probabilistic properties.

Wiener Filter.
In this section, we explain the Wiener filter [5] which utilizes both of the speech and the noise spectral probabilistic properties.It is well known that the Wiener filter provides an estimated speech signal with less musical noise in comparison to the spectral subtraction method.
To derive the Wiener filter, we assume that the speech spectrum X is uncorrelated with the noise spectrum D and The Wiener filter is obtained by minimizing the following cost function: where E[•] denotes the expected value.We can rewrite J as Differentiating J with respect to G * gives Putting (9) to zero and solving it with respect to G, we have the spectral gain of the Wiener filter given as where ξ = σ 2 x /σ 2 d is the a priori SNR.The Wiener filter requires one parameter ξ or two variances σ 2 x and σ 2 d .

MMSE-STSA Method.
In this section, we explain a historically important speech enhancement method, that is, the MMSE-STSA method [4] proposed by Ephraim and Malah in 1984.Ephraim and Malah have proposed not only an efficient spectral gain, but also an efficient estimation technique to get the a priori SNR.
The MMSE-STSA method is derived by minimizing a conditional mean square value of the short time spectral amplitude.The cost function to be minimized is given by (11) where p(X | Y ) denotes the conditional PDF of X.The estimated speech spectrum which minimizes J MMSE is given as As shown in [6], when we assume p(X) and p(D) as Gauss functions, ( 12) produces the Wiener filter again.
On the other hand, Ephraim and Malah considered the PDFs of the speech spectral amplitude and phase, that is, p(|X|) and p(∠X).They assumed that p(|X|) and p(∠X) as the Rayleigh distribution and the uniform distribution, respectively [18].They assumed p(D) as the Gauss function, where the noise variance σ 2 d is assumed to split equally into real and imaginary parts.These PDFs are expressed as where After tedious and complex computations, the spectral gain is given as [4] where I i (•) is the modified Bessel function of order i and Here, γ is called as the a posteriori SNR.As shown in [4], the optimal spectral phase in the sense of MMSE-STSA is identical to the observed one.Hence, G MMSE is also a real value.The MMSE-STSA solution, G MMSE , is completely characterized by σ 2 d , ξ, and γ.When the noise variance σ 2 d is known or can be estimated, γ is simply obtained by the observed spectrum.On the other hand, estimating the a priori SNR ξ is difficult, although it needs to be required for many other spectral speech enhancers.One of the valuable contributions in [4] is to present a useful estimation method of ξ, called the decision-directed method.We will show and use it to estimate ξ in Section 3.

MAP Estimation Method.
As confirmed in many literatures, the spectral gain G MMSE derived in the previous section is superior to the spectral subtraction method.But G MMSE is not easy to implement due to a large amount of computational complexity.Indeed, we can obtain a more theoretically relevant and reasonable spectral gain from the same cost function shown in (11).The MMSE-STSA method has chosen (11).Here, we can note that E[X | Y ] is the best choice when the PDF is an even function like a Gauss function.Because the Rayleigh distribution is asymmetric function, The MAP estimation method [6] denotes that the best choice for minimizing (11) is to employ the speech spectrum maximizing p(X | Y ).
To illustrate the difference between the MMSE-STSA solution and the MAP solution, we show an example of the specific PDF.Figures 3(a) and 3(b) show the Gauss and Rayleigh distributions, respectively.Here, the horizontal axis denotes the value of an argument x and the vertical axis is a PDF p(x).The vertical dotted lines denote the argument values giving the mean value and maximum value of p(x), respectively.The former value is corresponding to the MMSE-STSA solution and the latter value is corresponding to the MAP solution.As shown in Figure 3(a), the MMSE-STSA solution is identical to the MAP solution for the Gauss distribution which is an even function.On the other hand, the solutions of them are different for the asymmetric Rayleigh distribution as shown in Figure 3(b).Obviously, we should choose the solution of the MAP estimation rather than the MMSE-STSA solution to minimize the cost function (11).
To obtain the MAP solution, we have to maximize the conditional PDF p(X | Y ).Based on the Bayes's rule, we have [6] The MAP estimation is to find the arguments X which maximize p(X|Y ), that is, We assume the same PDFs from ( 13) to (15), and p(X) = p(|X|)p(∠X).After calculating ln{p(Y | X)p(X)} and differentiating it with respect to |X| (or ∠X), we put the obtained derivative to zero and solve it with respect to |X| (or ∠X).Then, we have [6] Since the MAP solution of ∠X is identical to the observed spectral phase, G MAP is also a real value.We see that G MAP consists of ξ and γ only; thus its computational complexity is extremely low in comparison to (16).
2.6.Lotter's Spectral Gain.In the previous section, we obtained a MAP solution for speech enhancement under the assumption that the PDF of the speech spectral amplitude can be modeled as the Rayleigh distribution.However, some researchers pointed out that there exists other appropriate speech PDF [8][9][10][11].In 2005, Lotter and Vary have proposed an original speech spectral amplitude PDF.This PDF was derived from a real speech histogram made from a large amount of real speech data.In the same manner as in the previous section, the speech spectral amplitude and phase were separately modeled in [11].The PDF of the spectral phase was also modeled as the uniform distribution defined in (14).Lotter et al. modeled the PDF of the speech spectral amplitude as a super-Gaussian function represented by where Γ(•) is a Gamma function and μ and ν are the shape parameters which determine the shape of the above PDF.
Using (21), ( 14) and ( 15), the same procedure in the previous section gives the MAP solution expressed as The MAP solution of the speech spectral phase is also identical to the observed one, and thus G L•MAP is a real value.
Lotter and Vary reported that the most appropriate shape parameters are μ = 1.74 and ν = 0.126 in [11].The spectral gain G L•MAP also consists of ξ and γ only, hence it is easy to implement.

2.7.
Adaptive Speech PDF Method.In [11], the shape parameters of the speech spectral amplitude PDF, μ and ν, had been derived from a large amount of speech data in a single narrow SNR interval.However, in a practical situation, a speech signal includes both activity segments and pause segments.Since the value of the speech spectral amplitude is always zero in the pause segments, we expect that its PDF can be modeled as a delta function.On the other hand, in the activity speech segments, the PDF of the speech spectral amplitude obeys x other functions.Tsukamoto et al. have noticed the fact and investigated an adaptive method to change the PDF of the speech spectral amplitude, according to the SNR [13].They have chosen Lotter's PDF defined in (21) as the adaptive PDF, because its shape is easily controlled by ν and μ.Here, we show examples of Lotter's PDF with different shape parameters in Figure 4. We see from this figure that the PDF can fit the exponential distribution and the Rayleigh distribution by adjusting the shape parameters.Utilizing real speech histograms, Tsukamoto et al. derived adaptive shape parameters and showed its effectiveness through the computer simulations [13].This basic idea is useful for speech enhancement in a practical situation.Unfortunately, a reliability of the derived adaptive shape parameter is comparatively low, because it is derived from only two speech histograms.
To sophisticate Tsukamoto's adaptive shape parameter, Thanhikam et al. have made and evaluated many real speech histograms in various narrow SNR intervals [16].They tried to fit the speech histograms with (21) and revealed an  interesting curve of the shape parameters for narrow SNR intervals.The obtained shape parameters as the fitting results and the derived curve are shown in Figures 5(a) and 5(b), where the narrow SNR was calculated as P = 10 log 10 ξ [dB].The lines in the figures denote the curves obtained by the least mean square method.Thes curves denote the relation between the shape parameters and P. Table 1 shows the formulations of the derived shape parameter function for P, where we denote the derived shape parameters by R μ k (n) and R ν k (n), and   spectral amplitude.Their "adaptive" MAP solution is as follows: where α is the forgetting factor and μ k (n) and ν k (n) are the adaptive shape parameters.In [16], they put α = 0.98, μ k (0) = 20, ν k (0) = 0.This paper also use these settings.
In the next section, we compare the speech enhancement capabilities of the spectral gains presented in this paper.

Speech Enhancement Simulation
To compare the speech enhancement capabilities of some spectral gains derived in this paper, we firstly explain about common conditions for speech enhancement simulation.After that, we show the simulation results and discuss them.

Common Conditions.
The speech enhancement methods explained in this paper commonly require the noise variance σ 2 d,k (n), a priori SNR ξ k (n), and a posteriori SNR γ k (n).To obtain these parameters, the following estimation methods were used.
Firstly, the noise variance was calculated by using the weighted noise estimator proposed in [19].This method can update the estimated noise variance even if a speech signal exists.The weighted noise estimator calculates an instantaneous noise power by using the weight W k (n) as shown in Figure 6.Here, θ and γ H are constant values.The literature in [19] recommends that θ = 7 and γ H = 10.As shown in Figure 6, W(n) is a function of γ(n) given as The noise variance σ 2 d,k (n) is updated as where β is a forgetting factor and β = 0.92 was used.
Next, the a posteriori SNR was directly calculated as Lastly, the a priori SNR was calculated by using the decision-directed method proposed in [4].The decisiondirected method is given by where α snr is a forgetting factor and α snr = 0.98 was used according to [4].The common speech enhancement system is shown in Figure 7, where the numbers denote the order of the estimation procedures.Of course, the spectral gain estimation is depending on the employed speech enhancement method.In simulations, the observed signal y(t) was a female speech signal x(t) corrupted with a practical tunnel noise d(t) with SNR = 0 dB, where the noise was recorded in a tunnel in an expressway in Japan.All the signals used in the simulations were sampled at 8 kHz, and the DFT size was 256 (the FFT was used instead of the DFT).For objective evaluations, we utilized the SNR defined as where L denotes the number of the samples in time domain.
It was also utilized the other evaluation function given as [17] LR where J is the number of frames.The LR (Likelihood Ratio) denotes a spectral distance between the original speech and the estimated one, hence the perfect speech estimate gives LR = 0.

Simulation Results
. Speech enhancement simulations were carried out to compare the presented speech enhancement methods.The chosen methods were the spectral   subtraction method [3] and Wiener filter [5] as traditional methods, Lotter's spectral gain [11] as a MAP method using a fixed speech PDF, and the adaptive speech PDF method [16] as the recent method.
Table 2 shows the results of the objective evaluation for each methods, where both of the best SNR and LR results were obtained from the adaptive speech PDF method proposed by Thanhikam et al. [16].We see from this table that the Wiener filter and Lotter's method also gave comparatively good SNR and LR results in comparison to the spectral subtraction method.The waveforms of the simulation results are shown in Figures 8(a)-8(e), and the respective spectrograms are shown in Figures 9(a)-9(e).From Figures 8(b) and 9(b), we see that the spectral subtraction method provided many residual noises.The main reason of it may be that the spectral subtraction method does not use any speech spectral information.The residual noises are perceived as an annoying musical noise.From Figures 8 and 9(c), we see that the Wiener filter is superior to the spectral subtraction method for speech enhancement.The Wiener filter gave the estimated speech with less musical noise, although the amount of the residual noise was comparatively large.From the waveform shown in Figure 8(d), we can confirm that the Lotter's spectral gain method can effectively reduce the noise in some segments.But its spectrogram shown in Figure 9(d) showed that the Lotter's spectral gain method emphasized isolated spectra, that is, musical noises.As a result, it also causes a perception problem.In Figures 8 and  9(e), such estimation errors cannot be confirmed.It implies that the adaptive PDF method proposed by Thanhikam is appropriate to reduce the noise in speech pause segments.However, in the speech activity segments, we can confirm that the speech spectral components were also vanished.The output speech quality of the adaptive speech PDF method may be improved by adjusting the forgetting factor in the adaptive shape parameters of the speech PDF.

Conclusion
Single channel speech enhancement methods have been extensively studied in decades.This paper have presented some spectral gain methods among numerous studies.Of course, there exists various noisy situations, and hence we cannot choose the best speech enhancement system among them.We just tried to explicitly denote theoretical backgrounds of the chosen speech enhancement methods.The noise reduction capability of the speech enhancement methods was roughly compared for an arbitrary noisy  speech, although the simulation results may slightly change when different noise and speech signals are used.From the obtained simulation results, we confirmed that the MAP estimation methods gave a good noise reduction performance.Particularly, the recently proposed adaptive speech PDF method reduced the noise signal strongly and hence did not produce a musical noise in speech pause segments.In the speech activity segments, we however perceived a small-level musical noise and a degradation of the speech.Such degradation tends to become large as noise increases.Future works in speech enhancement include a development of an effective noise reduction method which can give a good performance for a noisy speech signal with SNR less than 0 dB.

Figure 3 :
Figure 3: Maximum and mean values for the specific PDFs.

Figure 4 :
Figure 4: Shape examples of the PDF in (21) with different parameters.

Figure 5 :
Figure 5: Shape parameter fitting result for the SNR.
Thanhikam et al. used an averaged value of R μ k (n) and R ν k (n) to determine the present PDF shape of the speech
MAP estimation using Lotter's PDF shown in (22) MAP estimation using adaptive PDF shown in (25)

Table 1 :
Instantaneous shape parameter functions R μ k

Table 2 :
Objective evaluation results for noisy signal with SNR = 0 dB.