Subband DCT and EMD Based Hybrid Soft Thresholding for Speech Enhancement

1 Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8656, Japan 2Department of Computer Science and Engineering, The University of Rajshahi, Rajshahi 6205, Bangladesh 3 Graduate School of Engineering, The University of Tokyo, Tokyo 113-8656, Japan 4Department of Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh


Introduction
In many speech related systems, the desired signal is not available directly; rather it is mostly contaminated with some interference sources.These background noise signals degrade the quality and intelligibility of the original speech, resulting in a severe drop in the performance of the post applications.Speech enhancement aims at improving the perceptual quality and intelligibility of such speech signals degraded in noisy environments, mainly through noise reduction algorithms [1].Due to its significant importance in today's information technology, many methods have been developed for this purpose.A major problem in most algorithms is that the enhanced speech signal has distortions compared to the original one which results in loss of some speech details.The residual noise is another problem which affects the performance of the postprocessing systems.
Soft thresholding is a powerful technique used for removing the noise components by subtracting a constant value from the coefficients of the noisy speech signal obtained by the analyzing transformation.However, such type of direct subtraction results in a degradation of the speech components.Unlike the conventional constant noise-level subtraction rule [2,3], a new soft thresholding strategy based on frequency frames was proposed in [4].The later one is able to remove the noise components while giving significantly less damage to the speech signal.This enables even signals with high SNRs to be processed effectively.However due to the thresholding criteria, a noticeable amount of noise still remains in the enhanced signal.Another disadvantage is the lack of robustness of the algorithm to different noise types.
The empirical mode decomposition (EMD), recently pioneered by Huang et al. [5] as a new and powerful data analysis method for nonlinear and nonstationary signals, has made a novel and effective path for speech enhancement studies.Recent studies have shown that, with EMD, it is possible to successfully remove the noise components from the IMFs of the noisy speech.Since the extraction of the IMFs relies on 2 Advances in Acoustics and Vibration frequency characteristics, the IMFs with higher index contain lower frequency components.This property helps the noise and speech components to be roughly separated in terms of frequency and to dominate in different IMFs.Therefore, it will be even possible to identify and remove the noise parts that are embedded in the speech components.
In this paper, we propose a hybrid algorithm which will include a two-stage soft thresholding.In the first stage, a subband approach DCT domain soft thresholding is adapted to the noisy speech.The remaining noise in the enhanced speech looks like random tones and results in an irritating sound.Hence further denoising should be applied to get rid of this artifact.However, it is not an easy task to identify and remove these noise components without degrading the speech signal.Due to the frequency characteristics of the IMFs, further enhancement is achieved in the second stage through an EMD based soft thresholding strategy.

DCT Soft Thresholding
Transform domain speech enhancement methods commonly use amplitude subtraction based soft thresholding defined by [2,3] where  V denotes the noise level,   is the kth coefficient of the noisy signal obtained by the analyzing transformation, and X represents the corresponding thresholded coefficient.Since all the coefficients are thresholded by  V , the speech components are also degraded during this process.This degradation results in a loss in speech quality.Unlike the conventional constant noise-level subtraction rule in (1), a frame based soft thresholding strategy was proposed in [4].The strategy depends on segmenting the signal into short time intervals and applying discrete cosine transform (DCT) on each frame.The DCT coefficients of each frame are divided into frequency bins which are categorized as either signal-or noise-dominant depending on their speech and noise energy distribution.Figure 1 shows an illustration of typical noiseand speech-dominant frequency bins.The problems of the conventional constant noise-level subtraction rules given in (1) can be well observed in this figure.For instance, it is apparent from Figure 1(a) that subtracting a constant value from the noisy speech coefficients in order to obtain the clean speech coefficients is inadequate.Furthermore, due to the second part of thresholding a significant amount of speech information may be lost, resulting in a source of musical noise.Therefore a linear thresholding is followed in noise-dominant frames.On the other hand, Figure 1(b) proves that soft thresholding is very inaccurate for signaldominant frequency bins and will most probably degrade the speech components, therefore giving more damage than its contribution to the enhanced speech.Therefore, the signal-dominant frames should better be kept as they are in order not to degrade the high energy speech components.This enables even signals with high SNRs to be processed effectively.
The noisy speech is first segmented into 32 ms frames and a 512-point DCT is applied on each frame.The DCT coefficients of the frames are further divided into 8 frequency bins, each containing 64 DCT coefficients.As discussed before, for adaptive thresholding, each bin is categorized as either signal-or noise-dominant.The classification pertains to the average noise power associated with that particular bin.If the th bin satisfies the following inequality: where  2  denotes the variance of the noise,    is the kth DCT coefficient of the th frequency bin, and N (=64) is the number DCT coefficients of the bin; then the bin is characterized as signal-dominant, otherwise as noise-dominant.The signaldominant bins are not thresholded, since it is highly possible to degrade the speech signal, especially for high SNRs.In the case of a noise-dominant frequency bin, the absolute values of the DCT coefficients are sorted in ascending order and a linear thresholding is applied: where   is the linear threshold function obtained as where  is the index of sorted |  |.It is evident from (2) that, for the noise-dominant frequency bins, the average noise power added would be less than the average noise power estimated over the entire speech signal.Here, the added average noise power over any of these frequency bins is denoted as   .To find a reasonable value for , three speech signals contaminated with white noise at 10 dB SNR are used.Using the categorization in (2) at each frequency bin, the noise dominants are identified and a value of  is calculated by simply dividing the variance of that frequency bin by the overall noise variance.The sorted variation of  is shown in Figure 2. It can be observed that the value of  varies between 0.2 and 0.8 for all speech signals.Therefore, experimentally, the value of  should be selected in this range.

Basics of EMD
The principle of EMD technique is to decompose any signal () into a set of band-limited functions   (), which are zero mean oscillating components, simply called the IMFs.Each IMF satisfies two basic conditions: (i) in the whole data set the number of extrema and the number of zero crossings must be the same or differ at most by one and (ii) at any point the mean value of the envelope defined by the local maxima and the envelope defined by the local minima is zero [5].The first condition is similar to the narrow-band requirement for a Gaussian process and the second condition is a local requirement induced from the global one and is necessary to ensure that the instantaneous frequency will  not have redundant fluctuations as induced by asymmetric waveforms.The name intrinsic mode function is adopted because it represents the oscillation mode in the data.With this definition, the IMF in each cycle, defined by the zero crossings, involves only one mode of oscillation; no complex riding waves are allowed [5].IMF is not restricted to a narrow-band signal; it can be both amplitude and frequency modulated; in fact it can be nonstationary.The idea of finding the IMFs relies on subtracting the highest oscillating components from the data with a step by step process, which is called the sifting process.Although a mathematical model has not been developed yet, different methods for computing EMD have been proposed after its introduction [6,7].The very first algorithm is called the sifting process.The sifting process is simple and elegant.It includes the following steps: (1) identify the extrema (both maxima and minima of ()), (2) generate the upper and lower envelopes (() and ()) by connecting the maxima and minima points by cubic spline interpolation, (3) determine the local mean  1 () = [() + ()]/2, (4) since IMF should have zero local mean, subtract out  1 () from () to get ℎ 1 (), (5) check whether ℎ 1 () is an IMF or not, (6) if not, use ℎ 1 () as the new data and repeat steps 1 to 6 until ending up with an IMF.
Once the first IMF ℎ 1 () is derived, it is defined as  1 () = ℎ 1 (), which is the smallest temporal scale in ().To compute the remaining IMFs,  1 () is subtracted from the original data to get the residue signal  1 ():  1 = () −  1 ().The residue now contains the information about the components of longer periods.The sifting process will be continued until the final residue is a constant, a monotonic function, or a function with only one maximum and one minimum from which no more IMF can be derived [6].The subsequent IMFs and the residues are computed as At the end of the decomposition, the data () will be represented as a sum of  IMF signals plus a residue signal, A noisy speech signal and some selected IMF components are shown in Figure 3.It can be observed that higher order IMFs contain lower frequency oscillations than those of lower order IMFs.This is reasonable, since the sifting process is based on the idea of subtracting the component with the longest period from the data till an IMF is obtained.Therefore the first IMF will have the highest oscillating components: the components with the highest frequencies.Consequently, the higher the order of the IMF is, the lower its frequency content will be.However, the IMFs may have frequency overlaps but at any time instant the instantaneous frequencies represented by each IMF are different.This phenomenon can be well understood in Figure 4 which shows the instantaneous frequencies of the first 6 IMFs.Therefore EMD is not band pass filtering but is an effective decomposition of nonlinear and nonstationary signals in terms of their local frequency characteristics.The recent development of EMD focused on the use of ensemble EMD (EEMD) [8] and noise assisted multivariate EMD (MEMD) [9,10] to implement the traditional univariate EMD (UEMD).The key advantage of the newly developed EMD methods is to achieve the accurate decomposition of the analyzing signal.The EEMD approach consists of sifting an ensemble of white noise-added signal and threatens the mean as the final true result.The effect of the added white noise is to provide a uniform reference frame in the time-frequency space; therefore, the added noise collates the portion of the signal of comparable scale in one IMF.A noise-assisted approach in conjunction with MEMD is also used for the computation of EMD, in order to produce localized frequency estimates at the accuracy level of instantaneous frequency [9].The traditional EMD is prone to mode-mixing and is designed for univariate data.The noise assisted MEMD (NA-MEMD) approach utilizes the dyadic filter bank property of the MEMD providing the solution to the problem of standard EMD.
With these powerful characteristics, recent studies have shown that it is possible to successfully identify and remove a significant amount of the noise components from the IMFs of a noisy speech.Although all IMFs contain energy from both the original speech and the noise, the amount of the energy distribution is different.Since speech signals are mainly concentrated in the low and mid frequency bands, the high frequency noise components dominate the first IMFs.For instance, in case of white noise, most of the noise components are centered on the first three IMFs, while the speech signals dominate between the 3rd and 6th IMFs, as can be observed in Figure 3. Therefore, EMD makes it possible to some extent to separate the high frequency noise from the major speech components.

Proposed Hybrid Algorithm
The proposed hybrid algorithm is based on applying the frame based soft thresholding strategy [4] in two stages.The first stage includes the DCT domain soft thresholding with a subband approach in order to provide robustness to different noise types.The second stage of the algorithm consists of an EMD domain soft thresholding for further enhancement.

4.1.
Subband DCT Soft Thresholding.The major problem in DCT soft thresholding algorithm given in [4] is that it is not robust to different noise types.Since all the frequency bins are processed with a unique noise variance estimated in the time domain, the algorithm is mainly applicable to white noise which has a flat spectrum.The method fails for other noise types that show different spectral distribution within the frequency bins.Therefore, it is important to have a subband approach where a specific noise variance is calculated for each frequency band.The index of the frequency bins represents the index of the subband.For instance, the first frequency subband consists of the first frequency bins of each frame.The variance of each subband is calculated through a minimum statistics approach from the frequency bins.With this subband approach, each band will have an effective bin categorization.Therefore, the algorithm will be robust to different noise types.
Apart from the subband approach, a novel strategy is introduced here for the bin categorization.The limit given in (2), which is set to noise variance, is not efficient to identify all the noise-dominant bins.Since the variance of the noisy bins will have fluctuations, there will be many noise-dominant bins which will be identified as signal-dominant.Therefore, the limit for bin categorization should have a larger value than the noise variance, in order to guarantee that all the noisy bins are thresholded.A novel limit relies on the idea that a bin can be defined as noise-dominant, if the noise power in that bin is higher than the speech power.Therefore, the limit should be set to the case where the noise and speech variances  2  and  2  , respectively, are equal.The variance  2 of the noise contaminated speech for any frequency bin is represented as where (, ) is the covariance term of signal and noise.If the signal and noise are independent, the covariance function gives zero; thus we have For frame categorization (into signal-and noise-dominant frames), the threshold is considered with equal noise and speech power, and hence  2 = 2 2  .Therefore, in case of equal noise and speech power, the variance of the bin is equal to 2 2  .The variance of a speech segment directly corresponds to its power.The equal variance of speech and noise exhibits the equilibrium contribution of speech noise power to the noisy speech frame.Hence such level of power is considered as the threshold for speech frame categorization.It is treated as the minimum power level of noise-free speech frame.Any frame with power higher than such threshold exhibits that the speech power is dominating.Otherwise, the noise power dominates the analyzing frame.That is why the limit for the categorization of the bins in (2) should be set to this value.With the proposed strategy, if where  2  denotes the variance of the noise for the th subband and    is the k  th sample of the th bin, then this bin is categorized as signal-dominant, otherwise as noisedominant.Noise-dominant frequency bins are thresholded as in (3).The optimum value for  is defined here.

Optimum Value of 𝜆.
The soft thresholding algorithm can further be improved by defining an optimum value for .As we discussed, it is better to have a higher  for low SNRs and a lower value for high SNR input signals.This dependency of  on the input SNR can be better observed in Figure 5, which shows the effect of  on the SNR improvement results at different input SNRs.Therefore, the optimum value of  can be related with an estimated value of the input SNR.The input SNR can be estimated as where  2  denotes the variance of the speech signal and  2  denotes the variance of the noise signal within the whole noisy mixture.From the independency of the speech and noise,  2  is determined as  2  =  2 −  2  .Extensive computer simulations are performed to determine the values of the parameters  0 (0.6 <  0 < 0.8) and  1 (0.01 <  1 < 0.03); hence the optimum value of  is obtained as  IMFs of the enhanced speech.Similarly, the lower frequency noise signals can be identified from the later IMFs.
The IMFs are in time domain and may have frequency overlaps.However, at any time instant, the instantaneous frequency represented by each IMF is different.That is why, although the IMFs are in time domain, they have spectral difference at time instances.Therefore, the DCT soft thresholding algorithm can be applied to the IMFs as given in [11].First, the EMD is applied to the enhanced speech.The obtained IMFs are divided into 4 ms frames, thus each having 64 data for a 16 kHz sampling frequency.Due to the decomposition characteristics, the IMFs differ in terms of noise and speech energy distribution.Therefore the specific noise variance of each IMF is estimated from the speechless parts.As, in the DCT bin categorization case, the frames are characterized as either signal-or noise-dominant frames with the novel categorization limit given in (9).The noisedominant frames are thresholded using (3), while the signaldominant frames are not.

Experimental Results and Discussion
To illustrate the effectiveness of the EMD based hybrid algorithm, extensive computer simulations were conducted with 10 male and 10 female utterances sampled at 16 kHz, randomly selected from the TIMIT database.The clean speech samples were corrupted with weighted noise from the NOISEX database in order to obtain the noisy speech samples.To illustrate the robustness of the univariate EMD ( EMD ) scheme to different noise types, white, pink, and high frequency (HF) radio channel noise samples have been used.For evaluating the performance of the method, overall and average segmental SNR improvements as well as objective speech quality results were used.The quality of the enhanced signals has been measured with the perceptual evaluation of speech quality (PESQ).Figures 6(a) and 6(b) show the spectrogram for the male clean speech "do not ask me to carry an oily rag like that" from the TIMIT database and the corresponding noisy speech corrupted with white noise at 10 dB SNR.The spectrogram of the enhanced speech after the first stage of the algorithm is illustrated in Figure 6(c).It can be observed that, with the first stage, there is a reasonable enhancement in the noisy speech signal.Although the noise components are effectively removed for a wide range of frequencies, the remaining noise in the enhanced speech can be observed.With the second stage, we could manage to efficiently remove the remaining noise.By this way, not only do we have a significant improvement in the SNR but we also get rid of the irritating residual noise.The spectrogram of the overall enhanced signal in Figure 6(d) illustrates the effectiveness of the proposed method.Figure 7 shows the corresponding waveforms.
Similar to the DCT soft thresholding, the algorithm can be applied for a wide range of SNRs.Since the signaldominant frames are never thresholded, there is still significant improvement even in case of high SNRs where even the most proposed  EMD based methods fail to hold on to the input SNR.The average results of the computer simulations for 10 male and 10 female utterances for a wide range of SNR values with a comparison of different denoising methods are listed in Table 1(A) for white noise.The superiority of the  EMD scheme can be well observed in this table.
It can be observed that, for all SNR levels, the proposed  EMD method gives significantly better results.Although SNR improvement is a good measure for quantifying performance, it has little perceptual meaning and is therefore not a good measure for speech quality [12].Instead, the average segmental SNR (AvgSegSNR) is relatively a better measure.Figure 9: The waveform of (a) clean speech, (b) noisy mixture at 10 dB (pink noise), and enhanced speech with (c) wavelet packets thresholding [3], (d) DCT hard thresholding [11], (e) DCT soft thresholding, and (f)  EMD based hybrid method ( opt ).
The average results of computer simulations for 10 male and 10 female utterances for overall SNR, average segmental SNR, and PESQ results are listed in Table 2.
As discussed before, it can be seen that the DCT soft thresholding algorithm in [4] dramatically fails in such noise types that do not have flat spectral distribution in the frequency spectrum.Due to the subband variance approach adapted in the first stage, our proposed hybrid method is significantly robust to such noise types and highly superior to other methods.Moreover, since the signal-dominant subframes are never thresholded, the algorithm is always performing improvement in all SNR values.The EMD based soft thresholding in the second stage not only improves the SNR but also plays a critical role in removing the irritating musical noise, therefore extensively increasing the perceptual speech quality.Figures 8 and 9 show the spectrograms and waveforms of the clean speech, the noisy speech at 10 dB SNR contaminated with pink noise, and the enhanced speech signals for the female speech "they will take a wedding trip later." The performance of  EMD based speech enhancement is also compared with the methods in which the traditional EMD is computed using EEMD ( EMD ) [8] and MEMD ( EMD ) [9].The comparative results for a wide range of SNRs obtained by three EMD methods for white noise are illustrated in Figure 10.Only the white noise is taken into consideration.
It is found that the EEMD based approach exhibits lower performance than that of the traditonal EMD for white noise, whereas a slight improvement is acheived with MEMD based implementation of standard EMD.One underlying consideration of having improved result using MEMD based approach is that the noise assisted MEMD fully uses the dyadic filter property of MEMD to implement traditional EMD.It does not suffer from the mod-mixing problem and hence the improvement of denoising results.The improvement of other EMDs (e.g., EEMD and MEMD) is more prominent in lower SNR, that is, highly noise contaminated speech signals.

Conclusions
In this paper, we presented a hybrid speech enhancement method based on DCT and EMD.In order to provide robustness to different noise types, a DCT soft thresholding strategy with a subband approach is proposed in the first stage of the algorithm.Furthermore, a novel limit for frame categorization was given in order to have a better identification of the noise components.In the second stage, we proposed an EMD domain soft thresholding strategy in order to remove the remaining noise components within the first stage enhanced signal.
One of the main advantages of the method is that it does not include any prior knowledge of the noise signal.Its robustness to different noise types is another significance of the method.The major drawback of the algorithm is its time cost.Since a mathematical representation is not yet given for EMD, the process takes long time.Therefore, the algorithm is not applicable to real time speech processing.
The algorithm can be further improved by adapting an optimum value calculation for the number of subbands.This can be achieved by analyzing the spectral distribution of the noise signal which can be obtained from the speechless parts of the noisy speech.

Figure 2 :
Figure 2: The calculated value of  in noise-dominant frequency bins.

Figure 5 :
Figure 5: The effect of  on the SNR improvement results in different input SNRs.

Figure 6 :Figure 7 :
Figure 6: Spectrogram of (a) the clean speech, (b) the noisy speech corrupted with white noise at 10 dB SNR, (c) the recovered speech after soft thresholding with subband DCT, and (d) the overall recovered speech of the  EMD based method. 0
Figure 3: The illustration of EMD.A noisy speech signal at 10 dB SNR and its first 8 IMFs out of 14, plus a residue signal which can be observed to be close to a constant.

Table 1 :
Comparison of the SNR, AvgSegSNR, and PESQ improvements of different denoising methods for a high range of SNR values (white noise).

Table 2 :
Comparison of overall SNR, average segmental SNR (AvgSegSNR), and PESQ improvements of different denoising methods for pink and HF channel noise.The results for the AvgSegSNR are listed in Table1(B), which still proves the superiority of the  EMD based algorithm in all SNRs.In order to have a better idea about the perceptual quality of the enhanced speech signals, PESQ has been used.Recently regarded as the best algorithm for estimation of the results of a subjective test, PESQ returns a score between −0.5 and 4.5, with higher scores indicating better quality.The results of the PESQ simulation results can be observed in Table1(C).It can be observed that the  EMD based algorithm is still more effective in terms of perceptual quality than the other methods.In order to prove the robustness of the algorithm to different noise types, extensive computer simulations were conducted with pink and high frequency (HF) channel noise.