Speech Enhancement Using Joint DNN-NMF Model Learned with Multi-Objective Frequency Differential Spectrum Loss Function

,


Introduction
Speech enhancement is the task of separating the target speech from unwanted noises.Speech enhancement methods generally include statistical and data-driven learning-based methods.The statistical approaches such as the minimum mean-square error (MMSE) method [1] and Wiener filtering [2] are based on the statistical models of speech and noise.Non-negative matrix factorization (NMF) a well-known method in this category has been recently used a lot in speech separation [3].By NMF, a speech or noise signal can be decomposed into a non-negative basis matrix and an activation matrix.Then, for speech enhancement applications, in the testing phase, the learned concatenated basis matrices of speech and noise are used for an unknown noisy speech to estimate the related activation matrices.The estimated activations are multiplied by the related learned basis matrices to approximate the speech and noise sources.In addition, extracting noise-robust features is another approach for reducing the noise effects of speech signals [4].Lately, data-driven learning-based methods such as deep learning have also been widely used for complex mapping modeling such as learning the nonlinear mapping of noisy speech to clean speech for applications of speech enhancement and speech recognition [5][6][7][8][9][10].Training targets in data-driven methods are mostly the spectral magnitude of sources directly (mapping-based targets), or the spectral masks (masking-based targets) which are the gain values that represent T-F energy ratios of each source to the mixture and are then multiplied with the mixture of speech and noise to estimate each of them [11,12].
Moreover, in some research works, NMF or its extended versions are combined with deep neural networks (DNNs) to improve performance [13][14][15][16][17][18][19][20].In Kang et al.'s [13] study, mapping of the spectral magnitude of the noisy speech to the NMF activation coefficients of speech and noise is performed by a DNN.Then, the related estimated coefficients are multiplied with the corresponding learned basis matrix outside of DNN separately to approximate the actual signals.In Vu et al.'s [17] and Jia et al.'s [19] studies, instead of the main noisy spectrum, the noisy activation matrix which is the concatenated activation matrices of speech and noise is used as the DNN input noisy feature.Furthermore, in Wang and Wang's [20] study, NMF is first applied to an ideal ratio mask (IRM) and it is decomposed into a basis matrix and an activation matrix.Then, instead of directly predicting a mask as the DNN target, the related activation coefficients are estimated by DNN as an intermediate target.
Then, the estimated activation matrix and the learned basis matrix of IRM are linearly combined outside of DNN to reconstruct the IRM.Afterward, the estimated IRM separates the desired speech from the noisy mixture.On the other hand, in Williamson et al.'s [21][22][23] and Grais et al.'s [24] studies, DNN and NMF have combined in two subsequence separate stages, so that DNN in the first stage is applied for the separation purpose and then, NMF in the second stage for enhancement, or vice versa.In Williamson et al.'s [21][22][23] studies, the NMF reconstruction is used as a postprocessing step to enhance the separated speech by the mask estimated by DNN.In Williamson et al.'s [25,26] studies compared with Williamson et al.'s [21] study, a DNN is used in the second stage as an NMF alternative to estimate the activation matrix of clean speech from the first masked speech.Then, the estimated coefficients are multiplied with the pretrained basis matrix separately outside of the DNN to acquire the enhanced speech.
However, in the mentioned approaches, the NMF and DNN processes are carried out separately.Also, the DNN does not directly estimate the main targets but only estimates an intermediate target, which is the NMF activation coefficients.Therefore, in Nie et al.'s [14,15] and Li et al.'s [16] studies, the NMF and DNN processes are jointly combined, so that the learned NMF bases are integrated into the DNN as an extra layer.Then, the main objective signals are directly estimated by the DNN.However, in these methods, the activation coefficients do not have a direct effect on the DNN learning process and are not directly optimized by DNN.Hence in this paper, it is suggested that the activation coefficients be used in the network as prior knowledge in a multiobjective multi-loss training approach so that the extracted activation coefficients be injected at an intermediate output (prior) layer of DNN as a direct target and in the loss function while the original signals are also estimated by the DNN at the main output layer simultaneously.
The training loss function is also a remarkable subject in speech enhancement algorithms.The traditional MSE function is widely used as the training loss function in spectral speech enhancement.However, the spectral changes are not considered in the MSE.Due to the unique characteristics of each individualʼs sound source and vocal tract, the pitch frequency, the difference between frequency bins, and the process of changing amplitude in the frequency domain are different for each frame.Consequently, incorporating the process of changes between frequency bins in a time frame into the loss function can improve network learning and the performance of speech and noise separation.In addition, as described in Chen et al.'s [27] study and according to our observations, in the differentiated spectrum representation, the spectral peaks that carry valuable information are kept almost intact and the smooth parts of the spectrum become zero.Thus, to take an account the peak movements in the frequency domain in addition to the pitch information, we propose a frequency-differentiated loss function which is added to the multi-objective MSE loss function in this paper.In the MSE terms, the multi-objective training is applied so that in addition to the estimation of the actual signals in an MSE, the related NMF activation coefficients are also considered in another MSE function.
Also, another issue of interest is that most of the speech enhancement models are trained with a pool of different types of noises, and then these general models are used in the testing phase for enhancing each observed noisy speech.In this paper, to have better improvement, our joint models are exclusively learned for each type of training noise, and in the testing phase, using a noise classification and fusion approach (NCF), one or a suitable combination of the multiple learned models is used to enhance each detected noise.
The organization of this paper is structured as follows: in Section 2, an overview of NMF-based speech enhancement is given.In Section 3, the proposed system, including the Jnt-DNN-NMF model, the proposed loss function, and the noise classification and fusion approach will be explained.In Section 4, the experimental setup and results are presented.Finally, the conclusion is provided in Section 5.

NMF-Based Speech Enhancement
In the NMF approach, a non-negative data matrix, which in our work is the magnitude spectrum X 2 R F×T ≥0 , is decomposed into a non-negative basis matrix B x 2 R F×K ≥0 (K ≤ F) and an activation matrix H x 2 R K×T ≥0 according to Equation (1).K, T, and F represent the number of basis vectors (columns of B x ), time frames, and frequency bins, respectively.The basic structures of X are captured in the basis matrix.X can be the clean speech S, noisy speech Y, or noise N: Kullback-Leibler (KL) divergence as one of the multiplicative update rules is used to extract B x and H x matrices by iteratively minimizing the error between the observed signal X and its reconstruction B x H x as follows: ).Then in the testing phase, the magnitude of a test noisy speech is approximated as a product of the fixed B y matrix and a new activation matrix b which is calculated iteratively by Equation (3).Finally, the estimated speech and noise magnitudes are obtained as follows:

Proposed System
Since the effect of phase enhancement is not significant in speech improvement, we only use the short-time Fourier transform (STFT) magnitude spectrum of the framed signals for enhancement.As shown in Figure 1 It should be noted that NMF and Jnt-DNN-NMF training are performed for each training noise type to produce the noise-specific Jnt-DNN-NMF models which will be used in the testing phase (the dashed boxes in the bottom part of Figure 1).In other words, the repeated dashed boxes in Figure 1 are the learned Jnt-DNN-NMF models related to the N training noise types and have the same approach as dashed box1 (for noise1).The classifier DNN training is performed with different noisy speech magnitudes as input and N output class labels.In the testing phase, the noise type (matched or mismatched) of each input noisy speech is detected based on the classifier results (Section 3.3).Then according to Figure 1, after predictions made by N different Jnt-DNN-NMF models, in the fusion block, only one corresponding detected model is used for enhancement of each matched noise.However, for mismatched noises, a weighted combination of outputs of N models is regarded as enhanced speech.Finally, an inverse STFT followed by the overlapadd method is applied to reconstruct the waveform of the desired signal using the estimated magnitude and noisy phase.It should be noted that in the training phase, models are trained using noise-specific data which is the smaller dataset, and in the testing phase, multiple models are instantaneously and parallelly applied to the input noisy speech, so the computations are light.
3.1.Jnt-DNN-NMF Model.According to Figure 1, at first, in the NMF training stage, the structures in the magnitude spectra of the speech and noise sources are captured by applying the NMF inference for speech and each noise type independently as a feature and structure extraction process.So, the corresponding activation coefficients and basis matrices are obtained.In such a way that bases are trained first and then coefficients are extracted with the fixed bases.Then, as shown in Figure 1, the extracted NMF activation coefficients and basis matrices are employed in the next stage (the Jnt-DNN-NMF training stage).The extracted activation coefficients (H s , H n ) are directly served as the primary target features for the DNN (dashed lines) while the spectral magnitude of the noisy speech is as input (Y).The trained basis matrices are integrated into the DNN as an additional layer named the NMF reconstruction layer.The DNN together with the integrated NMF reconstruction and Wiener-like filtering layers form the multi-objective Jnt-DNN-NMF model to jointly optimize the main spectral magnitudes in the main output layer and the related NMF coefficients in the coefficients output layer.So, in the Jnt-DNN-NMF training stage, the joint model is trained with the noisy magnitude Y as input and the multi-objective targets of the activation coefficients at the coefficient output layer and the main speech and noise magnitudes at the main output layer using the proposed loss functions (Section 3.2).The mapping function of the DNN (g) is as follows: where W * j and b * j are the weights and biases of the DNN, respectively.J is the index of the coefficient output layer.b H s and b H n represent the estimated activation coefficients of speech and noise, respectively.In the NMF reconstruction layer, the speech and noise basis matrices are multiplied by the estimated coefficients, and then through a Wiener IET Signal Processing filtering layer, the final speech and noise estimates are achieved as follows: S and Ñ are the final estimated speech and noise magnitudes.The division and multiplication operations are element-wise.Jnt-DNN-NMF is trained with the proposed

4
IET Signal Processing loss functions (Section 3.2), and the weights and bias parameters are computed by the backpropagation algorithm.

Proposed Loss Functions.
In most traditional DNN-based speech enhancement methods, the learning process contains a direct mapping from the noisy signal to the actual separation targets without the use and direct influence of the structural features as prior knowledge on DNN and in the training process.So, in our Jnt-DNN-NMF model, we first propose a multi-objective combined loss function (L MO ) that not only optimizes the actual spectral signals of speech and noise but also the intermediate activation coefficients as follows: Then, to consider the spectrum changes in the frequency domain, according to Equation ( 9), we use a frequencydifferentiated loss function (L FD ) which calculates the amplitude differences between the neighboring frequency bins in each frame.Using this function allows the network to gain a better understanding of frequency characteristics and changes in the frequency domain.It calculates the MSE between the target signal and the estimated signal concerning frequency changes in each frame.
M is the number of neighboring frequency bins for a frequency bin that are involved in the calculation of the cost function for that frequency bin for each frame.Then, as shown in Equation (10), we propose the MSE-based multiobjective frequency-differentiated loss function (named as L MOFD ) which is a weighted combination of the frequencydifferentiated loss function L FD and a multi-objective combined loss function (the last two terms).These terms are the MSEs related to the objective spectral signals and the NMF activation coefficients, respectively.This leads to the simultaneous optimization of the encoded output features and the original spectral signals jointly in a single model.Indeed, the encoded features directly affect the learning process by considering a separate optimization term in the overall loss function L MOFD .Therefore, the joint model is trained based on two types of targets at the related output layer: where α 1 and α 2 are the weight parameters for L FD and the first MSE, respectively.

Experimental Setup and Results
The performance of the proposed system is compared with the following methods: (i) NMF [28]: the explained NMF-based speech enhancement in Section 2. (ii) DNN-Mag [29]: the traditional DNN-based speech enhancement where a DNN is used to map the spectral magnitude of noisy speech to the spectral magnitude of clean speech.(iii) LSTM-Mask [7,30]: a long short-term memory (LSTM) network maps the noisy speech magnitude to the IRM mask values.Then, the estimated mask values are multiplied by the noisy speech to estimate the sources.(iv) CRN-Mag [31]: a convolutional-recurrent network (CRN) is used with the mapping-based magnitude target.CRN is composed of CNN encoder-decoder and LSTM layers and its architecture is set similar to [31].(v) DNN-NMF-Sep [13]: a separate combinatorial model of DNN and NMF where the DNN maps the noisy speech to the NMF activation coefficients and the reconstruction of the main objective signals is separately performed outside of DNN.(vi) Jnt-DNN-NMF [15]: a joint combinatorial model of DNN and NMF where the DNN optimizes the objective signals.However, the activation coefficients do not directly incorporate into the DNN structure and learning process.

IET Signal Processing
We denote our proposed Jnt-DNN-NMF model with two loss functions of the multi-objective loss function L MO and the multiobjective frequency-differentiated loss function L MOFD as "Jnt-DNN-NMF-MO" and "Jnt-DNN-NMF-MOFD," respectively.
The proposed and comparison methods are trained and evaluated on the TIMIT dataset [32] which consists of 6,300 different utterances.We randomly select 200 clean speech utterances from the training set of TIMIT and are corrupted with babble, factory, and machinegun noises from the NOISEX-92 corpus [33] at SNRs −5 to 20 dB with steps of 5 dB.Our test set includes different 60 utterances from the test set of TIMIT which are corrupted with the training noises as matched noises and the real-world recorded factorymachine and windshieldrain noises from the Freesound data as mismatched noises at −5 to 10 dB SNRs.The baselines are trained and evaluated with the same training and testing datasets used for the proposed methods, respectively.
We use a 512-point STFT for the waveforms sampled at 16 kHz and framed using a 512-sample (32 ms) frame length, 512-sample (32 ms) Hamming window, and 128 shift samples (8 ms).The symmetric part of the STFT coefficients is cut off, so the dimension of our spectral magnitude matrices is 257 × frame numbers.
4.1.DNN and NMF Parameters.The NMF ranks of speech and noise basis matrices in all the baseline and proposed NMFbased methods are empirically set at 100 each (Ks; Kn ¼ 100).So, the size of the basis matrices is 257 × 100 (frequency bins × bases numbers).The maximum NMF iteration number is set to 50.
The architecture of the used DNN in all the baseline and proposed models includes four hidden layers with 1,024 units for a fair comparison.It should be noted that the main idea of the baseline DNN-Mag, DNN-NMF-Sep, and Jnt-DNN-NMF methods are, respectively, from Kang et al.'s [13], Nie et al.'s [15], and Huang et al.'s [29] studies, while the network topology and configurations are set according to our proposed models for a fair comparison.In all methods, the input layer includes 257 nodes due to the size of the noisy magnitude spectrum.The coefficient output layer due to the activation labels has 100 × 2 = 200 nodes and the main output layer for the main spectral magnitudes labels contains 257 × 2 = 514 nodes.The activation functions of the hidden layers and the main output layer are leaky rectified linear units (LReLU) [34] with α ¼ 0:1 (f ðxÞ: ¼ maxðαx; xÞ: ) and linear, respectively.The activation function of the coefficient output layer is ReLU (f ðxÞ: ¼ maxð0; xÞ: ) due to the non-negativity of the activation coefficients.The classifier DNN has two hidden layers of 1,024 units with the ReLU function and one output layer of three units with the softmax activation function for three classes.The softmax output is a probability distribution in the [0,1] range with a total sum of 1.The batch normalization is used after each hidden layer for faster training convergence.The baseline LSTM-Mask network has two LSTM layers of 3,072 LReLU units and a fully connected (FC) layer of 1,024 LReLU units, and a fully connected output layer of 257 linear units for mask values prediction.This configuration is according to the LSTM part in [7] and the LSTM-IRM method [30] which was used as a comparison method in Strake et al.'s [7] study.However, instead of 425 nodes in Strake et al.'s [7] study, here the nodes of the two LSTM layers are experimentally set to 3,072 to have better results.
The classifier DNN uses the cross-entropy loss function and the proposed model uses L MO and L MOFD loss functions.All networks are trained by the Adam optimizer [35] with an initial learning rate of 0.001 and a maximum epoch of 100.The weights of α 1 and α 2 in Equation ( 10) and the M parameter in Equation ( 9) are experimentally set to 2.3, 0.1, and 2, respectively.The N in Figure 1 is equal to 3 due to the three training noise classes.Moreover, to avoid overfitting, the early stopping method which stops the learning process based on the minimum validation loss is used in all models.

Results and Discussion
. This section explains the results of the proposed methods and baselines evaluated by three metrics of perceptual evaluation of speech quality (PESQ) [36], short-time objective intelligibility (STOI) [37], and frequency-weighted segmental SNR (SNR fw ) [38,39].
First, the classification results of the training noise types are displayed in Table 1 which indicates an appropriate classification.The classifier DNN identifies the mismatched noises of factorymachine and windshieldrain with the rates of (0.64, 0.22, 0.14) and (0.80, 0.18, 0.02), respectively.So, based on these prediction ratios (w 1 , w 2 , w 3 in Figure 1), the proportional contribution of the corresponding models is used for enhancement.
The average improvements of the PESQ metric (gPESQ), STOI, and SNR fw results of different methods over matched noise types and for each input SNR are displayed in Figures 2-4, respectively.
As can be seen in Figures 2-4, the proposed Jnt-DNN-NMF-MO and Jnt-DNN-NMF-MOFD methods outperform the baseline comparative methods.The superiority of Jnt-DNN-NMF-MO results over Jnt-DNN-NMF [15] in terms of three metrics is due to the use of extracted speech and noise NMF activation coefficients as direct intermediate targets by DNN, which are structural features and act as prior knowledge for DNN training.In fact, incorporating these coefficients in addition to the main signals into the loss function has led to improvement.The superior performance of Jnt-DNN-NMF-MO over other baseline methods is due to the joint learning of the integrated model of DNN and NMF bases and the use of the structural NMF features for DNN.According to Figures 2-4, in terms of three metrics, the Jnt-DNN-NMF-MOFD method outperforms the Jnt-DNN-NMF-MO and also the baselines, which demonstrates the strength of the proposed The results in Figure 2 show that the proposed methods (the last two methods) have higher PESQ improvements at each input SNR than the previous methods.The proposed Jnt-DNN-NMF-MOFD method reaches about 0.15 more average PESQ improvement compared to the best baseline (Jnt-DNN-NMF [15]) and 1.10 over the noisy speech.Also, the increase of STOI and SNR fw scores for the proposed methods is more than the baselines (Figures 3 and 4).Better extraction of frequency characteristics in the Jnt-DNN-NMF-MOFD method has led to improved performance.Furthermore, to evaluate the generalization ability, the average gPESQ results of the proposed and comparison methods over the mismatched noise types are given in Figure 5 for each input SNR.According to this figure, the results of the proposed methods are better compared to others.It indicates an average PESQ improvement of 0.11 higher than the best baseline (Jnt-DNN-NMF [15]).
To examine the enhancement performance in more mismatched noise types, the average gPESQ results of the proposed Jnt-DNN-NMF-MOFD method and the baseline LSTM-Mask method [7,30] over two other mismatched noises, restaurant and street, are given in Figure 6 for each input SNR.The   8 IET Signal Processing restaurant and street noises are given from the Aurora-2 database [40].These noises have different properties and structures from the previous matched and mismatched noises.The estimated classification rates of the restaurant and street noises are approximately (0.11, 0.71, 0.18) and (0.18, 0.28, 0.54), respectively.According to this figure, the performance trend is similar to the previous mismatched noises indicating the effectiveness of the suggested approach and the generalization capability.Moreover, to investigate the noise classification (NC) performance, the average results of the final proposed Jnt-DNN-NMF-MOFD method with and without NC are reported in Table 2 for matched and mismatched noises.Better results with NC compared to without NC indicate the benefit of using the noise-specific models and the fusion strategy.
Finally, the magnitude spectrograms of the estimated speech by the proposed Jnt-DNN-NMF-MOFD method and the baseline LSTM-Mask method are given in Figure 7, as examples.As can be seen, Jnt-DNN-NMF-MOFD restores speech components and removes noise parts better than LSTM-Mask.One reason for this result is due to the joint cooperation of NMF and DNN and the direct effect of the NMF activation coefficients on DNN as structural intermediate target features.The joint estimation of the actual spectral targets and the activation coefficients by DNN as multiobjective joint learning, and also consideration of the frequency domain spectrum changes in the loss function are the main reasons for this result.

Conclusion
We proposed a joint multi-objective model of NMF and DNN with new loss functions for speech enhancement.In the proposed multi-objective loss function ðL MO Þ: , the NMF activation coefficients are estimated simultaneously with the objective spectral signals by the DNN.Setting the NMF activation coefficients as a direct target of DNN and integration of the NMF speech and noise bases and wiener filters with the DNN layers leads to further improvement.It is due to the extraction of the harmonic structures by NMF and the direct incorporation of the extracted structural characteristics into the DNN structure.Then, to consider and maintain the frequency domain changes of speech and noise spectrums, we proposed a frequencydifferentiated loss function (L FD ) that considers the spectrum differences between the adjacent frequency bins.Finally, to improve the enhancement results, we proposed a multiobjective frequency differentiated loss function (L MOFD ) to optimize the Jnt-DNN-NMF model which is a weighted combination of the frequency-differentiated loss function and two MSEs related to the actual spectral signals and the NMF activation coefficients.
, the proposed system is performed in two phases of training and testing.The training phase includes the sections of NMF training, Jnt-DNN-NMF training, and classifier DNN training.Jnt-DNN-NMF is the joint cooperative model of DNN and NMF which will be explained in Section 3.1.The testing phase contains the classifier DNN prediction and the Jnt-DNN-NMF prediction for the test data.The NMF and the Jnt-DNN-NMF training parts are two consecutive stages, the NMF training is pretraining for the Jnt-DNN-NMF training (so that the results of the NMF training are used in the Jnt-DNN-NMF training as a pretraining stage).This will be described in Section 3.1.

FIGURE 1 :
FIGURE 1: Block diagram of the proposed system including training and testing phases.The NMF and Jnt-DNN-NMF training is repeated for each noise type.The classification and fusion blocks are applied in the testing phase to select one model (for matched noises) or combine the output of N models that have already been trained for N training noise types (for mismatched noises).Each dashed box in the testing phase is the learned model for each training noise type.

FIGURE 7 :
FIGURE 7: The magnitude spectrograms of noisy speech contaminated with factory noise at −5 dB SNR (a) and clean speech (b); the estimated speech by LSTM-Mask (c) and by the proposed Jnt-DNN-NMF-MOFD method (d).
are the frequency and time indices.In using NMF for speech enhancement, in the training phase, B x and H x for clean speech and noise are usually randomly initialized and then obtained using the iterative multiplicative update rules.H x is discarded and B x is held fixed for the enhancement stage.The noisy basis matrix B y is formed by concatenating the trained basis matrices of clean and noise (B y ¼ ½B s B n : 2 Approach.According to Figure1, in the testing phase, first, to judge the noise type, a classifier DNN which has already been learned to classify the N training noisy types is used to estimate the similarity rates of each observed noisy speech to the training noise classes.The noise type (matched or mismatched) is diagnosed such that if one of the estimated rates is greater than a high threshold (set to 0.90), that noisy speech is regarded as one of the training noisy mixtures i.e., a matched condition, otherwise, it is a mismatched condition.Then, in the fusion block, for matched noises, the enhanced speech is obtained from the output of only one learned model corresponding to the detected noise.However, for mismatched noises, the final result is calculated based on a weighted combination of the outputs of multiple models, where the weights are the corresponding classification rates.