Deep Learning-Based Amplitude Fusion for Speech Dereverberation

Mapping and masking are two important speech enhancement methods based on deep learning that aim to recover the original clean speech from corrupted speech. In practice, too large recovery errors severely restrict the improvement in speech quality. In our preliminary experiment, we demonstrated that mapping and masking methods had diﬀerent conversion mechanisms and thus assumed that their recovery errors are highly likely to be complementary. Also, the complementarity was validated ac-cordingly. Based on the principle of error minimization, we propose the fusion between mapping and masking for speech dereverberation. Speciﬁcally, we take the weighted mean of the amplitudes recovered by the two methods as the estimated amplitude of the fusion method. Experiments verify that the recovery error of the fusion method is further controlled. Compared with the existing geometric mean method, the weighted mean method we proposed has achieved better results. Speech dereverberation experiments manifest that the weighted mean method improves PESQ and SNR by 5.8% and 25.0%, respectively, compared with the traditional masking method.


Introduction
In the real-world speech environment, original clean speech is often corrupted by reverberation, which seriously damages the speech quality and reduces the performance of automatic speaker recognition [1] and automatic speech recognition (ASR) [2,3]. When the reverberation is too heavy, human hearing also suffers severe interference [4]. Speech enhancement refers to the processing with corrupted speech to obtain underlying clean speech, thereby improving the speech quality. It mainly includes speech dereverberation and speech denoising. is paper focuses on speech dereverberation. In the early stage, unsupervised speech enhancement methods were usually used to improve the corrupted speech. e traditional speech dereverberation method is mainly the weighted prediction error (WPE) method [5,6], but the improvement of speech in this method is rather limited. With the rapid development of deep learning in recent years, supervised speech enhancement methods were emerged. Two types of these methods are especially important. One is called mapping [7][8][9][10], which uses a deep neural network (DNN) to directly predict clean speech feature from corrupted speech feature (MAP). e feature here usually refers to the logarithmic power spectrum (LPS), which conforms to human auditory rules and is beneficial to the learning of DNN. e other is called masking [11,12], which predicts an intermediate state from the corrupted speech feature, and the clean speech is obtained from the known corrupted speech and the predicted intermediate state. Because DNN has a strong learning ability, mapping and masking methods can effectively reduce the reverberation components in the reverberant speech. e masking methods include the ideal binary mask (IBM) [13,14], the ideal ratio mask (IRM) [15,16], the ideal amplitude mask (IAM) [11,17] (known as FFT-MASK in [11]), the phase-sensitive mask (PSM) [18], and the complex ratio mask (cIRM) [19][20][21][22]. IRM takes advantage of the incoherence between clean speech and additive noise to approximate the power of the corrupted speech to the sum of the power of the original clean speech and additive noise, thus limiting the IRM training target to [0; 1]. is facilitates the training of DNN and makes IRM a great success in reducing additive noise. e authors of [23] pointed out that MAP and IRM were complementary in speech denoising under different signalto-noise ratio (SNR); that is, MAP is better than IRM when SNR is low, and MAP is below IRM when SNR is high. Similar complementarity study has also been reported in [24][25][26][27][28]. Based on this complementarity, the authors of [23] proposed a geometric mean method fusing the amplitudes of MAP and IRM to improve the speech denoising effect. Although this fusion method is exceedingly effective for speech denoising, it is not very favorable for speech dereverberation. Consequently, further effort is required to understand and better take advantage of the fusion method.
In this paper, we analyze the principle and mechanism of mapping and masking methods and assume that different conversion mechanisms will lead to the complementarity of speech enhancement effects. Based on this, we conclude that the complementarity between the mapping and masking methods is widespread. It is not limited to the MAP and IRM methods, and there is also complementarity between MAP and other methods belonging to masking. Since the IRM method has great limitations in speech dereverberation, we propose the fusion between IAM and MAP for speech dereverberation. In order to further explore the fusion mode that minimizes the recovered amplitude error, we propose the arithmetic mean and the weighted mean.
Clean speech will become into different corrupted speeches when interfered by different speech scenarios. Accordingly, the correspondence between corrupted speech and clean speech is actually "many-to-one." erefore, MAP is one kind of "many-to-one" conversion. While the training target of masking is typically a time-frequency (T-F) amplitude ratio of clean speech to corrupted speech. Hence, it is easier to infer that masking is a "one-to-one" conversion. e recovery errors produced by different conversion mechanisms are more likely to be complementary. Specifically, one conversion mechanism can overestimate the clean speech amplitude, while another is likely to underestimate it, and we found this phenomenon in the preliminary experiments. Based on the principle of error minimization, the mean of the amplitudes recovered by the mapping and masking methods will further improve the recovery accuracy. However, the IRM method is not satisfactory in terms of speech dereverberation because convolutional reverberation and additive noise have different mechanisms on destroying clean speech. Early reverberation has a strong correlation with clean speech, not as assumed in IRM. e training target of IAM is the T-F amplitude ratio of clean speech to corrupted speech. It has no limitation on the formation mechanism of corrupted speech and has great potential in both speech denoising and speech dereverberation. erefore, the fusion of IAM and MAP will be more conducive to speech dereverberation.
In summary, this paper proposes that the different conversion mechanisms of mapping and masking methods lead to the complementarity, which in turn minimizes the recovery errors of the fusion method, and the scope of fusion is extended from the fusion of MAP and IRM to the fusion of mapping and masking methods. For speech dereverberation, we propose the fusion of MAP and IAM, and a new effective fusion mode is also proposed: the weighted mean. e speech dereverberation experiments suggest that the proposed methods exceed the existing related methods. e other three contributions of this paper are as follows: (1) is paper proposed the novel target DCC, namely, the difference between constructed and masked outputs. (2) We propose to use the "standard deviation" to measure the error of the predicted amplitude and analyze that the "standard deviation" is more reasonable than the traditional "ratio." (3) We propose to use DNN to predict the weight coefficients in the weighted mean, and the effect is very good. e training targets (weight coefficients) are different from the traditional training target directly labeled though the corpus (such as mapping and masking) but are obtained from the outputs of the DNN. e remaining parts of this paper are organized as follows. e mapping and masking methods are introduced in Section 2. In Section 3, the amplitude fusion method is described. In Section 4, analysis of conversion mechanism and error minimization is provided. In Section 5, the speech dereverberation experiments and discussions are carried out, and conclusions are provided in Section 6.

Mapping and Masking Methods
When speech is corrupted by both reverberation and additive noise simultaneously, its mathematical expression in the time domain can be (1) Here, y, o, r, and n refer to corrupted speech, original clean speech, room impulse response (RIR), and additive noise, respectively. t indexes the time, and * represents the convolution. e short-time Fourier transform (STFT) of (1) is Here, Y, O, R, and N refer to the STFT of corrupted speech, original clean speech, RIR, and additive noise, respectively, while m and f index the time frame and frequency bin, respectively. "·" represents the multiplication operation. e first term on the right side of (1) or (2) represents reverberant speech, while the second term represents additive noise. It can be seen that the influence mechanisms of reverberation and additive noise are very different. e task of speech enhancement is to estimate the underlying clean speech o(t) from the corrupted speech y(t).
Mapping has one classic form: predicting the LPS of clean speech directly from the LPS of corrupted speech using a DNN (MAP). e regularly used LPS is numerically twice as large as the log magnitude spectrum (LMS) used in this paper. In our experiment, the speech quality obtained by LMS is not worse than that by LPS. e value of LMS can theoretically be in (−∞, +∞), but the actual value is mostly in (−10, 4) within our preliminary experimental dataset. Accordingly, the output activation function of MAP is linear. e amplitude of the target clean speech is obtained by the exponential expansion of the estimated LMS.
e masking method has a variety of forms, with IRM being the most widely used method to reduce additive noise, and IAM has great potential in reducing both additive noise and convolutional reverberation. e training target of IAM is the T-F amplitude ratio of clean speech to corrupted speech [11], and the expression is as follows: Here, M iam denotes the training target of IAM. |·| represents the modulus of the parameter. M iam is a typical training target of the masking method, through which a few variants can be obtained. e theoretical value range of M iam is [0, +∞); thus the output activation function of IAM is linear. Too large values of M iam are often clipped to facilitate the training of DNN, just as [11] advises that all values greater than 10 are set to 10. e amplitude of the target clean speech is obtained by multiplying the estimated value of M iam with the amplitude of the corrupted speech.
Historically, the IRM is extended from the ideal binary mask (IBM) [29]. But in this paper, we intentionally regard IRM as a variant of IAM. In the IRM method, we actually have an approximation as a prerequisite; that is, the power of noisy speech is set as the sum of the power of clean speech and noise. e T-F expression of the noisy speech corrupted by additive noise is And the approximate expression of the noisy speech amplitude is where the T-F unit symbols were omitted and θ denotes the phase difference between clean speech and noisy speech within the T-F unit. Since clean speech and additive noise are considered completely independent, the θ values are random. So the value range of cos θ is [−1, 1]. Hence, we set cos θ to its mean value zero in (5), which should not introduce too much error. Consequently, the training target of the IRM method is usually expressed as follows: Sigmoid is used as the output activation function. e amplitude of the target clean speech is obtained by multiplying the amplitude of the corrupted speech with the estimated. For reverberant speech enhancement, this paper replaces additive noise with the difference between reverberant speech and clean speech, which ensures that the target clean speech of the IRM method is the original clean speech, and the expression is Here, Y mainly contains reverberant speech. e speech enhancement methods studied in this paper only enhance the amplitude of the corrupted speech without dealing with the corrupted phase. Clean speech in the time domain is recovered with corrupted phase and enhanced amplitude as shown in Figure 1

Amplitude Fusion Methods
Supervised speech enhancement in this paper refers to training a DNN with training corpus and then converting the corrupted speech into underlying clean speech with the trained DNN. e supervised speech enhancement scheme for the amplitude fusion method is shown in Figure 1. e input features of the DNN can be various extracted speech features [30][31][32], but the LPS of speech is the most frequently used. e types of DNN here include feedforward multilayer perceptron (MLP) [12], convolutional neural network (CNN) [33,34], recurrent neural network (RNN) [23,35], and hybrid neural network [36][37][38], of which MLP is the most classic neural network. e expected output here refers specifically to the mapping or masking based training target. Based on different conversion mechanisms, the fusions between the mapping and masking methods are proposed in this paper for speech enhancement. We use multitarget training to estimate the training targets; as shown in Figure 1, the selected content in the dashed box manifests the method of multitarget training. "LMS (mapping)" and "LMS (clean)" refer to the estimated LMS by mapping method and the LMS of original clean speech, respectively. M mask and M mask are the estimated and the reference masking training target. We use MLP as the neural network, with mapping and masking sharing the weights before the output layer, and the loss function is where Z map and Z map are the estimated and the original clean LMS, respectively. 0 < α < 1, and α refers to the weight coefficient of the two error items. One multitarget training can reduce computational complexity without reducing the performance of multiple single-target pieces of training. For the fusion modes of amplitudes, in addition to the geometric mean (GM) already used in [23], this paper also proposes the arithmetic mean (AM) and the weighted mean (WM) as follows: Discrete Dynamics in Nature and Society Geometric mean is Arithmetic mean is Weighted mean is

LMS-based weighted mean (LWM) is
Here, A f , A map , and A mask refer to the estimated amplitudes of the fusion, mapping, and masking methods, respectively. Z mask is the estimated LMS by masking method.
GM, AM, and WM are amplitude-based mean modes, while LWM is an LMS-based weighted mean mode. β and c refer to the weight coefficients of the mapping method. ey are not some fixed interpolation weights as in [25] but are determined by the following equations: β � min(max(β, 0), 1), where A and Z refer to the original clean speech amplitude and LMS. "min" denotes an operation that picks the minimum value of its two parameters by element, and "max" denotes the maximum value. e weight coefficients β and c are all obtained by multitarget training as shown in Figure 2. Different from the conventional training methods, β and c are not directly labeled by the training set corpus but are calculated by the amplitudes of mapping and masking obtained through the inference process of the neural network. e calculation is performed by (13)- (16). ereafter, the calculated β or c is used as the label value of the weight coefficient in the training process. During the DNN training stage based on the weighted mean mode, the inference process and the training process are performed in each batch. e inference process does not change the parameters of the DNN, but only to calculate the weight coefficient β or c. e loss function of the training process is as follows: Here, ζ 1 , ζ 2 , and ζ 3 are the weighting coefficients of the error items. W and W represent the estimated and the reference β or c, respectively. e weighted mean mode ensures that the amplitude obtained by the fusion method is closest to the clean speech amplitude. In fact, when the weight coefficients defined by formulas (11)-(16) are accurately estimated (only in the ideal case), the weighted mean value can achieve zero error compared with the ideal value. Since the amplitude distributions obtained by the mapping and masking methods are regular, the derived β or c is also regular. erefore, the correspondence between them can be learned through the neural network.

Conversion Mechanism.
During the training stage of the MAP method, multiple speeches with varying degrees of corruption will be recovered to the same original clean speech. Also during the speech enhancement process, the same clean speech feature is obtained under ideal conditions when enhancing different corrupted speeches originating from the same clean speech. We call this conversion mechanism of MAP "many-to-one." According to the definition of the training target of masking, it is easy to see that the masking method is a "one-to-one" conversion. In order to manifest the relationship between these two conversion mechanisms more clearly, this paper proposes a new training target: the difference of LMS between corrupted speech and clean speech (DCC). DCC is essentially a masking method. In fact, it can also be considered as a variant of IAM. Its expression is as follows: e value of M dcc locates mostly in (−5, 5) within our preliminary experiments, and the output activation function is linear. e estimated amplitude of the clean speech is obtained by the following formula: Here, O and M dcc refer to the estimated original clean speech and the estimated M dcc , respectively. e function exp(·) denotes the exponential expansion. From (18), the following relational equation can be easily obtained: e left side of (20) is the corrupted speech's LMS which is usually used as the input feature for the mapping or masking methods. e first item on the right side is the training target of the mapping method, and the second item is the training target of the masking method. DCC has quite a few interesting properties. (1) It extends the physical meaning of masking method, so that the training target of masking can be understood not only as a ratio but also as a difference. (2) It helps to establish the relationship between mapping and masking method, so that the mapping and Discrete Dynamics in Nature and Society masking training targets can be regarded as a decomposition of corrupted speech. (3) e DCC training target has a logarithmic function structure similar to the MAP training target, which is beneficial for us to analyze the error minimization mechanism of amplitude fusion method in this paper.

Error
Minimization. As mentioned in [11], let η mark the ratio between the estimated amplitude and the clean speech amplitude, and η ∈ [0, +∞). e specific expression of η is as follows: Here, A refers to the estimated amplitude. When η > 1 or when η < 1, the method overestimates or underestimates the amplitude, respectively. Of course, the method get the minimum prediction error when η � 1.
e estimated amplitude of mapping, masking, and the fusion method can be expressed as where η map and η mask refer to η with mapping and masking methods, respectively. Due to the different conversion mechanisms of mapping and masking methods, the resulting η values also tend to be different distributions. As an example, when η map < 1, it is likely that η mask > 1. We call this phenomenon complementarity. e fusion amplitude in (24) is the geometric mean of the two predicted amplitudes [23]; hence the recovery error will be reduced. Here, we only analyze the geometric mean. In fact, the arithmetic mean and weighted mean can also reduce the amplitude recovery error.

Proposed Metric on Recovered Amplitudes.
is paper proposes the standard deviation σ of η as follows: In (25), n refers to the total number of speech amplitudes, and i indexes the amplitudes. η refers to the average of all η. It is widely shared that the closer the η value is to 1, the smaller the prediction error [11], but the fact is not strictly like this. We know that the STFT and the inverse STFT for speech signal have linear property, as shown in the following formulas: Here, F and F −1 refer to the STFT operation and its inverse operation, respectively. λ represents any constant. Since it does not substantially change the speech quality adding a multiplier to a speech signal, the ability of η to characterize the prediction accuracy is very limited, while the standard deviation σ reflects the degree of dispersion of η. e smaller the value of σ, the higher the consistency between the recovered amplitudes and the clean speech amplitudes. In the ideal case σ � 0, the quality of the recovered amplitudes reaches the upper limit. As can be seen from (25), σ � 0 when η takes any fixed value. erefore, σ is more responsive to the quality of the recovered amplitudes than η.

Experiment Setup.
is paper used REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge corpus [39]  e DNN used in this paper was the feedforward multilayer perceptron (MLP), with three hidden layers, and the number of nodes in each hidden layer was 3072. e activation function of hidden layers was Rectified Linear Unit (ReLU) [41], while the output activation function was linear except sigmoid for IRM. We added a batch normalization layer [42] before the three hidden layers to enhance the generalization ability of MLP to the dynamic features. e mean square error (MSE) was used as the loss function, and a for bitarget training was set to 0.5. For tritarget training based on the weighted mean, ζ 1 , ζ 2 , and ζ 3 are set to 1. MLP was trained with Adam optimizer [43] and a learning rate of 0.0002, using a batch size of 200. e maximum number of training epochs was set to 80. Once the network was trained, the model with the smallest recovery error on development dataset was chosen and testing on evaluation dataset was performed. e parameter settings for the DNN are listed in Table 1.
For speech segmentation, the frame length was 32 ms, and the frameshift was 16 ms, while the window type was Hanning window. e input feature was the reverberant speech's LMS which was normalized to zero mean and unit 6 Discrete Dynamics in Nature and Society variance. e parameters of the speech analysis are listed in Table 2.
For a single-target prediction method, each input or output feature of the MLP contains 7 frames (left 3 and right 3) context in order to utilize the contextual information. As the sliding distance of the input or output feature is one frame, each frame in the reverberant speech is enhanced 7 times to get 7 predicted amplitudes. Multiple predictions of a frame in an utterance are averaged; that is, the ultimate recovered amplitudes were obtained by averaging the 7 predicted amplitude values of the same frame. e specific expressions for the ultimate recovered amplitudes of the various methods are as follows: where i indexes the output batches of the same speech frame during the speech enhancement stage. e evaluation of speech quality is based on the perceptual evaluation of speech quality (PESQ) [44] and the frequency-weighted segmental signal-to-noise ratio (SNR) [45].

Experiment
Results. In addition to the various original mapping and masking methods, our comparison methods also include their important variants and fusion methods. e variants and their descriptions are listed in Table 3. e fusion methods tested in this paper include MAP + I_IRM, MAP + DCC, MAP + IAM, MAP + IRM, IAM + IRM, IAM + DCC, and IRM + DCC, where "A" and "B" in "A + B" represent the different single-target prediction methods and "A + B" refers to the fusion method of "A" and "B."

Evaluation of Different Speech Enhancement Methods.
e test results of different speech enhancement methods on the REVERB dataset are shown in Table 4. Without loss of generality, the fusion methods in this table are based on the geometric mean mode. MIX in the table refers to the unprocessed reverberant speech. e scores in Table 4 are averages over the entire evaluation dataset. It is shown that there is a big gap between the WPE method and the supervised speech enhancement methods. Although IRM and I_IRM perform well in speech denoising, they work poorly on speech dereverberation. e PSM and cIRM methods that consider both phase and amplitude did not exceed the IAM method that only enhances amplitude. is may be because the phase wrapping problem is inherent and does not disappear with the change of its expression. TDR exceeds IAM in SNR but is significantly lower than IAM in PESQ. We think this is related to the low recognizability of the time-domain signal. e other variants of the masking methods including I_IRM, IAM_A, and DCC_A did not exceed their original forms in our corpus. MAP + I_IRM was proposed in [23] for removing additive noise, but it is not good enough at speech dereverberation. We choose the traditional masking methods with high scores for amplitude fusion research. e fusion methods MAP + DCC and MAP + IAM have a significant improvement in both speech metrics compared to their single-target prediction methods. Although MAP + IRM is lower in SNR than MAP, the improvement on PESQ is very significant. e MAP + DCC method achieves the best results among all methods. In summary, the amplitude fusion of the mapping and masking methods has significant improvement, and Table 4 shows that the scores of IRM + DCC, IAM + IRM, or IAM + DCC are not improved compared to their single-target prediction methods, only the average of the two. is shows that the amplitude fusion between masking training targets cannot further improve the speech quality.   Tables 5  and 6. AVE at the bottom represents the average scores of the various methods. Table 5 shows that the MAP method performs poorly in small rooms, especially in the near field, while in medium and large rooms it performs very well.
Masking methods, such as IAM, IRM, and DCC, can effectively improve speech quality under a variety of reverberant conditions, although IRM improves very little. It is not difficult to find that the mapping and masking methods are complementary from the perspective of speech quality under various reverberant conditions. e fusion of MAP and masking methods has greatly improved the speech quality. Among them, MAP + DCC greatly improves the speech quality under various degrees of reverberation conditions. Although slightly lower than IAM in small rooms, MAP + IAM is significantly improved under other reverberation conditions. In addition to the small reduction in the medium and large rooms in far field, there is a great improvement with MAP + IRM under other reverberant conditions. However, Table 5 shows that the fusion between different masking methods does not effectively improve the speech quality under any reverberant conditions.
As can be seen from Table 6, MAP + IRM does not improve the SNR score when the reverberation is severe, which may be caused by the traditional IRM method not suitable for speech dereverberation. Although the MAP + IAM score improves on average compared with MAP and IAM, it does not improve in the far field or the small room. MAP + DCC not only has an average score increase but also improves significantly under all conditions of severe reverberation. is shows a high degree of complementarity between MAP and DCC. It may be caused by the same compression function natural logarithm of MAP and DCC. e log makes their predicted amplitudes have the same margin of error, which helps the errors cancel each other out. For the phenomenon that the scores of MAP + IAM and MAP + DCC decrease in the small room, we think that this is caused by the margin of amplitude error of MAP being too large relative to the masking method.
In order to analyze the fusion method more thoroughly, the PESQ and SNR scores are compared again according to different SNRs. e comparison results are listed in Tables 7  and 8, respectively. e leftmost column in the table is the average SNR score of unprocessed reverberant speech under different reverberation conditions and is arranged according to size. rough the comparison of MAP and IAM, it is found that MAP and IAM are complementary in different SNRs, which is reflected in both PESQ and SNR. Specifically,  Table 3: Description of some important methods derived from masking.

Method
Basic principle TDR Time-domain signal reconstruction. is paper uses IAM-based TDR, and also clean speech phase is used to recover the timedomain signal [46][47][48].
I_IRM Indirect mapping of IRM, which was proposed in [23] to learn the IRM target via MSE between the masked and reference clean LMS.

IAM_A
In this method, the DNN estimates a IAM mask that is applied over the corrupted speech amplitude and the loss function is created between masked amplitude and the clean speech amplitude [49,50]. DCC_A is method is similar to IAM_A, except that IAM mask is replaced with DCC mask. 8 Discrete Dynamics in Nature and Society MAP performs better at lower SNRs, while being lower than IAM at higher SNRs. e comparison of MAP with IRM or DCC also has similar rule. e PESQ and SNR scores of MAP + IRM improve significantly at higher SNR but do not at lower SNR. MAP + IAM improves PESQ when the SNR is low, and its SNR value improves only when the MIX SNR takes the middle value. e PESQ score of MAP + DCC significantly improves at any SNR, and the SNR score improves significantly when the MIX SNR is less than 6.68. It is proved again that MAP and DCC have a higher degree of fusion, and the fusion effect is better at a lower SNR. We conclude that the analytical conclusions obtained under reverberation and noise conditions are consistent.

Evaluation of Different Fusion
Modes. Without loss of generality, this paper compares various fusion modes based on the MAP and DCC method. e experimental results are shown in Table 9.
e results show that all fusions exceed the traditional MAP or DCC method. Among them, GM and LWM score higher. According to the definition of GM in (9), GM can also be regarded as an arithmetic mean based on LMS. Perhaps the LMS is more capable of characterizing speech signal. We speculate that LWM can get the highest score because it is based on LMS.

Listening Test.
In addition to the objective evaluation, we also conducted a listening test on the main methods. e corpus used for the test was from the evaluation set of REVERB. Seventeen sentences were randomly selected from the reverberant speech of each condition (far and near fields; small, medium, and large rooms), and a total of 102 sentences were obtained for the listening test. Speech enhancement methods for comparison include MAP, IAM, MAP + I_IRM based on GM mode, and MAP + DCC based on LWM mode. e enhanced speeches are placed under 102 folders, each of which contains four audio files. ey all come from the same sentence, only from different algorithms.
e order of the four audio files is completely random, and the subjects do not know the order. e five subjects (two males and three females) are all from graduate students or staff at Tianjin University. eir ages are between 24 and 37. English is not their native language. Each participant received a monetary incentive for listening test. ey were instructed to compare and sort the four audio in each folder. e content of the comparison is the degree of distortion and completeness of the speech, as well as the effect of suppressing noise and reverberation. e basis for sorting is the listener's own overall feeling of speech quality. e favorite audio ranks first, the second favorite ranks second, the third favorite ranks third, and the dislike ranks fourth. Listeners use a high-quality headset to test in a quiet environment and play each audio at least once. e ranking is scored and the scoring method borrows from [19]. e first is given a score of 3, the second 2, the third 1, and the fourth 0. After the listening test, the total score corresponding to each method is calculated. e score results are shown in Figure 3. In the figure, "m1-2" mark the scores of two males, and "f1-3" three females. AVE marks the average of the scores of the five subjects. e values of AVE corresponding to MAP, IAM, MAP + DCC, and MAP + I_IRM are 133.8, 162.2, 183.0, and 133.0. If the percentages are used for comparison, they account for 21.9%, 26.5%, 29.9%, and 21.7%, respectively. Our proposed MAP + DCC scores the highest score in the listening test.  Discrete Dynamics in Nature and Society

Discussion.
In terms of speech denoising, we think that MAP + I_IRM may be the most suitable fusion method [23]. However, MAP + DCC is more conducive to speech dereverberation than MAP + I_IRM, which is caused by the excellent performance of DCC on dereverberation. Since the DCC and MAP methods use the same compression function, the complementarity between them is most significant, and the dereverberation effect is also the best.  Table 10. e masking method as a "one-to-one" conversion is reflected in the amplitude recovery accuracy shown in Table 10. e reverberation time corresponding to the small, medium, and large rooms used in the evaluation set corpus is 0.25s, 0.5s, and 0.7s, respectively. σ of the DCC method shows obvious regularity; that is, the lighter the speech reverberation is, the smaller the σ becomes. As can be seen from (18), M dcc contains the amplitude of the reverberant speech. erefore, the degree of chaos in DCC training target is positively correlated with the degree of speech reverberation. For the same type of prediction target based on deep learning, there may be a rule that the lower the degree of chaos of the prediction target is, the higher the prediction accuracy becomes. is basically agrees with the PESQ and SNR scores of the DCC method shown in Tables 5 and 6. e distribution of σ with the MAP item in Table 10 shows that the speech with a lighter reverberation does not achieve higher prediction accuracy. e MAP method yields similar σ values under different reverberations. MAP as a "many-toone" conversion refers to predicting the same corresponding clean speech feature from multiple reverberant speech features of different degrees. e prediction targets of MAP are all clean speech features, which may result in close prediction accuracy under different reverberation    conditions. e phase of the reverberant speech is used to recover target clean speech as shown in Figure 1. e degree of phase chaos is positively correlated with the degree of reverberation of the reverberant speech. is will affect the quality of the recovered speech. erefore, the amplitude recovery accuracy of the MAP method shown in Table 10 is in good consistent with the PESQ and SNR scores of MAP in Tables 5 and 6. At the same time, we give the distribution of σ values according to different SNRs, which are listed in Table 11. e table shows that the σ values of the MAP change slightly under different SNRs, while σ of DCC increases as the SNR decreases, except when the SNR is 2.27. MAP + DCC has achieved the minimum value under various SNRs. erefore, the analysis results of σ according to the reverberation and noise conditions are basically consistent.

Issue on the Conversion
In summary, we use the different conversion mechanisms of mapping and masking to explain their complementarity in different degrees of speech damage, and this complementarity is also the motivation for the fusion of MAP and I_IRM [23]. From this, we conclude that MAP and other methods belonging to masking are also complementary, so we propose new fusion methods such as MAP + IAM and MAP + DCC.

Issue on the Error Minimization Mechanism.
In our preliminary experiments, we analyzed η produced by different methods on the training set corpus. We divided the training set into 24 subsets according to the RIR. η produced by different speech enhancement methods on each subset was compared. We observed significant amplitude error complementarity in 7 of the subsets, as shown in Figure 4.
In the figure, the abscissa value represents the frequency bin, and the ordinate value represents the average value of η on one subset. e figure shows that the mean of η corresponding to the MAP method is generally greater than 1, while the mean of η of DCC is generally less than 1. erefore, their geometric mean is closer to 1. e 7 corresponding RIRs are recorded on different reverberant conditions.
As shown in Tables 10 and 11, σ of the fusion method MAP + DCC is the smallest compared to the other two.
is suggests that there is complementarity between the amplitude recovery errors of the MAP and DCC methods, and their geometric mean reduces the errors. Since both DCC and MAP training targets use logarithmic function compression, their amplitude recovery errors have the same scale. Moreover, the existence of complementarity makes it easy to minimize σ of the fusion method.
As shown in Tables 4-6, the fusion between the masking methods, such as IRM + DCC, IAM + IRM, and IAM + DCC, does not improve the speech quality. is should be because the recovery errors produced by the same conversion mechanism are less likely to complement each other. However, the fusion between mapping and masking methods, such as MAP + IRM, MAP + IAM, and MAP + DCC, improves significantly. In view of the distribution of σ and the scores of speech, it can be inferred that the different conversion mechanisms between the mapping and masking methods are the source of their complementarity.
In theory, the weighted mean method should minimize the prediction error of the amplitude. e experimental results also show that the LWM method achieved the highest score. It can be inferred that LWM will work better if the prediction accuracy of c is improved with a more powerful neural network.
In order to directly compare the speech amplitudes recovered by various methods, we provide their spectrograms. Figure 5 shows that the fusion method is better than other three methods in reducing reverberation. is is in agreement with Table 4, which shows that the MAP + DCC in GM fusion mode achieves the highest speech quality.

Conclusion
is paper analyzes in detail the different conversion mechanisms of mapping and masking methods and experimentally verifies that their recovery errors are highly complementary. e amplitude fusion method used in this paper can effectively utilize this complementarity and reduce the recovery error of the target clean speech amplitude, thus further improving the speech quality. Furthermore, it is found that not only can MAP + IRM further improve the speech, but also the fusion of mapping with other masking methods, such as MAP + IAM and MAP + DCC, can greatly improve the speech quality. is is because there is complementarity between the recovery errors due to different conversion mechanisms between mapping and masking methods. Since the masking methods have the same conversion mechanism, the amplitude fusion between them does not further improve the speech quality.
is paper proposes a new fusion mode, LWM, and also proposes a new method of predicting weight coefficient with DNN. e predicted target β or c is different from the traditional target directly marked through corpus (such as mapping or masking) but obtained by the other two outputs of DNN. LWM takes full advantage of the DNN's predictive power to minimize amplitude prediction errors.
is paper also proposes a new method DCC that can be seen as a variant of IAM, and the MAP + DCC fusion method achieves the best results for speech dereverberation. Experiments indicate that the MAP + DCC improved PESQ and SNR by 5.8% and 25.0%, respectively, compared with the traditional IAM method.

Data Availability
e data used to support this study can be found at https:// reverb2014.dereverberation.org/data.html.

Conflicts of Interest
e authors declare that they have no conflicts of interest.