Denoising Speech Based on Deep Learning and Wavelet Decomposition

College of Chinese Literature and Media, Hubei University of Arts and Science, Xiangyang 441000, China School of Mechanical and Electrical Engineering, Guilin University of Electronic Technology, Guilin 541004, China Qinghai GLI Technology Limited, Xining 810001, China School of Informatics (National Demonstrative Software School), Xiamen University, Xiamen 361005, China Department of Computer Engineering, Changji University, Changji 831100, China


Introduction
In the actual environment, speech signals are inevitably affected by the noises from the surrounding environment, transmission media, and electrical noise inside the communication equipment. ese interferences greatly degrade the performance of the speech processing system and affect the quality of speech. Speech denoising aims to reproduce clean speech from noise-polluted signals, which is crucial for various applications, such as automatic speech recognition (ASR) and hearing aids. Several speech-denoising and speech-enhancement methods have been proposed based on the statistical difference between the speech and noise characteristics, including spectral subtraction [1], based estimation [2], Wiener filtering [3], subspace method [4], nonnegative matrix factorization (NMF) [5], and minimum mean square error (MMSE) [6].
Most of the filtering methods are limited to windowadding or masking operation in the frequency domain or time domain due to the strong time-frequency coupling between speech signals and noises. It is difficult for these filtering methods to achieve effective signal-noise separation. As a nonlinear filter, the neural network was applied to this problem in the past, such as the early use of the shallow neural network (SNN) for speech-denoising study. However, the constraints on computing power and the size of training data lead to the implementations of relatively small neural networks, limiting denoising performance.
By learning a deep nonlinear network structure, deep learning has the following advantages: achieving the approximation of complex functions, representing the distributed representation of input data, and demonstrating its powerful ability to learn data and essential characteristics from a few sample sets. Meanwhile, it emphasizes the deep structure of the learning model. e current learning framework usually adopts a multilevel model. In this way, the training of the model relies on a large number of data sets, highlighting the importance of big data for a complete and complex model. Deep learning also focuses on feature learning. Deep neural networks (DNNs) contain multiple nonlinear hiding layers, showing great potential to capture the complex relationship between noises and clean speeches. Many training algorithms have been proposed to train a deep network. DNNs have been applied to speech recognition [7], speech denoising [8], and speech separation [9].
Recently, Zhao et al. [10] used both convolutional and recurrent neural network architectures to exploit local structures in both the frequency and temporal domains for speech enhancement. Tan and Wang [11] combined the convolutional code-decoder (CED) and long short-term memory (LSTM) into the convolutional recurrent network (CRN) to achieve real-time monophonic speech enhancement. e proposed model is independent of noise and speaker. Moreover, the trainable parameters of CRN are much smaller. e full connection layer involved in deep neural networks (DNN) and convolutional neural networks (CNN) may not accurately describe the local information of the speech signal, especially for the high-frequency component. erefore, Fu et al. [12] proposed an enhancement model of a full convolutional network (FCN) based on the original waveform. e system performs speech enhancement in an end-to-end manner, different from most existing denoising methods only dealing with amplitude spectrum.
Speech is a time-varying signal, in which usually changes occur at syllabic rates of 10 times/sec and exceeds the fixed time intervals of 10-30 msec. Short-time Fourier transform (STFT) is often used to analyze the speech on a time-frequency range [8,9]. However, the window length of the STFT is fixed, that is, the time-domain resolution is fixed. According to the Heisenberg uncertainty principle, the frequency-domain resolution is also fixed. For a low-frequency signal, the time interval should be wider to determine the frequency better; however, for high-frequency signals, the time domain should be narrower to locate them better in the time domain. e resolution of STFT is not adjustable in the time domain and frequency domain, so it is not suitable for broadband analysis.
Wavelet analysis, developed in the 1980s, plays an important role in signal processing [13]. Wavelet transform (WT) has multiresolution and can adjust the window function adaptively according to the signal frequency. For low-frequency signals, WT provides low time-domain resolution and high-frequency domain resolution. For highfrequency signals, it provides high resolution in the time domain and low resolution in the frequency domain [14]. e wavelet transform coefficient reaches a maximum value in a certain region, and this point is called the modulus maximum of the wavelet transform in the region. e modulus maxima of useful signals in the multiresolution analysis increase with the decreased resolution; however, the modulus maxima of noisy signals in the multiresolution analysis decrease with the decreased resolution [15]. reshold values are set according to the characteristics of useful signals and noise, and the wavelet coefficient is analyzed using this threshold value. When the wavelet coefficient is lower than this threshold value, the wavelet coefficient corresponds to a noise signal. In the wavelet domain, the threshold is used to distinguish the useful signal from the noise signal. Finally, processed wavelet coefficients are reconstructed to obtain denoised signals [16]. e work proposed a speech denoising method based on deep learning. e predictor and target network signals were the amplitude spectrum of the wavelet-decomposition vector of the noisy audio signal and clean audio signal, respectively. e output of the network was the amplitude spectrum of the denoising signal. e output spectrum and the phase of the wavelet-decomposition vector were used to transform the denoised wavelet-decomposition vector back to the time domain. en, the denoised speech was obtained by the inversewavelet transform. is method overcame the problem that the frequency and time resolution of STFT could not be adjusted.

Short-Time Fourier Transform.
STFT is widely used in speech analysis and processing, suitable for slow signal and time-varying signal spectrum analysis. In this method, the speech signal is first divided into frames, and then, the Fourier transform is carried out for each frame. Each frame of the speech signal can be intercepted from a variety of stationary signal waveforms, and the short-time spectrum of each frame of speech is an approximation of the spectrum value of the smooth signal waveform. Since the signal of each frame is short and stable, the Fourier transform of the frame signal is calculated to obtain the STFT: where STFT x (t, f) is the coefficient of STFT. STFT is a function of time t and frequency f, which shows how the frequency of the speech signal changes with time. According to the above STFT transformation, its inverse transformation can be defined as where w(t) is a window function. e longer window length means higher spectral resolution; however, the time resolution of the long window decreases correspondingly. Due to the contradiction between the time resolution and the frequency resolution, the practical operation should be based on the STFT analysis, and the appropriate window length should be determined.

Wavelet Transform. STFT is a windowed FT transform.
FT is based on sinusoidal functions of different frequencies, so the signal is often decomposed into the superposition sum of sinusoidal waves of different frequencies.
e wavelet transform replaces the infinitesimal trigonometric basis function with the wavelet basis of finite length and attenuation, thus locating frequency and time. e continuous wavelet transform (CWT) is the inner product of wavelet function ϕ(t) and square-integrable function x(t) with good local properties in the time-frequency domain: where a > 0 is the scale factor and b the displacement factor. e scale factor plays an important role in wavelet transform. When it is very small, it will show the details of the signal changing rapidly. When it is large, the wavelet is extended to show the coarse features of the signal. When ϕ(t) meets the admissibility condition, the inverse continuous wavelet transform (ICWT) is where ϕ(t) is a dual function of ϕ(t) and C ϕ an admissible constant. e data from CWT has large redundancy, which may not be suitable for DNN training for denoising speech. Discrete WT (DWT) uses filter banks to implement the Mallat algorithm. Figure 1 shows the three-level DWT, where cA1, cA2, and CA3 are approximate coefficients containing low-frequency information of the signal. cd1, cd2, and cd3 are detail coefficients and contain high-frequency information of the signal. c is the wavelet-decomposition vector; l is the bookkeeping vector containing the number of coefficients of each level.

Convolution Neural Networks for Deep Learning.
A convolution neural network of deep learning is a deeplearning network generated on the theoretical basis of a neural network. e neural network is a fully connected network, that is, each neuron in the upper layer is connected to a neuron in the next layer. In this case, for multidimensional input information such as sound or image, the amount of information contained is relatively large; for the hidden layer, the traditional BP algorithm requires more weight parameters. e resulting slow training speed leads to more samples required for training. Overfitting is more likely to occur with insufficient training. In this way, the parameters learned are not universal, so they cannot represent and restore the input signal.
Ordinary neural network structure does not consider the characteristics of the input data. Even for a little change in the original data, the neural network does not take into account the data characteristics for optimized training. e neural network is fully connected, and all input data need to be considered; thus, it is impossible to identify and train the local regional features in the data.
Given the problems existing in the above ordinary neural network structure, the convolutional neural network transforms the ordinary neural network through local connection to feel the field of vision, weight sharing, and subsampling process through a local connection. It is used to learn features. Figure 2 shows the convolutional neural network model. e total core operation of convolution in the convolution layer is as follows: where k is the convolution kernel (filter), l is the number of layers, M is the j th feature map, b is the corresponding bias, and f is the activation function. e result of the convolution layer output goes to the downsampling layer, and down sampling is performed on each feature of the output in the convolution layer.

Proposed Method
Wavelet-decomposition vector c can be denoted as c � cA n cD n . . . cD i . . . cD 1 .
Assuming that the length of the signal is L and the frequency is Fs, the highest frequency of the signal is Fs/2. e frequency range of the lowest layer cA n is (0, Fs/2 n + 1 ), with the size of L/2 n . e frequency range of cD i is (Fs/2 i + 1 , Fs/2 i ), with the size of L/2 i . If we do STFT for c and select the window width as n w , the sampling of cA n is equivalent to the window width of about 2 n−1 * n w for the original signal, and the window width of cD i is equivalent to that of the original signal 2 i−1 * n w . In other words, if the frequency drops by one time, the window width increases by one time. us, we realize the effect of wavelet transform of large time windows at low frequency and small-time windows at high frequencies, almost without data redundancy. Figure 3 shows the proposed deep-learning training. e predictor and target network signals are the magnitude spectra of the wavelet-decomposition vector of the noisy and clean audio signals, respectively. e network's output is the magnitude spectrum of the denoised signal. e regression network uses the predictor input to minimize the mean square error between its output and the input target. e denoised wavelet-decomposition vector is converted back to the time domain using the output magnitude spectrum and the phase of the noisy wavelet-decomposition vector. en, the denoised speech can be obtained from the inverse wavelet transform.

Experiments and Discussion
e work used the Chinese Common Voice Corpus 6.1 subset of the Mozilla Common Voice dataset [17] to train and test our proposed method. Vehicle noise (Volvo) from the NOISEX-92 database [18] was taken as the noise source. e speech and noise were resampled at 16 kHz. e signalto-noise ratios (SNR) of 5, 0, and -5 dB were set to compare the denoising effect.
Morse wavelet function was used in DWT. Another DNN method used STFT and convolution neural network for comparison [19]. e window length of 64 of STFT was adopted for our proposed method and those of 64 and 256 were adopted for the compared method. Hamming window with an overlap of 75% was used in all cases. Figure 4 shows the clean speech and the noisy speech with different SNRs in the time domain and spectrogram.
Scientific Programming 3 e noise pollutes the noise in the broadband frequency. As the SNR decreases, more speech information is drowned out. Figure 5 shows the speech signal enhanced by subtracting amplitude spectra. e noise has been reduced partly. e spectrogram shows that the rough points of the original noisy speech have been reduced to a large extent. Due to the half-wave rectification of negative values, small, independent peaks appear on the random frequency of the multiframe spectrum. Transformed to the time domain, these peaks sound like multiple vibratos with random frequency changes between frames, which is commonly referred to as music noise.
In Figure 6, after Wiener filtering, the speech signal polluted by noise has been improved to a certain extent. However, there are still some noises after Wiener filtering, related to the filtering characteristics of the Wiener filter. Figures 7-9 show the denoising results using the proposed method and the compared method, respectively. e results of the proposed method show a better denoising effect from high to low SNRs in the whole frequency range. e compared Input Output

Convolution layer Convolution layer
Downsampling layer Downsampling layer    Scientific Programming method with the window length of 256 achieves some noise reduction effect, but it performs poorly in the high-frequency band. e compared method with 64 window lengths performs some superiority in the high-frequency band, but is still inferior to the proposed method. In the process of the wavelet transform, the signal energy in the frequency band remains the same with the reduced noise energy, which improves the SNR in the frequency band and denoising effect. Table 1 shows the SNR of the denoising speech, indicating the proposed method is an improvement of the compared method.

Conclusions
For the proposed method in the work, the predictor and the target network signals were the amplitude spectra of the wavelet-decomposition vector of the noisy audio signal and the clean audio signal, respectively. e output of the network was the amplitude spectrum of the denoising signal. e regression network used the input of the predictor to minimize the mean square error between its output and input targets. e denoised wavelet-decomposition vector was transformed back to the time domain using the output amplitude spectrum and the phase of the denoised wavelet-decomposition vector. en, the denoised speech was obtained by the inverse wavelet transform.
e proposed method overcame the problem that the frequency and time resolution of STFT could not be adjusted. Besides, since the noise energy was gradually reduced during wavelet decomposition, the noise reduction effect of each frequency band was improved. e experimental results showed that the proposed method has a good denoising effect in the whole frequency band.

Data Availability
e datasets and codes of this paper for the simulation are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.