Underdetermined Separation of Speech Mixture Based on Sparse Bayesian Learning

This paper describes a novel algorithm for underdetermined speech separation problem based on compressed sensing which is an emerging technique for efficient data reconstruction. The proposed algorithm consists of two steps. The unknown mixing matrix is firstly estimated from the speech mixtures in the transform domain by using K-means clustering algorithm. In the second step, the speech sources are recovered based on an autocalibration sparse Bayesian learning algorithm for speech signal. Numerical experiments including the comparisonwith other sparse representation approaches are provided to show the achieved performance improvement.


Introduction
In recent years, compressed sensing (CS) theory [1,2] has attracted a great deal of attention for various applications.It is a novel concept to directly sample the signals in a compressed manner and the signals in a lossless or robust manner, under the assumption that the signals have sparse or compressible representation in a particular domain [2,3].In particular, the sensing procedure in CS can preserve useful information embedded in the high-dimensional signals and the CS recovering procedure can robustly reconstruct the original sparse signals from these collected low-dimensional samples [3].In this manner, both sensing and storage costs can be substantially saved.It provides potentially a powerful framework for computing a sparse representation of signals.
The key factor allowing the success of the CS technique is proper exploitation and utilization of sparsity.In practical applications, fortunately, sparsity of the signal widely exists in various applications.
Speech separation refers to the process of separating source signals from their mixtures [4,5].When the number of mixtures is greater than or equal to the number of sources, independent component analysis (ICA) [6] based methods are widely used.However, for the case of underdetermined separation, where the number of mixtures is less than the number of sources, ICA based methods generally fail to separate the sources.In this context, the sparsity of the signal is often utilized to separate the source signals [5,7,8].A signal is considered to be sparse if most of its samples are zero [4].Since signals such as speech are more sparse in the time frequency (TF) domain compared to that in the time domain, several algorithms have been proposed for the separation of the signals in their TF domain [7][8][9][10].In the received mixtures, a single source point is defined as any TF point that is associated with only one source signal.If all the TF points are single source points, the sources are known to be W-disjoint.Assuming W-disjoint sources, the degenerate unmixing estimation technique (DUET) [7] first estimates the feature vector consisting of TF points.The extracted feature vectors are then clustered to separate the sources.
In the underdetermined speech separation problem, the underdetermined mixture is a form of compressed sampling, and therefore CS theory can be utilized to solve the problem.The similarities between CS and source separation are shown in [11].Xu and Wang developed a framework for this problem based on CS using fixed dictionary [12], while they proposed a multistage method for underdetermined speech separation using block-based CS [13].However, all these mentioned methods ignore the error brought in after calculating the mixing matrix.Different from the previously reported work, our proposed approach can be considered to be parametric and is particularly tailored to solve the speech recovery with inaccurate estimation of the mixing matrix.The problem is formulated in a sparse Bayesian framework and solved by Bayesian inference technique due to the privileges of the sparse Bayesian algorithm [14][15][16].It operates in a statistical alternating fashion, where both the estimation and uncertainty information are utilized.Moreover, for calibrating the inaccurate mixing matrix, this framework facilitates parameter learning procedures.
The rest of the paper is organized as follows.In Section 2, the underdetermined speech separation problem is formulated into a compressed sensing framework.In Sections 3 and 4, our sparse Bayesian algorithm is proposed and speech recovery algorithm which deals with both mixing error and speech recovery is described.Numerical experiments and conclusion are given in Sections 5 and 6, respectively.

The CS Framework of Underdetermined Separation
The task of speech separation is to recover the sources using the observable signals.The noise-free instantaneous mixing model can be described as follows: where the mixing matrix A ∈  (×) is unknown, x() ∈  () is the observed data vector at discrete time instant , s() ∈  () is the unknown source vector,  is the number of the microphones, and  is the number of the sources.In this paper, we focus on the underdetermined speech separation; that is,  < .
Let us expand (1) as where  = 1, 2, . . .,  stands for discrete time instants,   (), 1 ≤  ≤ , is the th mixed signal at time instant ,   is the (, )th element in mixing matrix A, and   (), 1 ≤  ≤ , is the th source signal at time instant .We carry out separation frame by frame with the window length , usually  ≪ , and adjacent frames are overlapped.
Let us define some notations as follows: Λ  = diag(  , . . .,   ) denotes  ×  matrix, where diag{ } denotes a diagonal matrix, and We also define every frame of the mixed and source signal as column vectors.
where b  = [  (), . . .,   ( +  − 1)]  ,  = 1, . . ., , denotes a frame of the th mixed signal and f  = [  (), . . .,   ( +  − 1)]  ,  = 1, . . ., , denotes a frame of the th source signal.For every frame, (4) can be converted into the form b = Mf. ( We assume that the source f  has a sparse representation on some dictionary D where g  is the sparse coefficient vector and D  is the dictionary on which f  has a sparse representation.Then f can be sparsely represented by where is a dictionary D composed of D  and . . . Then where g can be recovered by measurements b using an optimization process min     g    0 and ‖ ⋅ ‖ 0 denotes the  0 -norm.For a general CS problem, obtaining the sparsest solution to an underdetermined system ( 11) is known to be a NP-hard problem, which requires intractable computations [1].This has led to considerable efforts in developing tractable approximations to find the sparse solutions.In general, most of the sparse recovery algorithms can be categorized into one of the following three categories.
(i) The first one is generally known as greedy algorithms.These algorithms approximate the signals' support and amplitude iteratively.Orthogonal matching pursuit (OMP) [17] is a classical representative in this category.
(ii) The second category is associated with ℓ 1 regularized optimization method, which can be considered as the tightest convex relaxation of ℓ 0 norm.The basis pursuit (BP) [18] and basis pursuit denoising (BPDN) [19] are the classical ℓ 1 regularized optimization methods to recover sparse signals in noiseless and noisy environments, respectively.
(iii) The third category is based on the sparse Bayesian methodology.The problem is formulated as learning and inference in a probabilistic model.By properly choosing the hierarchical prior for the signals, sparsity can be imposed statistically [14,20].Sparse Bayesian learning is a classical method to recover the sparse signals by formulating a scaled Gaussian mixtures model [14].The main advantages of the sparse Bayesian methods are their desirable statistical characteristics and flexibility in imposing prior information.
To solve g from ( 11), the observation b, mixing matrix M, and dictionary D are required, respectively.Many methods for dictionary training and estimating the mixing matrix have been reported [7,21,22].For convenience and without losing generality, we utilize the K-SVD [23] dictionary composition and -means unmixing estimation technique [12] that is to be described in Section 3. Then the method to solve g is described in Section 4.
The detailed procedures are summarized as follows.
Algorithm 1 (procedure for dictionary training and mixing matrix estimation).
(1) Every speaker's speech sample is taken as training data using K-SVD method.D 1 , . . ., D  are obtained.
(2) Mixing matrix is estimated in frequency domain by -means unmixing estimation technique.The estimate for A, that is, Â, is obtained.
(3) Format D 1 , . . ., D  and Â into D and M according to the dimension of A and the selected frame window size.
(4) b is the mixed speech frame.
(5) The separated speech signal g can be recovered by solving (10) subject to (11) in a frame by frame manner.

Estimation of the Mixing Matrix
In TF domain, the mixing model in (1) can be written as where X and S contain the STFT coefficients of () and (), respectively.At every TF point (, ), we have where   is the (, )th element in mixing matrix A and they can be complex numbers as well.Denote  = 1, . . ., ; then A = [a 1 , . . ., a  , . . ., a  ] is noninvertible.The sources are generally estimated under the assumption that the source signals are W-disjoint.Defining Ω  () as the set of TF points in frequency bin  where   is the dominant source, that is, |  (, )| ≫ |   (, )| for   ̸ = , the mixing model in ( 13) can then be simplified as The above equation implies that, given x(, ), the vector a  can be estimated up to an amplitude and phase ambiguity.Without loss of generality, this ambiguity is resolved by assuming that a  is of unit norm with the first element being a positive and real value [10].This can be achieved by normalizing the mixture sample vector as where   1 (,) is the phase of the first entry of x(, ) and ‖ ⋅ ‖ denotes the  2 -norm.The normalized x(, ) can now be clustered into  clusters so that centroid of the th cluster corresponds to the estimate of a  [7,10].Conventional algorithms reported in [8][9][10] assume that the approximation in ( 14) holds for all the TF points.This, however, may not be true in a real environment.Instead of assuming that ( 14) applies for all the TF points, our proposed algorithm introduces a single source measure to quantify the validity of ( 14) for each TF point.Only TF points with a high value of confidence are used to estimate a  , based on which a more accurate mask can be computed to separate the sources.
3.1.The Proposed TF Points Selection.From ( 14), the corresponding  ×  autocorrelation matrix can be expressed as where {⋅} is the expectation operator,  2  (, ) = {  (, ) *  (, )}, and * is the conjugate operator.Therefore for single source points, we have Considering that speech utterances are locally stationary [25], where Δ  ≥ 1 specifies the number of neighboring TF points used to estimate R x (, ) and is adjustable according to the time duration within which the source signals are considered to be stationary.It may not be proper to direct use rank(R x (, )) = 1 as the single source TF point measure because the energy of nondominant sources is not always zero.To deal with this issue, a modified TF point selection method is provided.Assume that at a particular TF point (  ,   ), source signals  1 to   ,  ≤ , have nonzero energy and the signal   is assumed to be the dominant source which is  dB higher than the other sources in terms of energy at (  ,   ); that is, Assuming the sources are uncorrelated, the autocorrelation matrix for this TF point can then be expressed as where and a  = [ 1 , . . .,   ]  , 1 ≤  ≤ , is the th row of the mixing matrix A. Note that the diagonal elements of the  ×  matrix Λ(  ,   ) are nonzero.It is not proper to directly use (17) to determine single source point.Alternatively a continuous measure is also not feasible since  2  (  ,   ) and A  (  ) are unknown in the speech separation problem.
Considering that the decomposition in ( 20) is similar to the singular value decomposition (SVD) of R x (  ,   ), the ratio of singular values of R x (, ) is proposed as the single source TF point measure (SSTFM); that is, where   (, ) is the th singular value of R x (, ) and  max (, ) is the maximum singular value of R x (, ).Application of SVD to detect single source points is valid since the separation problem at each single source TF point is an overdetermined problem.The TF points after selection are as illustrated in Figure 1.
Since the SSTFM provides a measure of x(, ) being a single source point, only those x(, ) with a high SSTFM should be used by the clustering algorithm to estimate a  .This can be achieved by selecting x(, ) which has a SSTFM value above a threshold.This threshold can be predetermined or simply select the median value of SSTFM(, ) which will prevent cases of too few x(, ) identified as single source point, which may degrade the performance of using clustering algorithm to estimate a  .As a result, for each frequency bin , the subset {x (, ) : SSTFM (, ) ≥ threshold} (24) is used to estimate a  , that is,  in (11).

Estimating a 𝑛 of Mixing Matrix
A. After selecting the TF points, the next stage is the estimation of the mixing matrix.
Here we are using the -means clustering techniques [26].
It is noted that this may not be the best algorithm to cluster the samples as other algorithms can also be used.Since only the selected TF points are used in this step, the scatter plot has a clear orientation towards the directions of the column vectors in the mixing matrix.Hence these points are clustered into  groups.After clustering, the column vectors of the mixing matrix are determined by calculating the centroid of each cluster.Note that the points lying in the left-hand side of the vertical axis in the scatter diagram are mapped to the right-hand side (by changing their sign) before calculating the centroid.

3.3.
Obtaining the Dictionary D. We assume that before separating sources we have some speech samples as training data.In our approach, we directly train K-SVD dictionary on these speech samples using the source-trained strategy described in [27].

Speech Recovery
The matrix MD in (11) is not precisely known.The matrix obtained from the mixing matrix estimation step (Procedure (2) in Algorithm 1) contains errors due to time frequency point selection.From numerical experiments, recovering speech with the estimated matrix brings crosstalk residuals [11][12][13].In theory, (11) with M is degenerated into where g solved from (25) is not the sparse solution with MD and observation b.Instead, it is a sparse solution to a linear combination of trained dictionaries.Thus to get the correct sparse solution, (10) should be alternatively expressed as where E  is  ×  diagonal matrix to represent the difference between accurate M and estimated M. In order to solve g from ( 26), we introduce a sparse Bayesian model as follows.[20] was derived from the research area of machine learning and has become a popular method for sparse signal recovery in CS.In this model, the sparse signal recovery problem is formulated from a Bayesian perspective while the sparsity information is exploited by assuming a sparse prior for the signal of interest.For instance, a Laplace prior [28] corresponds to the  1 norm which has been widely studied in existing optimization approaches.Since exact Bayesian inference is typically intractable, approximation approaches to Bayesian inference have been adopted in [28].One advantage of Bayesian CS compared to other CS methods is the flexibility of modeling sparse signals.It can promote the sparsity of its solution and exploit additionally known structures of the sparse signal [29] as in our case.Thus a sparse Bayesian model is selected to solve (26).

Bayesian Model. Sparse Bayesian model
In our approach, we use the sparse Bayesian model in [28] to recover the sparse signals for sparser solutions.In other words, a sparer solution can be obtained by calibrating E  in (26).The mixing matrix difference E  is assumed to be an independent and identically distributed Gaussian noise and an empirical variance  0 .Probabilistic model is used for convenient inference.According to (26), the observation b is assumed to obey the following distribution: where  0 is a noise variance between b and recovered signal and assumed to be known a prior [28].

Mathematical Problems in Engineering
Probabilistic model is used for the signal to impose the sparsity-inducing Laplace prior and enable convenient inference [28].In the first stage of the signal model, the probability density function of the speech sources, g, is given as where  = ( 1 ,  2 , . . .,   )  and  denotes the length of combined dictionary .The hyperparameter,   for  ∈ {1 : }, is modeled as an independent Gamma distribution Based on ( 28) and ( 29), the marginalized distribution (g | ) obeys a Laplace distribution [28].Furthermore, the parameter, , controls the shape of the Laplace distribution and determines the sparsity of the signal g.To conveniently learn , a Gamma distribution is assumed as where V is a parameter to be tuned and often set to be a small value as suggested in [28].
According to ( 28)-( 30), the joint probability distribution conditioned on E  can be derived as (31)

Proposed Methods. An expectation maximization (EM)
algorithm is implemented to solve (31).The EM algorithm requires the knowledge of the posterior distribution [30]  (g, , where (g, , , b; E  ) is obtained in (31) and (b) = ∭(g, , , b)g  .Subsequently, a distribution (Θ), where Θ = {g, , } denotes the hidden variables, is assumed to approximate the true posterior, which minimizes the Kullback-Leibler (KL) divergence between (Θ) and the true posterior [31]: Then the hidden variable Θ and parameter E  can be iteratively updated by the following steps.

Expectation Stage.
In this stage, the method assumes that (Θ) has a factorization form: According to [31], the optimal distribution that minimizes (33) can be expressed as where ⟨⋅⟩ (Θ\Θ  ) denotes the expectation with respect to (Θ \ Θ  ) and Θ \ Θ  represents the set Θ without Θ  .By substituting g, , and  into (35), respectively, the approximation is obtained from the following procedures.
(i) For  * (g), we have By substituting ( 27) and ( 28) into (36),  * (g) obeys the Gaussian distribution with a mean and a covariance matrix as (ii) For  * (), we obtain By substituting (28) and ( 29) into (38),   obeys a generalized inverse Gaussian distribution whose th moment is expressed as [32] ⟨   ⟩ = ( where   is a Bessel function of the second kind. (iii) For  * (), we have By substituting ( 29) and ( 30) into (40), it is shown that  * () obeys a Gamma distribution with a mean The optimal approximated distribution  * (Θ) can be obtained by iteratively calculating the above steps until convergence and each hidden variable is updated once before proceeding to the next stage.

Maximization Stage.
According to [31], E  is estimated by maximizing the expected log-likelihood as There exists a closed-form solution to (42) for updating E  ; that is, where D  represents the th row of matrix D. In summary, this proposed algorithm jointly estimates g and E  to achieve sparsity and the procedures are listed in Algorithm 2.
Algorithm 2 (proposed method for solving g).

Numerical Experiments
In this section we firstly describe the setting of our experiments and then comparisons are made in terms of the obtained performances by the proposed method and the other ones reported in the literature.

Estimating the Mixing Matrix.
To compare our proposed modified unmixing method with algorithm described in [13], we use the mixing matrix randomly generated and mix the clean speeches from [33] to get the speech mixtures.The average normalized mean square error (NMSE) over 50 trials is presented in Figure 2(a) where the median value of SSTFM values is used as threshold rather than a predetermined one, and the value of Δ  in (18) is 2. Figure 2(a) shows that the proposed TF selection method has a smaller NMSE with various dimensions of mixing matrix.However the exact mixing matrix is not obtained because the NMSEs obtained by both methods are always larger than zero, which motivates the model reported in (26).Figure 2(b) illustrates two source signals at frequency bin  and the estimated SSTFM using the mixtures in a simulation trial.It can been seen that the proposed method is effective in identifying the single source points with unfixed SSTFM, as marked by the dashed ovals.

Speech Signal Recovery.
The speech sample is downloaded from the database of the source separation evaluation campaign [33].For every speaker, we have sentences for testing and training separately.A sparse dictionary is prepared for every speaker by using K-SVD Algorithm [23].Figure 3 illustrates the sparse coefficients obtained from the source signals.The parameters for K-SVD are in Table 1.
The mixing matrix is randomly generated.Two examples of random 2 × 3 and 2 × 4 mixed matrices and the corresponding estimated matrices by modified method are shown as reference with permutation ambiguity ignored.
Figures 5 and 6 show the recovery signal in STFT domain using sparse Bayesian [14] and proposed method, respectively.Compared with original sources in Figure 4, the significant differences are marked by the highlighted ovals.
In Figure 5(b), the interference inside the green ovals mainly comes from source 1 (also see Figure 4(a)) while the points in red ovals are mainly from source 3 (also see Figure 4(a)).With our proposed method, these interferences are avoided by calibrating the mixing matrix while solving the sparse solution.
The separation performance is quantified in terms of source-to-interference ratio (SIR), source-to-distortion ratio (SDR), and source-to-artifacts ratio (SAR).Without loss of generality, we assume the separated output ŝ () corresponding to the source signal   () for ease presentation.The three performance measures first decompose the th source estimate   () using orthogonal projections as [34] Ŝ () = Since SDR considers both interference and artifacts, it is expected to be a more comprehensive criterion compared with SIR and SAR [34].All the above measures can be computed using the BSS-EVAL Toolbox [35].Note that in our noiseless simulations,  noise  () = 0, which will not affect the three criteria defined above.
The performance gain of the proposed algorithm compared to other algorithms is illustrated in Table 2.
From Table 2, it is seen that in the underdetermined situation (mixing matrix = 2 × 3 and 2 × 4), the proposed method increases both SIR and SAR, which means that, by introducing autocalibrating E  to (11) to obtain a sparser representation, both interferences and artifacts are effectively suppressed.The proposed algorithm achieves worse performance than that obtained by using a known mixing matrix, as shown in the last column of Table 2 because that g is solved by using E  + MD not MD.Note that even using a known mixing matrix, the recovery signal still have interference, distortion, and artifacts for two reasons.Firstly for mixtures separation, Selected TF points with  = 30

Figure 1 :
Figure 1: Three-dimensional view of TF points by plotting the real parts.Different clusters are marked with different colors and the red circles show the domain of ideal steering vectors of  sources.

Figure 3 :
Figure 3: The sparse coefficients of original sources solved from dictionary trained with K-SVD.

Figure 4 :
Figure 4: Original speech signal in STFT domain.

result of source 3 Figure 5 : 2 ∑
Figure 5: Speech signal recovered by sparse Bayesian method in STFT domain.
Recovery result of source 3

Figure 6 :
Figure 6: Speech signal recovered by proposed method in STFT domain.

Table 1 :
The parameters for -SVD training.

Table 2 :
Performance comparison of different algorithms.