This paper describes a novel algorithm for underdetermined speech separation problem based on compressed sensing which is an emerging technique for efficient data reconstruction. The proposed algorithm consists of two steps. The unknown mixing matrix is firstly estimated from the speech mixtures in the transform domain by using K-means clustering algorithm. In the second step, the speech sources are recovered based on an autocalibration sparse Bayesian learning algorithm for speech signal. Numerical experiments including the comparison with other sparse representation approaches are provided to show the achieved performance improvement.
National Natural Science Foundation of China61571174Zhejiang Provincial Natural Science Foundation of ChinaLY15F0100101. Introduction
In recent years, compressed sensing (CS) theory [1, 2] has attracted a great deal of attention for various applications. It is a novel concept to directly sample the signals in a compressed manner and the signals in a lossless or robust manner, under the assumption that the signals have sparse or compressible representation in a particular domain [2, 3]. In particular, the sensing procedure in CS can preserve useful information embedded in the high-dimensional signals and the CS recovering procedure can robustly reconstruct the original sparse signals from these collected low-dimensional samples [3]. In this manner, both sensing and storage costs can be substantially saved. It provides potentially a powerful framework for computing a sparse representation of signals. The key factor allowing the success of the CS technique is proper exploitation and utilization of sparsity. In practical applications, fortunately, sparsity of the signal widely exists in various applications.
Speech separation refers to the process of separating source signals from their mixtures [4, 5]. When the number of mixtures is greater than or equal to the number of sources, independent component analysis (ICA) [6] based methods are widely used. However, for the case of underdetermined separation, where the number of mixtures is less than the number of sources, ICA based methods generally fail to separate the sources. In this context, the sparsity of the signal is often utilized to separate the source signals [5, 7, 8]. A signal is considered to be sparse if most of its samples are zero [4]. Since signals such as speech are more sparse in the time frequency (TF) domain compared to that in the time domain, several algorithms have been proposed for the separation of the signals in their TF domain [7–10]. In the received mixtures, a single source point is defined as any TF point that is associated with only one source signal. If all the TF points are single source points, the sources are known to be W-disjoint. Assuming W-disjoint sources, the degenerate unmixing estimation technique (DUET) [7] first estimates the feature vector consisting of TF points. The extracted feature vectors are then clustered to separate the sources.
In the underdetermined speech separation problem, the underdetermined mixture is a form of compressed sampling, and therefore CS theory can be utilized to solve the problem. The similarities between CS and source separation are shown in [11]. Xu and Wang developed a framework for this problem based on CS using fixed dictionary [12], while they proposed a multistage method for underdetermined speech separation using block-based CS [13]. However, all these mentioned methods ignore the error brought in after calculating the mixing matrix. Different from the previously reported work, our proposed approach can be considered to be parametric and is particularly tailored to solve the speech recovery with inaccurate estimation of the mixing matrix. The problem is formulated in a sparse Bayesian framework and solved by Bayesian inference technique due to the privileges of the sparse Bayesian algorithm [14–16]. It operates in a statistical alternating fashion, where both the estimation and uncertainty information are utilized. Moreover, for calibrating the inaccurate mixing matrix, this framework facilitates parameter learning procedures.
The rest of the paper is organized as follows. In Section 2, the underdetermined speech separation problem is formulated into a compressed sensing framework. In Sections 3 and 4, our sparse Bayesian algorithm is proposed and speech recovery algorithm which deals with both mixing error and speech recovery is described. Numerical experiments and conclusion are given in Sections 5 and 6, respectively.
2. The CS Framework of Underdetermined Separation
The task of speech separation is to recover the sources using the observable signals. The noise-free instantaneous mixing model can be described as follows:(1)xt=Ast,where the mixing matrix A∈R(M×N) is unknown, x(t)∈R(M) is the observed data vector at discrete time instant t, s(t)∈R(N) is the unknown source vector, M is the number of the microphones, and N is the number of the sources. In this paper, we focus on the underdetermined speech separation; that is, M<N.
Let us expand (1) as(2)x1t⋮xMt=a11⋯a1N⋮⋮⋮aM1⋯aMNs1t⋮sNt,where t=1,2,…,T stands for discrete time instants, xj(t), 1≤j≤M, is the jth mixed signal at time instant t, aji is the (j,i)th element in mixing matrix A, and si(t), 1≤i≤N, is the ith source signal at time instant t. We carry out separation frame by frame with the window length l, usually l≪T, and adjacent frames are overlapped.
Let us define some notations as follows: Λji=diag(aji,…,aji) denotes l×l matrix, where diag{}˙ denotes a diagonal matrix, and(3)M=Λ11⋯Λ1N⋮⋮⋮ΛM1⋯ΛMN.We also define every frame of the mixed and source signal as column vectors.(4)b=b1T,…,bMTT,f=f1T,…,fNTT,where bj=[xj(t),…,xj(t+l-1)]T,j=1,…,M, denotes a frame of the jth mixed signal and fi=[si(t),…,si(t+l-1)]T,i=1,…,N, denotes a frame of the ith source signal.
For every frame, (4) can be converted into the form(5)b=Mf.
We assume that the source fi has a sparse representation on some dictionary Di(6)fi=Digi,where gi is the sparse coefficient vector and Di is the dictionary on which fi has a sparse representation. Then f can be sparsely represented by(7)f=Dg,where(8)D=D1⋯00Di00⋯DNis a dictionary D composed of Di and(9)g=g1t⋮gNt.
Then(10)b=Mf=MDg,where g can be recovered by measurements b using an optimization process(11)ming0s.t.b=MDgand ·0 denotes the l0-norm. For a general CS problem, obtaining the sparsest solution to an underdetermined system (11) is known to be a NP-hard problem, which requires intractable computations [1]. This has led to considerable efforts in developing tractable approximations to find the sparse solutions. In general, most of the sparse recovery algorithms can be categorized into one of the following three categories.
The first one is generally known as greedy algorithms. These algorithms approximate the signals’ support and amplitude iteratively. Orthogonal matching pursuit (OMP) [17] is a classical representative in this category.
The second category is associated with l1 regularized optimization method, which can be considered as the tightest convex relaxation of l0 norm. The basis pursuit (BP) [18] and basis pursuit denoising (BPDN) [19] are the classical l1 regularized optimization methods to recover sparse signals in noiseless and noisy environments, respectively.
The third category is based on the sparse Bayesian methodology. The problem is formulated as learning and inference in a probabilistic model. By properly choosing the hierarchical prior for the signals, sparsity can be imposed statistically [14, 20]. Sparse Bayesian learning is a classical method to recover the sparse signals by formulating a scaled Gaussian mixtures model [14]. The main advantages of the sparse Bayesian methods are their desirable statistical characteristics and flexibility in imposing prior information.
To solve g from (11), the observation b, mixing matrix M, and dictionary D are required, respectively. Many methods for dictionary training and estimating the mixing matrix have been reported [7, 21, 22]. For convenience and without losing generality, we utilize the K-SVD [23] dictionary composition and K-means unmixing estimation technique [12] that is to be described in Section 3. Then the method to solve g is described in Section 4.
The detailed procedures are summarized as follows.
Algorithm 1 (procedure for dictionary training and mixing matrix estimation).
Every speaker’s speech sample is taken as training data using K-SVD method. D1,…,DN are obtained.
Mixing matrix is estimated in frequency domain by K-means unmixing estimation technique. The estimate for A, that is,A^, is obtained.
Format D1,…,DN and A^ into D and M^ according to the dimension of A and the selected frame window size.
b is the mixed speech frame.
The separated speech signal g can be recovered by solving (10) subject to (11) in a frame by frame manner.
3. Estimation of the Mixing Matrix
In TF domain, the mixing model in (1) can be written as(12)X=AS,where X and S contain the STFT coefficients of x(t) and s(t), respectively. At every TF point (ω,t), we have(13)x1ω,t⋮xMω,t=a11⋯a1N⋮⋮⋮aM1⋯aMNs1ω,t⋮sNω,t,where aji is the (j,i)th element in mixing matrix A and they can be complex numbers as well. Denote n=1,…,N; then A=[a1,…,an,…,aN] is noninvertible. The sources are generally estimated under the assumption that the source signals are W-disjoint. Defining Ωn(ω) as the set of TF points in frequency bin ω where sn is the dominant source, that is, snω,t≫|sn′(ω,t)| for n′≠n, the mixing model in (13) can then be simplified as(14)xω,t≈ansnω,t,ω,t∈Ωnω.The above equation implies that, given x(ω,t), the vector an can be estimated up to an amplitude and phase ambiguity. Without loss of generality, this ambiguity is resolved by assuming that an is of unit norm with the first element being a positive and real value [10]. This can be achieved by normalizing the mixture sample vector as(15)xω,t⟵xω,te-ȷϕx1ω,txω,t,where ϕx1(ω,t) is the phase of the first entry of x(ω,t) and · denotes the l2-norm. The normalized x(ω,t) can now be clustered into N clusters so that centroid of the nth cluster corresponds to the estimate of an [7, 10].
Conventional algorithms reported in [8–10] assume that the approximation in (14) holds for all the TF points. This, however, may not be true in a real environment. Instead of assuming that (14) applies for all the TF points, our proposed algorithm introduces a single source measure to quantify the validity of (14) for each TF point. Only TF points with a high value of confidence are used to estimate an, based on which a more accurate mask can be computed to separate the sources.
3.1. The Proposed TF Points Selection
From (14), the corresponding M×M autocorrelation matrix can be expressed as(16)Rxω,t=Exω,txHω,t=anEsqω,tsq∗ω,tanH=ananHσq2ω,t,where E{·} is the expectation operator, σq2(ω,t)=E{sq(ω,t)sq∗(ω,t)}, and ∗ is the conjugate operator. Therefore for single source points, we have(17)rankRxω,t=1.Considering that speech utterances are locally stationary [25],(18)Rxω,t≈∑t~=t-Δtt+Δtxω,t~xHω,t~,where Δt≥1 specifies the number of neighboring TF points used to estimate Rx(ω,t) and is adjustable according to the time duration within which the source signals are considered to be stationary. It may not be proper to direct use rank(Rx(ω,t))=1 as the single source TF point measure because the energy of nondominant sources is not always zero. To deal with this issue, a modified TF point selection method is provided.
Assume that at a particular TF point (ωo,to), source signals s1 to sn,n≤N, have nonzero energy and the signal si is assumed to be the dominant source which is γ dB higher than the other sources in terms of energy at (ωo,to); that is,(19)σi2ωo,to≥10γ/10∑σother2ωo,to.
Assuming the sources are uncorrelated, the autocorrelation matrix for this TF point can then be expressed as(20)Rxωo,to=a1a1Hσ12ωo,to+⋯+aiaiHσ22ωo,to+⋯+aNaNHσn2ωo,to=AΛωo,toAH,where(21)A=a1⋯aN,Λωo,to=σ12ωo,to000⋱000σn2ωo,to,and ai=[a1i,…,aMi]T,1≤j≤N, is the ith row of the mixing matrix A. Note that the diagonal elements of the n×n matrix Λ(ωo,to) are nonzero. It is not proper to directly use (17) to determine single source point. Alternatively a continuous measure(22)σi2ωo,toσothers2ωo,tois also not feasible since σn2(ωo,to) and An(ωo) are unknown in the speech separation problem.
Considering that the decomposition in (20) is similar to the singular value decomposition (SVD) of Rx(ωo,to), the ratio of singular values of Rx(K,τ) is proposed as the single source TF point measure (SSTFM); that is,(23)SSTFMω,t=λmaxω,t∑i=1Mλiω,t-λmaxω,t=λmaxω,ttraceRxω,t-λmaxω,t,where λi(ω,t) is the ith singular value of Rx(ω,t) and λmax(ω,t) is the maximum singular value of Rx(ω,t). Application of SVD to detect single source points is valid since the separation problem at each single source TF point is an overdetermined problem. The TF points after selection are as illustrated in Figure 1.
Three-dimensional view of TF points by plotting the real parts. Different clusters are marked with different colors and the red circles show the domain of ideal steering vectors of N sources.
All TF points
Selected TF points with γ=10
Selected TF points with γ=30
Since the SSTFM provides a measure of x(ω,t) being a single source point, only those x(ω,t) with a high SSTFM should be used by the clustering algorithm to estimate an. This can be achieved by selecting x(ω,t) which has a SSTFM value above a threshold. This threshold can be predetermined or simply select the median value of SSTFM(ω,t) which will prevent cases of too few x(ω,t) identified as single source point, which may degrade the performance of using clustering algorithm to estimate an. As a result, for each frequency bin ω, the subset(24)xω,t:SSTFMω,t≥thresholdis used to estimate an, that is, M in (11).
3.2. Estimating an of Mixing Matrix A
After selecting the TF points, the next stage is the estimation of the mixing matrix. Here we are using the K-means clustering techniques [26]. It is noted that this may not be the best algorithm to cluster the samples as other algorithms can also be used. Since only the selected TF points are used in this step, the scatter plot has a clear orientation towards the directions of the column vectors in the mixing matrix. Hence these points are clustered into N groups. After clustering, the column vectors of the mixing matrix are determined by calculating the centroid of each cluster. Note that the points lying in the left-hand side of the vertical axis in the scatter diagram are mapped to the right-hand side (by changing their sign) before calculating the centroid.
3.3. Obtaining the Dictionary D
We assume that before separating sources we have some speech samples as training data. In our approach, we directly train K-SVD dictionary on these speech samples using the source-trained strategy described in [27].
4. Speech Recovery
The matrix MD in (11) is not precisely known. The matrix obtained from the mixing matrix estimation step (Procedure (2) in Algorithm 1) contains errors due to time frequency point selection. From numerical experiments, recovering speech with the estimated matrix brings crosstalk residuals [11–13]. In theory, (11) with M^ is degenerated into(25)ming0s.t.b=M^Dg,where g solved from (25) is not the sparse solution with MD and observation b. Instead, it is a sparse solution to a linear combination of trained dictionaries.
Thus to get the correct sparse solution, (10) should be alternatively expressed as(26)b=M^+EmDg,where Em is Ml×Nl diagonal matrix to represent the difference between accurate M and estimated M^. In order to solve g from (26), we introduce a sparse Bayesian model as follows.
4.1. Bayesian Model
Sparse Bayesian model [20] was derived from the research area of machine learning and has become a popular method for sparse signal recovery in CS. In this model, the sparse signal recovery problem is formulated from a Bayesian perspective while the sparsity information is exploited by assuming a sparse prior for the signal of interest. For instance, a Laplace prior [28] corresponds to the l1 norm which has been widely studied in existing optimization approaches. Since exact Bayesian inference is typically intractable, approximation approaches to Bayesian inference have been adopted in [28]. One advantage of Bayesian CS compared to other CS methods is the flexibility of modeling sparse signals. It can promote the sparsity of its solution and exploit additionally known structures of the sparse signal [29] as in our case. Thus a sparse Bayesian model is selected to solve (26).
In our approach, we use the sparse Bayesian model in [28] to recover the sparse signals for sparser solutions. In other words, a sparer solution can be obtained by calibrating Em in (26). The mixing matrix difference Em is assumed to be an independent and identically distributed Gaussian noise and an empirical variance α0. Probabilistic model is used for convenient inference. According to (26), the observation b is assumed to obey the following distribution:(27)pb∣g;Em=Nb∣M^+EmDg,α0I,where α0 is a noise variance between b and recovered signal and assumed to be known a prior [28].
Probabilistic model is used for the signal to impose the sparsity-inducing Laplace prior and enable convenient inference [28]. In the first stage of the signal model, the probability density function of the speech sources, g, is given as(28)pg∣α=∏i=1LNgi∣0,αi,where α=(α1,α2,…,αL)T and L denotes the length of combined dictionary D. The hyperparameter, αi for i∈{1:L}, is modeled as an independent Gamma distribution(29)pα∣λ=∏i=1LΓαi∣1,λ2.
Based on (28) and (29), the marginalized distribution pg∣λ obeys a Laplace distribution [28]. Furthermore, the parameter, λ, controls the shape of the Laplace distribution and determines the sparsity of the signal g. To conveniently learn λ, a Gamma distribution is assumed as(30)pλ∣v=Γλ∣v2,v2,where v is a parameter to be tuned and often set to be a small value as suggested in [28].
According to (28)–(30), the joint probability distribution conditioned on Em can be derived as(31)pb,g,α,λ;Em=pb∣g;Empg∣αpα∣λpλ.
4.2. Proposed Methods
An expectation maximization (EM) algorithm is implemented to solve (31). The EM algorithm requires the knowledge of the posterior distribution [30](32)pg,α,λ∣b;Em=pg,α,λ,b;Empb,where p(g,α,λ,b;Em) is obtained in (31) and p(b)=∭p(g,α,λ,b)dgdαdλ.
Subsequently, a distribution q(Θ), where Θ={g,α,λ} denotes the hidden variables, is assumed to approximate the true posterior, which minimizes the Kullback-Leibler (KL) divergence between q(Θ) and the true posterior [31]:(33)q∗Θ=argminqΘDKLqΘ∣pΘ∣b;Em=argminqΘ∫qΘlogqΘpΘ∣b;EmdΘ.
Then the hidden variable Θ and parameter Em can be iteratively updated by the following steps.
4.2.1. Expectation Stage
In this stage, the method assumes that q(Θ) has a factorization form:(34)qΘ=qg,α,λ=qgqαqλ.According to [31], the optimal distribution that minimizes (33) can be expressed as(35)lnq∗ΘK=lnpb,Θ;EmqΘ∖ΘK,where ·q(Θ∖ΘK) denotes the expectation with respect to q(Θ∖ΘK) and Θ∖ΘK represents the set Θ without ΘK. By substituting g, α, and λ into (35), respectively, the approximation is obtained from the following procedures.
(i) For q∗(g), we have(36)lnq∗g=lnpb∣g;Empg∣αqαqλ+c.By substituting (27) and (28) into (36), q∗(g) obeys the Gaussian distribution with a mean and a covariance matrix as(37)μ=1α0ΣM^DTE^mb,Σ=1α0M^DTE^m2M^D+diag1αqα-1.
(ii) For q∗(α), we obtain(38)lnq∗α=lnpg∣αpα∣λqgqλ+c.By substituting (28) and (29) into (38), αn obeys a generalized inverse Gaussian distribution whose ith moment is expressed as [32](39)αni=gn2qgλqλi/2κ0.5+iλqλgn2qgκ0.5λqλxn2qg,where κa is a Bessel function of the second kind.
(iii) For q∗(λ), we have(40)lnq∗λ=lnpα∣λpλ;vqxqα+c.By substituting (29) and (30) into (40), it is shown that q∗(λ) obeys a Gamma distribution with a mean(41)λ=2L+v∑i=1Lαiqα+v.The optimal approximated distribution q∗(Θ) can be obtained by iteratively calculating the above steps until convergence and each hidden variable is updated once before proceeding to the next stage.
4.2.2. Maximization Stage
According to [31], Em is estimated by maximizing the expected log-likelihood as(42)E^m=argmaxEmlnpb,g,α,λ;Emqgqαqλ.There exists a closed-form solution to (42) for updating Em; that is,(43)E^mi=bi-M^DigDiμμTDiTDiμ+traceDiTDiΣ,where Di represents the ith row of matrix D.
In summary, this proposed algorithm jointly estimates g and Em to achieve sparsity and the procedures are listed in Algorithm 2.
Algorithm 2 (proposed method for solving g).
Input: b,M^D,v.
While converge do
Update μ and Σ by (37).
Update α by (39)
Update λ by (41)
Update Em by (43)
end while
Output: g,Em
The separated signals are recovered by Dg.
5. Numerical Experiments
In this section we firstly describe the setting of our experiments and then comparisons are made in terms of the obtained performances by the proposed method and the other ones reported in the literature.
5.1. Estimating the Mixing Matrix
To compare our proposed modified unmixing method with algorithm described in [13], we use the mixing matrix randomly generated and mix the clean speeches from [33] to get the speech mixtures. The average normalized mean square error (NMSE) over 50 trials is presented in Figure 2(a) where the median value of SSTFM values is used as threshold rather than a predetermined one, and the value of Δt in (18) is 2. Figure 2(a) shows that the proposed TF selection method has a smaller NMSE with various dimensions of mixing matrix. However the exact mixing matrix is not obtained because the NMSEs obtained by both methods are always larger than zero, which motivates the model reported in (26). Figure 2(b) illustrates two source signals at frequency bin ω and the estimated SSTFM using the mixtures in a simulation trial. It can been seen that the proposed method is effective in identifying the single source points with unfixed SSTFM, as marked by the dashed ovals.
NMSE comparison between [13] and proposed method and SSTFM by proposed method.
NMSE of estimated matrix
Selected TF points with unfixed γ
5.2. Speech Signal Recovery
The speech sample is downloaded from the database of the source separation evaluation campaign [33]. For every speaker, we have sentences for testing and training separately. A sparse dictionary is prepared for every speaker by using K-SVD Algorithm [23]. Figure 3 illustrates the sparse coefficients obtained from the source signals. The parameters for K-SVD are in Table 1.
The parameters for K-SVD training.
The number of training data
10 speeches for every speaker
Window length
1024
Dictionary size
1024∗3072
The number of iterations
200
Window overlap
50%
Sampling rate
8000
The sparse coefficients of original sources solved from dictionary trained with K-SVD.
Original source 1
Original source 2
Original source 3
The mixing matrix is randomly generated. Two examples of random 2×3 and 2×4 mixed matrices and the corresponding estimated matrices by modified method are shown as reference with permutation ambiguity ignored.(44)A2×3=0.34200.64280.92390.93970.76600.3826,A^2×3=0.34250.64220.92390.93950.76650.3826,A2×4=0.36200.62750.78960.91840.93320.77860.61360.3957,A^2×4=0.37090.62740.79130.90450.92870.77870.61140.4265.
Figures 5 and 6 show the recovery signal in STFT domain using sparse Bayesian [14] and proposed method, respectively. Compared with original sources in Figure 4, the significant differences are marked by the highlighted ovals. In Figure 5(b), the interference inside the green ovals mainly comes from source 1 (also see Figure 4(a)) while the points in red ovals are mainly from source 3 (also see Figure 4(a)). With our proposed method, these interferences are avoided by calibrating the mixing matrix while solving the sparse solution.
Original speech signal in STFT domain.
Original source 1
Original source 2
Original source 3
Speech signal recovered by sparse Bayesian method in STFT domain.
Recovery result of source 1
Recovery result of source 2
Recovery result of source 3
Speech signal recovered by proposed method in STFT domain.
Recovery result of source 1
Recovery result of source 2
Recovery result of source 3
The separation performance is quantified in terms of source-to-interference ratio (SIR), source-to-distortion ratio (SDR), and source-to-artifacts ratio (SAR). Without loss of generality, we assume the separated output s^q(t) corresponding to the source signal sq(t) for ease presentation. The three performance measures first decompose the qth source estimate yq(t) using orthogonal projections as [34](45)S^qt=stargetqt+einterfqt+eartifqt+enoiseqt,where stargetq(t) is the portion attributing to sq(t), einterfq(t) is the interference from other sources, eartifq(t) is the artifacts introduced by the separation algorithm, and enoiseq(t) is the noise effect. The SIR, SDR, and SAR for source q are defined as (46)SIRq=10log10∑tstargetq2t∑teinterfq2t,SDRq=10log10∑tstargetq2t∑teinterfqt+eartifqt+enoiseqt2,SARq=10log10∑tstargetqt+einterfqt+enoiseqt2∑teartifq2t.Since SDR considers both interference and artifacts, it is expected to be a more comprehensive criterion compared with SIR and SAR [34]. All the above measures can be computed using the BSS-EVAL Toolbox [35]. Note that in our noiseless simulations, enoiseq(t)=0, which will not affect the three criteria defined above.
The performance gain of the proposed algorithm compared to other algorithms is illustrated in Table 2.
Performance comparison of different algorithms.
Basis pursuit [24]
Sparse Bayesian
Proposed method
With known mixing matrix
Mixing matrix = 2×3
Average SAR of S^
9.6
9.7
9.9
10.8
Average SDR of S^
7.2
7.2
7.3
8.1
Average SIR of S^
13.4
13.4
13.8
14.5
Mixing matrix = 2×4
Average SAR of S^
6.9
6.9
7.0
7.7
Average SDR of S^
5.2
5.2
5.2
6.1
Average SIR of S^
11.8
11.8
12.1
13.4
From Table 2, it is seen that in the underdetermined situation (mixing matrix = 2 × 3 and 2 × 4), the proposed method increases both SIR and SAR, which means that, by introducing autocalibrating Em to (11) to obtain a sparser representation, both interferences and artifacts are effectively suppressed. The proposed algorithm achieves worse performance than that obtained by using a known mixing matrix, as shown in the last column of Table 2 because that g is solved by using Em+M^D not MD. Note that even using a known mixing matrix, the recovery signal still have interference, distortion, and artifacts for two reasons. Firstly for mixtures separation, g is solved by using MD rather than the trained dictionary D directly. Secondly, the speech sources used for dictionary training are different from the those used for separation evaluation.
6. Conclusions
We have presented the compressed sensing based algorithm to solve the problem of instantaneous underdetermined speech separation. Since exact mixing matrix is unreachable, the sparse Bayesian learning model is used and the separated speeches are recovered from approximated posterior distribution derived with the EM method. The proposed one operates in a statistical manner to achieve a sparser estimation. Numerical experimental results show that the proposed algorithm provides better estimation performance than the other methods reported in the literature.
Competing Interests
The authors declare that they have no competing interests.
Acknowledgments
This research work is supported by the National Natural Science Foundation of China (no. 61571174) and Zhejiang Provincial Natural Science Foundation of China (no. LY15F010010).
CandesE. J.RombergJ.TaoT.Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information200652248950910.1109/tit.2005.862083MR22361702-s2.0-31744440684CandesE. J.WakinM. B.An introduction to compressive sampling2008252213010.1109/msp.2007.9147312-s2.0-41949092318BaraniukR. G.Compressive sensing [lecture notes]200724411812110.1109/msp.2007.42865712-s2.0-34548253373ComonP.JuttenC.2010Academic PressPedersenM. S.LarsenJ.KjemsU.ParraL. C.2007New York, NY, USASpringerHyvrinenA.JuhuK.ErkkiO.2001Wiley-InterscienceYilmazO.RickardS.Blind separation of speech mixtures via time-frequency masking20045271830184710.1109/tsp.2004.828896MR20876892-s2.0-3142694930ArakiS.SawadaH.MukaiR.MakinoS.Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors20078781833184710.1016/j.sigpro.2007.02.0032-s2.0-34247223586RejuV. G.KohS. N.SoonI. Y.Underdetermined convolutive blind source separation via time-frequency masking201018110111610.1109/tasl.2009.20243802-s2.0-70449447752SawadaH.ArakiS.MakinoS.Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment201119351652710.1109/TASL.2010.20513552-s2.0-78650016939BaoG.YeZ.XuX.ZhouY.A compressed sensing approach to blind separation of speech mixture based on a two-layer sparsity model201321589990610.1109/TASL.2012.22341102-s2.0-84873431881XuT.WangW.A compressed sensing approach for underdetermined blind audio source separation with sparse representationProceedings of the IEEE/SP 15th Workshop on Statistical Signal Processing (SSP '09)September 2009Cardiff, UK49349610.1109/ssp.2009.52785322-s2.0-72349087597XuT.WangW.A block-based compressed sensing method for underdetermined blind speech separation incorporating binary maskProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '10)March 2010Dallas, Tex, USA2022202510.1109/icassp.2010.54949352-s2.0-78049360146WipfD. P.RaoB. D.Sparse Bayesian learning for basis selection20045282153216410.1109/TSP.2004.831016MR20855772-s2.0-3543103176ZhaoL.WangL.BiG.ZhangL.ZhangH.Robust frequency-hopping spectrum estimation based on sparse bayesian method201514278179310.1109/TWC.2014.23601912-s2.0-84923381391ZhaoL.WangL.BiG.LiS.YangL.ZhangH.Structured sparsity-driven autofocus algorithm for high-resolution radar imagery201612537638810.1016/j.sigpro.2016.02.004TroppJ. A.GilbertA. C.Signal recovery from random measurements via orthogonal matching pursuit200753124655466610.1109/tit.2007.909108MR24469292-s2.0-64649083745ChenS. S.DonohoD. L.SaundersM. A.Atomic decomposition by basis pursuit1998201336110.1137/s1064827596304010MR16390942-s2.0-0032131292ChenS. S.DonohoD. L.SaundersM. A.Atomic decomposition by basis pursuit200143112915910.1137/S003614450037906XMR1854649ZBL0979.940102-s2.0-0035273106TippingM.Sparse bayesian learning and the relevance vectormachine20011211244RubinsteinR.ZibulevskyM.EladM.Double sparsity: learning sparse dictionaries for sparse signal approximation20105831553156410.1109/tsp.2009.2036477MR27580282-s2.0-78349258863JafariM. G.PlumbleyM. D.Fast dictionary learning for sparse representations of speech signals2011551025103110.1109/JSTSP.2011.21578922-s2.0-80051759153AharonM.EladM.BrucksteinA.K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation200654114311432210.1109/tsp.2006.8811992-s2.0-33750383209van den BergE.FriendlanderM. P.Probing the pareto frontier for basis pursuit solutions2008TR-2008-01VincentE.ArberetS.GribonvalR.AdaliT.JuttenC.RomanoJ. M. T.BarrosA. K.Underdetermined instantaneous audio source separation via local gaussian modeling20095441New York, NY, USASpringer775782Lecture Notes in Computer Science10.1007/978-3-642-00599-2_97XuR.WunschD.IISurvey of clustering algorithms200516364567810.1109/TNN.2005.8451412-s2.0-16444383160XuT.WangW.Methods for learning adaptive dictionary in underdetermined speech separationProceedings of the 21st IEEE International Workshop on Machine Learning for Signal Processing (MLSP '11)September 20111610.1109/mlsp.2011.60646102-s2.0-82455163774BabacanS. D.MolinaR.KatsaggelosA. K.Bayesian compressive sensing using Laplace priors2010191536310.1109/tip.2009.2032894MR27299572-s2.0-72949095917HeL.CarinL.Exploiting structure in wavelet-based Bayesian compressive sensing20095793488349710.1109/tsp.2009.2022003MR26827012-s2.0-69349089478ZhaoL.BiG.WangL.ZhangH.An improved auto-calibration algorithm based on sparse bayesian learning framework201320988989210.1109/LSP.2013.22724622-s2.0-84880816284TzikasD. G.LikasA. C.GalatsanosN. P.The variational approximation for Bayesian inference: Life after the EM algorithm200825613114610.1109/MSP.2008.9296202-s2.0-70349224936JørgensenB.1982New York, NY, USASpringerMR648107ArakiS.NestaF.VincentE.KoldovskyZ.NolteG.ZieheA.BenichouxA.The 2011 signal separation evaluation campaign (sisec2011): -audio source separationProceedings of the International Conference on Latent Variable Analysis and Signal Separation ((LVA/ICA '12)March 2012Tel Aviv, IsraelSpringer414422VincentE.GribonvalR.FévotteC.Performance measurement in blind audio source separation20061441462146910.1109/TSA.2005.8580052-s2.0-33744975847FevotteC.GribonvalR.VincentE.BSS EVAL toolbox user guide20051706Rennes, FranceIRISA