Bayesian Sparse Estimation Using Double Lomax Priors

Sparsity-promoting prior along with Bayesian inference is an effective approach in solving sparse linear models (SLMs). In this paper, we first introduce a new sparsity-promoting prior coined as Double Lomax prior, which corresponds to a three-level hierarchical model, and then we derive a full variational Bayesian (VB) inference procedure. When noninformative hyperprior is assumed, we further show that the proposed method has one more latent variable than the canonical automatic relevance determination (ARD). This variable has a smoothing effect on the solution trajectories, thus providing improved convergence performance. The effectiveness of the proposed method is demonstrated by numerical simulations including autoregressive (AR) model identification and compressive sensing (CS) problems.


Introduction
Consider the following linear model: where Φ ∈ R × is the designed matrix and y = [ 1 , . . .,   ] T is the measurement vector.e is the measurement noise, and x = [ 1 , . . .,   ] T is the unknown coefficient vector.This linear model is called underdetermined if  ≤  and overdetermined otherwise.In the underdetermined case, there are in general numerous solutions, and additional assumptions need to be assumed on x for getting the desired one.In the applications we consider, the sparsity of x is an important assumption: most elements of x should be tiny or equal to zero when they are not required to tell the data, but few should be allowed to be large if necessary.A linear model with such an assumption on x is referred to the sparse linear model (SLM).SLMs can represent various types of signals in different transform domains [1][2][3][4][5].Obtaining such a representation is equivalent to solving for a linear inverse problem with sparsity-promoting regularization.It is a fundamental issue of model (basis) selection [5,6], compressive sensing [7,8], source localization [9,10], and data fusion [11,12].Under the sparsity assumption, one seeks for a representation of y with a minimal number of basis vectors in Φ.This is equivalent to seeking for the sparsest solution of x, which has a maximal number of elements equal to zero.The sparsest solution can be induced by the improper ℓ 0 -norm of x [13]: . The associated maximum a posteriori (MAP) estimation (assuming that e is white Gaussian noise) is a penalized least squares problem: where  is the model parameter that controls the relative importance applied to error term and sparseness term.The challenge in solving for (2) is that it requires a combinatorial search which is NP-hard in general.A natural approach is to replace ‖ ⋅ ‖ 0 by a tractable approximation.Among all the tractable approximations, ℓ 1 -norm is the tightest convex relaxation of ℓ 0 -norm; it leads to the successful Lasso algorithm [1,3,4,[14][15][16].However, a nonconvex relaxation is able to recover sparsity in a more efficient way.It is proved that a nonconvex relaxation that closely resembles the ℓ 0 -norm is able to find the correct solution with fewer measurements [17][18][19][20].
One difficulty with nonconvex relaxation is the existence of many local minima.Point-estimation methods such as MAP may easily get stuck in poor local solutions [21].A posterior-estimation Bayesian approach can nicely alleviate this problem.Given the sparsity-promoting prior (x), which Mathematical Problems in Engineering is the probabilistic representation of a particular ℓ 0 -norm relaxation, the Bayesian inference can be expressed by where (x | y) is the posterior.The Bayesian approach computes the posterior mean [x | y] = ∫ x(x | y)x instead of its mode, and it integrates instead of maximizing over (x | y).In case of many local minima, the mass of the probability density tends to concentrate at a strong minimum, which is more likely to be the global optimum.Although [x | y] is not necessarily the exact global optimum in general, it is proved that [22], by taking a "zero temperature limit, " [x | y] is able to converge to the global optimum.In other words, as − ln (x) approaches the ℓ 0 -norm, [x | y] converges to the ℓ 0 solution.In Bayesian formulation, the choice of prior (x) is important for both solution sparsity and computational tractability.The Student- prior is a relatively standard prior that stands out by being non-log-concave (results in nonconvex penalty) and having a Gaussian-integral representation for efficient processing.With S(⋅) denoting the Student- prior, the distribution is defined as where  controls the degree of sparsity enforced and is known as shape parameter and  is the scale parameter.Additionally, denoting the Gaussian prior with variance  by N( | 0, ) and the Gamma prior by G, defined as G( | , ) = (  /Γ()) −1 exp(−), the Student- prior corresponds to a Gaussian integral as follows: where  is the latent variable modifying the precision of the Gaussian mixture.The representation in (5) can be treated as an integral over scale parameter of the Gaussian density, thus is referred to as Gaussian scale mixtures (GSM) [23], which naturally invokes a two-level hierarchical Bayes model.Specifically, setting  → +0 leads to the noninformativehyperpriors case of Student- prior and gives the well known automatic relevance determination (ARD) prior (the scale parameter  is also assumed to be unity) [24] as follows: As the limited case, the ARD prior provides the strongest sparsity-promoting potential among all Student-t priors with different values of  (see the remark of Proposition 1), and it has drawn tremendous attention in many problems [5,8,25].
In this paper, we formulate the SLM problem from a Bayesian perspective.Despite of the two-level Student-t prior, we propose the use of a new non-log-concave sparsity prior referred to as Double Lomax prior, which corresponds to a three-level hierarchical Bayes model.Additionally, a full VB inference procedure is derived to solve for the SLM using Double Lomax priors.As shown here, when noninformative hyperpriors are assumed ( → +0), the updating procedure of proposed method includes the canonical ARD updating procedure as a special case, but it has a better global convergence performance that can result in improved sparse estimations.
The organization of this paper is as follows.In Section 2, we present the proposed Double Lomax prior and demonstrate its properties.In Section 3, we formulate the Bayesian model of SLM using Double Lomax priors based on a generalized noise model.In Section 4, a full VB inference procedure is developed.In Section 5, we discuss the relationship between the proposed method and the ARD approach.Experimental results are presented in Section 6.Two SLM applications with different Φ matrix properties are considered in this section.One application indicates order determination and coefficients estimation for autoregressive (AR) models, where the rows of Φ matrix are strongly correlated.The other application is sparse signal recovery from the standard compressive sensing (CS) setup, where the rows of Φ matrix are uncorrelated.Conclusions are given in Section 7.
Moreover, note that lim Thus when the shape parameter goes to infinite, Double Lomax prior asymptotically becomes the Laplace prior with inverse scale ; when the shape parameter goes to zero, Double Lomax prior tends to strongly concentrate its mass on the zero value, distributing with the same summit but much longer and thinner tails.This behavior can be observed in Figure 1, which shows the Double Lomax priors with different values of .Proof.Equation ( 7) can be expressed as Meanwhile, note that the Laplace prior has a hierarchical representation as follows: where E( | ) =  exp(−) is an exponential prior.Substituting ( 13) into (12) gives (11).This proves Proposition 3.
Remark 4. Equation ( 11) corresponds to a three-level hierarchical Bayes model with two latent variables  and .On the first level, random variables are generated from zeromean Gaussian priors with independent variance .On the second level, a set of  is generated from exponential priors with independent inverse scales, which are formed by (1/2) 2 multiplying a set of independent variables  2 .On the third level, the positive squares of independent parameters  are generated from a Gamma prior with two parameters of the same value.Moreover, besides this GSM model, Double Lomax prior can be considered as a Laplacian scaled mixture as well, and the inverse scale   can be thought as the width-controller of the corresponding Laplace prior: if   is large, the corresponding distribution is narrow and the regularizing effect towards zero is strong.

Sparse Linear Model Using Double Lomax Priors
We assume independent Double Lomax priors over the coefficients and use the model in (1).Without loss of generality, the scale parameter  is assumed to be unity.Using (11), the coefficient model of ( 1) can be written as where   is the latent variable modifying the variance of the GSM model,   is the latent variable modifying the inverse scale of exponential prior, and Λ = diag( −1  ) denotes the precision matrix.
Traditionally the noise of SLM is presumed to be Gaussian distributed, which means that the measurements are also Gaussian distributed.However, in many real applications, measurements are distributed with tails that decay more slowly than Gaussian, which can not be described well by Gaussian noise model.Here we allow for very slow decay of tails by modeling the noise model using Student-t distribution.As a generalized noise model, Student-t can model a wide range of noises which from Gaussian to very heavy tailed.It also makes a connection of the robust statistic and the sparsity priors, under the view of the capability of modeling extreme events: those variables who have small probabilities but a large value range.
Using a hierarchical Bayes model which is similar to (5), denoting the shape parameter by  (also called degrees of freedom), the scale parameter by  (also called precision parameter), and the latent variable by   , and imposing two additional Gamma priors on  and , the hierarchical form of Student-t distributed data can be written as where   is the th row of Φ and L = diag(  ) is the precision matrix.

Variational Bayesian Sparse Regression
It is usually difficult to find the analytical results of the integrals in Bayesian estimation, hence approximation methods are required.There are two major classes of the approximations.One is Markov chain Monte Carlo (MCMC), which represents the posterior by samples; unbiased results can be obtained in infinite time.However, MCMC is relatively time consuming [29].The other is variational relaxations [22], which relaxes the Bayesian estimation into a series of feasible optimization problems, providing a good tradeoff between computation accuracy and efficiency.VB inference [30][31][32] is one approach of this class.
VB approximates the posterior density (Θ | y) by introducing a factorable distribution (Θ).Using Jensen's inequality, we get the lower bound of the log probability of (y): In this work, we factorize the approximation distribution (Θ) over groups of parameters Θ  as follows: Maximizing the right-hand side of ( 22) with respect to (Θ  ), we can get a general expression of the optimal solution q(Θ  ), where ⟨⋅⟩  ̸ =  denotes the expectation operator with respect to the distributions (Θ  ) for all  ̸ = .∫ exp ⟨ln (y, Θ)⟩  ̸ =  Θ  is the integration over Θ  which is a normalization constant.If the conjugate prior exists and is chosen for each group, the approximate posterior has the same functional form as the prior [30].Hence the normalization constant does not need to be considered.Now we derive a full VB inference procedure for the above problem.We use the general optimal solution in ( 24) to obtain the factorized approximate posterior densities (Θ  ).However, different from most VB updates in the literature, the challenge here is, for several random variables in the proposed model, that there are no conjugate priors for their likelihoods, and hence, in order to obtain an iterative algorithm with closed form, extra integration analysis has to be incorporated in our derivation.

Coefficients, x.
To compute (x), we only need to consider those terms that are functionally dependent on x.Because (x | ) is the conjugate prior for x, q would still be a Gaussian.It is convenient to ignore the normalization constant.The optimal solution is given by where ⟨L⟩ = diag(⟨  ⟩) and ⟨Λ⟩ = diag(⟨ −1  ⟩).Comparing the square term with standard Gaussian function, we can identify the mean and variance of this Gaussian, and get 4.2.Coefficient's Latent Variables, .From ( 14) and ( 15) we observe that the terms involving  are composed of -order multiplicative terms over   ,   , and   , a further factorization of this approximate posterior density becomes where ⋅ () and ⋅ () denote the th element of a vector and the th diagonal element of a matrix, respectively.However, since exponential distribution is not the conjugate prior for the variance of Gaussian distribution, the normalization term must be reinstated.Integrating over   , the normalization constant of this posteriori becomes Therefore, the complete form of this posteriori is It is noted that the posteriori density in (30) corresponds to a probability function known as generalized inverse Gaussian distribution (GIG), which has the probability function [33] GIG ( | , , ) = (/) where   (⋅) is the modified Bessel function of the third kind with index , one of its integral representations is given by From (32), we have Substituting ( 33) into (30) and comparing with (31), it can be found that The th order moment of a GIG distribution is given by [33] In addition, from (32), it can be derived that Therefore, for the specific GIG in (34), using (35) with (36), we have: (37)

Coefficient's Latent Variables, 𝜐.
The approximate posterior density of  can be factorized as Equation (39) contains logarithmic, square, and linear terms, thus it is not in correspondence with any standard distribution.Using a definite integration result in [34], the normalization term of q(  ) can be expressed as where   (⋅) is the parabolic cylinder function, whose computation can be implemented [35].Thereby, However, when noninformative hyperprior is assumed,  is set to zero and q(  ) becomes the Rayleigh distribution who has the probability function Here Equation ( 43) indicates that the value of ⟨ 2  ⟩ is simply inversely proportional to ⟨  ⟩ when  → +0.

Noise Precision Parameter, 𝛽.
Since Gamma distribution is conjugate to the precision of Gaussian function, we still obtain a Gamma distribution for  as follows: where (45)
We summarize the VB inference of Double Lomax formulation (noninformative hyperprior) as follows.
For coefficient model, consider For noise (observation) model, consider When initial expectations are assumed, the iterative updating can be carried out in closed form.

Relations to ARD Update Procedure
Now let us inspect the relations between the proposed Double Lomax updating procedure and the ARD updating procedure.Considering (57) and (59) that are induced by the Double Lomax priors, it can be shown that if just (57) and (59) are iterated to convergence, then the final result is equivalent to directly setting which is the exact ARD update [26].To prove this, construct a function as It can be shown that (57) minimizes F partially with respect to ⟨ −1  ⟩, and (59) minimizes F partially with respect to ⟨ 2  ⟩.If we do the complete minimization over both ⟨ −1  ⟩ and ⟨ 2  ⟩, then we directly get (66).Therefore, when noninformative hyperprior is assumed, the proposed Double Lomax updates only have one more latent variable ⟨ 2  ⟩ than the canonical ARD updates; this latent variable can be further vanished if we only consider the updating equations of ( 57) and (59).
However, if we consider the full iteration procedure involving all the updating equations, (54), ( 55), ( 57), (59), (61), (63), and (65), then the difference is shown: the solution trajectories of Double Lomax are not necessarily consistent with those of ARD.To see this, we need to denote ARD's Equation (66), and Double Lomax's Equations ( 57) and ( 59) from an iteration perspective, so we have for ARD and for Double Lomax, where  denotes the th iteration.
It can be seen from ( 69) that the latent variable ⟨ 2  ⟩ () of Double Lomax can be expressed as a function of ⟨ −1  ⟩, which involves all the updates of ⟨ −1  ⟩ from former iterations.That is, Considering the above expression and comparing (67) with (68), the difference between ARD and Double Lomax ( → +0) is shown: in contrast to the ARD's ⟨ −1  ⟩ (+1) that depends only on ⟨ 2   ⟩ () , the Double Lomax's ⟨ −1  ⟩ (+1) depends not only on ⟨ 2  ⟩ () but also on all the former updates of ⟨ −1  ⟩, including ⟨ −1  ⟩ (−1) , ⟨ −1  ⟩ (−2) , . . ., ⟨ −1  ⟩ (0) .Consequently, it makes the Double Lomax method being able to smooth the solution trajectory of ⟨ 2  ⟩ () and also the trajectories of all other variables, as the results of smoothness propagation effect caused by the alternating-update strategy of VB inference.With these smooth trajectories, the algorithm may be prevented from a large deviation of the right direction when there are some disturbances or unreliable updates.As will be shown experimentally, comparing to ARD, Double Lomax tends to move more carefully around local minima and provides a better convergence performance.
Following, we use linear regression as an example, the coefficient vector is x = [1, −1, 0, 0, 1, −1, 1, 0, 0, −1] T , the basis matrix Φ is a 12 × 10 random matrix whose entries are independently chosen from N(0, 1), and e is zeromean Gaussian noise with a standard deviation of 0.01.The VB-Double Lomax ( → +0) formulation established in the above section (denoted with DL) and an equivalent formulation with ARD prior (denoted with ARD) are used here for comparison.The maximum number of iterations are set to 50 for both of the algorithms.Figure 4 shows the estimation results of 50 trials, and the estimation mean (denoted with mean) and standard deviation (denoted with std) are also reported.Neither DL nor ARD can recover the exact coefficients due to the noise.In general, DL has shown more cases of convergence to global minima, while ARD has shown more cases of convergence to local minima, which has a relatively large deviation of global minima though with a small probability since that does not change the mean values much.Figure 5(a) reports the coefficient update traces of a typical case when ARD converges to a local minima but DL does not.Figure 5(b) reports the coefficient update traces of a typical case when both ARD and DL converge to the global minima.In both situations, the DL appears to have smoother coefficient update traces, which is in agreement with the above analysis.Furthermore, the update traces of latent variables ⟨ −1  ⟩ and ⟨ 2  ⟩ and noise parameter  when both ARD and DL converge to the global minima (the case in Figure 5(b)) are also reported in Figure 6.Comparing to the traces of ARD's ⟨ −1  ⟩, smoother traces of DL's are observed, even if they are plotted on a log scale.It is noted that the trace shapes of DL's ⟨ −1  ⟩ are similar to those of DL's ⟨ 2  ⟩.But the differences can be seen in Figures 6(d) and 6(e) (nonzero coefficient's traces are shown in Figure 6(d) while zero coefficients are shown in Figure 6(e) for better visualization).If we force ⟨ 2  ⟩ to equal to be ⟨ −1  ⟩, an instance of ARD update is obtained.It is worth noting that since an extra level of iterations is introduced by ⟨ 2  ⟩, a slower convergence speed of DL can be expected in general.

Computer Simulations
In this section, the proposed VB-Double Lomax ( → +0) formulation (denoted with DL) is compared with the equivalent formulation with ARD prior (denoted with ARD).Furthermore, a hybrid formulation (denoted by hybrid), with first  iterations employing DL and the rest employing ARD  until convergence, is also considered here for a tradeoff between performance and speed.This hybrid approach is based on the observation that ARD tends to converge to a local minima at early iterations.Two kinds of SLM applications with different Φ matrices are considered here.One is the AR model identification where the rows of Φ matrix are strongly correlated, the other is the standard compressive sensing problem where the rows of Φ matrix are uncorrelated.
In the experiments reported below, the hybrid number  = 20, and an experimental criterion, |‖x()‖ − ‖x( − 1)‖| < 10 −3 and lasting for more than 50 iterations, is used to terminate the iterative procedure.Autoregressive model is widely used to model time series data, which attempts to predict the future values based on previous values.An AR model of order  is defined as where   is the th observation (measurement) in the series,   is the AR coefficients, and   is the excitation noise.Using matrix notation, we have where Φ AR is the  ×  matrix whose th row contains the lags for elements   , that is, [ −1 , . . .,  − ] and x is the AR coefficient vector to be estimated.If the model order  is unknown, the strategy used here is choosing an  where  is large enough to ensure  >  then computing the coefficient vector of  × 1, and the extra coefficients are supposed to have estimates of zero. Figure 7 shows an example of AR identification using DL with  = 300,  = 8, and  = 20.The excitation noise follows a Student-t distribution with  = 3 and  = 1.It is shown in the left of the figure that for coefficients  ≤ , the algorithm gives relatively satisfactory estimates, and for the extra coefficients  > , the estimates have been set closely to zero.The center is plotted for the expected values of one-step-ahead predictions compared with the observations, and the right gives the comparison of the estimated excitation noise distribution with that used to generate the observations.
To compare DL, ARD, and hybrid, we use three AR models with  = 5, 10, and 15.In each case,  = 20 and the excitation noise is Student-t distributed with  = 3 and  = 1.The coefficient estimation error is calculated as ‖x − x‖ 2 /‖x‖ 2 , where x and x are the estimated and true coefficient vectors.The prediction mean square errors (MSEs) are calculated as [( ŷ − ) 2 ], where ŷ and  are the one-stepahead predictions and actual observations.Each experiment is carried out 200 times and the average results for AR (5), AR (10), and AR (15) are shown in Figures 8(a), 8(b), and 8(c), respectively.The figures on the left report the coefficient estimation error with error bars (one standard deviation).It is shown that DL results in smaller errors than ARD, and the improvement is more significant for AR (10) and AR (15).It is also observed that hybrid gives almost the same performance as DL.As the number of observations increases, the coefficient estimation errors of DL, ARD, and Hybrid decrease and their differences become less significant.Similar observations are found in the prediction MSEs, which are reported in the center of Figure 8.However, the prediction MSEs do not decrease monotonously with the number of observations.This may be due to the reason that the current SLM formulation is based on the overall coefficient estimation error but does not put the partial autocorrelation of the AR model in consideration.Finally, the right of the figure reports the number of iterative steps required for convergence versus the number of observations.It is shown that for the same number of observations, DL requires more iterations than ARD for convergence.However, hybrid can achieve a comparable convergence rate as ARD.In general, as the number of observations increases, the convergence speeds of DL, ARD, and Hybrid also increase.
CS is a technique for acquiring and recovering sparse signals with underdetermined linear systems that can be represented as where y is  × 1 observations, Φ CS is  ×  measurement matrix,  < , e represents the acquisition noise, and x is the  × 1 unknown sparse signal, that is, most of its coefficients are zero.We use the following default CS setup in our experiments.Φ CS is induced by sampling i.i.d.columns from a unit sphere in   . coefficients at random locations of the signal are drawn from four different probability distributions, and the rest ( − ) of the coefficients are set to zero.The nonzero coefficients of the sparse signals are realizations of the following four distributions: (1) uniform ±1 random spikes, (2) nonnegative unit spikes, (3) zero-mean unit variance Gaussian, and (4) zero-mean Student-t with unit scale parameter and shape parameter equal 3. We fix  = 256 and  = 80, and we vary the number of measurements .Moreover, both noiseless and noisy acquisitions are considered, where a zero-mean white Gaussian noise with a standard deviation of 0.005 is used in the noisy experiment.The reconstruction error is calculated as ‖x − x‖ 2 /‖x‖ 2 , where x and x are the estimated and true signals.Each experiment is carried out 200 times and the average results are reported.Figure 9 reports the number of measurements versus reconstruction error with error ranges (one standard deviation) for the noise-free case.It is shown that DL provides improved overall reconstruction performance over ARD for a reasonable number of measurements, the improvement is more significant for signal types (1) and ( 2) and less significant for types (3) and (4) (but the differences can still be seen from the error ranges).And the hybrid can provide a performance close to DL. Figure 10 reports the number of iterative steps required for convergence versus the number of measurements for the noise-free case.DL shows a slower convergence rate than ARD, and hybrid has a comparable rate as of the ARD. Figure 11 shows the number of estimated nonzero coefficients for the noise-free case, whose values are larger than 10 −3 .Overall, DL gives the sparsest estimation results, then hybrid followed by ARD. Figure 12 reports the results for noisy acquisitions.Similar observations to the noise-free case (shown in Figure 9) are found in this figure, except that this time none of the algorithms can recover the coefficients exactly due to the noise.

Conclusion
In this paper, we propose a new non-log-concave sparsity prior, referred to as Double Lomax Prior, which corresponds to a three-level hierarchical Bayes model.A VB inference procedure is introduced to solve for the SLM using Double Lomax priors.When noninformative hyperprior is assumed ( → +0), the proposed update procedure includes the canonical ARD update procedure as a special case.However, comparing to the canonical ARD update, the proposed update procedure has one more latent variable which has the smoothness effect that can result in an improved performance.A hybrid formulation of Double Lomax and ARD is also considered as a tradeoff between performance and

1 Figure 1 :
Figure 1: Double Lomax prior at zero position, with  = 1 for various values of .

Figure 2 :
Figure 2: Examples of the negative logarithm of Double Lomax prior for different values of , normalized by ln(1 + /).The ℓ 1norm and ℓ 0 -norm are included for comparison.

Figure 3 :
Figure 3: Directed graph representing the Bayesian modeling.

Figure 4 :
Figure 4: Regression coefficient estimation results using the Double Lomax formulation (DL) and the ARD formulation (ARD), reported with means and standard deviations.

Figure 5 :
Figure 5: The update traces of coefficients from the regression coefficient estimation experiment shown in Figure 4. (A) The update traces by ARD.(B) The update traces by DL. (C) The update traces of the first three coefficients [1, −1, 0] T .

Figure 6 :
Figure 6: The update traces of latent variables and noise parameter when both ARD and DL converge to the global minima (shown in Figure 5(b)).

Figure 7 :
Figure 7: AR model identification using DL.Left: coefficients and order estimation.Center: expected values of one-step-ahead predictions versus the observations.Right: excitation noise estimation.

Figure 9 :
Figure 9: Number of measurements versus reconstruction error with error ranges (one standard deviation) for the noise-free case.

Figure 10 :
Figure 10: The number of iterative step required for convergence for the noise-free case.

Figure 11 :
Figure 11: The number of estimated nonzero coefficients for the noise-free case.