Weaker Regularity Conditions and Sparse Recovery in High-Dimensional Regression

Regularity conditions play a pivotal role for sparse recovery in high-dimensional regression. In this paper, we present a weaker regularity condition and further discuss the relationships with other regularity conditions, such as restricted eigenvalue condition. We study the behavior of our new condition for design matrices with independent random columns uniformly drawn on the unit sphere. Moreover, the present paper shows that, under a sparsity scenario, the Lasso estimator and Dantzig selector exhibit similar behavior. Based on both methods, we derive, in parallel, more precise bounds for the estimation loss and the prediction risk in the linear regression model when the number of variables can be much larger than the sample size.


Introduction
In the recent years, the problems of statistical inference in high-dimensional setting, in which the dimension of the data  exceeds the sample size , have attracted a great deal of attention.One concrete instance of a high-dimensional inference problem concerns the standard linear regression model: where  ∈ R × is called the design matrix,  ∈ R  is an unknown target vector, and  ∈ R  is a stochastic error term, in which the goal is to estimate a vector  ∈ R  based on response  and the vector of covariates  = ( 1 , . . .,   ).
In the setting  ≫ , the classical linear regression model is unidentifiable, so that it is not meaningful to estimate the parameter vector  ∈ R  .However, many high-dimensional regression problems exhibit special structure that can lead to an identifiable model.In particular, sparsity in the regression vector  is an archetypal example of such structure; that is, only a few components of  are different from zero, say -sparsity;  is then said to be -sparsity, and there has been a great interest in the study of this problem recently.The use of the ℓ 1norm penalty to enforce sparsity has been very successful and there have been several methods, such as the Lasso [1] or basis pursuit [2], and the Dantzig selector [3].Sparsity has also been exploited in a number of other questions, for instance, instrumental variable regression in the presence of endogeneity [4].
Thus, in the setting of high-dimensional linear regression, the interesting question is accurately estimating the regression vector  and the response  from few and corrupted 2 Journal of Applied Mathematics observations.In the standard form, under assumptions on the matrix  and with high probability, the estimation bounds are of the form ‖‖ 0 (log()/) /2 (e.g., see [7,8,13,21]), and the prediction errors are bounded by  log()‖‖ 0 (e.g., see [1,7,21]), where  is a positive constant.
The main contribution of this paper is the following: we present a restricted eigenvalue assumption that is weaker than the RE conditions in previous paper under certain setting.Using the ℓ 1 -norm penalty, our results are more precise than the existing ones.There is an open question that is finding a weaker assumption and obtaining better results no matter under what circumstances.
The remainder of this paper is organized as follows.We begin in Section 2 with some notations and definitions.In Section 3, we introduce some assumptions and discuss the relation between our assumptions and the existing ones.Section 4 contains our main results, and we also show the approximate equivalence between the Lasso and the Dantzig selector.We give three lemmas and the proofs of the theorems in Section 5.

Preliminaries
In this section, we introduce some notations and definitions.
Let a vector  ∈ R  .We denote by the number of nonzero coordinates of , where  {⋅} denotes the indicator function and || the cardinality of .We use the standard notation to stand for the ℓ  -norm of the vector of .Moreover, a vector  is said to be -sparse if ‖‖ 0 ≤ ; that is, it has at most  nonzero entries.For a vector Δ ∈ R  and a subset  ⊂ {1, . . ., }, we denote by Δ  the vector in R  that has the same coordinates as Δ on  and zero coordinates on the complement   of .
For linear regression model (1), regularized estimation with the ℓ 1 -norm penalty, also known as the Lasso [1] or the basis pursuit [2], refers to the following convex optimization problem: where  > 0 is a penalization parameter.The Dantzig selector has been introduced by Candes and Tao [3] as where  > 0 is a tuning parameter.It is known that it can be recast as a linear program.Hence, it is also computationally tractable.
For an integer 1 ≤  ≤ /2 and -sparse vector  ∈ R  , let   0 ∈ R | 0 | be a subvector of  ∈ R  confined to  0 .One of the common properties of the Lasso and the Dantzig selector is that, for an appropriately chosen  and a vector  = β − , where β is the solution from either the Lasso or the Dantzig selector, it holds with high probability (cf.Lemmas 11 and 12): with  0 = 1 for the Dantzig selector by Candes and Tao [3] and with  0 = 3 for the Lasso by Bickel et al. [9], where  0 > 0 and is the set of nonzero coefficients of the true parameter  of the model.Finally, for any  ≥ 1,  ≥ 2, we consider the Gram matrix: where  is the designed matrix in model ( 1) and   ∈ R × denotes the transpose matrix of .

Discussion of the Assumption
Under the sparsity scenario, we are typically interested in the case where  > , and even  ≫ .Here, sparsity specifies that the high-dimensional vector  has coefficients that are mostly 0. Clearly, the matrix Ψ  is degenerate, and ordinary least squares does not work in this case, since it requires positive definiteness of Ψ  .That is, min It turns out that the Lasso and Dantzig selector require much weaker assumptions.The idea by Bickel et al. [10] is that the minimum in (10) be replaced by the minimum over a restricted set of vectors and the norm ‖‖ 2 in the denominator of the condition be replaced by the ℓ 2 -norm of only a part of .Note that the role of ( 7) is to restrict set of vectors Assumption 1 (RE(,  0 ) (Bickel et al. [10])).For some integer  such that 1 ≤  ≤  and a positive number  0 , the following condition holds: Bickel et al. [10] showed that the bounds of estimation error and prediction error are ‖‖ 0 (log()/) /2 and  log()‖‖ 0 , respectively, for both the Lasso and Dantzig selector, where  is a positive constant and ‖‖ 0 is the sparsity level.Next, we describe the RE 2 (,  0 ) assumption presented by Wang and Su [7], which is obtained by replacing ‖‖ 2 by its upper bound ‖‖ 1 in (10).
Assumption 2 (RE 2 (,  0 ) (Wang and Su [7])).For some integer  such that 1 ≤  ≤  and a positive number  0 , the following condition holds: The two conditions are very similar.The only difference is the ℓ 1 -versus ℓ 2 -norm of a part of  in the denominator.The RE 2 (,  0 ) condition is equivalent to RE(,  0 ); see [7,13] for the discussion on equivalence.The results of [7,13] are more precise for the bounds of estimation and prediction than those derived in Bickel et al. [10] and do not lie on the sparsity level ‖‖ 0 .
In order to obtain our regularity condition in this paper, we decompose  into a set of vectors   0 , Replacing ‖‖ 1 by √‖   0 ‖ 2 in (13), we get the following assumption.
Assumption 4 (LR 2 (,  0 )).For some integer  such that 1 ≤  ≤  and a positive number  0 , the following condition holds: The two conditions above can be used to solve all the problems of sparse recovery in high-dimensional regression.Due to technical reasons, we only give the results when the LR 2 (,  0 ) is satisfied.

Main Results of Sparse Recovery for Regression Model
In order to provide performance guarantees for ℓ 1 -norm penalty applied to sparse linear models, it is sufficient to assume that the regularity conditions are satisfied.In this section, we show main results when the LR 2 (,  0 ) is satisfied.
In particular, for convenience, we assume that all the diagonal elements of the matrix   / are equal to 1.
We firstly prove a type of approximate equivalence between the Lasso and the Dantzig selector.Similar results on equivalence can be found in [7,10,13].It is expressed as closeness of the prediction losses ‖ −   ‖ 2 2 and ‖ −   ‖ 2 2 when the number of nonzero components of the Lasso or the Dantzig selector is small as compared to the sample size.
Remark 8. We have no conditions on the parameter .As in [10], we can rewrite  in terms of another parameter  in order to clarify the notation: Then, the results of Theorems 5-7 are as follows: ( Comparing the results above, our results greatly improve those in Bickel et al. [10].Additionally, the similar results for Lasso can be found in Wang and Su [7].They are 2 log   . ( It is clear that our results are more precise than those in the existing results, for example, [7,10]. Remark 9.The assumptions LR 1 (,  0 ) and LR 2 (,  0 ) are weaker than assumptions RE 2 (,  0 ) and RE(,  0 ), since √‖   0 ‖ 2 ≤ ‖‖ 1 .Note that the inequality √‖   0 ‖ 2 ≤ ‖‖ 1 holds under the setting discussed in Section 3.That is, our weaker assumptions hold under certain condition, but they cannot be considered to be better than those in previous paper at any time.

Lemmas and the Proofs of the Results
In this section, we give three lemmas and the proofs of the theorems.