An Orthogonal Matching Pursuit Variable Screening Algorithm for High-Dimensional Linear Regression Models

Variable selection plays an important role in data mining. It is crucial to lter useful variables and extract useful information in a high-dimensional setup when the number of predictor variables d tends to be much larger than the sample size n. Statistical inferences can be more precise after irrelevant variables are moved out by the screening method. ­is article proposes an orthogonal matching pursuit algorithm for variable screening under the high-dimensional setup. ­e proposed orthogonal matching pursuit method demonstrates good performance in variable screening. In particular, if the dimension of the true model is nite, OMP might discover all relevant predictors within a nite number of steps. ­roughout theoretical analysis and simulations, it is conrmed that the orthogonal matching pursuit algorithm can identify relevant predictors to ensure screening consistency in variable selection. Given the sure screening property, the BIC criterion can be used to practically select the best candidate from the models generated by the OMP algorithm. Compared with the traditional orthogonal matching pursuit method, the resulting model can improve prediction accuracy and reduce computational cost by screening out the relevant variables.


Introduction
Variable screening is an important technique in data mining. It captures informative variables by reducing the dimension in a high-dimensional setup when the number of predictor variables d tends to be much larger than the sample size n. However, statistical inference is di cult to compute in ultrahigh dimensional linear models before variable screening due to the computational complexity. It is necessary to remove the irrelevant variables from the model before statistical inference. e core idea is to screening out the informative variables with the aim of building a relevant model for future prediction. By removing most irrelevant and redundant variables from the data, variable selection helps improve the performance of learning models in terms of obtaining higher estimation accuracy [1]. en the AIC [2] or BIC [3] can be applied to further guarantee the accuracy of the relevant model. e focus of this article is on ultra-high dimensional linear models, in which the number of predictor variables d tends to be much larger than the sample size n. In particular, the number of covariates may increase at an exponential rate. Such linear models have gained a lot of attention in practical areas, such as sentiment analysis and nance. Existing techniques in the past literature include forward selection [1], least absolute shrinkage and selection operator (Lasso) [4], smoothly clipped absolute deviation penalty (SCAD) [5], etc. ese e orts have been devoted to the challenging ultra-high dimensionality problem, which is motivated by contemporary applications such as bioinformatics, genomics, nance, etc. In other words, it is becoming a major issue to investigate the existence of complex relationships and dependencies in data with the aim of building a relevant model for inference. A practically attractive approach is to rst use a quick screening procedure to reduce the dimensionality of the covariates to a reasonable scale, for example below the sample size, and then apply variable selection techniques such as LASSO and SCAD in the second stage.
Motivated by the current studies on variable screening approaches in ultra-high dimensional linear models, it is of our interests in showing the screening consistency property of the OMP under certain conditions, by restricting the technical conditions stated in Wang [6] and hence selecting a subset of predictors, which includes all relevant predictors, to ensure variable screening results. e rest of this article is organized as follows: Section 2 provides the literature review on current variable screening methods. Section 3 demonstrates a variable screening algorithm based on the OMP. Furthermore, the asymptotic results of the estimators are studied. Section 4 examines via simulation that our proposed technique exhibits desired sample properties and can be useful in practical applications. Finally, Section 5 concludes the article and provides some future research direction. In particular, the proof of the asymptotic theories and lemmas can be found in Appendix.

Literature Review
In the content of variable selection, screening approaches have gained a lot of attention besides the penalty approaches such as Lasso [4] and SCAD [5]. When the predictor dimension is much larger than the sample size, the story changes drastically in the sense that the conditions for most of the Lasso-type algorithms cannot be satisfied. erefore, to conduct model selection in the high-dimensional setup, variable screening is a reasonable solution.
Sure independence screening (SIS), which is proposed by Fan and Lv [7], has gained popularity under the condition when the number of predictor variables d tends to be much larger than the sample size n. Sure screening means a property that all the important variables are selected after applying a variable screening procedure with probability tending to 1. It is desired to have a dimensionality reduction method with the sure screening property. ere are three facts why sure screening is of great importance and usage when dimension d is larger than sample size n, which is clearly stated in Fan and Lv [7]. First of all, the design matrix X is rectangular, having more columns than rows. In this case, the matrix X T X is giant in dimension and singular. e maximum spurious correlation between a covariate and a response can be large due to the dimensionality and the fact that an unimportant predictor can be highly correlated with the response variable owing to the presence of important predictors associated with the predictor. In addition, the population covariance matrix Σ may become ill conditioned as n grows, and it makes variable selection difficult.
ird, the minimum nonzero absolute coefficient |β j | may decay with n and fall close to the noise level, say, the order log(d)/n (− 1/2) . Hence, in general, it becomes challenging to estimate the sparse parameter vector β accurately when d ≫ n.
To solve the abovementioned difficulties in variable selection, Fan and Lv [7] proposed a simple sure screening method using componentwise regression or equivalently correlation learning, to reduce dimensionality from high to moderate scale that is below sample size. Below is the description of the SIS method.
Let ω � (ω 1 , . . . , ω d ) T be a d-vector that is obtained by componentwise regression, that is where the n × d data matrix X is first standardized columnwise. For any given c ∈ (0, 1), we sort the d componentwise magnitudes of the vector ω in a descending order and define a submodel is correlation learning ranks the importance of features according to their marginal correlations with the response variable. Moreover, it is called the independence screening because each feature is used independently as a predictor to decide the usefulness for predicting the response variable. e computational cost of SIS is of order O(nd).
With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a welldeveloped method such as SCAD, Lasso, or adaptive Lasso [8,9], denoted by SIS-SCAD, SIS-Lasso, or SIS-AdapLasso, respectively. Moreover, sure screening property has been proven in Fan and Lv [7]. Intuitively, the core idea of SIS is to select the variables by two stages. In the first stage, an easyto-implement method is used to remove the least important variables. In the second stage, a more sophisticated and accurate method is applied to reduce the variables further.
ough SIS enjoys sure screening property and is easy to be applied, it has several potential problems. First of all, if there is an important predictor jointly correlated but marginally uncorrelated with the response variable, it is not selected by SIS and thus cannot be included in the estimated model. Second, similar to Lasso, SIS cannot handle the collinearity problem between predictors in terms of variable selection.
ird, when there are some unimportant predictors which are highly correlated with the important predictors, these unimportant predictors can have higher chance of being selected by SIS than other important predictors that are relatively weakly related to the response variable. In all, these three potential issues can be carefully treated when some extensions of SIS are proposed. In particular, iterative SIS (ISIS) is designed to overcome the weakness of SIS.
ISIS works in two steps. In the first step, a subset of k 1 variables A 1 � X i 1 , . . . , X i k 1 is selected by using an SISbased model selection method such as SIS-SCAD or SIS-Lasso methods. ere is an n-vector of residuals from regressing the response y over X i 1 , . . . , X i k 1 . In the second step, the residuals are treated as the new response variable and the previous step is repeated to the remaining d − k 1 2 Scientific Programming variables. It returns a subset of k 2 variables A 2 � X j 1 , . . . , X j k 2 . Fitting the residuals from the previous step on X 1 , . . . , X d A 1 can significantly weaken the prior selection of those unimportant variables that are highly correlated with the response through their relations with X i 1 , . . . , X i k 1 . In addition, the second step also makes those important variables which are missed out in the first step possible to be selected. Iteratively, the second step is iterated until l disjoint subsets A 1 , . . . , A l are obtained with the union A � ∪ l j�1 A j has a size [cn]. If SIS is used to select only one variable at each iteration, that is |A i | � 1, ISIS is equivalent to orthogonal matching pursuit (OMP) [10], which is a greedy algorithm for variable selection.
is is discussed in the study by Barron and Cohen [11].
Kim [12] proposed a filter ranking method using the elastic net penalty with sure independence screening (SIS) on resampling technique to overcome the overfitting and high-performance computational issues. It is demonstrated via extensive simulation studies that SIS-LASSO, SIS-MCP, and SIS-SCAD with the proposed filtering method achieve superior performance of not only accuracy but also true positive detection compared to those with the marginal maximum likelihood ranking (MMLR) method.
Another very popular yet classical variable screening method is the forward regression (FR). As one type of important greedy algorithms, FR's theoretical properties have been investigated in the studies by Barron and Cohen [11], Donoho and Stodden [13], and Wang [6]. In particular, Wang [6] investigated FR's screening consistency property, under an ultra-high dimensional setup, by introducing the four technical conditions.
ere are a few comments on those four technical conditions introduced in the study by Wang [6]. First of all, the normality assumption has been popularly used in the past literature for theory development. Second, the smallest and largest eigenvalues of the covariance matrix Σ need to be properly bounded. is bounded condition together with the normality assumption ensures the sparse Riesz condition (SRC) defined in the study by Zhang and Huang [14]. ird, the standard L 2 norm of the regression coefficients β is bounded above by some proper constant. It guarantees that the signal-to-noise ratio is convergent. Moreover, the minimum value of the nonzero β j s needs to be bounded below. is constraint on the minimal size of the nonzero regression coefficient ensures that relevant predictors can be correctly selected. Otherwise, if some of the nonzero coefficients converge too fast, they cannot be selected consistently. Last but not least, log(d) is bounded above in the order of n ξ for some small constant ξ. is condition allows the predictor dimension d to diverge to infinity at an exponential fast speed, which implies that the predictor dimension can be substantially larger than the sample size n.
Under the assumption that the true model T exists, Wang [6] introduces the FR algorithm in the aim of discovering all relevant predictors consistently. e main step of FR algorithm is the iterative forward regression part.
Consider the case where k − 1 relevant predictors have been selected accordingly. en the next step is to construct a candidate model that include one more predictor that belongs to the full set but excluding the selected k − 1 predictors and calculate the residual sum of squares based on the constructed candidate model. is step is repeated for each predictor that belongs to the full set but excluding the selected k − 1 predictors, and all the residual sum of squares are recorded accordingly.
e minimum value of all the recorded residual sum of squares are found, and the kth relevant predictor is updated based on the index of the corresponding minimum residual sum of squares. A detailed algorithm can be found in the study by Wang [6].
Wang [6] showed the theoretical proof that FR can identify all relevant predictors consistently, even if the predictor dimension is considerably larger than the sample size. In particular, if the dimension of the true model is finite, FR might discover all relevant predictors within a finite number of steps. In other words, sure screening property can be guaranteed under the four technical conditions. Given the sure screening property, the recently proposed BIC of Chen and Chen [3] can be used to practically select the best candidate from the models generated by the FR algorithm. e resulting model is good in the sense that many existing variable selection methods, such as Adaptive Lasso and SCAD, can be applied directly to increase the estimation accuracy. e extended Bayes information criterion (EBIC) proposed by Chen and Chen [3] is suitable for large model spaces. It has the following form: where M is an arbitrary candidate model with |M| ≤ n, . We then select the best model S � S (m) , where m � argmin 1≤m≤n BIC(S (m) ).
EBIC, which includes the original BIC as a special case, examines both the number of unknown parameters and the complexity of the model space. e model in Chen and Chen [3] is defined to be identifiable if no model of comparable size other than the true submodel can predict the response almost equally well. It has been shown that EBIC is selection consistent under some mild conditions. It also handles the heavy collinearity problem for the covariates. Furthermore, EBIC is easy to implement due to the fact the extended BIC family does not require a data adaptive tuning parameter procedure.
Other screening approaches include tournament screening (TS) [15], sequential Lasso [16], quantile-adaptive model-free variable screening [17], and conditional screening [18]. When P ≫ n, the tournament screening possesses the sure screening property to reduce spurious correlation. Furthermore, the asymptotic properties of sequential Lasso for feature selection in linear regression models with ultra-high dimensional feature spaces are investigated. e advantage of sequential Lasso is that it is not restricted by the dimensionality of the feature space.

Scientific Programming
Quantile-adaptive model-free variable screening has two distinctive features, allowing the set of active variables to vary across quantiles and overcoming the difficulty in specifying the form of a statistical model in a high-dimensional space. Baranowski [19] proposed a workflow representation for scheduling, provenance, or visualization to resolve variable and method dependencies and evaluated the performance of screening properties. Samudrala [20] proposed a parallel algorithm by identifying key components for dimensionality reduction of large-scale data. It shows better performance for dimension reduction compared to the existing methods. Chen [21] proposed a model-free feature screening method when the censored response and error-prone covariates both exist. An iterative algorithm is developed in the presence of the censored response and error-prone covariates. In addition, we also develop the iteration method to improve the accuracy of selecting all important covariates. Choudalakis [22] proposed appropriate numerical methods for parameter estimation under the high-dimensional setup. A thorough comparison is considered among existing methods for both coefficient estimations and variable selection for supersaturated designs. Xu et al. [23][24][25] proposed several multi-objective robust optimization models for MDVRPLS in refined oil distribution. Ren [26] proposed an asymmetric learning to hash with variable bit encoding algorithm (AVBH) to solve the high-dimensional data problem, and a real data application is applied for the finite performance of the proposed AVBH algorithm. We proposed a parallel framework for dimensionality reduction of large-scale data. We also identified key components underlying the spectral dimensionality reduction techniques and proposed their efficient parallel implementation.

Main Results
Orthogonal matching pursuit (OMP) is an iterative greedy algorithm that selects at each step the column which is most correlated with the current residuals. e selected column is then added into the set of selected columns. Inspired by the idea of the FR algorithm in Wang (2009), it is shown that under some proper conditions, OMP can enjoy the sure screening property in the linear model setup.

Model Setup and Technical Conditions.
Let (X i , y i ) be the observation collected from the i th subject (1 ≤ i ≤ n), where y i ∈ R 1 is the response and X i � (X i1 , . . . , X id ) T ∈ R d is the high-dimensional predictor with d > n and cov(X i ) � Σ. Moreover, β � (β 1 , . . . , β d ) T is the regression coefficient. In matrix representation, the design matrix is X ∈ R n×d and the response vector is y ∈ R n . Consider the linear regression model as Without loss of generality, it is assumed that the data are centered, that is the columns of X are orthonormal and y i s are conditionally independent given the design matrix X. Equivalently, E(X ij ) � 0 and Var(X ij ) � 1. Moreover, the error term ϵ are independently and identically distributed with mean zero and finite variance σ 2 . A model fitting procedure produces the vector of coefficients β � (β 1 , . . . , β d ) T .
Before the main result for the screening property of OMP is presented, four technical conditions are needed as follows: 3.1.1. Assumption 1. Technical Conditions.

OMP Algorithm.
Under the assumption that the true set T exists, our main objective is to discover all relevant predictors consistently. To this end, we consider the following OMP algorithm (Algorithm1):
e Proof of Lemma 1 can be found in Appendix.
Before the theorems are established, we follow Wang (2009)'s idea on screening consistency of a solution path and define the solution path S to be screening consistent, if 4 Scientific Programming en OMP's screening consistency can be formally established by the following theorem.

Theorem 1. Under model (4) and conditions (C1)-(C4), we have as
where the constant K � 2τ max τ min −2 C 2 β ] −4 β ] is independent of n, the constants τ max , τ min , C β , ] β , and ] are defined in conditions (C2)-(C4), and [t] is the smallest integer no less than t. eorem 1 proves that within Kn ξ 0 +4ξ min steps, all relevant predictors will be identified by the OMP algorithm. is number of steps is much smaller than the sample size n under condition (C4). In particular, if the dimension of the true model is finite with ξ 0 � ξ min � 0, only a finite number of steps are needed to discover the entire relevant variable set.
Furthermore, eorem 1 provides a theoretical basis for OMP, which enables us to empirically select the best model from S. On the other hand, the solution path S contains a total of n nested models. To further select relevant variables from the solution path S, the following BIC (Chen and Chen, 2008) is considered, where M is an arbitrary candidate model with |M| ≤ n, . We then select the best model S � S (m) , where m � arg min 1≤m≤n BIC(S (m) ). We typically do not expect S to be selection consistent (i.e., P(S � T) ⟶ 1). However, we are able to show that S is indeed screening consistent.

Theorem 2. Under model (4) and conditions (C1)-(C4), then as n ⟶ ∞
Define k min � min 1≤k≤n k: T ⊂ S (k) . By eorem 2.1, we know that k min satisfies k min ≤ Kn ξ 0 +4ξ min with probability tending to 1. erefore, our aim is to prove that P(m ≤ k min ) ⟶ 0 as n ⟶ ∞. en the theorem conclusion follows. Equivalently, it suffices to show that P min A detailed proof can be found in Appendix. , which measures how likely all relevant variables is discovered by the method, is evaluated.

Numerical Analysis
is defined coverage probability characterizes the screening property of a particular method.
To characterize the capability of a method in producing sparse solutions, we define Step 1 (Initialization). Set S (0) � ∅. Set the residual r 0 � y.
Step 3 (Solution path). Iterating Step 2 for n times leads a total of n nested candidate models. We then collect those models by a solution path S � S (k) : 1 ≤ k ≤ n with S (k) � a 1 , . . . , a k .

Scientific Programming
Percentage of Correct Zeros(%) � To characterize the method's underfitting effect, we further define If all sparse solutions are correctly identified for all irrelevant predictors and no sparse solution is mistakenly produced for all relevant variables, the true model is perfectly identified, that is S (k) � T. To measure such a performance, we define the percentage of correctly fitted (%)� 100 − 1 k I(S (k) � T), which characterizes the selection consistency property of a particular method.
As we need to know which variables are truly relevant or irrelevant, we create sparse regression vectors by setting β i � 0 for all i � 1, . . . , d, except for a chosen set T of coefficients, where β i is defined in advance for every 1 ≤ i ≤ d 0 . Moreover, the noise vector (ϵ 1 , . . . , ϵ n ) is chosen i.i. d. N(0, 1). Note that all the simulation runs are conducted in MATLAB.

Example 1. (independent predictors).
is is an example borrowed from Fan and Lv [7]. X i is generated independently according to a standard multivariate normal distribution. us, different predictors are mutually independent. (n, d, d 0 ) � (100, 5000, 8) with β j � (−1) U j (4logn � n √ + |Z j |), where U j is a binary random variable with P(U j ) � 0.4 and Z j is a standard normal random variable.

Example 2. (autoregessive correlation). X i is generated from a multivariate normal distribution with mean 0 and
is is called an autoregressive type correlation structure. Such type of correlation structure might be useful if a natural order exists among the predictors. As a consequence, the predictors with large distances in order are expected to be mutually independent approximately. is is an example from Tibshrani [4] with (n, d, d 0 ) � (100, 5000, 3). In addition, the first, fourth, and seventh components of β are set to be 3, 1.5, and 2, respectively.

Example 3. (grouped variables)
. X i is generated by the following rule. N(0, 1), and ϵ x,j ∼ N(0, 1) are independent. is creates within-group correlations of ρ ij � 0.15 for i, j ∈ 1, . . . , d 0 and ρ ij � 0.95 for i, j ∈ d 0 + 1, . . . , d 0 + 5}. is example presents an interesting scenario where a group of significant variables are mildly correlated and simultaneously a group of insignificant variables are strongly correlated. e settings are similar to those in Example 2. (n, d, d 0 ) � (100, 5000, 3). In addition, the three nonzero components of β are set to be 3, 1.5, and 2, respectively.

Simulation Results for OMP Screening Consistent
Property. Finite sample performance of OMP screening consistent property is investigated based on the abovementioned three examples in Section 4.1. Simulation results are presented in Table 1.
First of all, simulation results for the independent predictor example are in good performance in terms of screening selection consistency for OMP. In other words, we have 100% coverage probability, which means all relevant variables can be discovered by OMP method. In addition, 94% of correctly fitted denotes that BIC selects the true set of variables correctly 94 times out of 100 simulation replications. is result is not surprising since Zhang [14] pointed out that OMP can select features or variables consistently under a certain irrepresentable condition. Furthermore, the percentage of correct zeros and the percentage of incorrect zeros are 99.9% and 1.6%, respectively. Last but not least, the average model size is 7.94, which is slightly below d 0 � 8.
Furthermore, simulation results for autoregressive correlation example are in very good performance in terms of screening selection consistency for OMP. Both of the coverage probability and the percentage of correctly fitted are 100%. Especially 100% of correctly fitted denotes BIC selects the true set of variables correctly 100 times out of 100 simulation replications. is is good news since the number of nonzero β j d 0 is 3, which is a very sparse representation given d � 5000. On top of that, the percentage of correct zeros and the percentage of incorrect zeros are 100% and 0%, respectively. Last but not least, the average model size is 3. erefore, it seems that our OMP algorithm works pretty well under this autoregressive correlation setup with a sparse representation of β.
ird, simulation results for grouped variables example are in worst performance among all the three examples in terms of screening selection consistency for OMP. However, the performance itself is not bad. Coverage probability is 96%, meaning that not all the relevant predictors can be discovered by OMP algorithm in some of the simulation replications. In addition, 34% of correctly fitted denotes that BIC selects the true set of variables correctly only 34 times out of 100 simulation replications. On top of that, the percentage of correct zeros and the percentage of incorrect zeros are 95.9% and 2%, respectively. Last but not least, the average model size is 3.84.
Besides a summary of simulation results of OMP algorithm, three plots are presented in Figure 1. For each of the three examples, one particular plot of number of variables included in the final model versus BIC is extracted for reference. ese graphs are not representable as a whole; however, they do provide trends of BIC casewise. Take BIC of Example 1 as an example. Please refer to Figure 1 is is not surprising since BIC decreases as the model complexity increases. Similar trends can be observed for Example 2 and Example 3. Please refer to Figures 1(b) and 1(c).
One possible suggestion for OMP algorithm is that after the screening process with n candidate models, only (n/2) BIC candidate models for minimum BIC values are compared. By doing so, computational time can be saved without loss of correctness of screening consistent property of OMP.
In conclusion, nite simulation performances in terms of screening selection consistency for OMP are good under all the three examples. ose performances support our theories proposed in Section 3.3.

Conclusion and Future Research
To conclude, this article shows the theoretical proof that OMP can identify all relevant predictors consistently, even if the predictor dimension is considerably larger than the sample size. In particular, if the dimension of the true model is nite, OMP might discover all relevant predictors within a nite number of steps. In other words, sure screening property can be guaranteed under the four technical conditions. Given the sure screening property, the recently proposed BIC of Chen and Chen (2008) can be used to practically select the best candidate from the models generated by the OMP algorithm.
e resulting model is good in the sense that many existing variable selection methods, such as adaptive Lasso and SCAD, can be applied directly to increase the estimation accuracy. Compared with the traditional orthogonal matching pursuit method, the resulting model can improve prediction accuracy and reduce computational cost by screening out the relevant variables. e abovementioned variable selection procedure only considers the xed e ect estimates in the linear models. However, in real life, a lot of existing data have both the xed e ects and random e ects involved. For example, in the clinic trials, several observations are taken for a period of time for one particular patient. After collecting the data needed for all the patients, it is natural to consider random e ects for each individual patient in the model setting since a common error term for all the observations is not su cient to capture the individual randomness. Future research may include random e ects in the model by imposing penalized hierarchical likelihood algorithm for accurate variable selection.

Proof of lemmas and theorems
Proof of Lemma 1. Let r (r 1 , . . . , r 2 ) T ∈ R d be an arbitrary d-dimensional vector and r 1 be the subvector corresponding to Σ 11 . By the condition, we know immediately,  Scienti c Programming e argument is presented in the following. Suppose that the matrix Σ is positive definite and has the partition as given by en the inverse of Σ has the following form: where Σ 22,1 � Σ 22 − Σ 21 Σ 11 Σ 12 . In fact, the above formula can be derived from the following identity: and the fact that Moreover, the largest singular value is referred to the operator norm of the linear operator (matrix) in a Hilbert space. If A is a p × n matrix of complex entries, then its singular values S 1 ≥ . . . ≥ S q ≥ 0, q � min(p, n) are defined as the square roots of the q largest eigenvalues of the nonnegative definite Hermitian matrix AA * . If A (n × n) is Hermitian, then let λ 1 ≥ λ 2 ≥ . . . ≥ λ n denote its eigenvalues.
. is part has been proven in page 334 of the book Spectral Analysis of Large Dimensional Random Matrices.
We have that C is positive definite. We want to prove A � Σ −1 11 is positive definite. We have B is also positive definite because for any vector x, Since Σ −1 22,1 is positive definite. erefore, for any vector β β ′ Aβ Now, the desired conclusion of Lemma 1 is implied by where ε > 0 is an arbitrary positive number. e left-hand side of (A.9) is bounded by where M is the set that contains m variables. Hence, (A.11) By lemma A3 in Bickel and Elizaveta [28], there exists constants C 1 > 0 and C 2 > 0, such that P(|σ j 1 j 2 − σ j 1 j 2 | > ε) ≤ C 1 exp(−C 2 nε 2 ). In contrast, under the assumption Var(Y i ) � 1, we have n − 1 ‖Y‖ 2 ⟶ p 1. is contradicts with the result of (A.29). Hence, it implies that it is impossible to have S (k) ∩ T � ∅ for every 1 ≤ k ≤ Kn ξ 0 +4ξ min . Consequently, with probability tending to 1, all relevant predictors should be recovered within a total of Kn ξ 0 +4ξ min steps. is completes the proof.

Data Availability
e data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.