With the fast evolution of high-throughput technology, longitudinal gene expression experiments have become affordable and increasingly common in biomedical fields. Generalized estimating equation (GEE) approach is a widely used statistical method for the analysis of longitudinal data. Feature selection is imperative in longitudinal omics data analysis. Among a variety of existing feature selection methods, an embedded method—threshold gradient descent regularization (TGDR)—stands out due to its excellent characteristics. An alignment of GEE with TGDR is a promising area for the purpose of identifying relevant markers that can explain the dynamic changes of outcomes across time. We proposed a new novel feature selection algorithm for longitudinal outcomes—GEE-TGDR. In the GEE-TGDR method, the corresponding quasilikelihood function of a GEE model is the objective function to be optimized, and the optimization and feature selection are accomplished by the TGDR method. Long noncoding RNAs (lncRNAs) are posttranscriptional and epigenetic regulators and have lower expression levels and are more tissue-specific compared with protein-coding genes. So far, the implication of lncRNAs in psoriasis remains largely unexplored and poorly understood even though some evidence in the literature supports that lncRNAs and psoriasis are highly associated. In this study, we applied the GEE-TGDR method to a lncRNA expression dataset that examined the response of psoriasis patients to immune treatments. As a result, a list including 10 relevant lncRNAs was identified with a predictive accuracy of 70% that is superior to the accuracies achieved by two competitive methods and meaningful biological interpretation. A widespread application of the GEE-TGDR method in omics longitudinal data analysis is anticipated.
With fast evolution of high-throughput technology, longitudinal omics experiments have become affordable and increasingly common in many biomedical fields for exploring dynamically or temporally changed biological systems or processes. Usually, the analysis strategies focus on analyzing individual time points separately. As many investigators have pointed out [
The generalized estimating equation (GEE) approach [
Like its crosssectional counterpart, feature selection is imperative in the learning process for longitudinal omics data. Feature selection is aimed at eliminating irrelevant genes, avoiding overfitting, speeding up the learning process, and achieving a final model that is parsimonious (i.e., the number of selected genes is as least as possible). Consequently, a modification to GEE to analyze high-dimensional data necessitates the involvement of feature selection. In the literature, there are several such algorithms. For example, Wang et al. [
Among a variety of existent feature selection algorithms, we have devoted dramatic efforts on the threshold gradient descent regularization (TGDR) [
Long noncoding RNAs (lncRNAs) are posttranscriptional and epigenetic regulators and have the characteristics of lower expression levels and more tissue-specific compared with protein-coding genes [
In this article, we proposed a new feature selection algorithm, referred to as GEE-TGDR, specifically for longitudinal data mining and feature selection. In the GEE-TGDR method, the corresponding quasilikelihood function of a GEE model is the objective function to be optimized, while the optimization and feature selection are accomplished by the TGDR method. We applied this method to a longitudinal microarray gene expression data that is aimed at assessing the treatment efficacy of two immune therapies for psoriasis patients and identified the relevant lncRNAs that can predict the temporal changes of psoriasis area and severity index (PASI) scores that is utilized to determine if a patient with psoriasis responds to the treatments, with the objectives of revealing the underlying mechanisms of these two treatments from the perspective of lncRNAs.
Following the structures of a review by [
The microarray dataset [
In this study, the preprocessed data were directly downloaded from the GEO database. No alternative preprocessing had been carried out. By matching the gene symbols of lncRNAs in the GENCODE (
In this paper, we conceive a new novel feature selection algorithm called GEE-TGDR specifically for selecting relevant features associated with the temporal changes of longitudinal outcomes, in which GEE is equipped with TGDR just as its name implies. We briefly described both GEE and TGDR methods before proceeding to the proposed integration. Here, to keep it the most relevant, we focused on the case of continuous outcomes.
For continuous outcomes, the TGDR algorithm is based on a linear model, where a response variable
The TDGR algorithm started from that the Upon current estimate Let Update Repeat steps 1-3 for
In the TGDR method, no explicit penalty term is added to the objective function (i.e., response function). The regularization on coefficients (thus the selection of features) is made possible by introducing the threshold function
In the longitudinal notation, the
In the GEE model, the first two marginal moments of
The conventional TGDR method only deals with univariate outcomes. As far as longitudinal outcomes that are multivariate are concerned, the method needs to be extended.
In this study, we proposed to replace the likelihood function with the corresponding quasilikelihood function and to extend TGDR as GEE-TGDR. With Upon current estimate Let Update Calculate the residuals, viz, Repeat steps 1-4 for
In this study, we only developed the GEE-TGDR algorithm for continuous outcomes given in the motivated database; PASI scores which are continuous were the outcomes of interest, then the corresponding expectations of
Flowchart of the proposed GEE-TGDR algorithm.
Since the outcomes were continuous, the mean squared error (MSE) statistic was calculated to evaluate the performance of resulting gene signatures. It is worth pointing out that for the outcomes of other types, an extension suitable for the underlying data type of GEE-TGDR algorithm is straightforward, with the corresponding quasilikelihood function serving as the objective function/response function.
Statistical analysis was carried out in the R language version 3.6.1 (
In this study, we propose to extend the feature selection algorithm TDGR to account for correlation structure of longitudinal data. This is accomplished by defining the objective function of TDGR as the corresponding quasilikelihood function, which as in GEE is specified based on the first two moments and a working correlation matrix. TDGR-GEE is described in the Materials and Methods section. In this section, we illustrate the application of the proposed method while looking for biomarkers that predict clinical resolution of psoriasis after being treated with two immune therapies.
Gene expression profiles of baseline lesional skin biopsies were obtained for 30 subjects followed up to 16 weeks after treatment with adalimumab and methotrexate. Clinical resolution at weeks 1, 2, and 4 was measured by PASI. In this example, we would like to identify a signature of genes whose baseline expression values correlate with changes in PASI, our continuous longitudinal outcome. WE used 662 lnRNA as covariates in the proposed GEE-TGDR model, under 4 different working correlation structures. The performance statistics (i.e., MSEs) and identified lncRNA genes are presented in Table
Results of psoriasis lncRNA longitudinal data.
Ave. of MSE (5-fold CVs) | SD of MSE (CVs) | MSE (all data) | Identified lncRNAs (using all data) | ||||
---|---|---|---|---|---|---|---|
Baseline | Week 1 | Week 2 | Week 4 | ||||
AR1 | 14.456 | 3.258 | 2.101 | RAMP2-AS1 | RAMP2-AS1 | RAMP2-AS1 | RAMP2-AS1 |
Unstructured | 3.725 | 0.498 | 0.793 | XIST | LRRC75A-AS1 | LRRC75A-AS1 TMEM99 LINC01018 PAXIP1-AS1 LINC01139 RAMP2-AS1 | TMEM99 |
Exchangeable | 2.758 | 1.649 | 0.767 | XIST | LRRC75A-AS1 XIST LINC01139 SDHAP2 RAMP2-AS1 | TMEM99 LINC01139 RAMP2-AS1 | TMEM99 |
Independent | 2.675 | 1.694 | 0.760 | SNHG5 LINC01139 RAMP2-AS1 MIR205 | SNHG5 RAMP2-AS1 | SNHG5 TMEM99 RAMP2-AS1 | SNHG5 |
Only baseline expression values were used. AR1: autoregressive order 1; MSE: mean squared error; SD: standard deviation; CV: crossvalidation.
In this application, the results obtained under working correlation structures exchangeable, unstructured, and independent barely differ, with similar sets of biomarkers leading to similar performance. This reflects a well-known robust characteristic of GEE, where when predictors are correctly given, the GEE estimates remain consistent even if the correlation structures are misspecified. Under the AR1 structure, GEE-TGDR identified only one lncRNA as being related to PASI scores, leading to an underfitting and inferior to the performance when compared to the other three correlation structures.
Due to the patient burden and budgetary restrictions, longitudinal omics data are usually very short and unevenly spaced. In this case, AR1 is not well suited and the unstructured correlation may be the most suitable structure, even though that this structure corresponds to a model with more nuisance parameters involved in the corresponding working correlation structure.
Crossvalidation (CV) results gave us an idea for the variability in the model performance in this regard; CV results indicated that all correlation structures but AR1 structure provided similar results, with both the exchangeable and independent structures having the least MSEs but a bigger variability and the unstructured structure having a larger MSE but the smaller variations.
Even though that at individual time points, the identified features varied substantially for the unstructured, exchangeable, and independent working correlation structures (Figure
Venn diagram of identified lncRNAs for baseline, at weeks 1, 2, and 4, respectively, by different working correlation structures. (a) Under the unstructured working correlation structure. (b) The exchangeable working structure. (c) The independent working structure.
Venn diagram of integrated lncRNAs by three working correlation structures.
In order to further characterize the GEE-TGDR method, a comparison with two competing methods was made. One competing method under consideration is the GEE-screening method [
Comparison between the GEE-TGDR method and two competing algorithms.
Method | Size | Predictive error |
---|---|---|
GEE-TGDR | 9 | 30% |
GEE-based screening | 50 | 40% |
Linear mixed model-based screening | 27 | 33.33% |
In order to gain biological insight-identified biomarkers, we evaluated the relevance to psoriasis of the 10 identified lncRNA using disease confidence scores, where a high score represents a solid support by the literature according to the GeneCards database. None of the 10 lncRNAs were directly related to psoriasis while 5 lncRNAs, listed in a descending order for the confidence scores and thus descending support by the literature according to the GeneCards database,
Little meaningful information was extracted from currently annotated lncRNA databases, no surprisingly since that psoriasis remains largely unexplored from the perspective of lncRNAs. We thus focused on studying the mRNAs correlated or targeted by these lncRNAs. Specifically, we identified the genes whose baseline lesional expression was strongly correlated with at least one of the 10 lnRNA (
A gene-set overrepresentation analysis was carried out on the 225 mRNAs identified as targeted by the 10 lnRNA biomarker panel using the STRING software [
Lastly, among the 225 mRNA, we selected the top 10 in terms of psoriasis-relevance (
Resulting interaction network of identified lncRNAs and their correlated mRNAs. Here, only mRNAs with high enough confidence scores for the relevancy to psoriasis were considered. From the network, it is observed that IL10 is a hub gene directly connecting several other mRNAs and three identified lncRNAs. Four lncRNAs were highlighted in yellow, and the other six lncRNAs without correlated mRNAs were omitted from the graph.
At current stage, the GEE-TGDR method has several limitations. First, no grouping structure is taken into account, and thus, the GEE-TGDR method belongs to the conventional embedded feature selection category. So far, accumulated studies [
Second, the TGDR method is much slower than the coordinate descent (CD) [
Third, the GEE-TGDR method only takes time-invariant covariates in its current version. For longitudinal gene expression profiles, a summary score would be utilized to summarize each gene’s expression values over time as one overall value. Consequently, covariates became time-invariant again. For example, the mean values of lncRNA expression profiles at baseline and week 1 can be used to represent the corresponding lncRNAs and then as the covariates to investigate they are associated with PASI scores at week 1, week 2, and week 4 or the change of PASI scores at those time points from the baseline levels. On the other hand, the GEE-TGDR method can be certainly extended to handle time-varying covariates, which can examine the impact of dynamic changes in gene expression values on the outcomes of interest and thus facilitate a timely adjustment on treatment strategies accordingly. Lastly, right now, the only type of outcomes is continuous; yet certainly, it can be extended to handle outcomes of other types, with the corresponding quasilikelihood function acting as the objective function.
In this study, we propose a new feature selection algorithm that is capable of analyzing longitudinal outcomes and investigating the associations between gene expression profiles and the temporal changes of outcomes. In the psoriasis application, overfitting might be possible on the basis of the large discrepancy in MSE statistics between the whole training set and the crossvalidations. Even worse but more realistic, overfitting and underfitting may accompany each other to exist in a feature selection process. Since for real-world applications, the true relevant genes are unknown so the biological relevance is usually resorted to abstract some insight about the appropriation of identified gene lists. Nevertheless, for psoriasis and the underlying mechanism of immune treatments to combat this disease, little has been investigated from the perspective of lncRNAs to mine such relevant information. To the best of our knowledge, our work here is one of first efforts to unveil the mechanisms of psoriasis and its immune treatments using lncRNA expression profiles and a feature selection method specific for longitudinal data.
After the limitations of the GEE-TGDR method are addressed in the near future, we believe that a lncRNA signature will be harvested to tell precisely which patients would respond to a specific treatment from those who would not and thus facilitating personalized regimens or at least complementing other molecular markers for precise treatment strategies.
In this study, we proposed a novel feature selection algorithm—GEE-TGDR—capable of handling longitudinal outcomes and identifying relevant genes associated with the temporal changes of such outcomes.
Our future work will focus on eliminating the limitations of the GEE-TGDR method. In addition, extensions of the current procedure to analyze other types of outcomes rather than continuous ones and a more efficient and faster implementation of updating coefficients are at the top of this list.
It is worth mentioning that besides dealing with longitudinal clinical outcomes, the GEE-TGDR can be adopted to inference the associations between lncRNAs and mRNAs and thus construct lncRNA-mRNA interaction networks. For example, using well-known cancer-related mRNAs as outcomes, the lncRNAs that may potentially regulate/target those mRNAs could be found with the aid of the GEE-TGDR method, which is also one of our future works. Therefore, we anticipate a widespread application of the GEE-TGDR method in omics data analysis.
Preprocessed gene expression data (accession no.: GSE85034) along with patient’s clinical information were downloaded from the GEO database (
No competing interests have been declared.
ST conceived and designed the study. ST and CW analyzed the data. CW and ST interpreted data analysis and results. ST and CW wrote the paper. All authors reviewed and approved the final manuscript.
We thank Dr. Danna Gilbreath for the English editing. This study was supported by a fund (No. 31401123) from the National Natural Science Foundation of China. Dr. Suarez-Farinas was also supported by the Irma T. Hirschl/Monique Weill-Coulier Research Award.