GEE-TGDR: A Longitudinal Feature Selection Algorithm and Its Application to lncRNA Expression Profiles for Psoriasis Patients Treated with Immune Therapies

With the fast evolution of high-throughput technology, longitudinal gene expression experiments have become affordable and increasingly common in biomedical fields. Generalized estimating equation (GEE) approach is a widely used statistical method for the analysis of longitudinal data. Feature selection is imperative in longitudinal omics data analysis. Among a variety of existing feature selection methods, an embedded method—threshold gradient descent regularization (TGDR)—stands out due to its excellent characteristics. An alignment of GEE with TGDR is a promising area for the purpose of identifying relevant markers that can explain the dynamic changes of outcomes across time. We proposed a new novel feature selection algorithm for longitudinal outcomes—GEE-TGDR. In the GEE-TGDR method, the corresponding quasilikelihood function of a GEE model is the objective function to be optimized, and the optimization and feature selection are accomplished by the TGDR method. Long noncoding RNAs (lncRNAs) are posttranscriptional and epigenetic regulators and have lower expression levels and are more tissue-specific compared with protein-coding genes. So far, the implication of lncRNAs in psoriasis remains largely unexplored and poorly understood even though some evidence in the literature supports that lncRNAs and psoriasis are highly associated. In this study, we applied the GEE-TGDR method to a lncRNA expression dataset that examined the response of psoriasis patients to immune treatments. As a result, a list including 10 relevant lncRNAs was identified with a predictive accuracy of 70% that is superior to the accuracies achieved by two competitive methods and meaningful biological interpretation. A widespread application of the GEE-TGDR method in omics longitudinal data analysis is anticipated.


Introduction
With fast evolution of high-throughput technology, longitudinal omics experiments have become affordable and increasingly common in many biomedical fields for exploring dynamically or temporally changed biological systems or processes. Usually, the analysis strategies focus on analyzing individual time points separately. As many investigators have pointed out [1][2][3][4], a failure to incorporate information contained in the dependent structure of time course data results in inefficient estimation of the standard errors, leading to an inadequate statistical power. Especially in big omics studies, this problem stands out since the sample size of such data is usually small. Furthermore, an oversimplified consideration by combining the results from marginal analysis at individual time points tends to fail to detect a meaningful pattern of changes over time.
The generalized estimating equation (GEE) approach [5] is a well-established and widely used statistical method to analyze longitudinal data. GEE considers the first two marginal moments (i.e., mean and variance) of data and a working correlation matrix to model correlated responses, artfully avoiding the specification of full joint likelihood function. The appeal of GEE lies in that it yields consistent estimators for the parameters of interest, even if the working correlation structure is incorrectly specified. Naturally, GEE has been modified or extended to identify differentially expressed genes over time for high-throughput data. Such modifications and/or extensions are not simple due to the high dimensionality of omics data, although some efforts have been made [1,2,6].
Like its crosssectional counterpart, feature selection is imperative in the learning process for longitudinal omics data. Feature selection is aimed at eliminating irrelevant genes, avoiding overfitting, speeding up the learning process, and achieving a final model that is parsimonious (i.e., the number of selected genes is as least as possible). Consequently, a modification to GEE to analyze high-dimensional data necessitates the involvement of feature selection. In the literature, there are several such algorithms. For example, Wang et al. [2] used a smoothly clipped absolute deviation (SCAD) penalty term [7] which is a novel extension to the L 1 penalty to equip the GEE models with feature selection capacity. The L 1 penalty, also known as LASSO [8], forces genes with small estimated coefficients out of the final model, rendering a sparse solution by the means of which feature selection occurs. However, two subsequent works on this topic [1,6] showed that this algorithm usually fails to converge when the number of covariates is much larger than the number of samples. This drawback is more apparent and fatal in longitudinal omics data, where the sample size is typically smaller than that of a crosssectional study.
Among a variety of existent feature selection algorithms, we have devoted dramatic efforts on the threshold gradient descent regularization (TGDR) [9] method (see the Methods section for its description). Previously, we had extended TGDR for classification task of multiple groups (>2) and for identification of subgroup-specific prognostic genes with a survival outcome [10][11][12][13][14]. By applying these TGDR extensions to different types of omics data including microarray, RNA sequencing, and mass spectrometry (MS) data, we have shown that TGDR and its respective extensions have many merits including easy-to-moderate programming intensity, good predictive performance, and biologically meaningful implications of the resulting signatures. In a recent work [4], we show that the TGDR algorithm can be regarded as an optimization strategy and that the final models given by TGDR have superior predictive performance and more meaningful biological interpretation than the LASSO models optimized by the coordinate descent method [15]. Therefore, an integration of GEE with TGDR may overcome the drawbacks of existing approaches for the purpose of longitudinal feature selection.
Long noncoding RNAs (lncRNAs) are posttranscriptional and epigenetic regulators and have the characteristics of lower expression levels and more tissue-specific compared with protein-coding genes [16]. Once being regarded as evolutionary junks, lncRNAs have been demonstrated to play essential roles in many complex diseases, especially in cancer [16]. As pointed out by our previous study [17], psoriasis is an ideal model for examining the effects of targeted immune treatments given that it is well characterized by molecular profiles, displays low placebo effects, and possesses easily accessible diseased tissues. So far, the implication of lncRNAs in psoriasis remains largely unexplored and poorly understood. Among the limited research carried out to explore the roles of lncRNAs play in psoriasis; however, some encouraging results have turned up. For example, a very recent study [18] has shown that LOC285194 can serve as a sponger for miR-616 that regulates the expression of GATA3 though binding to its 3 ′ -untranslated region using Western blotting, quantitative real-time PCR, and dual-luciferase reporter assays. Specifically, the expression level of LOC285194 was lower in the affected skin of patient with psoriasis compared to the unaffected skin. Furthermore, Rakhshan et al. [19] showed that one SNP (i.e., rs12826786) of the HOX Transcript Antisense RNA (HOTAIR) is associated with a higher risk of developing psoriasis (TC+TT versus CC: OR = 1:59, p = 0:02). Therefore, we believe that the roles of lncRNAs play in psoriasis deserve to be explored deeply and widely.
In this article, we proposed a new feature selection algorithm, referred to as GEE-TGDR, specifically for longitudinal data mining and feature selection. In the GEE-TGDR method, the corresponding quasilikelihood function of a GEE model is the objective function to be optimized, while the optimization and feature selection are accomplished by the TGDR method. We applied this method to a longitudinal microarray gene expression data that is aimed at assessing the treatment efficacy of two immune therapies for psoriasis patients and identified the relevant lncRNAs that can predict the temporal changes of psoriasis area and severity index (PASI) scores that is utilized to determine if a patient with psoriasis responds to the treatments, with the objectives of revealing the underlying mechanisms of these two treatments from the perspective of lncRNAs.
Following the structures of a review by [20], the article is organized as follows. In Section 2, the details about the proposed GEE-TGDR method are given. In Section 3, the application of the GEE-TGDR method to psoriasis longitudinal lncRNA expression data and the analysis results are presented. Then, the biological relevance of identified lncRNA signature to psoriasis is discussed in detail. In Section 4, the limitations of the present study in addition to contributions and future work are discussed. Lastly, conclusions are given.

Materials and Methods
2.1. Experimental Data. The microarray dataset [17] used to characterize the proposed GEE-TGDR algorithm was in the Gene Expression Omnibus (GEO) database (https://www .ncbi.nlm.nih.gov/geo/) under the accession number of GSE85034. There were 179 arrays in this experiment, including the gene expression profiles of 30 patients with moderate to severe psoriasis at the baseline nonlesion skins and baseline lesion skins and at weeks 1, 2, 4, and 16. Of the 30 patients, half were administrated with adalimumab (ADA), and the other half were treated with methotrexate (MTX). One patient on the ADA arm had no expression measurements of week 16 since his/her psoriasis area and severity index (PASI) score already had experienced a 75% decrement at the week 4. In original paper, a treatment response was based on a reduction of 75% in PASI score after week 12 or later. Longitudinal profiles of PASI scores (baseline lesion skins, at weeks 1, 2, and 4) were the outcomes of interest, and the lncRNA expression values of the baseline lesional skins serve as potential predictors to investigate if they are relevant to the PASI scores of psoriasis patients over time.
In this study, the preprocessed data were directly downloaded from the GEO database. No alternative preprocessing had been carried out. By matching the gene symbols of lncRNAs in the GENCODE (https://www.gencodegenes.org/ ) database (version 32) to those of genes annotated by the Illumina HumanHT-12 V 4.0 bead chips, 662 unique lncRNAs were identified and included in the downstream analysis.

Statistical Methods.
In this paper, we conceive a new novel feature selection algorithm called GEE-TGDR specifically for selecting relevant features associated with the temporal changes of longitudinal outcomes, in which GEE is equipped with TGDR just as its name implies. We briefly described both GEE and TGDR methods before proceeding to the proposed integration. Here, to keep it the most relevant, we focused on the case of continuous outcomes.

Threshold Gradient Descent
Regularization. For continuous outcomes, the TGDR algorithm is based on a linear model, where a response variable Y i (i = 1, ⋯, n, n is the sample size) is modelled by a P-dimensional vector of observed covariates X ip (here, p = 1, ⋯, P) as E ðY | XÞ = X T β. Here, β ' s represent the coefficients of covariates for the magnitudes of association between covariates and the outcome. For continuous outcomes, a normal distribution is usually assumed, and then, the corresponding likelihood function is used as a response function/an objective function in the TGDR algorithm. With some algebraic simplification, the response function can be written as The TDGR algorithm started from that the β ' s were initially set at zero's (corresponding to the null model). Using Δv to denote a small positive increment (e.g., 0.01) in the gradient descent search, and for iteration k, (1) Upon current estimate β ðkÞ , a negative gradient matrix g with its p th component as g p ðkÞ are calculated as (2) Let f ðkÞ represent the threshold vector of size P at iteration k and I ðxÞ is an indicator (if the condition x is true, this indicator returns 1; otherwise, its value is 0), then its p th component (for the p th gene) is (3) Update β p ðk+1Þ = β p ðkÞ + Δv × g p ðkÞ × f p ðkÞ and k = k + 1 (4) Repeat steps 1-3 for K times. K can be regarded as a tuning parameter, with a large value corresponding to a dense model (more nonzero coefficients) and a small value to a sparse model (less nonzero coefficients). The optimal value of K is determined by crossvalidations (CVs).
In the TGDR method, no explicit penalty term is added to the objective function (i.e., response function). The regularization on coefficients (thus the selection of features) is made possible by introducing the threshold function f ðkÞ in step 2, which determines if the gradient of a coefficient is large enough to descent or more precisely speaking to be updated. For more detailed description of the TGDR method, the works [9,21] are referred.

Generalized Estimating Equation.
In the longitudinal notation, the j th time point/measurement of the i th subject, a t-dimensional vector of response variables Y ij (here, i = 1, ⋯, n and j = 1, ⋯, t) and covariates X ijp (here, p = 1, ⋯, P represents p th covariate) are observed. Thus Y i: = ðY i1 , ⋯, Y it Þ T denotes the vector of responses at t different time points for subject i, and X ij = ðX ij1 , ⋯, X ijP Þ T is P covariates for subject i at time point j.
In the GEE model, the first two marginal moments of Y ij are denoted by μ ij ðβÞ = E ðY ij | X ij Þ (the expectation of Y ij given X ij ) and σ 2 ðβÞ = V ðY ij Þ (the variance of Y). Here, β ' s are the coefficients representing the magnitude of association between covariates and outcomes, with β jp representing how attribute p is associated with the value of outcome Y :j (meaning the outcome at time point j). Those β ' s are parameters of interest. Furthermore, the distribution of Y ij is assumed to belong to an exponential family with a canonical link function. Let Here, R i ðαÞ is an t × t working correlation matrix with α as the finite dimensional parameter vector for correlations, which would be usually estimated by the residual-based moment method. In a GEE model, the quasilikelihood function can be written as 3 BioMed Research International Four structures are commonly used for the working correlation matrix R i ðαÞ-first-order autoregressive (AR1), exchangeable, unstructured, and independent structure.

GEE-TGDR.
The conventional TGDR method only deals with univariate outcomes. As far as longitudinal outcomes that are multivariate are concerned, the method needs to be extended.
In this study, we proposed to replace the likelihood function with the corresponding quasilikelihood function and to extend TGDR as GEE-TGDR. With Δv denoting a small positive increment (e.g., 0.01) in gradient descent search, then at k iteration, (1) Upon current estimate β ðkÞ , a negative gradient matrix g with its ðj, pÞ th component as g jp (2) Let f ðkÞ j represent the threshold vector of size P for the j th time point (j = 1, ::, t) at iteration k, then its p th component (for the p th gene) is In this study, we only developed the GEE-TGDR algorithm for continuous outcomes given in the motivated database; PASI scores which are continuous were the outcomes of interest, then the corresponding expectations of Y i ' s given Here, let j = 1, 2, ::, t represent the time points measured; then X ij are for the gene expression profiles at time point j for subject i, and β j are for the corresponding coefficients of those gene expression values at time point j. Figure 1 gives the graphical illustration of the GEE-TGDR algorithm for continuous longitudinal outcomes.
Since the outcomes were continuous, the mean squared error (MSE) statistic was calculated to evaluate the performance of resulting gene signatures. It is worth pointing out that for the outcomes of other types, an extension suitable for the underlying data type of GEE-TGDR algorithm is straightforward, with the corresponding quasilikelihood function serving as the objective function/response function.

Statistical Language.
Statistical analysis was carried out in the R language version 3.6.1 (http://www.r-project.org).

Identified LncRNA Signatures.
In this study, we propose to extend the feature selection algorithm TDGR to account for correlation structure of longitudinal data. This is accomplished by defining the objective function of TDGR as the corresponding quasilikelihood function, which as in GEE is specified based on the first two moments and a working correlation matrix. TDGR-GEE is described in the Materials and Methods section. In this section, we illustrate the application of the proposed method while looking for biomarkers that predict clinical resolution of psoriasis after being treated with two immune therapies.
Gene expression profiles of baseline lesional skin biopsies were obtained for 30 subjects followed up to 16 weeks after treatment with adalimumab and methotrexate. Clinical resolution at weeks 1, 2, and 4 was measured by PASI. In this example, we would like to identify a signature of genes whose baseline expression values correlate with changes in PASI, our continuous longitudinal outcome. WE used 662 lnRNA as covariates in the proposed GEE-TGDR model, under 4 different working correlation structures. The performance statistics (i.e., MSEs) and identified lncRNA genes are presented in Table 1.
In this application, the results obtained under working correlation structures exchangeable, unstructured, and independent barely differ, with similar sets of biomarkers leading to similar performance. This reflects a well-known robust characteristic of GEE, where when predictors are correctly given, the GEE estimates remain consistent even if the correlation structures are misspecified. Under the AR1 structure, GEE-TGDR identified only one lncRNA as being related to PASI scores, leading to an underfitting and inferior to the performance when compared to the other three correlation structures.
Due to the patient burden and budgetary restrictions, longitudinal omics data are usually very short and unevenly spaced. In this case, AR1 is not well suited and the unstructured correlation may be the most suitable structure, even though that this structure corresponds to a model with more nuisance parameters involved in the corresponding working correlation structure.
Crossvalidation (CV) results gave us an idea for the variability in the model performance in this regard; CV results indicated that all correlation structures but AR1 structure provided similar results, with both the exchangeable and independent structures having the least MSEs but a bigger variability and the unstructured structure having a larger MSE but the smaller variations.
Even though that at individual time points, the identified features varied substantially for the unstructured, exchangeable, and independent working correlation structures (Figure 2), the unions of lncRNA lists across time are essentially the same, including 9 lncRNAs identified by all these three structures and one lncRNA selected by the independent structure alone (Figure 3).

Comparison with Competing Methods.
In order to further characterize the GEE-TGDR method, a comparison with two competing methods was made. One competing method under consideration is the GEE-screening method [1] in which a GEE model was fit with the PASI scores over time as the outcome and the expression values of a certain lncRNA as the covariate. The GEE-screening method filtered genes one by one. Of note, in the GEE-TGDR method and the GEE-screening model, we only considered the unstructured working correlation structure. In the other competing method, namely, linear mixed model-based screening  5 BioMed Research International method, a GEE model was replaced by a linear mixed model (the outcome and the covariate are the same as those in the GEE-screening model), and the intercept term was regarded as a random effect. The lncRNAs with corresponding p values of the coefficients <0.05 were selected as being relevant in both competing methods. Then, a support vector machine model was fit using the response status as the outcome and the identified lncRNAs by a specific method as predictors. According to the 10-fold crossvalidation results (to estimate the predictive performance of each method), the GEE-TGDR method achieved the best predictive accuracy ( Table 2). Of note, even GEE-TGDR has the best performance compared to the other two competitive methods, its predictive accuracy is estimated as 70%, which is far from 100%, leaving a large space to be improved.

Biological Relevance.
In order to gain biological insightidentified biomarkers, we evaluated the relevance to psoriasis of the 10 identified lncRNA using disease confidence scores, where a high score represents a solid support by the literature  BioMed Research International according to the GeneCards database. None of the 10 lncRNAs were directly related to psoriasis while 5 lncRNAs, listed in a descending order for the confidence scores and thus descending support by the literature according to the GeneCards database, MIR205, XIST, SNHG5, LINC01139, and SDHAP2 were associated with immunity. Little meaningful information was extracted from currently annotated lncRNA databases, no surprisingly since that psoriasis remains largely unexplored from the perspective of lncRNAs. We thus focused on studying the mRNAs correlated or targeted by these lncRNAs. Specifically, we identified the genes whose baseline lesional expression was strongly correlated with at least one of the 10 lnRNA (|Spearman correlation coefficient | >0:6, 5) and identified 225 mRNAs genes. According to the GeneCards database [22], approximately 30% of these mRNAs (64) were directly related to psoriasis, most notably IL10, FABP5, KRT16, CCR6, IL18, STAT3, GATA3, and SERPINB3, providing some validation of the lncRNA biomarkers identified by the GEE-TGDR method. In contrast, among the 29 target mRNAs identified by the lncRNA Disease 2.0 database [23] as targeted by the 10 lnRNA panel (all of which were identified by the correlation approach), GeneCards claimed that CCR10, AOC3, UBB, and WNK4 were directly related to psoriasis, but only CCR10 had a large confidence score for its relevancy to psoriasis. Of note, among the 10 lncRNAs, only RAMP2-AS1, PAX1P1-AS1, TMEM99, and LIN01018 have many correlated mRNAs, but the other five have few or no correlated mRNAs at all.

Enriched Pathways by Target mRNAs.
A gene-set overrepresentation analysis was carried out on the 225 mRNAs identified as targeted by the 10 lnRNA biomarker panel using the STRING software [24] on KEGG and GO collections. About 346 enriched biological process (BP) terms, 23 molecular function (MF) terms, and 21 cellular component (CC) terms were identified in the GO collection reflecting the immune pathophysiology of the disease. The top 3 enriched KEGG pathways [25] reflected the inflammatory processes not only identifying inflammatory bowel diseases (FDR < 0:001) and cytokine-cytokine receptor interaction (FDR = 0:005) but also zeroing on the hallmark pathway in psoriasis: Th17 differentiation (FDR = 0:031).
Lastly, among the 225 mRNA, we selected the top 10 in terms of psoriasis-relevance (confidence score for relevancy > 15 ) and constructed a lncRNA-mRNA interaction network, visualized by Cytoscape software [26] (Figure 4). We observed that the target mRNAs are highly connected, with IL10 serving as a hub gene. It is well-known that IL10 is an immunosuppressive cytokine and enables to maintain immunological homeostasis [27]. Based on this, we anticipate that identified lncRNAs may regulate the expression of important cytokines such as IL10 and warrant further investigation.

Limitations and Future
Work. At current stage, the GEE-TGDR method has several limitations. First, no grouping structure is taken into account, and thus, the GEE-TGDR method belongs to the conventional embedded feature selection category. So far, accumulated studies [28][29][30][31] have shown that a pathway-based method that considers grouping information is superior to its gene-based counterpart in which grouping information is ignored. Thus, how to extend the proposed GEE-TGDR method to account for correlations among genes is a research avenue we will pursue in the near future.
Second, the TGDR method is much slower than the coordinate descent (CD) [15] method as shown by our previous study [4]. Given that the GEE-TGDR extension has the TGDR method as an optimization strategy, its speed of convergence is expected to be very slow. A method that combines the merits of these two algorithms together is definitely in demand. Alternatively, a sine cosine algorithm [20] may be integrated into the gradient descent step for a faster updating and a better tuning of hyperparameters (tuning parameters). Furthermore, the step increment Δv is fixed at a constant value in the current version. In the future, this parameter will be modified to update along the iterations, as in the Adam algorithm, which may boost the computing efficiency and avoid being stuck in a local minimum value as well.
Third, the GEE-TGDR method only takes time-invariant covariates in its current version. For longitudinal gene expression profiles, a summary score would be utilized to summarize each gene's expression values over time as one overall value. Consequently, covariates became timeinvariant again. For example, the mean values of lncRNA expression profiles at baseline and week 1 can be used to represent the corresponding lncRNAs and then as the covariates to investigate they are associated with PASI scores at week 1, week 2, and week 4 or the change of PASI scores at those time points from the baseline levels. On the other hand, the GEE-TGDR method can be certainly extended to handle time-varying covariates, which can examine the impact of dynamic changes in gene expression values on the outcomes of interest and thus facilitate a timely adjustment on treatment strategies accordingly. Lastly, right now, the only type of outcomes is continuous; yet certainly, it can be extended to handle outcomes of other types, with the corresponding quasilikelihood function acting as the objective function. 33.33% * The predictive errors were calculated on the basis of 10-fold crossvalidations. Here, the response status, i.e., if the PASI score experienced a reduction of 75% from the baseline affected skin after week 12 or later. Size: the number of identified lncRNAs by a specific method; here, the sizes trained on the whole dataset were given; in crossvalidations, these numbers were subject to changes since the training sets were a subset of the whole dataset. For GEE-TGDR-and GEE-based screening, only unstructured working correlation matrix was considered.

Contributions.
In this study, we propose a new feature selection algorithm that is capable of analyzing longitudinal outcomes and investigating the associations between gene expression profiles and the temporal changes of outcomes. In the psoriasis application, overfitting might be possible on the basis of the large discrepancy in MSE statistics between the whole training set and the crossvalidations. Even worse but more realistic, overfitting and underfitting may accompany each other to exist in a feature selection process. Since for real-world applications, the true relevant genes are unknown so the biological relevance is usually resorted to abstract some insight about the appropriation of identified gene lists. Nevertheless, for psoriasis and the underlying mechanism of immune treatments to combat this disease, little has been investigated from the perspective of lncRNAs to mine such relevant information. To the best of our knowledge, our work here is one of first efforts to unveil the mechanisms of psoriasis and its immune treatments using lncRNA expression profiles and a feature selection method specific for longitudinal data.
After the limitations of the GEE-TGDR method are addressed in the near future, we believe that a lncRNA signature will be harvested to tell precisely which patients would respond to a specific treatment from those who would not and thus facilitating personalized regimens or at least complementing other molecular markers for precise treatment strategies.

Conclusions
In this study, we proposed a novel feature selection algorithm-GEE-TGDR-capable of handling longitudinal outcomes and identifying relevant genes associated with the temporal changes of such outcomes.
Our future work will focus on eliminating the limitations of the GEE-TGDR method. In addition, extensions of the current procedure to analyze other types of outcomes rather than continuous ones and a more efficient and faster implementation of updating coefficients are at the top of this list.
It is worth mentioning that besides dealing with longitudinal clinical outcomes, the GEE-TGDR can be adopted to inference the associations between lncRNAs and mRNAs and thus construct lncRNA-mRNA interaction networks. For example, using well-known cancer-related mRNAs as outcomes, the lncRNAs that may potentially regulate/target those mRNAs could be found with the aid of the GEE-TGDR method, which is also one of our future works. Therefore, we anticipate a widespread application of the GEE-TGDR method in omics data analysis.

Conflicts of Interest
No competing interests have been declared. Here, only mRNAs with high enough confidence scores for the relevancy to psoriasis were considered. From the network, it is observed that IL10 is a hub gene directly connecting several other mRNAs and three identified lncRNAs. Four lncRNAs were highlighted in yellow, and the other six lncRNAs without correlated mRNAs were omitted from the graph.