Identification of a 20-Gene Expression-Based Risk Score as a Predictor of Clinical Outcome in Chronic Lymphocytic Leukemia Patients

Despite the improvement in treatment options, chronic lymphocytic leukemia (CLL) remains an incurable disease and patients show a heterogeneous clinical course requiring therapy for many of them. In the current work, we have built a 20-gene expression (GE)-based risk score predictive for patients overall survival and improving risk classification using microarray gene expression data. GE-based risk score allowed identifying a high-risk group associated with a significant shorter overall survival (OS) and time to treatment (TTT) (P ≤ .01), comprising 19.6% and 13.6% of the patients in two independent cohorts. GE-based risk score, and NRIP1 and TCF7 gene expression remained independent prognostic factors using multivariate Cox analyses and combination of GE-based risk score together with NRIP1 and TCF7 gene expression enabled the identification of three clinically distinct groups of CLL patients. Therefore, this GE-based risk score represents a powerful tool for risk stratification and outcome prediction of CLL patients and could thus be used to guide clinical and therapeutic decisions prospectively.


Introduction
Chronic lymphocytic leukemia (CLL), the most common leukemia in the western countries, is characterized by the clonal proliferation and accumulation of neoplastic B lymphocytes in the blood, bone marrow, lymph nodes, and spleen. CLL shows a heterogeneous clinical course, with many patients having an indolent disease while others suffering from rapid disease progression and are in need of early treatment [1]. Clinical staging systems based on physical examination and routine laboratory tests are the first basis for assessing different prognostic subgroups in patients with CLL [1]. However, these staging systems have a limited capacity to predict clinical outcome at an early stage of the disease and do not predict the likelihood of response to treatment in an individual with advanced disease [2].
Several biomarkers have been identified out as prognostic factors in CLL. These include somatic hypermutations in the rearranged variable regions of the immunoglobulin heavy chain (IgVH) genes, which involve around 30-40% of patients. Patients with unmutated IgVH genes had a significantly shorter median overall survival (OS) than those with mutated ones [3]. IgVH mutation status, along with deletions at 11q22-q23 (11q-) and/or 17p13 (17p-), has been identified as independent prognostic factors in CLL patients [4,5].
Meanwhile, with the advent of microarray technology and gene expression profiling (GEP) analyses, additional markers have been investigated for their potential prognostic impact in CLL. Of these, LPL (Lipoprotein lipase) [6], ZAP70 (zeta-associated protein 70) [7], CLLU1 (Chronic 2 BioMed Research International lymphocytic leukemia up-regulated 1) [8], and TCL1A (Tcell leukemia/lymphoma 1A) [9] have been demonstrated to be predictive for clinical outcome. Expression of microRNAs (e.g., miR-29c and miR-223) could be also of prognostic significance in CLL [10]. These markers combined with others were used to develop multigene expression-based prognostic scores. In 2006, Zucchetto et al. constructed a scoring system based on six surface expression molecules [11]. In a study by Rodríguez et al. [12], a predictor model based on the expression of seven genes allowed the characterization of three groups of patients with distinct OS and treatment-free survival (TFS), both in two independent cohorts of patients. In 2010, Kienle et al. identified a four-gene combination, based on ZAP70, TCF7 (Transcription factor 7), DMD (Dystrophin), and ATM (Ataxia telangiectasia mutated) expression, as a predictor of IgVH mutation status in 88% of cases [13]. Stamatopoulos et al. developed a qPCR score, based on the expression of three markers (ZAP70, LPL and miR-29c), that was able to significantly predict OS and TFS by dividing patients into three groups [14]. More recently, Herold et al. developed an eight-gene expression-based risk score which showed additional prognostic value for OS and TFS compared with the established genetic markers and Binet staging [15].
We report here the design of a GE-based risk score, involving 20 genes, whose value is strongly prognostic in 2 independent cohorts of CLL patients.

Patients.
Gene expression microarray data from three independent cohorts of patients diagnosed with CLL were used. Publicly available gene expression data from 2 cohorts with newly diagnosed CLL patients were used to construct GE-based risk score [15]. The first cohort, used as the training cohort, comprised 107 patients, and the second one as the validation cohort comprised 44 patients [15]. Peripheral blood or bone marrow samples were analyzed by Affymetrix oligonucleotide microarrays [15]. A third cohort of 130 newly diagnosed patients, with available Affymetrix gene expression data, was used as validation cohort for time to treatment analyses [16]. Clinical characteristics of patients and number and schedules of treatments were previously published [15,16]. Interphase FISH data of the training cohort were previously published [17]. Affymetrix gene expression data are publicly available via the online Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE22762, GSE39671, and GSE25571. The data were normalized using the robust multichip average (RMA) method [15,16].

Gene Expression Profiling and Statistical Analyses.
The statistical significance of differences in overall survival between groups of patients was calculated by the log-rank test. Multivariate analysis was performed using the Cox proportional hazards model. Survival curves were plotted using the Kaplan-Meier method. All these analyses have been done with R.2.10.1 and bioconductor version 2.5.

Selection of Prognostic Genes on the Training Set.
Probe sets were selected for prognostic significance using Maxstat R function (R.2.10.1 and bioconductor version 2.5) and Benjamini Hochberg multiple testing correction [18], yielding 22 significant probe sets in the two independent cohorts of patients with CLL (Table 1).

Validation in the Independent Cohort of Patients.
The GEbased risk score of CLL patients was individually calculated and patients were grouped according to the prognostic models and cutoffs from the training cohort. The prognostic value of this scoring was evaluated using log-rank statistics and Cox models.

Gene Set Enrichment Analysis (GSEA).
We compared the gene expression levels from high GE-risk score versus low risk score CLL patients and picked up the genes which had significant different expressions for gene set enrichment analysis (GSEA). Gene set enrichment analysis was carried out by computing overlaps with canonical pathways and gene ontology gene sets obtained from the broad institute [19].

GE-Based Risk Score in CLL.
Using Maxstat R function and Benjamini-Hochberg multiple testing correction [18], 22 probe sets were found to have prognostic value for OS (adjusted value < 0.05) in two independent cohorts of patients with previously-untreated CLL (GSE22762, = 107 and = 44 [15]) ( Table 1). These 22 probe sets were probed for 20 unique genes and were used to build a GE-based risk score as reported [20]. Figures 1(a) and 1(b) show expression of the 22 prognostic probe sets and GE-based risk score from patients' tumor samples of the training cohort (ranked according to increasing GE-based risk score). When used as a continuous variable, GE-based risk score had a prognostic value in the two cohorts of patients with CLL (P ≤ 10-4, data not shown). Patients of the training cohort ( = 107) were ranked according to increased prognostic score and, for a given score value X, the difference in survival of patients with a GE-based risk score ≤X or >X was computed. A maximum difference in overall survival (OS) was obtained with X = −32.3, splitting patients into a high-risk group (19.6% of patients, GE-based risk score > −32.3) with a 13.4 months median OS and a low-risk group (80.4% of patients, GE-based risk score ≤ −32.3) with not reached median survival (Figure 2(a)). The prognostic value of GE-based risk score was validated in an independent CLL patient's cohort ( = 44) (Figure 2(b)). Interestingly, a high GE-based risk score is associated with a shorter median time to treatment requirement in two independent cohorts of CLL patients, that is, 2.1 months and 25.2 months for patients with GE score > −32.5 versus 47,7 and 78 months for patients with GE score ≤ −32.5 ( = 7.9 − 9 and = 0.01, resp.) (Figures 3(a) and 3(b)). In order to investigate the prognostic value of the GEbased risk score in regards to time of first treatment in CLL patients with good prognostic, the analysis was completed in patients without Del17p, without Del11q, and without trisomy 12 known to be associated with a poor prognosis [21]. High GE-based risk score is associated with a shorter time to treatment requirement in patients with cytogenetically defined good prognostic (4.7 months for patients with GE score > −32.5 versus 65.4 for patients with GE score ≤ −32.5, = 1 − 5) (Figure 3(c)). Cox analysis was used to determine whether GE-based risk score provides additional prognostic information compared to previously-identified gene expression-based prognostic markers such as ADAM29 (a disintegrin and metalloprotease domain 29), AKAP12 (a kinase (PRKA) anchor protein 12), DMD, LPL, NRIP1 (Nuclear receptor-interacting protein 1), SET10 (Septin 10), SPG20 (Spastic paraplegia 20), TCF7, TCL1A, TPM1 (Tropomyosin 1), ZAP70 gene expression, the Herold's GEP-based prognostic score (PS8), and Del17p (Table 2) [22][23][24][25][26][27]. None of these genes were included in the current 20 prognostic genes. Using univariate analyses, GE-based risk score, ADAM29, AKAP12, DMD, LPL, NRIP1, SET10, SPG20, TCF7, TCL1A, TPM1, ZAP70 gene expression, PS8, and Del17p were prognostic ( < 0.05, Table 2(a)). When compared two by two, GE-based risk score tested with NRIP1, SPG20, TCF7, and TPM1 expression, PS8 or Del17p remained significant ( < 0.01, Table 2(b)). When all parameters were tested together, only GE-based risk score, NRIP1, and TCF7 gene expression kept prognostic value (Table 2(c)).

Combining Prognostic Information of GE-Based Risk
Score and NRIP1 and TCF7 Expression, into a Single Staging. Since GE-based risk score and NRIP1 and TCF7 expression displayed independent prognostic information, the prognostic information of the GE-based risk score was combined with those of TCF7 and NRIP1 gene expression into a single staging. Kaplan-Meier analysis with the 5 patient groups of the training cohort was performed (Figure 4(a)). When 2 Genes CLL patients (increasing GE-based risk score)

Low
High Expression scale consecutive groups showed no prognostic difference, they were merged resulting in a single staging splitting patients into a Group I comprising 72.9% of patients with low GEbased risk score/high TCF7 or NRIP1 expression and low GE-based risk score/high TCF7 and NRIP1 expression, a Group II comprising 11.2% of patients with low GE-based risk score/low TCF7/low NRIP1 expression and high GE-based risk score/high TCF7 and-or high NRIP1 expression, and a Group III comprising 15.9% of patients with high GE-based risk score/low TCF7/low NRIP1 expression (Figure 4(b)). Group I patients had a not reached median OS, patients of groups II and III had, respectively, a median OS of 46.2 months and 10 months (Figure 4(b)).

Discussion
Following the introduction of microarray methodology in haematological malignancies research, many studies investigated the prediction of reliable prognostic patient subtypes on the basis of their specific gene expression signatures [20,29,30]. CLL, although initially reported as an indolent malignancy, is characterized by a highly heterogeneous clinical course, with many patients eventually progressing and requiring therapy [31]. Several large-scale gene expression-based profiling analyses in this malignancy have led to the identification of prognostic factors [11,13,22] and development of prognostic signatures for patients' risk stratification [12,15]. We report here a new GE-based risk score in CLL specimens based on the expression levels of 20 genes documented by 22 probe sets, splitting patients of two independent cohorts into 2 risk categories. None of the 20 genes constituting the GE-based risk score overlap with the previously published prognostic gene signatures for patients' risk stratification [12,15]. Interestingly, when compared using multivariate analysis, only the current GE-based risk score and NRIP1 and TCF7 expression, kept prognostic value. NRIP1 gene, known as RIP140, is a nuclear receptor coregulator with important role in energy homeostasis and a potential involvement in breast cancer [23,32]. Several reports indicate that NRIP1 could either inhibit target gene transcription or act as a transcriptional activator. NRIP1 has been recently described as a novel cell-cycle regulated gene whose expression is directly controlled by E2F transcription factors and increases through their binding to the promoter region [33]. Few studies have analyzed the deregulation of this gene expression in haematological diseases: NRIP1 has been found to be significantly upregulated in acute myeloid leukemia with complex karyotypes and abnormal chromosome 21 [34]. In CLL, NRIP1 was shown to be differentially expressed with regard to IgVH mutational status [22,35].
TCF7 is a member of a family of HMG box containing factors that are known to associate with -catenin in the nucleus to mediate Wnt signaling [36]. The canonical Wnt/catenin signaling pathway has been shown to play a role in the control of the proliferation, survival, and differentiation of hematopoietic cells [37]. Recent gene expression analyses showed that several members of the Wnt family are overexpressed in CLL cells when compared to their normal counterparts from healthy donors, and this uncontrolled Wnt signaling may contribute to the defect in apoptosis that characterizes this malignancy [38]. The involvement of this pathway in the pathogenesis of several carcinomas, such as colorectal cancer and melanoma, has been also reported [39,40]. However, there is a significant body of evidence showing that Wnt proteins can function as growth factors for progenitor cells of the B-cell lineage. Indeed, by analyzing the B-cell compartment using LEF1-deficient mice, Reya and colleagues showed a marked reduction of B220 + cells in the fetal liver and perinatal bone marrow caused by both increased apoptosis and decreased proliferation [24]. In the same way, an abnormal B-cell development has been observed in mice knocked out for the Wnt receptor Frizzled 9 [25]. In the present study, low expression of TCF7 with high GEP risk score have been correlated with a poor survival. Mice deficient in the TCF7 gene develop intestinal and mammary adenomas, suggesting a role for TCF7 as a tumor suppressor [26]. Furthermore, TCF7 has been also reported to be expressed in hematopoietic stem cells and that its loss diminishes hematopoietic stem/progenitor cell function [27]. These data suggest that the role of Wnt in B-cell malignancies is controversial, as it may have potential oncogenic, as well as tumor suppressor functions. Moreover, Kienle et al. tested the ability of TCF7 gene to predict the genetic risk in CLL patients, defined by IgHV status, V3-21 usage, 11q-, and 17p-. TCF7 expression provided a high rate of correct assignment of patients at genetic risk [13]. The prognostic impact of our GE-based score associated with NRIP1 and TCF7 genes expression should be tested in the context of IgVH mutational status, ZAP70 protein expression and TP53 mutational status.
Among the 20 genes we identified, overexpression of ERCC1 correlated with a very poor prognosis (HR = 15.0143 and 15.6883 for 203720 s at and 203719 s at probes, resp., Table 1). Since many years, it has been shown that treatment of CLL patients with alkylating agents is associated with low rates of complete remission and no improvement in OS [41]. The ability of CLL cells to efficiently repair alkylatorinduced DNA damage through DNA repair genes might explain this lack of response. Indeed, ERCC1 forms with Xpf/ERCC4 an endonuclease complex that is involved in Low GEP risk score/NRIP1 low /TCF7 low Low GEP risk score/NRIP1 high OR TCF7 high High GEP risk score/NRIP1 low /TCF7 low High GEP risk score/NRIP1 high AND-OR TCF7 high Low GEP risk score/NRIP1 high AND TCF7 high Low GEP risk score/NRIP1 high OR TCF7 high Low GEP risk score/NRIP1 high AND TCF7 high Low GEP risk score/NRIP1 low /TCF7 low High GEP risk score/NRIP1 high AND-OR TCF7 high High GEP risk score/NRIP1 low /TCF7 low (b) Figure 4: Combination of the prognostic information of GE-based risk score and NRIP1 and TCF7 gene expression. (a) Kaplan-Meier analyses were performed to combine the prognostic information of GE-based risk score and NRIP1 and TCF7 gene expression. Patients were scored from 1 to 5 according to GE-based risk score in 7 high or Low and 1 high or low groups. (b) After merging consecutive groups with no prognostic difference, 3 patient groups with different overall survival (OS) were obtained: I, II, and III (patients of the training cohort, = 107).
nucleotide excision repair (NER) and in repair of druginduced crosslinks between two complementary strands of DNA, known as interstrand crosslinking (ICL) [42]. For instance, there is evidence that increased expression of ERCC1 in CLL lymphocytes explains the development of resistance to DNA crosslinking agents, for example, nitrogen mustards [43]. In addition, Clingen et al. demonstrated that sensitivity to SJG-136, a highly efficient ICL agent that reacts with guanine bases in a 5 -GATC-3 sequence in the DNA minor groove, was dependent to some extent on ERCC1 expression in CHO cells [44]. Fludarabine could enhance the DNA ICL capacity of SJG-136 in primary human CLL cells and thereby offer a rationale for its clinical use in combination with SJG-136 [45]. Furthermore, F11782, a novel dual catalytic inhibitor of topoisomerases I and II, known to be a potent inhibitor of NER could be of therapeutic interest in the GEbased high risk group of CLL patients [46]. More recently, it was demonstrated that a function of PARP in NER DNA repair and clinical grade PARP inhibitors in association with chemotherapy could reverse the resistance of CLL cells to DNA crosslinking agents [47].
Interestingly, GSEA analysis highlighted a significant enrichment of genes downregulated in CLL patients with mutated IgVH chain and genes upregulated in CLL patients expressing high levels of lipoprotein lipase in tumor cells of patients within high risk GE-based score group (Supplementary Figure S2 and Supplementary Tables S1 and S2), in particular already known bad prognosis factors LPL, DMD, AKAP12, and SEPT10 (Supplementary Table S2) [31]. Interestingly, enrichment for IRF4 gene expression was identified in the GE-based high risk group. The t(1,6)(p35.3,p25.2), exclusively found in unmutated CLL, is associated with the involvement of IRF4 (Interferon regulatory factor 4) gene. This translocation is observed with high-risk chromosomal aberrations including deletions of 11q and 17p and appears to be associated with an aggressive clinical course [48]. In CLL tumors with low GE-based risk score, GSEA analysis highlighted an enrichment of genes encoding for chemokine signaling pathways (Supplementary Figure S3 and Tables  S3 and S4). Of interest, we identified an enrichment of genes involved in the CXCR4 signaling pathway or in the interactions between the CLL tumor cells and their microenvironment (CCL3, CCL4, and CD49d) (Supplementary Table  S3). CLL cells express high levels of functional CXCR4 and signaling through this receptor reduces spontaneous and drug-induced apoptosis and also facilitates CLL cell migration beneath stromal cells [49,50]. More recently, it was demonstrated that the tyrosine kinase inhibitor Dasatinib inhibits CXCR4 signaling in CLL cells and impairs their migration in response to chemokine stimulation [51]. Dasatinib may constitute a potential therapeutic approach in 8 BioMed Research International these subgroups of CLL patients. Activated CLL cells secrete CCL3 and CCL4 for the recruitment of immune cells (T cells and monocytes) for cognate interactions. CD49d integrin (VLA-4), expressed on CLL cells, cooperates with chemokine receptors in establishing cell-to-cell adhesion with stromal cells [52]. These data suggested that tumor CLL cells of the GE-based low risk subgroup are more dependent on the interactions with their microenvironment to support their survival and proliferation.

Conclusion
Given the heterogeneity of CLL patients, the current GEbased risk score combined with NRIP1 and TCF7 expression could help in identifying high-risk patients who may benefit from intensive therapeutic strategies and new-targeted treatments.