A Novel Prognostic Model for Acute Myeloid Leukemia Based on Gene Set Variation Analysis

Acute myeloid leukemia (AML) is a malignant hematological malignancy with a poor prognosis. Risk stratification of patients with AML is mainly based on the characteristics of cytogenetics and molecular genetics; however, patients with favorable genetics may have a poor prognosis. Here, we focused on the activity changes of immunologic and hallmark gene sets in the AML population. Based on the enrichment score of gene sets by gene set variation analysis (GSVA), we identified three AML subtypes by the nonnegative matrix factorization (NMF) algorithm in the TCGA cohort. AML patients in subgroup 1 had worse overall survival (OS) than subgroups 2 and 3 (P < 0.001). The median overall survival (mOS) of subgroups 1–3 was 0.4, 2.2, and 1.7 years, respectively. Clinical characteristics, including age and FAB classification, were significantly different among each subgroup. Using the least absolute shrinkage and selection operator (LASSO) regression method, we discovered three prognostic gene sets and established the final prognostic model based on them. Patients in the high-risk group had significantly shorter OS than those in the low-risk group in the TCGA cohort (P < 0.001) with mOS of 2.2 and 0.7 years in the low- and high-risk groups, respectively. The results were further validated in the GSE146173 and GSE12417 cohorts. We further identified the key genes of prognostic gene sets using a protein-protein interaction network. In conclusion, the study established and validated a novel prognostic model for risk stratification in AML, which provides a new perspective for accurate prognosis assessment.


Introduction
Acute myeloid leukemia is a heterogeneous myeloid neoplasm, characterized by uncontrolled proliferation and impaired diferentiation of myeloid blasts. Despite the development of novel antileukemic drugs, the prognosis of AML has remained poor, with a 5-year overall survival rate of less than 30% [1]. Tus, it is of great importance to identify prognostic biomarkers to predict prognosis and better understand the pathogenesis of AML. Te European Leuke-miaNet (ELN) recommended that patients should be stratifed according to their cytogenetics and molecular genetics, which are the main basis of risk stratifcation for AML, and treated with chemotherapy or allogeneic stem cell transplantation, respectively [2]. However, patients with favorable genetics may have a poor prognosis [3], which necessitates new methods to improve the prognosis assessment system.
Previous studies on prognosis assessment in AML have mainly focused on a single gene or a group of genes related to specifc biological processes [4][5][6], instead of comprehensively integrating multiple pathways in AML expression gene sets. For example, autophagy-related signatures [7], ferroptosis-related genes [8], and immune-related signatures [9] were reported to be related to the prognosis of AML, respectively. Here, we applied gene set variation analysis (GSVA) to detect subtle changes in the pathway within an entire gene expression set [10], and then the patients were classifed according to the enrichment score (ES) of gene sets in an unsupervised manner.
To our knowledge, subtypes based on the activity changes of gene sets in AML have not been determined. In the present study, we identifed three subtypes of AML based on the activity changes of hallmark and immunologic gene sets in AML and established a prognostic model using TCGA gene expression data. Subsequently, the prognostic model was further validated in GEO cohorts. Our study provides a novel perspective for the prognostic model of AML by identifying the activity changes of gene sets in the AML population instead of focusing on changes in a single gene.

Data Collection and Processing.
One hundred and twenty-fve patients were used as the training cohort (patients with AML M3 were excluded), and their fragments per kilobase per million mapped reads (FPKM)-normalized RNA-Seq data and clinical information were downloaded from the TCGA database via the GDC data portal (https:// portal.gdc.cancer.gov/repository). Two gene expression profles of AML, GSE146173 (n � 246) and GSE12417 (n � 242), and the corresponding probe annotation platform fles (GPL18460 and GPL96) were downloaded from the GEO (https://www.ncbi.nlm.nih.gov/geo/). Data normalization and log2 transformations were performed on all gene expression datasets.

Gene Set Variation Analysis.
A total of 4922 Immune-SigDB gene subsets of C7 and hallmark gene sets were downloaded from MSigDB (https://www.gsea-msigdb.org/ gsea/msigdb/index.jsp) for single sample gene set variation analysis (ssGSVA) with the GSVA R package. We used GSVA to evaluate the changes in pathway activity over the TCGA cohort and ES to reveal the activation degree [10]. Accordingly, the gene expression matrix from TCGA and GEO cohorts was transformed into the pathway ES matrix, which was uploaded as Supplementary Tables 1 and 2.

Nonnegative Matrix Factorization (NMF) Algorithm and Supervised Hierarchical
Clustering. In the training cohort, Cox regression analysis was used to flter the gene sets related to prognosis (R package CancerSubtypes) with P < 0.001. Subsequently, patients were divided into three types (C1-3) by NMF based on the ESs. According to the clustering results of the training set (TCGA), the 100 most signifcantly upregulated gene sets in each cluster (if less than 100, all gene sets were selected) were selected as subgroup features. Accordingly, patients from the GSE146173 dataset were used as an independent validation cohort for supervised hierarchical clustering. Te relevant fles generated during the calculation were uploaded as Supplementary Tables 3-5.

Prognostic Model Construction and Validation.
Diferential gene sets of each two cluster were analyzed with R package limma, fltered by P < 0.05 and |logFC| > 0.02, and intersected subsequently. A univariate Cox regression analysis was used to screen gene sets associated with survival, fltered by P < 0.05 (R package survival). Te least absolute shrinkage and selection operator (LASSO) regression method was performed to further screen prognostic gene sets and obtain their corresponding coefcients (R package glmnet). Te "maxit" was set to 1000. Te fnal risk score was calculated as follows: risk score = n i Coef i × X i . "Coef" represents the regression coefcient of each gene set, "X" represents the ES of gene sets in the prognostic model, "n" represents the total number of gene sets in the prognostic model, and "i" represents the gene set that comprises the model. Te patients were classifed as high-and low-risk groups according to the median risk score. Te relevant fles generated during the calculation were uploaded as Supplementary Tables 6-11.

Statistical and Bioinformatic
Analyses. Data analysis and graphical visualization were performed using R software (version 4.0.3, https://www.r-project.org) and GraphPad Prism (version 8.0). Heatmaps were used to show the ESs of gene sets of patients in various clusters and risk classifcations. Te hub genes were identifed by constructing a protein-protein interaction (PPI) network in the STRING database.
Te violin map is drawn by GraphPad Prism to show the gene expression diferences of diferent groups. Kaplan-Meier survival curves and the log-rank test were used to perform survival analysis of patients in diferent groups (R package survival and survminer). Te chi-square test and Fisher's exact test were used to analyze the diferences in clinical features of patients in diferent groups. P < 0.05 was considered statistically signifcant, and all tests were twotailed.

Identifcation of Subtypes of AML Using Immunologic and
Hallmark Gene Sets. To investigate the diference in the activity of immunologic and hallmark gene sets in AML patients, the hallmark and C7 gene sets were downloaded from the MsigDB database. Te study fowchart is illustrated in Figure 1. Hallmark gene sets are coherently expressed signatures that represent well-defned biological states or processes. C7 gene sets are immunologic signature gene sets, defned by microarray gene expression data from immunologic research.
Firstly, we obtained ES of immunologic and hallmark gene sets in the TCGA cohort (n � 125) with the GSVA method. Cox regression analysis was used for feature fltering through the R package CancerSubtype. Te optimal number of clusters (K) was generated (K � 3, Figures 2(a) and 2(b)) using the factoextra package. AML patients are divided into three diferent subtypes (Figure 2(c)) by the NMF method, which is an efective dimension reduction method for cancer subtype identifcation. Te silhouette width value was 0.92 in the silhouette width plots (Figure 2(d)), suggesting a fne match between an AML sample and its identifed subtype. Te patients with AML in subgroup 1 had worse overall survival than subgroups 2 and 3 (P < 0.001, Figure 2(e)), whose immunity was shown as a state of overall activation by heatmap ( Figure 3). Te median overall survival (mOS) of subgroups 1-3 was 0.4, 2.2, and 1.7 years, respectively. Moreover, age classifcation was signifcantly diferent among each subgroup (P � 0.0012, Figure 3 and Table 1). In addition, the number of patients with M5 and M1 FAB classifcations (37.5% and 29.17%, P � 0.0044) was found to be more abundant in subtype 1 than in the other two subtypes.

Identifcation and Validation of Distinct Gene Sets of Tree
Subgroups. According to the clustering results (Subtype1-3) of the training set (TCGA), the 100 most upregulated gene sets in each cluster (all gene sets were selected if they were less than 100) were used to conduct supervised hierarchical clustering in another gene expression profling data (GSE146173) (the ES of immunologic and hallmark gene sets in the GEO cohort was displayed in Supplementary Table 4). As shown in Figure 4(a), patients with subtypes 2 and 3 had a more favorable prognosis than subtype 1 (P � 0.014). Te heatmap displayed the ES of gene sets in each subtype in the GSE146173 cohort ( Figure 4(b)). Similar to the TCGA cohort, the heatmap showed that most of the immune pathways of subtype 1 in the GSE146173 cohort tended to be "hot," suggesting the activation state of overall immunity.

Construction and Validation of the Prognostic Model.
To identify the representative gene sets that distinguish the three subgroups, we obtained the diferential gene sets between each of the two subgroups and intersected them. One hundred and twenty-fve representative gene sets were obtained ( Figure 5(a) and Supplementary Table 6). As shown in Figure 5(b), ESs of 125 gene sets in the three subgroups were diferent. Univariate Cox regression was used to select the gene sets related to survival, and 62 gene sets remained for further analysis (Supplementary Table 7). Ten we investigated prognostic gene sets with TCGA data by the LASSO regression method. Te model with the minimum λ (0.06081) was selected as the best model, where the gene sets were reduced from 62 to 3 (Figures 6(a) and 6(b)). Tus, the gene sets of the fnal prognostic model and their corresponding coefcients were obtained (Supplementary Table 8). Te risk score is defned as follows: Te patients were divided into high-and low-risk groups, with the median score as the cutof value. Patients in the high-risk group had a poorer prognosis than those in the low-risk group in the TCGA cohort (P < 0.001, Figure 6(c)). Te mOS of low-and high-group was 2.2 and 0.7 years, respectively. Subsequently, the prognostic model was further validated in GSE146173 (P � 0.046, Figure 6(d)) and GSE12417 (P � 0.034, Figure 6(e)). Te mOS of the low- DEGs of high-and low-risk groups     Journal of Oncology GSE146173 cohort but not in the GSE12417 cohort. Te high ES of GSE36891_UNSTIM_VS_PAM_TLR2_STIM_PER-ITONEAL_MACROPHAGE_DN was only associated with poor prognosis in the TCGA cohort ( Supplementary  Figures 1(a)-1(c)).

Exploration of Key Genes of Prognostic Gene Sets.
Te PPI networks of GSE35543_IN_VITRO_ITREG_VS_CON VERTED_EX_ITREG_DN (Figure 7(a)) and HALL-MARK_CHOLESTEROL_HOMEOSTASIS (Figure 7(b)) were built through the String website, and the results were imported into Cytoscape for further analysis. Te top 50 genes were selected based on the ranking of gene-connecting nodes (Figures 7(c) and 7(d) and Supplementary Table 15). To further identify the key genes that have an infuence on the prognosis, we intersected the diferentially expressed genes (DEGs) between the high-and low-risk groups in the TCGA cohort (P < 0.05 and |log FC| > 1, Supplementary  Table 16) with the 50 candidate genes from gene sets GSE35543_IN_VITRO_ITREG_VS_CONVERTED_EX_I-TREG_DN (Figure 7(e)) and HALLMARK_CHOLESTER-OL_HOMEOSTASIS (Figure 7(f )     Journal of Oncology RTED_EX_ITREG_DN, three intersection genes were obtained, including TIMP1, ZFP36, and LGALS3. For gene set HALLMARK_CHOLESTEROL_HOMEOSTASIS, four intersection genes were obtained, which were LDLR, LGALS3, S100A11, and ANXA5. Amongst these genes, TIMP1 and LDLR have the largest number of nodes in the two gene sets. As shown in Figures 7(g)-7(j), TIMP1 and LDLR were signifcantly increased in the high-risk groups in the two GEO cohorts (P < 0.0001), which further confrms the ability of the prognostic model to distinguish patients between diferent risks and provides clues for further research.

Discussion
AML is a malignant myeloid neoplasm with a signifcant amount of heterogeneity in tumor biology and poor clinical outcomes. Traditional analysis strategies have focused on comparing the diferential gene expression between two groups of interested populations. However, some pathways may be signifcantly regulated without a signifcant change in single gene expression. To meet this challenge, we applied GSVA to display the activity changes of each pathway in immunologic and hallmark gene sets in AML samples. According to the ESs of immunologic and hallmark gene sets in TCGA samples, AML patients were clustered into three subtypes with the NMF algorithm. Patients with subtypes 2 and 3 had a more favorable prognosis than patients with subtype 1, which was validated in the GSE146173 cohort. Subsequently, one hundred and twenty-fve diferential gene sets were further screened using univariate Cox regression and LASSO regression methods, and 3 prognostic gene sets were identifed. Based on these three gene sets, we constructed the fnal prognosis model and divided the patients into high-risk and low-risk groups in TCGA, GSE146173, and GSE12417 cohorts. Our results identifed three prognostic gene sets, and they were related to macrophages, Treg, and cholesterol homeostasis, respectively. In the past decade, increasing evidence revealed the role of macrophages in the etiology and progression of both solid tumors and blood malignancies. Macrophages in solid tumor tissue and the bone marrow microenvironment are referred to as tumorassociated macrophages (TAMs) and leukemiaassociated macrophages (LAMs), respectively. Based on their functional phenotypes, macrophages are divided into two phenotypes, M1 and M2. In brief, M1 macrophages have antitumor efects, whereas M2 is believed to promote tumor progression [11]. Yang et al. revealed that the acquisition of M1 characteristics in leukemiaassociated macrophages (LAMs) through activation of the IRF7-SAPK/JNK pathway was associated with prolonged survival in mice. Besides, the high-level expression of CD163, which is the typical marker of M2 macrophages, was associated with worse overall survival in AML patients [12].
Previous studies have shown that the cholesterol synthesis of AML cells increased signifcantly after being exposed to chemotherapy [13]. Inhibiting cholesterol synthesis can kill AML cells and sensitize them to chemotherapeutic drugs [14,15], suggesting that AML cells require higher levels of cholesterol to maintain survival compared with normal cells and that the imbalance of cholesterol homeostasis may lead to treatment failure. A phase 2 study of pravastatin in combination with idarubicin and cytarabine reported an encouraging response rate for relapsed AML [16] and improved overall survival for patients in the lowrisk group [17]. Cholesterol homeostasis plays an important role in the survival of AML cells and eventually afects the prognosis of patients with AML.
Regulatory T cells (Tregs) are considered to play a key role in immune suppression and angiogenesis via secreting immunosuppressive cytokines and molecules, performing cytolytic functions, disrupting metabolism, and attenuating the capacity of dendritic cells (DCs) [18]. Patients with AML had a higher frequency of Tregs in peripheral blood and bone marrow than healthy participants [19,20], while patients    who achieved complete remission had a comparable frequency of Tregs compared with healthy controls. Wan et al. revealed that Tregs lead to immune escape by inhibiting the function of CD4+CD25− T cells [19]. Tese results indicate that Tregs are involved in the tumorigenesis of AML.
Amongst three prognostic gene sets, high ES of HALLMARK_CHOLESTEROL_HOMEOSTASIS and GSE35543_ IN_ VITRO_ ITREG_ VS_ CONVERTED_ EX_ ITREG_ DN was correlated with short overall survival in at least two cohorts. We further investigated the key genes in these gene sets. Te key genes of GSE35543_ IN_ VITRO_ ITREG_ VS_ CONVERTED_ EX_ ITREG_ DN included TIMP1, ZFP36, and LGALS3. Te key genes of HALL-MARK_CHOLESTEROL_HOMEOSTASIS included LDLR, LGALS3, S100A11, and ANXA5. Amongst these genes, TIMP1 and LDLR have the largest number of nodes in the PPI network of these two gene sets. TIMP1 and LDLR have been reported to be associated with the prognosis of AML [21][22][23].
Tere were several limitations in this study. First, our data were acquired from the TCGA database, which lacked important clinical information, including treatment regimens and event-free survival, limiting further investigation of the prognostic model. Second, the prognostic model was conducted using publicly accessible datasets and requires multicenter clinical samples to assess its prognostic value.

Conclusion
Briefy, the study provided a full view of activity changes in hallmark and immunologic gene sets and established a novel prognostic model for risk stratifcation in AML, which provides a new perspective for optimizing prognostic assessment strategies in the future.