A New Stemness-Related Prognostic Model for Predicting the Prognosis in Pancreatic Ductal Adenocarcinoma

Objective This study is aimed at identifying stemness-related genes in pancreatic ductal adenocarcinoma (PDAC). Methods The RNA-seq data of PADC patients were downloaded from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) databases. The mRNA expression-based stemness index (mRNAsi) and epigenetically regulated mRNAsi (EREG-mRNAsi) of PADC patients were evaluated. The mRNAsi-related gene sets in PADC were identified by weighted gene coexpression network analysis (WGCNA). The key genes were further analyzed using functional enrichment analysis. The Kaplan-Meier survival analysis and the Cox proportional hazards model were used to evaluate the prognostic value of the key genes. Prognostic hub genes were used to establish nomograms. The receiver operating characteristic (ROC) curves, concordance index (C-index), and calibration curves were used to assess the discrimination and accuracy of the nomogram. Finally, these results were validated in the Gene Expression Omnibus (GEO) database. Results A total of 36 key genes related to mRNAsi were identified by WGCNA. A prognostic gene signature compromising seven genes (TPX2, ZWINT, UBE2C, CCNB2, CDK1, BUB1, and BIRC5) was established to predict the overall survival (OS) of PADC patients. The Cox regression analysis revealed that the risk score was an independent prognostic factor for PADC. Patients were then divided into the high-risk and low-risk groups. The ROC curves, C-index, and calibration curves indicated good performance of the prognostic signature in the TCGA and GEO datasets. Moreover, the nomogram incorporating clinical parameters showed better sensitivity and specificity for predicting the OS of PADC patients. Conclusion The stemness-related prognostic model successfully predicted the OS of PADC patients and could be used for the treatment of PADC.


Introduction
Pancreatic ductal adenocarcinoma (PDAC) is the most prevalent type of pancreatic neoplasm, accounting for more than 90% of the total number of pancreatic tumors. The 5-year survival rate of PDAC is less than 10%. Moreover, PDAC is expected to become the second leading cause of cancerrelated death by 2030 [1,2]. Surgical resection represents the only chance for cure and the advances in adjuvant che-motherapy have improved the long-term outcomes of PDAC patients [3]. Early diagnosis and effective intervention are the major factors for favorable outcomes in PADC. However, current treatments often cause trauma, which impairs the quality of life of patients. Molecular targeted therapy, which inhibits cancer growth, progression, and metastasis by targeting specific molecular biomarkers, has emerged as a promising treatment strategy with better efficacy and fewer trauma-related complications [4]. However, the identification of molecular biomarkers remains challenging due to insufficient understanding of the pathogenic mechanisms of PDAC.
Cancer stem cells (CSCs) have been shown to promote tumor recurrence, metastasis, and drug resistance owing to their self-renewal ability, proliferation capacity, and multilineage differentiation potential [5]. Emerging evidence has indicated that the stemness of CSCs is associated with high intratumoral heterogeneity in most types of cancers, including PADC [6,7]. Intratumoral heterogeneity, which includes phenotypic diversity (e.g., cell surface markers and epigenetic abnormality), has been reported to drive disease progression and cause treatment failure. These markers and gene mutation types are often used for pathological classification and clinical treatment of tumors [8]. Pancreatic CSCs, first described in 2007 [9], account for less than 1% of all pancreatic cancer cells [10]. They are responsible for the development, metastasis, and chemoresistance of PDAC. The activation of CSC-related biomarkers and signaling pathways, including CD133, CD24, CD44, MYC, WNT/βcatenin, and Notch, has been reported in PDAC [11]. Although previous studies have explored the unlimited self-renewal capacity of pancreatic CSCs and their roles in tumorigenesis and chemoresistance, investigations on the molecular mechanisms of pancreatic CSCs are still warranted. In this study, we aimed to uncover the heterogeneity of PADC from the perspective of the stemness features of CSCs.
The stemness properties of CSCs are mainly characterized by mRNA expression-based stemness index (mRNAsi), such as epigenetically regulated mRNAsi (EREG-mRNAsi) [5]. The mRNAsi score has been used to identify new CSC markers and to indicate carcinogenic dedifferentiation [12]. Previous studies have reported that the mRNAsi score is associated with the stemness features of CSCs and can be calculated by the one-class logistic regression (OCLR) machine learning algorithm, suggesting a significant correlation between mRNAsi and the prognosis of PADC [5]. However, genes related to the stemness features of CSCs have not been fully identified and the biological functions of these genes remain largely unknown. The analysis of differentially expressed genes (DEGs) has been used to identify key genes related to tumorigenesis, but it is unable to elaborate the connections between those genes. The gene network analysis is a tool used to investigate the complex process of tumorigenesis [8]. The weighted gene coexpression network analysis (WGCNA) has been used to explore coexpression modules associated with the clinical characteristics of cancer patients, including mRNAsi [13,14].
In the current study, WGCNA was used to identify stemness-related modules and key genes. The functions of these genes during the development of PADC were explored using Gene Ontology (GO) analysis, Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis, and Gene Set Enrichment Analysis (GSEA). A novel prognostic model comprising seven genes (i.e., TPX2, ZWINT, UBE2C, CCNB2, CDK1, BUB1, and BIRC5) was then constructed and validated using the TCGA and GEO databases. Finally, prognostic genes were used to establish nomograms. The new model and nomogram might be a powerful tool for the prediction of the prognosis of PADC patients.

Dataset Sources.
Patients who met the following criteria were selected from the TCGA PADC cohort: (1) patients with histologically confirmed primary PADC, (2) patients with RNA-seq data, and (3) patients with complete clinicopathological data, such as age, gender, TNM stage, grade, and overall survival (OS). The corresponding RNA-seq data and clinicopathological information of PADC patients were collected from the TCGA in the GDC database (https:// portal.gdc.cancer.gov/). The expression of genes in normal tissues was obtained from the GTEx database. All data were available on September 9 th , 2020. The RNA-seq data, including 169 normal samples and 142 tumor samples, were merged into a matrix file using a merge script in Perl language (https://www.perl.org/).

mRNAsi in PADC and Its Clinical
Significance. The mRNAsi used for assessing the degree of oncogenic dedifferentiation was obtained from previous studies [5]. The mRNAsi score of PADC samples was calculated using the PADC dataset in the TCGA database by OCLR. The gene expression-based stemness index ranges from low (zero) to high (one) stemness. Significant differences in mRNAsi between tumors and nontumors were determined by the Wilcoxon test. To evaluate the prognostic value of the mRNAsi score, an OS analysis according to the mRNAsi score was performed using the "survival" package in the R software (v3.6.1, https://CRAN.R-project.org/package= survival). The Kaplan-Meier (K-M) analysis of samples with high or low mRNAsi scores was carried out. The Wilcoxon test was then performed to investigate the correlation between the mRNAsi score and patient's age.

DEGs in PADC.
The Wilcoxon test was used to identify DEGs between PADC and normal tissue samples. A false discovery rate (FDR) of <0.05 and a |log2 fold change ðFCÞ | of ≥ 1 were set as the cut-off criteria for defining DEGs.

WGCNA Construction and Module
Preservation. An R package "WGCNA" was used to investigate the correlations among genes by constructing modules. The modules were obtained by clustering genes with similar expression patterns [14]. A total of 1750 DEGs with high precision and accuracy were selected for subsequent analysis. To achieve high-scale independence and mean connectivity, the soft-thresholding power was calculated using a gradient test ranging from 1 to 20 and determined by the pickSoftThreshold function. Then, the topological overlap matrix (TOM) was calculated based on the corresponding soft-thresholding power. The TOM was used to identify the modules of highly coexpressed genes and to make the networks less sensitive to spurious connections. Genes with high absolute correlations were further clustered into the same module. The modules were defined by cutting the clustering tree into branches using a dynamic tree cutting algorithm and then assigned to different colors for visualization. The module dendrograms were 2 BioMed Research International constructed using hierarchical clustering analysis based on TOM-based dissimilarity. To avoid oversplitting, correlated modules (r < 0:25) were merged, and each module was labeled with the corresponding color [15].

Identification of Modules Associated with Clinical
Characteristics. Module eigengenes (MEs) are considered the major components in the principal component analysis for each gene module. The expression patterns of all genes can be summarized as a single expression profile within a given module [16]. The gene significance (GS) was calculated to demonstrate the correlation between genes and clinical characteristics. The module significance (MS), referring to the average GS of all genes in the module, was used to determine the correlation between modules and the charac-teristics of samples. The module with a higher correlation with the MS was identified for further analysis.
2.6. Identification of Key Genes in the Red Module. The key genes in the coexpressed network were highly interconnected with the nodes in a module and were used to explore the biological functions of identified dysregulated genes [17]. The module membership (MM) was defined as the correlation of the gene expression profile with MEs. The key genes in the module were defined as cor. GS > 0:5 and cor. gene MM > 0:8. The R package "corrplot" was used to calculate Pearson's correlation coefficient among these genes. KEGG analyses were performed using the R package "clus-terProfiler" [18,19]. An FDR of ≤0.05 was considered statistically significant. A protein-protein interaction (PPI) network of the key genes in key modules was constructed using the STRING (https://www.string-db.org/) [20]. Hub nodes were identified when the combined score was ≥0.8, and the connectivity degree was ≥20.
2.8. Establishment of a Prognostic Model. The K-M survival analysis and the stratified Cox proportional hazards analysis were used to evaluate the prognostic value of the key genes. Then, the risk score of each prognostic gene was calculated based on the sum of Si (expression level of the key gene) and β (corresponding coefficient) generated from the Cox model. PADC patients were divided into the high-risk and low-risk subgroups according to the median risk score, and their OS was analyzed using the K-M analysis. According to the method proposed by Blanche et al. [21], the predictive accuracy of each prognostic signature was accessed by calculating Uno's inverse-probability of censoring weighting estimation of the time-dependent receiver-operator characteristic (ROC) area under the curve (AUC) values (time spanning from 6 to 24 months) using the "timeROC" package (version 0.3). Finally, the expression of the signature genes was visualized in the heatmap using the R package "pheatmap." 2.9. Prognostic Value of the Prognostic Model. The risk scores and clinicopathological parameters, including age, gender, TNM stage, grade, OS, postoperative radiotherapy (postopera-tive_tx_tx), radiotherapy, and alcohol consumption history, were included in the univariate and multivariate Cox regression analyses. Based on the results of the multivariate Cox regression analysis, a nomogram was established using the "rms" package  (version 5.1.2) [22]. The nanogram was used to predict the 6-, 12-, and 24-month OS of PADC patients in the TCGA dataset. Subsequently, a time-dependent ROC curve was plotted to assess the sensibility and specificity of the nomogram using the R package "timeROC." The AUC was also calculated. The predictive accuracy of the nomogram was assessed by the nomogram calibration curve and Harrell's C-index. The calibration curve was plotted to determine the consistency between the predicted and observed OS. The C-index was calculated using a bootstrap method with 1000 resamples to assess the discrimination ability of the nomogram.

Validation of the Prognostic Signature in the GEO
Database. To minimize bias caused by small sample size, the prognostic capacity of the model was validated using the GSE62452 dataset in the GEO database (http://www.ncbi.nlm .nih.gov/geo/). The optimal risk score cut-off value for the GEO dataset was the same as that for the primary TCGA   2.11. Gene Set Enrichment Analysis (GSEA). To identify the potential function of the prognostic model, GSEA (https:// www.gsea-msigdb.org/gsea/index.jsp) was performed using the GSEA software (v4.0.3). GSEA determines whether a priori defined set of genes shows statistically significant, concordant differences between two biological states [23]. Based on the molecular signature database (v. 6.2), C2 (curated gene sets), C5 (GO gene sets), and C6 (oncogenic signatures) were analyzed to identify KEGG pathways, biological processes (BP), cellular components (CC), molecular functions (MF), and dysregulated oncogenic signatures. The samples in TCGA were divided into two groups ( the high-risk score group vs. low-risk score group) according to the median expression value of each gene. An enriched gene set with a nominal p value of <0.05, a |NES| of >1, and a FDR q value of <0.25 was considered statistically significant.
2.12. Statistical Analysis. Statistical analysis was performed using the R software (v3.6.1). In the K-M survival analysis, p value and hazard ratio (HR) with 95% confidence interval (CI) were generated by the log-rank test and univariate Cox proportional hazards regression analysis. Pearson's Chisquare test was performed to assess the significance between groups. The results of multivariate Cox regression analysis were visualized by the nomogram. C-index, time-dependent ROC curves, and calibration curves were used to evaluate the nomogram. A p value of <0.05 was considered statistically significant.  BioMed Research International

Data Processing and Survival
Analysis. CSC-related RNA-seq data were obtained from the TCGA PADC dataset, including 142 PADC samples. The data of 169 normal samples were collected from the TCGA and GTEx databases. The gene expression-based stemness index for PADC was extracted by OCLR [5]. The mRNAsi indicates the degree of similarity between tumor cells and stem cells. The EGER-mRNAsi is an index that describes the epigenetic regulation of CSC-related genes. As shown in Figure 1(a), there was a significant difference in age between those younger than 65 and those older than 65 years. The survival analysis showed that the level of mRNAsi was significantly different from the survival time of PADC patients (Figure 1(b), adjusted p < 0:0017), further suggesting that the stemness features of CSCs were related to the survival outcomes of PADC patients.

Identification of DEGs in PADC.
A total of 1750 DEGs were identified using the Wilcoxon test, including 845 downregulated genes and 905 upregulated genes. The expression of these genes was visualized using the R package "beeswarm." The volcano plot and heatmap show the DEGs between PADC and normal tissues (Figure 1(c)).

Analysis of the Key Gene in the Red
Module. Genes in the same module exhibit common expression patterns. In the red module, 36 key genes out of 1750 DEGs were related to the mRNAsi. We further compared the DEGs with the minimum p value (p < 0:01) in PADC and normal tissues and found that all of them were upregulated in tumor tissues (Figure 3(a)). Meanwhile, we generated a heatmap using the R package

Functional Enrichment and PPI Network Analyses.
To further explore the biological functions of 36 key genes in PADC, GO enrichment and KEGG pathway analyses were performed. The functions of these genes were mainly enriched in 30 pathways, including BP (biological process), CC (cellular component), and MF (molecular function) (Figure 4(a)). BP was found to be relevant to mitotic nuclear division. CC was related to the spindle. MF includes microtubule binding. As shown in Figure 4(b), KEGG pathway analysis indicated that hub genes were enriched in the cell cycle signaling pathway. The signaling pathway with an FDR-corrected p value of <0.01 was considered significant. Furthermore, a PPI network of key genes was constructed to identify gene-gene interaction (Figure 4(c)).

Identification and Evaluation of the Stemness-Related
Prognostic Model. To evaluate the prognostic value of stemness-related genes, a univariate Cox regression analysis was performed. The results showed that 36 key genes were associated with the OS of PADC patients. Among them, 35 OS-related genes were identified as risk factors (Figure 5(a)). Next, a prognostic model comprising seven genes was developed based on the red module using the stepwise multivariate Cox regression analysis. The seven genes were TPX2, ZWINT, UBE2C, CCNB2, CDK1, BUB1, and BIRC5. The risk score was calculated based on the Cox coefficients of these genes: Then, patients were divided into two groups, the highrisk group (risk score ≥ 1:103) and the low-risk group (risk score < 1:103). The survival of all PADC patients and  Figure 5(b). The K-M curves revealed that the high-risk group had a significantly worse prognosis compared to the low-risk group (p < 0:0001) (Figure 6(a)). Furthermore, in the validation dataset, the 6-, 12-, and 24-month AUCs of the time-dependent ROC curves were 0.815, 0.767, and 0.744, respectively ( Figure 6(b)), suggesting high efficiency of the seven-gene prognostic signature in predicting the OS of PADC patients. The C-index of the risk score was 0.718 (95% CI; 0.148-1.289). The calibration curves of the nomogram for 6-, 12-, and 24-month survival probabilities are shown in Figure 6(c).

Construction and Validation of a Prognostic Nomogram.
A prognostic nomogram for predicting the OS of PADC patients was developed based on the clinical data of 311 patients from the TCGA database. The parameters of the nomogram included risk score, gender, tumor size, TNM stage, grade, postoperative_tx_tx, radiotherapy, and alcohol consumption history. The forest plots of the univariate and multivariate Cox regression analyses based on 8 clinicopathological characteristics were used to evaluate the independent prognostic value of the signature (Figures 7(a)  3.8. Validation of the Seven-Gene Prognostic Model. Next, we evaluated the predictive power of the prognostic gene signature in the PADC cohort from the GEO database (GSE62452). Similar procedures were carried out to assess the performance of the stemness index-associated signature. Patients were divided into the low-risk and high-risk groups according to their risk scores. Then, the OS of the two groups was compared. As shown in Figure 8(a), the high-risk cohort had a significantly poorer prognosis compared with the lowrisk cohort (p < 0:021). Also, the time-dependent ROC showed that the AUC for 24-month OS was 0.714 ( Figure 8(b)). The C-index for external validation set was 0.698 (95% CI, 0.625-0.770). The calibration curves of the nomogram predicting the 6-, 12-, and 24-month OS are shown in Figure 8(c). These results were consistent with those obtained in the TCGA database, suggesting the  3.9. Gene Set Enrichment Analysis (GSEA). To further explore the functions of the seven genes in PADC, GSEA was performed. The results showed that these genes were associated with cell cycle mitotic, homologous recombination, mitotic nuclear division, oocyte meiosis, DNA replication, and the p53 signaling pathway (Figure 9).

Discussion
PDAC is a heterogeneous malignancy with high morbidity and mortality. CSCs, a small proportion of cells within the tumor, possess unlimited proliferative potential and share similar properties with cancer cells. Previous studies have suggested that the regulatory mechanisms underlying the self-renewal of stem cells and tumor cells are similar. More-over, tumor cells may be derived from normal stem cells [24][25][26]. CSCs were first identified in a leukemia model, in which CD34 + CD38 − leukemic cells showed the characteristics of bone marrow hematopoietic stem cells [27,28]. Solid tumor CSCs (CD44 + CD24 −/low Lineage − cells) have been found in breast, ovarian, prostate, colon, pancreatic, liver, and lung cancers, suggesting that CD34 + and CD44 + are typical CSC markers [29,30]. In addition, CSCs in PDAC have been identified using cell surface markers, including CD133, CD44, CD24, and epithelial-specific antigen (ESA)/EpCAM [31]. It has been shown that the metastasis, chemoresistance, and relapse of PDAC are driven by CSCs. However, the molecular mechanisms responsible for the stemness of pancreatic CSCs remain unclear. A better understanding of the functions of these cells and the development of CSC inhibitors may contribute to tumor eradication [32][33][34].
The existence of CSCs in tumor tissues and the stem-like features of CSCs suggest that drugs targeting the stemness characteristics of CSCs may be used for cancer treatment. Tathiane et al. proposed the use of mRNAsi and EGER-mRNAsi to evaluate the stemness characteristics of CSCs [5]. Although the risk stratification of the stemness index has been investigated in pancancer populations, the comprehensive prognostic value of the stemness index for PADC is still unknown. Moreover, few studies have analyzed the genes related to the stemness characteristics of CSCs in PADC. Therefore, we aimed to develop a stemness-related prognostic signature for patients with PADC. In this study, the K-M curves showed that there was a significant difference in the OS between the groups with low and high mRNAsi (mRNAsi/tumor purity) scores. The stemness index-related modules and genes were identified using WGCNA. The red module showed the highest positive correlation with the mRNAsi. A total of 36 genes related to the stemness of CSCs were obtained from this module. The GO, KEGG, and GSEA analyses showed that these genes were enriched in mitotic nuclear division, nuclear chromosome segregation, spindle checkpoint function, microtubule binding, and cell cycle. Next, a model comprising seven key mRNAsi-related genes (i.e., TPX2, ZWINT, UBE2C, CCNB2, CDK1, BUB1, and BIRC5) was constructed using univariate and multivariate Cox regression analyses. We found that the survival model accurately predicted the prognosis of PADC patients in both the TCGA and GEO databases. The ROC curves showed that this model has high predictive power for predicting the OS of PADC patients. Furthermore, some prognostic parameters (i.e., tumor size, risk score, and radiotherapy) were significantly correlated with the OS of PADC patients. Taken together, these results indicated that these genes might be related to the occurrence and development of PADC through regulating cell cycle, which were consistent with the study by Bai et al. [35]. Prospective studies are needed to verify the prognostic value of the stemness index-related signature in PADC.
Previous studies have reported that the dysregulation of the seven key genes contributes to the development of tumors [36][37][38]. In addition, these hub genes are closely associated with cell cycle events, such as mitotic cyclin destruction and cell cycle progression, and may be involved in carcinogenesis. TPX2 encodes a microtubule-associated protein. The upregulation of TPX2 levels has been found to promote the proliferation and invasion of renal cancer cells [39]. ZWINT and CDK1, which correct erroneous centromere-microtubule attachment and regulate the mitotic spindle checkpoint, are mainly involved in cell cycle control in adrenocortical carcinoma [40]. The dysregulation of UBE2C is associated with the upregulation of Ki-67, a proliferative marker, and poor overall survival in colorectal carcinoma [41,42]. Aaron et al. showed that aberrant expression of CCNB2 was closely related to cell cycle-driven subpopulation in advanced prostate cancer [43]. The BIRC5 gene encodes survivin, an antiapoptotic protein that has been defined as a target in many cancers, including PDAC, and is overexpressed in PDAC [44,45]. Few studies have investigated the genes related to the stemness characteristics of CSCs in PADC. Here, we showed that the seven genes regulated cell division cycle in PADC, suggesting that they may contribute to the initiation, metastasis, and recurrence of PADC. These findings support the development of therapies targeting the seven genes for PADC treatment. Future studies on the molecular mechanisms of these genes and the development of tailored targeted therapies are warranted.

Conclusion
In summary, we established a stemness index-related signature and developed a prognostic nomogram in combination with prognosis-related clinicopathological characteristics. This model might be used to predict the OS of patients with PADC. Furthermore, TPX2, ZWINT, UBE2C, CCNB2, CDK1, BUB1, and BIRC5 may orchestrate the stemness, proliferation, and invasion of tumor cells. These genes might be potential prognostic biomarkers and therapeutic targets in PADC.

PADC:
Pancreatic ductal adenocarcinoma mRNAsi: mRNAsi expression base-index EREG-mRNAsi: Epigenetic regulation based-index DEGs: Differentially expressed genes WGCNA: Weighted gene coexpression network analysis CSCs: Cancer stem cells FDR: False discovery rate TOM: Topological overlap measure GS: Gene significance MS: Module significance ME: Module eigengenes MM: Module membership GO: Gene Ontology analysis Low risk High risk Figure 9: Gene set enrichment analysis between the high-and lowrisk group in TCGA PADC cohort.