Construction of an Epithelial-Mesenchymal Transition-Related Model for Clear Cell Renal Cell Carcinoma Prognosis Prediction

Background A rising amount of data demonstrates that the epithelial-mesenchymal transition (EMT) in clear cell renal cell carcinomas (ccRCC) is connected with the advancement of the cancer. In order to understand the role of EMT in ccRCC, it is critical to integrate molecules involved in EMT into prognosis prediction. The objective of this project was to establish a prognosis prediction model using genes associated with EMT in ccRCC. Methods We acquired the mRNA expression profiles and clinical information about ccRCC from TCGA database. In this study, we measured differentially expressed EMT-related genes (DEEGs) by two comparison groups (tumor versus normal tissues; “stages I-II” versus “stages III-IV” tumor tissues). Based on classification and regression random forest models, we identified the most important DEEGs in predicting prognosis. Afterwards, a risk-score model was created using the identified important DEEGs. The prediction ability of the risk-score model was calculated by the area under the curve (AUC). A nomogram for prognosis prediction was built using the risk-score in combination with clinical factors. Results Among the 72 DEEGs, the classification and regression random forest models identified six hub genes (DKK1, DLX4, IL6, KCNN4, RPL22L1, and SPDEF), which exhibited the highest importance values in both models. Through the expression of these six hub genes, a novel risk-score was developed for the prognosis prediction of ccRCC. ROC curves showed the risk-score performed well in both the training (0.749) and testing (0.777) datasets. According to the survival analysis, individuals who were separated into high/low-risk groups had statistically different outcomes in terms of prognosis. Besides, the risk-score model also showed outstanding ability in assessing the progression of ccRCC after treatment. In terms of nomogram, the concordance index (C-index) was 0.79. Additionally, we predicted the differences in response to chemotherapy drugs among patients from low- and high-risk groups. Conclusion Gene signatures related to EMT could be useful in predicting ccRCC prognosis.


Introduction
RCC accounts for 2 to 3% of all cancers worldwide [1]. Almost 403,000 people are diagnosed with RCC each year, and 175,000 people die from it [2]. There is a range of histological classification groups, but kidney renal clear cell carcinoma (KIRC, ccRCC) is the most prevalent and contributes to the majority of renal cancer-related deaths. KIRC can remain clinically occult in the absence of significant clinical symptoms, and patients are initially diagnosed when they are already at a late stage of the TNM. In general, cases of late diagnosis are associated with lower survival rates, which results in a lower five-year survival rate for KIRC patients. In stage I, the five-year diseasespecific survival for RCC patients ranges from 80 to 95 percent, but it will drop to less than 10% for stage IV patients [3]. For these RCC patients who had a lower survival rate and high risk, more elaborate and customized treatment plans were necessary. As a result, prognostic models that are capable of accurately identifying patients at high risk are urgently needed.
The EMT process describes the transition of epithelial cells to mesenchymal cells in a series of steps, and it is characterized by a loss of polarity, a breakdown in the integrity barrier, and an increase in invasion [4]. Many studies have highlighted the significance of EMT in cancer metastasis and pharmaceutical resistance [5]. The abnormal EMT signature is associated with various acquired capabilities, such as resistance to chemotherapy and immunotherapy, in addition to migration and invasion [6]. Recently, an EMT signature was shown to be linked to immune cell signaling, providing novel insights into the link between EMT and immune activation [7]. There are potential therapeutic opportunities because of the association between EMT and immune cells. Although EMT-related signatures have been linked to ccRCC metastasis and prognosis, limited studies have been conducted to determine if they can be employed as indicators for early detection and prognosis assessment.
In the current study, random forest models were developed to identify the most important genes associated with KIRC patient survival time and survival status. A prognostic risk-score model for KIRC was developed by the expression of six important genes. The AUC values and survival analysis results demonstrated the feasibility and accuracy of the riskscore model. A nomogram was constructed to predict overall survival (OS) in KIRC after incorporating the risk-score and clinical data parameters. Together, our findings demonstrate the importance of risk-score and nomogram for the prediction of survival for patients with KIRC.

Data Collection.
Level three of mRNA sequencing data of cancer patients with KIRC was collected from TCGA (https:// tcga-data.nci.nih.gov/tcga/). The expression data of 539 KIRC and 72 normal kidney samples were chosen for further investigation. The form of the downloaded gene expression data was "fragments-per-kilobase-million" (FPKM). The original data was then converted into "transcript-per-million" (TPM). Among 539 KIRC samples, the numbers of stage I, stage II, stage III, and stage IV were 268, 57, 123, and 83.

Identification of Differentially Expressed Genes (DEGs).
The R package "edgeR" was chosen to obtain DEGs between KIRC and normal tissues [8]. The DEGs filtering criteria were established at a p value of less than 0.05 and a jlog 2 FoldChangej greater than 0.5. Similarly, DEGs between early stage ("stages I-II") and advanced stage ("stage III-IV") tumor tissues were obtained by the same method and screening criteria. We downloaded 1184 genes related to EMT from the dbEMT online database [9], and then we obtained the DEEGs by integrating the DEGs and EMTrelated genes through the R package "VennDiagram" [10].

Selection of Biomarkers by Machine Learning.
In order to construct a model that has perfect prediction performance, we used machine learning models to select the genes that are significantly correlated with prognosis. The expression values of DEEGs were normalized by the "log2ðx + 1Þ" and "minmax" normalization methods. A classification and a regression model were constructed by the random forest (RF) algorithm. The classification RF (cRF) was built for the assessment of the survival status of KIRC patients. The regression RF (rRF) was built for the prediction of the survival time of KIRC patients.
The importance values of genes in two models were calculated, and the six genes with the greatest importance values were chosen for further study as hub genes.
2.5. Construction of the Risk Model. The expression profiles of TCGA-KIRC were separated randomly into training (70%) and testing (30%) datasets. In the training of KIRC patients, univariate Cox analysis was performed to assess the coefficients of genes. The risk-score was evaluated by the equation: risk − score = ðcoefficient × expression of gene 1Þ + ðcoefficient × expression of gene 2Þ + ⋯ + ðcoefficient × expression of gene XÞ. KIRC individuals were separated into low and high groups by the median risk-score, respectively. With the log-rank test, survival curves for low-and highrisk individuals were compared, including OS and progression-free interval events (PFI). The "survivalROC" R package was selected to calculate the AUC value to evaluate the predictive ability.
2.6. Stratification Analysis. TCGA-KIRC individuals were stratified into subgroups by age (≥60 years vs. <60 years), gender (female vs. male), and TNM stages (T1/T2 vs. T3/ T4, N0 vs. N1, and M0 vs. M1). The "Wilcoxon rank-sum" test was selected to discover the risk-score distribution with the R package "ggpubr." 2.7. Nomogram Development. A nomogram including clinical variables (age and stage) and the risk-score was designed to estimate the likelihood of one, three, and five-year OS. Cindex values vary between 0.5 and 1.0, representing no discriminating ability and excellent discriminating capacity, respectively. The fit of the generated and reference lines indicates the high accuracy of the nomogram model.

Chemotherapeutic Response Prediction.
The responses to chemotherapeutic drugs were predicted for samples by the R package "pRRophetic" [12]. With a prediction model based on Genomics of Drug Sensitivity in Cancer (GDSC) data and expression profiles of TCGA-KIRC samples, the package could predict the IC50 of each drug for each patient. The IC50 refers to the dosage required for halving the number of viable cells, and it is a measure of the drug's therapeutic effectiveness and can also be used for assessing the tolerance of tumor cells to drugs.
2.9. Evaluation of the Tumor Microenvironment (TME). ESTI-MATE [13] and CIBERSORT [14] were utilized in R to determine each KIRC sample's TME status. For example, ESTIMATE predicted the level of stromal, immune, and tumor is scored based on the expression profiles of TCGA-KIRC samples. The relative levels of 22 tumor-infiltrating lymphocytes (TILs) in KIRC samples were predicted by the CIBERSORT 2 Disease Markers algorithm. To ensure the prediction results are credible, p value < 0.05 was used as the selection criterion.

Identification of DEEGs and Functional Enrichment
Analysis. A total of 8905 significantly DEGs were identified between KIRC and normal kidney samples, of which 5660 were upregulated and 3245 were downregulated in KIRC samples than in normal samples (Figures 1(a) and 1(b)). Similarly, 2052 significantly DEGs were found between early stage ("stages I-II") and advanced stage ("stages III-IV") tumor tissues, of which 1453 were upregulated and 599 were downregulated in the advanced stage than in early stage KIRC samples (Figures 1(b) and 1(c)). After an intersection of EMT-related genes and DEGs by Venn diagram, 72 DEEGs were found (Figure 2(a)). Following that, functional enrichment analysis was used to investigate the probable molecular processes behind DEEGs. The enriched biological process (BP) terms were "inflamma-tory_response" and "cytokine_mediated_signaling_pathway" (Supplementary Table 1). The enriched molecular function (MF) was the terms of "cytokine_activity" and "receptor_ ligand_activity" (Supplementary Table 2). The significant cellular component (CC) terms were "collagen_containing_ extracellular_matrix" and "secretory_granule_lumen" (Supplementary Table 3). Furthermore, the KEGG analysis indicated that DEEGs were strongly linked to pathways in "IL17_signaling" and "viral_protein_interaction_with_ cytokine_and_cytokine_receptor" (Supplementary Table 4). Besides, the hallmark pathway analysis showed that "epithelial_mesenchymal_transition" and "inflammatory_ response" (Supplementary Table 5). . KIRC patients were classified as the high or low group, respectively. The AUC of the risk-score was 0.749, suggesting a high prognostic prediction ability (Figure 2(e)). According to the survival curve (OS), there was a substantial difference in OS between groups (p value < 0.001) (Figure 2(f)). We then validated the 6 gene model in the testing dataset. The risk-score distributions, survival status, survival time, and transcriptomic levels of individuals were ordered using the risk-score (Supplementary Figure 1A-C). 79 and 80 KIRC individuals were classified as high or low-risk, and the AUC value was 0.777 (Supplementary Figure 1D). According to the survival curve (OS), there was a substantial difference in OS between groups (p value = 0:0011) (Supplementary Figure 1E).

Selection of EMT-Related
We then validated the 6 genes to predict the progression of KIRC patients. The distributions of risk-scores, prognosis,   Figure 2A-C). 254 and 255 KIRC patients were labeled as high or low risk, respectively, and the AUC was 0.722 (Supplementary Figure 2D). Discrepancies in PFI were found between high and low groups (p value < 0.001) (Supplementary Figure 2E). These results suggest that our risk-score model could be an accurate indicator for OS and PFI prediction.

Relationship between Prognostic Signature and Clinicopathological Features.
A correlation between the prognostic signature and clinical and pathological characteristics was then examined. The results indicated a positive correlation between the risk core and poor prognosis. For example, risk-score was found in the advanced stages of KIRC, such as stage IV (Figure 3(a)), T4 (Figure 3(b)), N1 (Figure 3(c)), and M1 (Figure 3(d)). In contrast, the correlations of the risk-score with age (Figure 3(e)) and laterality (Figure 3(f)) were not significant.

Stratification Analysis.
In the groups of "stages I-II" and "stages III-IV," patients with higher risk had worse OS (Supplementary Figure 3A-B). Similarly, we demonstrated that risk-score could predict the OS of T1-T2 or T3-T4 patients   Figure 3F). Afterward, we conducted the univariate/multivariate Cox regression to validate the independent prognostic role of risk-score. Univariate analysis calculated the p values of age (p value < 0.01), laterality (p value = 0:994), stage (p value < 0.01), and risk-score (p value < 0.01). Subsequent multivariate analysis demonstrated that age (coefficients: 0.037, p value < 0.01), stage (coefficients: 0.52, p value < 0.01), and risk-score (coefficients: 0.76, p value < 0.01) were   Disease Markers negatively correlated with OS. These findings suggest that the risk-score is an independent predictor of survival in KIRC patients.
3.6. Construction of a Predictive Nomogram. By combining the risk-score and various clinical indicators, a nomogram was created to assess the survival rate (Figure 4(a)). The nomogram has a C-index of 0.79, and the risk-score clearly demonstrated greater importance than age and stage did. The prediction and reference calibration curves showed a great fit in predicting one, three, and five years of OS (Figures 4(b)-4(d)), which proves the prediction ability of the nomogram.
3.7. Difference in Sensitivity to Chemotherapies. The responsive predictive values of the risk-score for chemotherapy drugs (Figures 5(a)-5(h)) were calculated by IC50 values. Bortezomib, cisplatin, sunitinib, temsirolimus, and vinblastine all had lower IC50 values in the high-risk group, indicating that patients with a higher risk-score were more responsive to these medications. In the low-risk group, however, the IC50 value of sorafenib was much lower, indicating that individuals with a lower risk-score were more susceptible to it.
3.8. Correlation between the Risk-Score and TME. The CIBERSORT method was used to determine the percentage     A total of 423 samples were analyzed and found to be statistically significant. Fractions of "follicular_helper_T" and "Tregs" were higher among high-risk TCGA-KIRC samples ( Figure 6(a)), while the values of "CD4_memory_T" and "NK" cells were higher among low-risk TCGA-KIRC samples ( Figure 6(b)). Using the ESTIMATE technique, we also examined the differences between risk categories in terms of TME scores (Figure 6(b)). The Wilcoxon rank-sum test suggested that the immune and stromal scores in TCGA-KIRC samples were significantly higher, while the tumor purity was higher in the lower risk-score TCGA-KIRC samples. Using the Kaplan-Meier method, the prognosis of

Discussion
KIRC is particularly prone to invasion and metastasis, which may explain its poor prognosis. About 25-30% of KIRC patients have metastases at the time of diagnosis [15], and about 60% have metastases within the initial 2-3 years after diagnosis [16]. EMT is critical for tumor invasion, tumor metastasis, and tumor cell proliferation [17]. As a result, we developed a prognostic risk model for six EMT-related genes and evaluated its reliability and relationship with survival. Additionally, we checked the link between risk and response to the pharmacological therapy. Currently, Cox regression [18] and LASSO regression [19] analyses are prevalent for identifying prognostic genes and constructing prediction models. In our study, we used machine learning models to identify the prognostic genes. Machine learning has many advantages since it can achieve a higher accuracy value with fewer genes, and it also gains the prevalence of in multiple studies [20][21][22]. For example, in breast cancer, a machine learning model was provided to predict the immune subtype of breast cancer [21]. The major obstacle to using machine learning models on survival data is that it contains two variables: survival status and time. Thus, we built a classification model and a regression model for predicting the survival status and time, respectively. The necessary data for these two models were the expression values of DEEGs after normalization. Based on the prediction results of these two models, we could precisely plot the survival curve for each patient. Through this method, we also successfully identified the most important genes for predicting the prognosis of KIRC.
Through the EMT process, tumors including kidney cancer could gain the potential for aggressiveness and metastasis. The activation of the EMT process is complex, but our study found that immune cells may make a significant contribution to EMT in a variety of ways. For example, some kinds of immune cells may secrete immunosuppressive molecules, hence promoting cancer progression. In our study, we discovered that Tregs were more abundant in high-risk than in low-risk samples. Tregs have been shown to impair anticancer immunity by impairing protective immunosurveillance and thwarting efficient antitumor immune responses [23]. Among high-risk samples that were linked to invasion and negative prognosis, we found that immune and stromal cells were increased but tumor purity was decreased. These results suggest that the number of immune and stromal cells might exert crucial roles in tumor development. Together, we suppose that the stromal cells and Tregs among TME increase the migration of tumor cells, which leads to a worse prognosis.
DKK1 is a Wnt signaling pathway suppressor, and its dysregulation has recently been identified as a possible biomarker for cancer development and prognosis in a variety of malignancies [24]. The amount of DKK1 expression is inversely related to the number of CD8+ T cells. DLX4, often referred to as BP1, may play a crucial role in tumor development by supporting proliferation and EMT [25]. A previous study confirmed that DLX4 contributed to the proliferation and migration of KIRC [25]. In RCC patients, high levels of interleukin-6 (IL-6) are linked to a poor prognosis [26]. IL-6 is a key diver that promotes EMT and enhances migration and invasion in KIRC tissues [27]. KCNN4 expression is higher in KIRC than in normal tissues, and its level is linked to the tumor stage and grade [28]. RPL22L1 is a ribosomal protein, and previous studies have confirmed that RPL22L1 expression is greater in cancer tissue and is linked to a worse prognosis [29,30]. SPDEF has a complex correlation with the prognosis of cancer patients. For example, upregulation of SPDEF is associated with poor prognosis in prostate cancer [31], but it could also serve as a suppressor in colorectal cancer [32].
There are some strengths in this study. Firstly, DEEGs were derived from two comparison groups (tumor versus normal tissues; "stages I-II" versus "stages III-IV" tumor tissues) and EMT-related genes, which guarantee the clinical significance of DEEGs. Secondly, machine learning models have the ability to predict both survival time and status. Thirdly, we selected the hub DEEGs by machine learning, which increased the prediction ability of these DEEGs. For example, ROC curves showed the risk-score performed well in both the training (0.749) and testing (0.777) datasets. In terms of nomogram, the concordance index (C-index) was 0.79. Numerous limitations should be noted in our research as well.
To begin with, the risk-score and nomogram were constructed using a publicly available dataset. More datasets that contain the expression data and clinical information of KIRC samples are needed to validate our results. Then, the underlying mechanisms between 6 DEEGs and KIRC progression should be clarified. Prior to clinical usage, further laboratory experiments on the six-gene signature are required.

Conclusion
In summary, EMT is critical for the advancement of cancer and is linked with worse survival in individuals with KIRC. We developed a risk-score model and a nomogram using the EMT-related genes for predicting OS in KIRC, which might enable tailored therapy and clinical decision-making for KIRC patients.

Data Availability
The datasets generated for this study can be found in TCGA. Further inquiries can be directed to the corresponding authors.