Construction and Evaluation of a Tumor Mutation Burden-Related Prognostic Signature for Thyroid Carcinoma

Thyroid carcinoma is a type of prevalent cancer. Its prognostic evaluation depends on clinicopathological features. However, such conventional methods are deficient. Based on mRNA, single nucleotide variants (SNV), and clinical information of thyroid carcinoma from The Cancer Genome Atlas (TCGA) database, this study statistically analyzed mutational signature of patients with this disease. Missense mutation and SNV are the most common variant classification and variant type, respectively. Next, tumor mutation burden (TMB) of sample was calculated. Survival status of high/low TMB groups was analyzed, as well as the relationship between TMB and clinicopathological features. Results revealed that patients with high TMB had poor survival status, and TMB was related to several clinicopathological features. Through analysis on DEGs in high/low TMB groups, 381 DEGs were obtained. They were found to be mainly enriched in muscle tissue development through enrichment analysis. Then, through Cox regression analysis, a 5-gene prognostic signature was established, which was then evaluated through survival curves and receiver operation characteristic (ROC) curves. The result showed that the signature was able to effectively predict patient's prognosis and to serve as an independent prognostic risk factor. Finally, through Gene Set Enrichment Analysis (GSEA) on high/low-risk groups, DEGs were found to be mainly enriched in signaling pathways related to DNA repair. Overall, based on the TCGA-THCA dataset, we constructed a 5-gene prognostic signature through a trail of bioinformatics analysis.


Introduction
With the universalness of imaging for diagnosis and monitoring, the thyroid carcinoma (THCA) cases have been increasing throughout the world [1]. Although most patients with thyroid carcinoma respond to therapies like surgical intervention and iodine-131, 10% of them succumb to poor prognosis and malignant progression [2]. It can be perceived that diagnostic and therapeutic strategies for malignant thyroid carcinoma remain to be improved. Hence, accurate diagnosis and assessment of tumor malignancy, as well as personalized therapeutic options, may aid in improving the survival status of patients with this disease. Tumor mutation burden (TMB) is being defined as the number of somatic mutations per megabase of genome, which varies across malignancies [3]. Currently, TMB is broadly recognized to be associated with DNA damage repair (DDR). For instance, in 2018, Nassar et al. [4] found that DDR-related genes are correlated with TMB in a cohort study for gene mutational features in urothelial carcinoma. That is, the severer the loss of DDR-related gene function, the higher the corresponding TMB is in patients. Likewise, Parikh et al. [5] revealed that loss of DDR function is positively related to TMB. It is well known that loss of DDR function results in malignant progression of tumors [6], thus, TMB can partially reflect tumor malignancy. Besides, several studies analyzed the correlation between TMB and prognosis of patients through bioinformatics methods. For example, Xie et al. [7] concluded that compared with low-TMB patients, the prognosis is more unfavorable in high-TMB patients. Overall, there is a   Computational and Mathematical Methods in Medicine correlation between TMB level and prognosis of patients with thyroid carcinoma. Screening gene signatures based on public databases is an effective strategy to search for prognostic biomarkers. Nowadays, researchers have searched for possible prognostic biomarkers based on varying features of multiple cancer types and have constructed various prognostic models, which provide relevant evidence for the prognosis of patients with related cancers. An example is that Liu et al. [8] generated a prognostic model of autophagy-related genes in nonsmall cell lung cancer. Guo and He [9] analyzed gene expression in the SLC30A family and manifested a likely prognostic marker for gastric cancer. Sui et al. [10] generated an immunocyte-related prognosis model that can predict the prognosis of patients effectively and efficacy of chemotherapy by analyzing immunocyte infiltration level in breast cancer. However, TMB-related prognostic model for thyroid carcinoma has not been fully studied yet.
Here, based on the dataset from The Cancer Genome Atlas (TCGA) database, we first perform statistical analysis on the mutational signature of thyroid carcinoma. Then, we analyzed relationships among TMB, survival time, and several clinicopathological features of patients. Afterward, we conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses on differentially expressed genes (DEGs) in high/low TMB groups. Based on these DEGs, we generated a 5-gene prognostic signature for thyroid carcinoma. In conclusion, the establishment of the prognostic signature may present an opportunity for its clinical application as a prognosticator of patients with this disease.

Data Acquisition and Mutational Signature Analysis.
TCGA-THCA dataset was downloaded from TCGA (https://portal.gdc.cancer.gov/) database. The dataset included mRNA expression data (normal: 58; tumor: 510), single nucleotide variants (SNV) data (VarScan2 Annotation, tumor: 487), and common clinical profiles. The R package "maftools" [11] was utilized to conduct mutational signature analysis on samples of thyroid carcinoma patients in the TCGA-THCA dataset. This statistical analysis included variant classification, SNV class, variants frequency, and mutated gene of samples.

Correlation Analysis of TMB in Thyroid Carcinoma and
Clinicopathological Features. The number of somatic mutations (nonsynonymous mutations) per megabase in the coding region of tumor genome in samples was defined as TMB value. Samples were stratified into high-and low-TMB groups per the upper quartile of TMB value. "Survival" package [12] was employed to plot survival curves of high/low TMB groups. Correlation between various clinicopathological features and TMB was analyzed and the variability was detected by Wilcoxon.

DEGs and Enrichment
Analyses. The R package "edgeR" [13] was used to conduct differential expression analysis (|logFC | >1, padj < 0:05) on mRNA expression data in high/low TMB groups. TMB-related DEGs were obtained. The R package "clusterprofiler" [14] was employed to carry out GO and KEGG enrichment analyses on DEGs. Terms and signaling pathways subjected to p value < 0.05 and q value < 0.05 were selected as significantly enrichment results.

Establishment of the TMB-Related Prognostic Signature.
Among TCGA-THCA samples, those with survival time greater than 0 were chosen and randomly grouped into training set (n = 350) and test set (n = 151). Univariate Cox regression analysis was carried out in the training set by using R package "survival," and genes related to survival were obtained (p < 0:05). These genes were then subjected to LASSO regression analysis. Cross-validation was conducted to select the optimal penalty parameter (lambda) to rule out genes with relatively high correlation. Afterward, the "survival" package was applied for multivariate Cox regression analysis on the selected genes by LASSO analysis. The prognostic signature was constructed, by which risk score of each patient was computed following the formula: n represents the number of prognosis-related genes. Coⅇf j represents weighted correlation coefficients of each gene. Gene expression j represents the expression of prognosis-related genes.

Prognostic Signature Evaluation and Gene Set
Enrichment Analysis (GSEA). In the training set, samples were classified into high-and low-risk groups with the median risk score of patients as the cut-off value. The R package "survival" was utilized to plot survival curves of high-and low-risk groups in the training set and the test set, respectively. Then, survival status was observed. The R package "timeROC" [15] was employed to draw receiver operation characteristic (ROC) curves in two sets, respectively, thereby evaluating the performance of the prognostic signature. Afterward, risk score was considered as a feature, combined with common clinicopathological features (sex, age, T staging, N staging, and tumor stage), univariate and multivariate Cox regression analyses were done. Next, R package "rms" [16], combined with clinical information of patients and risk score, was utilized to draw a nomogram of 3-year and 5-year survival. Finally, the R package "foreign" (https://cran.r-project.org/web/packages/foreign/ index.html) was used to plot calibration curves to validate performance of the nomogram. Besides, to investigate main signaling pathways that variated across samples in high-and low-risk groups, GSEA software was downloaded from website (http://www.gsea-msigdb.org/gsea/index.jsp) for analysis of the expression profiles of patients in high-and lowrisk groups [17].

Correlation Analysis of TMB and Clinicopathological
Features. The number of somatic mutations per megabase in the coding region of tumor genome is termed as TMB. TMB value of each sample was calculated to investigate the relationship between TMB and clinicopathological features of patients. Then, thyroid carcinoma samples were classified into high-and low-TMB groups per the upper quartile. Subsequently, survival curves were plotted and survival status in high/low TMB groups was observed. As illustrated in Figure 2(a), compared with high TMB group, survival of patients in low TMB group was more favorable. Furthermore, the correlation of TMB and various clinicopathological features was analyzed. The results presented that TMB was conspicuously upregulated in patients older than 65 (Figure 2(b)). TMB values in male patients were notably higher than those in female patients (Figure 2(c)). TMB values in patients in T3 − 4 were markedly higher than those of patients in T1 − 2 (Figure 2(d)). TMB values in patients in N1-3 were markedly higher than those of patients in N0 (Figure 2(e)). TMB values in patients in M1 were markedly higher than those of patients in M0 (Figure 2(f)). TMB values in patients in stage III-IV were remarkably higher than those of patients in stage I-II (Figure 2(g)).    Table 1). Afterward, GO and KEGG enrichment analyses were performed on 381 DEGs. GO enrichment analysis revealed that these DEGs were mainly enriched in terms such as muscle tissue development, collagen-containing extracellular matrix, and receptor-ligand activity (Figure 3(b)). KEGG enrichment analysis manifested that these DEGs were mainly enriched in signaling pathways such as protein digestion and absorption, pancreatic secretion, endocrine and other factor-regulated calcium reabsorption (Figure 3(c)). Thus, we speculated that these DEGs were mainly involved in biological functions and signaling pathways related to muscle tissue development.

Construction and Assessment of the Prognostic Signature.
Based on the above DEGs, to construct TMB-related prognostic signature for thyroid carcinoma, tumor samples from the TCGA-THCA dataset were divided into the training set (n = 350) and the test set (n = 151). To initially screen survival-related genes, 33 survival-related genes were obtained through univariate Cox regression analysis in the training set (Supplementary Table 2). Next, LASSO Cox regression analysis was conducted to further screen out 7 genes from 33 genes. Thus, genes that were too closely related to each other were excluded (Figures 4(a) and 4(b)). Finally, based on the above 7 genes, a 5-gene prognosis signature for thyroid carcinoma was established via multivariate Cox regression analysis (risk score = 0:5117 * BMP8A + 0:1719 * ADARB2 + 0:1205 * SALL3 + 0:4577 * PPBP + 0:2235 * SCN1A) (Figure 4(c)). Afterward, based on the prognostic signature, risk score of each sample in the training set was calculated. Patients with thyroid carcinoma were divided into high-and lowrisk groups with the medium risk score of the training set as the cut-off value. Survival of patients and expression of 5 feature genes were analyzed (Figures 4(d)-4(f)).
To assess the performance of the prognostic signature, as well as its independence as a prognosticator, survival curves were plotted in the training set and the test set. Subsequently, survival analysis on patients in high-and low-risk groups was conducted. The results disclosed that survival of patients in the low-risk group was conspicuously better than those in the high-risk group (Figures 5(a) and 5(b)). Then, ROC curves were drawn in the training set and the test set to assess the performance of the prognostic signature.  Computational and Mathematical Methods in Medicine multivariate Cox regression analyses were carried out on risk scores of patients. The results presented that in univariate Cox regression analysis, T staging, tumor stage, and risk score were markedly correlated with patient's prognosis. While in multivariate Cox regression analysis, only risk score remarkably affected patient outcomes (Figures 5(e) and 5(f)). Hence, according to results of univariate and multivariate Cox regression analyses, risk score could be an independent prognostic risk indicator. Finally, a nomogram was plotted to predict 3year and 5-year survival of patients with thyroid carcinoma by combining various clinicopathological features and risk scores ( Figure 5(g)). Calibration curves were drawn to evaluate the performance of the nomogram (Figures 5(h) and 5(i)). It could be seen that 3-year and 5-year survival of patients could be effectively predicted by the nomogram.

GSEA.
To inquiry about signaling pathways varied in patients in high-and low-risk groups, GSEA was performed on high-and low-risk groups. As presented in Figures 6(a)-6(c), significant differences showed in activation of signaling pathways including nucleotide excision repair, mismatch repair, and DNA replication in high-and low-risk groups.
In conclusion, differentially activated signaling pathways may be responsible for prognosis difference between highand low-risk groups.

Discussion
Based on the TCGA-THCA dataset, this study carried out statistical analysis on mutation information of thyroid carcinoma. BRAF gene mutation was the most common one. BRAF, as a member of serine/threonine protein kinase family, participates in MAPK/ERK signaling pathway in cells, and its mutation is related to thyroid canceration [18]. Numerous investigations mentioned high mutation rate of BRAF gene in various thyroid carcinomas [19][20][21][22], wherein T1799A nucleotide transversion is the most common oncogenic mutation of BRAF gene. This mutation leads to con-version of the 600th valine of BRAF protein to glutamate, which enhances the activity of serine/threonine protein kinase of BRAF protein. Overall, both mutational signature analysis in this study and earlier studies exhibited that mistranslation mutations of BRAF gene in thyroid carcinoma mutations play a pivotal role in thyroid carcinogenesis. TMB varies in different solid tumors. In this study, TMB of patients with thyroid carcinoma was in the range of 0.02~1.63 (Supplementary Table 1). Nevertheless, literature reported that 50% of patients with lung squamous cell carcinoma and 71% of patients with melanoma have a TMB greater than 10. Investigators speculated that high TMB of the abovementioned solid tumor patients mainly results from long-term exposure to chronic mutagenic exposures (smoking or ultraviolet radiation) and long-term superimposed mutations [3]. Notwithstanding the limited floating range of TMB in thyroid carcinoma, our analysis illustrated a significant difference in survival time of patients in high and low TMB groups. That is, the survival of patients in the high TMB group is dismal (Figure 2(a)). As such, in 2020, Xie et al. [7] supported the view through a series of bioinformatics analyses that patients with papillary thyroid carcinoma with high TMB have unfavorable outcomes. Hence, combined our results with published literature, we could speculate that the prognosis of thyroid carcinoma patients with high TMB is poor.
According to American Joint Committee on Cancer staging manual eighth edition, in the TNM staging of thyroid carcinoma, T3 of thyroid carcinoma is defined as tumor > 4 cm limited to the thyroid, or gross extrathyroidal extension invading only strap muscles [23]. Meanwhile, a study [24] pointed out that the probability of local recurrence can be predicted according to strap muscle invasion of thyroid carcinoma. Hence, strap muscle invasion may relate to malignancy of thyroid carcinoma. In this investigation, DEGs were obtained by differential expression analysis on patients with high and low TMB. These genes were   (c) Figure 6: GSEA enrichment analysis in high-and low-risk groups. Enrichment of high-and low-risk groups in DNA replication pathway (a), mismatch repair pathway (b), and nucleotide repair pathway (c).