Integrative Modeling of Multiomics Data for Predicting Tumor Mutation Burden in Patients with Lung Cancer

Immunotherapy has been widely used in the treatment of lung cancer, and one of the most effective biomarkers for the prognosis of immunotherapy currently is tumor mutation burden (TMB). Although whole-exome sequencing (WES) could be utilized to assess TMB, several problems prevent its routine clinical application. To develop a simplified TMB prediction model, patients with lung adenocarcinoma (LUAD) in The Cancer Genome Atlas (TCGA) were randomly split into training and validation cohorts and categorized into the TMB-high (TMB-H) and TMB-low (TMB-L) groups, respectively. Based on the 610 differentially expressed genes, 50 differentially expressed miRNAs and 58 differentially methylated CpG sites between TMB-H and TMB-L patients, we constructed 4 predictive signatures and established TMB prediction model through machine learning methods that integrating the expression or methylation profiles of 7 genes, 7 miRNAs, and 6 CpG sites. The multiomics model exhibited excellent performance in predicting TMB with the area under curve (AUC) of 0.911 in the training cohort and 0.859 in the validation cohort. Besides, the significant correlation between the multiomics model score and TMB was observed. In summary, we developed a prognostic TMB prediction model by integrating multiomics data in patients with LUAD, which might facilitate the further development of quantitative real time-polymerase chain reaction- (qRT-PCR-) based TMB prediction assay.


Introduction
Lung cancer is one of the most common malignancies worldwide, and it is the first leading cause of tumor-related mortality with an increasing incidence in recent years [1]. It was reported that 2.1 million new cases of lung cancer were diagnosed around the world in 2018, which accounted for 11.6% of all new cancer patients [2,3]. Despite the improvements in chemotherapy and targeted therapy, the 5-year overall survival (OS) for patients with lung cancer remained poor [1,4]. Nevertheless, immunotherapy, especially the application of immune checkpoint inhibitors (ICIs), had made a great breakthrough in the treatment of cancer and dramatically increased survival rate and quality of life for patients with lung cancer [5][6][7][8][9].
As the most successful representative of immunotherapy, programmed cell death-1/programmed cell death ligand-1 (PD-1/PD-L1) inhibitor had shown better performance over conventional chemotherapy in terms of OS, response rate, and progression-free survival (PFS) for the treatment of lung cancer [10,11]. Furthermore, a large amount of clinical research had also demonstrated that immunotherapy alone or in combination with chemotherapy could be used for the first-line treatment of patients with metastatic lung cancer [12][13][14][15]. It was reported that patients with higher PD-L1 expression had better outcomes compared to patients with lower or no PD-L1 expression using anti-PD-L1 antibody clone 22C3 [16]. Unfortunately, only 10%-20% of nonsmall-cell lung cancer (NSCLC) patients have considerable curative effects, and most patients cannot benefit from immunotherapy [17][18][19]; therefore, biomarkers are urgently needed to rationalize the utilization of immunotherapy for patients.
Tumor mutation burden (TMB) emerged recently as a reliable biomarker that significantly correlated with immunotherapy efficacy across a wide spectrum of tumor types. TMB is defined as the number of somatic mutations per megabase (Mb) of the genome examined. Previous studies found that higher TMB was associated with improved objective response, durable clinical benefit, and PFS in NSCLC patients under immunotherapy [20]. It had been reported that PFS among stage IV patients with high TMB was significantly longer with PD-1/PD-L1 plus cytotoxic T-lymphocyteassociated protein 4 (CTLA-4) treatment than with chemotherapy [21]. Moreover, through analyzing 7,033 patients with different types of cancer, TMB was found to be a useful biomarker for predicting response of ICIs across different types of cancer, and higher TMB (highest 20% in each histology) was associated with better OS [22].
Whole-exome sequencing (WES) is considered the gold standard for evaluating TMB, but it is time-consuming and carries high cost [23]. Thus, targeted next generation sequencing (NGS) has been adopted as an alternative approach for predicting TMB. Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT) (468 genes) and FoundationOne companion diagnostic (CDx) (324 genes) are two extensively utilized targeted NGS methods, and both of them have been approved by the U.S. Food and Drug Administration (FDA) for clinical application. Despite the fact that targeted NGS is effective in predicting TMB, various problems arise for its routine clinical application, such as the limit of detection, germline mutation exclusion, and standard cutoff threshold determination [24]. Moreover, targeted NGS is still time-consuming and expensive compared with other clinical molecular tests [24]. In an effort to establish a simplified, cost-effective approach to predict TMB in patients with lung adenocarcinoma (LUAD), we intended to integrate a multiomics data to develop a predicting model for TMB.
In this study (Figure 1), patients with lung adenocarcinoma (LUAD) in The Cancer Genome Atlas (TCGA) were first split into training and validation cohorts. Then patients in training cohort were divided as TMB-high (TMB-H) and TMB-low (TMB-L), and the differentially expressed genes, miRNAs, and differentially methylated CpG sites were identified. Subsequently, a multiomics TMB prediction model (TPM) involving expression profiles of selected genes, miR-NAs, and methylation profiles of CpG sites was established. Finally, patients from the validation cohort were used to verify the performance of TPM.  Table S1) [25,26]. The somatic mutation profiles (mutation annotation format, MAF) were processed by Mutect software. Missense mutations, nonsense mutations, splicesite mutations, frameshift insertions, frameshift deletions, in-frame insertions, or in-frame deletions identified in the samples were regarded as positive mutations. Gene expression profiles of 594 samples were annotated through g:Profiler website [27] and normalized using the scale method in limma package [28]. The low abundance profiles were eliminated. DNA methylation profiles were annotated using IlluminaHumanMethylation450kanno.ilmnl12.hg19 R package. Quality control for DNA methylation profiles was conducted through minfi R package to eliminate certain CpG sites [29], in which the single-nucleotide polymorphisms (SNPs) existed [30], multiple mapping to human reference genome was found [31], and the methylation information of any samples was not available. In addition, CpG sites located in sex chromosomes were excluded for analysis [32]. miRNA expression profiles of 495 samples including 450 samples from LUAD tissue and 45 samples from matched normal lung tissue in TCGA database were downloaded from the University of California Santa Cruz (UCSC) Xena database (https://xena.ucsc.edu/public). Then, the miRNA expression profiles were transformed into reads per million (RPM), and miRNAs expressed in more than 10% of patients with LUAD were extracted. The clinical information of 522 patients with LUAD from TCGA database was obtained using TCGAbiolinks R package [25], which covered id, age, gender, tumor stage, state, weight, Body Mass Index (BMI), alcohol history, height, days to last follow-up, years smoked, and race (Table 1).

TMB Calculation and Classification of Patients.
Somatic mutation profiles were processed by Mutect software, and the identified somatic mutations, including base substitution, deletions, and insertions, were filtered according to the following criteria: (1) the minimum sequencing coverage for mutations should be greater than or equal to 10; (2) the variant allelic fraction should be greater than or equal to 5%. TMB was calculated as the total count of somatic mutations identified divided by 38 Mb, which is the length of exons in human genome. According to previously reported cutoff threshold of 10 in patients with LUAD [21,33], patients were divided into TMB-H (TMB ≥ 10) and TMB-L (TMB < 10). Density plot of TMB-distribution for all patients with LUAD and boxplot of correlation between tumor stage and TMB was drawn by ggplot2 R package.

2
BioMed Research International

Multiomics Analysis between TMB-H and TMB-L Patients.
Differentially expressed genes between TMB-H and TMB-L patients in training cohort were identified through limma R package with -log10 adj.p value > 2 (adj.p value < 0.01) and log2 FoldChange > 3 [28] and then illustrated in volcano plot and heatmap by ggplot2 and pheatmap R package, respectively. Differentially expressed miRNAs between TMB-H and TMB-L patients in training cohort were identified through limma R package with -log10 adj.p-value > 1.30 (adj.p-value < 0.05) and log2 FoldChange > 0:35 [28] and then illustrated in volcano plot and heatmap by ggplot2 and pheatmap R package, respectively. In addition, target genes of the differentially expressed miRNAs were searched and analyzed through miRWalk website tool (http://mirwalk.umm.uni-heidelberg.de/) [35]. Differentially methylated CpG sites between TMB-H and TMB-L patients in training cohort were identified through limma R package with -log10 adj.p value > 1.30 (adj.p value < 0.05) and log2 FoldChange > 0:15 [28] and then illustrated in volcano plot and heatmap by ggplot2 and pheatmap R package, respectively.

Functional Enrichment Analysis.
We first converted gene symbols into ENTREZ ID via org.Hs.eg.db R package, and then Gene Ontology (GO) and Kyoto Encyclopedia of Genes (KEGG) analysis of differentially expressed genes were implemented using ggplot2, enrichplot, and clusterProfiler R packages [36]. Meanwhile, GO and KEGG enrichment analysis were conducted for target genes of differentially expressed miRNAs using the same method described above.    BioMed Research International and multiomics signature (15 genes + 15 miRNAs + 15 CpG sites) using differentially expressed genes, miRNAs, and differentially methylated CpG sites between TMB-H and TMB-L patients in the training cohort. Next, the least absolute shrinkage and selection operator (LASSO) logistic regression model analysis was performed to select the optimal biomarker signature for predicting TMB through glmnet R package [37]. The predictive performance for each biomarker signature was evaluated by lambda.min and matched area under curve (AUC). Finally, differentially expressed or methylated genes, miRNAs, and CpG sites identified with nonzero regression coefficients were used to construct the TPM. The TPM score was calculated using the regression coefficients from LASSO analysis to weigh the expression or methylation of the chosen biomarkers. The validation cohort was used to evaluate the performance of the TPM through assessing the predicting sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and AUC.

Principal Component Analysis (PCA).
Differentially expressed genes, miRNAs, and differentially methylated CpG sites identified through LASSO analysis were used to perform PCA. Expression or methylation profiles of the genes, miRNAs, and CpG sites were extracted from each patient, and ggfortify R package was utilized to conduct the PCA.
2.8. ROC Analysis. ROC curve analysis was conducted using pROC R package to investigate the performance of TPM in predicting TMB [38].
2.9. Correlation Analysis and Regression Analysis. Correlation between TPM score and TMB was analyzed by cor.test R function with the two-side Pearson's method. Samples were plotted in two-dimensional plots with the TPM score and TMB value. Regression analysis between TPM score and TMB was performed using lm R function.

Results
3.1. TMB-Based Division of Patients with LUAD. WES data of tumor tissue from a total of 567 patients were acquired from TCGA-LUAD database, and clinical characteristics of the patients were summarized (Table 1); the mean age of the patients was 65.8, among which 242 individuals were males and 280 individuals were females (Table 1). TMB was calculated as the number of somatic mutations identified per megabase (Mb) in tumor tissue of each patient. It was found that most of patients with LUAD had a TMB ranging from 0 to 40 (Figure 2 Figure S2A). KEGG pathway enrichment analysis suggested that the differentially expressed genes were mainly related to pyrimidine metabolism (Supplementary Figure S2B). These results demonstrated that differentially expressed genes might be correlated with carcinogenesis-related processes [39]. In addition, GO and KEGG enrichment analysis were also performed for the target genes of differentially expressed miRNAs, which were found to be mainly enriched in netrin-activated signaling pathway, DNA-binding transcription activator, and single-stranded RNA binding (Supplementary Figure S3A, Figure S3B). In addition, most of the differentially methylated CpG sites were found to locate in gene body regions (Supplementary Figure S4), and 5 CpG sites were found to locate in the TSS1500 (sequence region from -200 to -1500 nt upstream of the transcription start site) and TSS200 (sequence region -200 nt upstream of the transcription start site) region of genes (Supplementary Figure S4, Supplementary Table S2).

Machine Learning-Based Construction of TMB Prediction Model.
To develop TPM based on the differences identified through multiomics analysis, we firstly generated 4 possible prediction biomarker signatures including gene signature, miRNA signature, CpG site signature, and multiomics signature, which were composed of expression profiles of top 45 differentially expressed genes, top 45 differentially expressed miRNAs and top 45 differentially methylated CpG sites between TMB-H and TMB-L patients, respectively (Supplementary Table S3). To further compare predicting efficacy of the four signature, we implemented LASSO logistic analysis to select the optimal signature from training cohort. The optimal biomarkers for the 4 prediction signature were obtained with nonzero regression coefficients ( Figure 6, Table 2), and as a result, the multiomics signature with maximum measure (0.868) was selected as the optimal biomarker signature for predicting TMB (Figure 6(d), Table 2). PCA using the shrunk multiomics signature suggested that TMB-H patients and TMB-L patients could be separated obviously (Supplementary Figure S5). Based on  Group_list  cg20546259  cg16794961  cg06627916  cg10317815  cg06967998  cg17242596  cg08080489  cg03784083  cg07383535  cg21069019  cg10627389  cg11504078  cg02031308  cg08641973  cg12095807  cg01195676  cg11554507  cg07277061  cg19452338  cg27490043  cg19887883  cg17687285  cg11538307  cg12190306  cg26229018  cg23222169  cg13167263  cg08431709  cg08528474  cg14471784  cg07241908  cg04046889  cg19886365  cg26337816  cg18944144  cg21816377  cg24716042  cg14013053  cg08457980  cg24553235  cg24057718  cg03286742  cg00704227  cg00446536  cg26193468  cg26680793  cg17265300  cg19958021  cg24637231 (Figure 7(a)). Besides, the p value of a two-side t-test was 3:40e − 48 between TPM score and TMB (Figure 7(b)), which suggested that TPM score was highly correlated with TMB in patients with LUAD.

Evaluation of the Predicting Accuracy of TPM in the Validation Cohort.
To evaluate the predicting efficacy of TPM constructed in training cohort, expression or methylation profiles of genes, miRNAs, and CpG sites in patients from validation cohort were used as input parameters for calculating TPM score. According to the threshold of -3.366, 41 patients from the TMB-H group were predicted as TMB-H, and 66 patients from the TMB-L group were predicted as TMB-L. In summary, the TPM has a sensitivity of 0.911, a specificity of 0.750, and an accuracy of 0.805 in predicting TMB in the validation cohort, and the PPV was 0.651 and NPV was 0.943 (Supplementary Table S4). ROC analysis revealed the AUC of the TPM in validation cohort was 0.859 (Figure 8(a)), and p value of the two-side t-test was 1:19e − 14 between the TPM score and TMB (Figure 8(b)). These results suggested that the TPM

Discussion
Immunotherapy has been demonstrated particularly successful in NSCLC treatment and is being adopted as a firstline treatment option worldwide [12,21,40]. Nevertheless, only a small portion of unselected patients can benefit from immunotherapy [24,41]. Therefore, biomarkers for patient selection become important to achieve effective therapy. TMB has been recognized as the effective prognostic biomarker in NSCLC patients according to the results from numerous clinical trials [21,22,42,43]. Although targeted NGS has been proved to be an alternative approach of WES for the prediction of TMB, the accuracy and simplicity of targeted NGS remain challenging as various parameters should be taken into consideration [44]. In this study, we developed a mathematic multiomics model that could precisely predict TMB in patients with LUAD, and the prediction accuracy of the model was validated in an independent cohort with high sensitivity and specificity. Furthermore, as the input parameter in this model includes expression profiles of 7 genes, 7 miRNAs, and the methylation profiles of 6 CpG sites, which could be obtained through quantitative real time-polymerase chain reaction (qRT-PCR). This model paved the way for further development of the simplified qRT-PCR-based clinical assay for TMB prediction.
Through multiomics analysis, we integrated gene/ miRNA expression and DNA methylation data to reflect subtle alterations of the tumor microenvironment to precisely predict TMB for better prognosis of patients with LUAD in immunotherapy. Fragments per kilobase per million mapped reads (FPKM) of YBX2, HLTF, KLC3, WRNIP1, CKS1B, RNF26, ZYG11A, and RPM of hsa-miR-571, hsa-miR-586, hsa-miR-151b, hsa-miR-378i, hsa-miR-6727-5p, hsa-miR-502-3p, and hsa-miR-6798-3p as well as beta-value of cg02031308, cg03286742, cg04046889, cg12095807, cg16794961, and cg24553235 were extracted from RNA-seq, miRNA-seq and Illumina HumanMethyla-tion450 BeadChip, respectively, for calculating TPM score. Although the FPKM, RPM, and beta value involved in the TPM were based on high-throughput sequencing or chip analysis, it is feasible to convert them to cycle threshold (Ct) value in qRT-PCR analysis and thus simplify the prediction of TMB by using benchtop qRT-PCR instrument. The conversion of FPKM in different samples to Ct values could be probably through comparison of the targeted gene expression to reference gene expressions, such as actin and eukary-otic elongation factor (eEF), which have relative consistent expression under different tumor microenvironment, and beta value of CpG sites could also be converted into Ct value though the quantitative MethyLight technology [68]. To our best knowledge, this is the first time to construct the TPM for patients with LUAD from multiomics view.

Conclusion
In summary, the present study developed a multiomics risk model with high specificity and sensitivity in predicting TMB for patients with LUAD and laid the base for establishing a more simplified and cost-effective TMB test assay. Nevertheless, this study was solely bioinformatics research, and clinical sample validation for the TPM had not been implemented. The training cohort and the validation cohort used in this study were relatively small in size and required further expansion to increase the accuracy.

Data Availability
Somatic mutation profiles of 567 samples, gene expression profiles of 594 samples, DNA methylation profiles of 507 samples, and miRNA expression profiles of 495 samples were obtained from TCGA database.

Conflicts of Interest
The author(s) declare(s) that they have no conflicts of interest. Team Project of Shenzhen University (843-00000325) and the start-up funds from Shenzhen University (to J.W.). We would like to thank TCGA project organizers as well as all study participants.

Supplementary Materials
Supplementary Table S1: summary of TCGA-LUAD data. Supplementary Table S2: five CpG sites located in the TSS1500 and TSS200 region of the genes. Supplementary  Table S3: different biomarker signatures used in the training cohort. Supplementary Table S4: the performance of TPM in the validation cohort. Supplementary Figure S1: the differences in the abundance of tumor-infiltrating immune cells between TMB-H and TMB-L patients. Student's t-test was used; error bars indicated standard deviation; * p < 0:05.