A Novel Prognostic Signature Revealed the Interaction of Immune Cells in Tumor Microenvironment Based on Single-Cell RNA Sequencing for Lung Adenocarcinoma

Background The tumor immune microenvironment (TIME) played an important role in immunotherapy prognosis and treatment response. Immune cells constitute a large part of the tumor microenvironment and regulate tumor progression. Our research is dedicated to studying the infiltrating immune cell in lung adenocarcinoma (LUAD) and seeking potential targets. Methods The scRNA-seq data were collected from our FDZSH and two public datasets. The code for cell-type mapping algorithms was downloaded from the CIBERSORTx portal. The bioinformatics data of LUAD patients could be approached from The Cancer Genome Atlas (TCGA) portal. Weighted gene coexpression network analysis (WGCNA) and least absolute shrinkage and selection operator (LASSO) analyses were performed to construct a risk model. TIMER2 and TIDE helped with the immune infiltration estimation, while PROGENy helped the cancer-related pathways' enrichment analysis. GSE31210 dataset and IMVigor ICB therapy cohort validated our findings as the external validation datasets. Results We clustered the scRNA-seq dataset (integrating our FDZSH datasets and other public datasets) into 23 subpopulations. After curated cell annotation, we implemented Cibersort and WGCNA analysis to anchor the brown module and natural killer cell cluster1 due to the most relationship with tumor trait. The overlap of the brown module gene, natural killer cell signature, and DEGs of tumor and adjacent normal samples was screened by LASSO Cox regression. The obtained 5-gene risk model showed an excellent prognostic performance in the validation dataset. Furthermore, there was a correlation between risk score and tumor-infiltrating immune cells and tumor genomics abnormity. Patients with higher risk scores had a significantly lower immunotherapy response rate. Conclusion Our observations implied that immune cells played a pivotal role in TIME and established a 5-gene signature (including IDH2, ADRB2, SFTPC, CCDC69, and CCND2) on the basement of nature killer markers targeted by WGCNA analysis. The significance of clinical outcome and immunotherapy response prediction was validated robustly.


Introduction
Lung cancer is the most common cancer and the most prevalent cause of tumor-related death in the world [1]. Lung adenocarcinoma (LUAD) accounts for 85% of cases [2,3]. Despite the significant advance in LUAD multidiscipline treatment, including surgery, chemotherapy, radiotherapy, and especially targeted therapy, the five-year survival rate of patients with LUAD remains discouragingly low. As the merging therapy, immunotherapy holds tremendous promise in controlling or even eradicating residual disease and improving cancer treatment and prognosis [4], but many patients still do not respond to anti-PD-1/PD-L1 immunotherapy. The current opinion is that the reciprocal regulation between tumor cells and tumor-infiltrating immune cells shapes the immune status of the TME [5] and may  Journal of Immunology Research determine the outcome of cancer progression. Hence, the comprehensive analysis of the immune cells in LUAD patients facilitates the exploitation of novel biomarkers to predict the treatment response and disease prognosis.
With the development of sequencing technology over the past decade, molecular prognostic markers of tumors based on RNAseq technology emerged in endlessly. As a hot new technology in transcriptional analysis, scRNA-seq technology enables single-cell sequencing technology to reveal cellular gene expression that cannot be detected by bulk RNA sequencing [6]. Single-cell sequencing is mostly used to identify cell subgroups and pedigree analysis initially. With the maturity of scRNA-seq, single-cell sequencing technology in analyzing the tumor microenvironment is starting to become mainstream in oncology research. Currently, the exploration and in-depth analysis of scRNA-seq data of tumor specificity are still of great significance for the mass use of bulk RNA sequencing to characterize different cell subsets for using bulk RNA sequencing to characterize different cells subsets in large quantities.
This study integrated scRNA-seq data from our hospital and two external public databases with curated cell identity annotations. Furthermore, weighted gene coexpression network analysis (WGCNA) was implanted on the normal and tumor samples of the TCGA-LUAD cohort to explore the key module and cluster associated with tumor status. Consequently, the hub gene was selected to construct a five-gene risk model by the least absolute shrinkage and selection operator (LASSO) Cox regression algorithm and AUC validation. Our downstream analysis proclaimed the prospect and functionality of the signature in immune infiltration, mutational status, oncogenic pathways, and clinical prognosis; our results provide new sight into the immune cells in tumor heterogeneity and biomarker mining.

Materials and Methods
2.1. Single-Cell RNA Sequencing Data Collection. Fourteen primary LUAD patients who had received surgical resection in the Department of Thoracic Surgery in Zhongshan Hospital (FDZSH) were included for scRNA sequencing [7]. The diagnosis of lung adenocarcinoma was confirmed in each case by histopathological analysis. The other two public datasets, two independent LUAD patient cohorts, were downloaded from ArraryExpress (accession numbers E-MTAB-6149 and E-MTAB-6653) and Human Cell Atlas Data Coordination Platform (accession number PRJEB31843).

The Process of scRNA Dataset Integration and Cell
Annotation. Preparation for single-cell transcriptomic sequencing followed the protocol for the 10x Genomics Chromium Single-Cell platform. Our previous published literature described the detailed tissue processing, and the single-cell suspension was described in our previous published literature [7].
We followed the Seurat v3 guidelines for the routine procedure. Cells expressing less than 200 genes or greater than 7000 genes or more than 20% mitochondrial genes were removed in the cell QC procedure. After normalization and PCA dimension reduction, the harmony [8] R package was utilized for removing the batch effect. Cell clustering was based on PCA dimensionality reduction using the first 20 PCs and a resolution value of 0.4. Marker genes manually identified cell annotation in the CellMarker database (http:// biocc.hrbmu.edu.cn/CellMarker/) with the assistance of Sin-gleR [9] and scType [10]. The marker genes of each cluster were conducted by the function FindAllMarkers() with the default parameters. C l u s t e r 0 C l u s t e r 1 C l u s t e r 2 C l u s t e r 3 C l u s t e r 4 C l u s t e r 5 C l u s t e r 6 C l u s t e r 7 C l u s t e r 8 C l u s t e r 9 C l u s t e r 1 0 C l u s t e r 1 1 C l u s t e r 1 2 C l u s t e r 1 3 C l u s t e r 1 4 C l u s t e r 1 5 C l u s t e r 1 6 C l u s t e r 1 7 C l u s t e r 1 8 C l u s t e r 1 9 C l u s t e r 2 0 C l u s t e r 2 1 C l u s t e r 2 2  The differential gene analysis between the 56 paired tumor and normal sample of TCGA-LUAD was performed using the R package limma [11] with the threshold (FDR < 0:05 and Log ð FoldchangeÞ > 1).
2.4. The CIBERSORTx and WGCNA Analysis. The cell type abundance of TCGA samples from our scRNA-seq annotation was calculated by CIBERSORTx [12] using our customized signature matrix. The customized cell-type-specific signature genes were created by the creation feature of CIBERSORTx (https://cibersortx.stanford.edu). WGCNA was performed using the R package WGCNA [13] (version 1.70). Firstly, we chose both normal and tumor TCGA-LUAD samples for the analysis; the outlier sample was removed by setting the cutHeight 140. The soft power was determined by the pickSoftThreshold() function and setting the networkTYPE = "unsigned." Power of 5 was chosen. We set WGCNA "mergeCutHeight" to 0.25 and merged the similar small module to identify nine modules. Of the nine WGCNA modules, the gray module includes genes that do not coexpress and are unassigned to a coexpression network; therefore, the gray module was excluded from our analysis.

LASSO Algorithm Model
Construction. LASSO is a widely used regression method appropriate for analyzing data with high dimensions and strong relationships like  Each row corresponds to a module, and each column corresponds to a cluster cell abundance. The color scale on the right shows module-trait correlations from 1 (red) to −1 (blue). Each cell at the row-column intersection's background color represents the correlation coefficient between the modules and clusters. The red color indicates a high degree of positive correlation, and the blue indicates a high degree of negative correlation between each module and the clusters. Each cell also contains the corresponding p signatures (bottom symbols).   Journal of Immunology Research high-throughput data. We used the R package glmnet to perform the LASSO Cox regression analysis as previously described [14], including a built-in crossvalidation function to adjust the L1 regularization variable lambda for candidate feature selection. The R caret package (https://cran.r-project .org/package=caret) was applied to build the classification of the TCGA-LUAD cohort and assess machine learning classifiers for the classification task.

Immune Infiltration Estimation and Genomic Analysis.
The immune infiltration and cell composition estimation of TCGA patients were based on the TIDE (http://tide .musc.edu/) and TIMER2 (http://timer.cistrome.org/). The MAF file was used to illustrate the distribution of mutation frequency and status by the R package maftools [15]. Tumor mutational burden (TMB) was calculated as the number of somatic base substitutions or indels per megabase (Mb) of the coding region target territory of the test (currently, 1.11 Mb). The stemness data was from previous researchers' articles [14,16,17].

11
Journal of Immunology Research type annotation of our datasets, we formulated the CIBER-SOT progress to calculate the cell abundance based on the transcriptome profile of our scRNA-seq. Considering the heterogeneity of each cluster, we included all the clusters rather than cell types to establish the signatures. The cell abundance of tumor and normal samples in the TCGA-LUAD database showed significant differences among 21 cell subpopulations (Figure 3 WGCNA analysis was performed to determine the correlation between gene expression module and cell abundance of specific cell subpopulations. All 585 TCGA-LUAD samples mingled with 59 adjacent and 526 tumor samples. The hierarchical clustering results and deleted outlier sample are shown in Figure 3(b). Figure 3(c) showed that the soft threshold selection process and a scale-free network were successfully conducted ( Figure S2). Finally, we obtained nine modules after merging small modules (Figure 3(d)). The correlation analysis between WGCNA modules and tumor phenotype showed that the brown module was most likely relevant to the tumor. Correspondingly, the closest cell cluster was the NK cell (cluster 1). Further investigation of the brown module genes was conducted. The GO and KEGG enrichment results are shown in Table S2 and Figure S2  The validation of our model was based on the TCGA-LUAD cohort. The whole cohort was divided into the train and validated dataset based on the recursive feature elimination classification algorithm, often used in machine learning.

The Immune Landscape of Different Risk Groups in LUAD.
The immune cells in the tumor microenvironment play an essential role in tumor prognosis. The association between the risk score and infiltration of immune cells was explored. The ESTIMATE formula computed the Immune-Score and TumorPurity. Corresponding to the above result, the high risk score tumors had statistically significantly lower levels of immune infiltration (Figure 6(a)) (Immune-Score: high vs. low: 1204 vs. 1812, p < 0:001) and higher tumor infiltration (TumorPurity: 0.692 vs. 0.599, p < 0:001 ). Additionally, the mRNAsi stemness score in high group was higher than low group (Figure 6(b)) (0.350 vs. 0.302, p < 0:001). The correlation analysis between the immune cell abundance and a risk score is shown in Figures 6(c)-6(l). The risk score was negatively correlated with the abundance of CD8+ T cell, NK cell, neutrophil cell, macrophage M1, myeloid dendritic cell, and memory B cell. CD4+ T cell and macrophage M0 positively correlate with a risk score. The TIDE prediction score in the low-risk group was significantly lower than in the high-risk group, representing the positive correlation between risk score and tumor immune escape (Figure 6(m)). The barplot in Figure 6(n) shows the predicted response of immunotherapy in the low and high groups corresponding to the abovementioned results. A conclusion could be inferred from the above results that the risk model could participate in regulating tumor microenvironment via immune cell infiltration and had a predictive value in immunotherapy response.

The Correlation between Risk Score and Genomic
Features. The investigation of the genomic features was conducted to reveal the tumor characteristics in TIME. The distribution of variants and somatic interactions of the low-and high-risk groups was shown in Figure 7(a) and Figure S3. The commonly driven genes like EGFR, KRAS, and KEAP1 mutated in a mutually exclusive manner. The differentially mutated oncopathways between low-and high-risk groups are shown in a forest plot in Figure 7(b). The pROGENY analysis also revealed that many oncopathways were enriched in the high-risk group. Drive gene KEAP1 has a significantly higher mutation frequency in the high-risk group (28% vs. 9%, p < 0:001). Consistent with the analysis above, the patients in the high-risk group hold an elevated tumor mutation burden (6.13 vs. 4.58, p < 0:001). The above results suggested that this model revealed that immune cell infiltration in the TIME affects the tumor's genomics status and mutation load.
3.6. The Analysis of Clinical Characteristics and Construction of Nomogram. As Table S3 and Figures 8(a)-8(e) show, the risk score has no connection to the age, but the base characteristics between the two groups are not balanced. The overall survival (OS) results revealed that patients with lower risk scores exhibited better survival prognoses (Figure 8(f)). The median time of survival in the low-risk score group was 50.5 (95% CI: 40.9-NA) months, whereas high-risk score patients had a considerably shorter median survival time (37.83 [32.5, 48.4] months, p < 0:0001). Due to the baseline imbalance and confounding factors. We identified that the risk score group was a crucial independent prognostic factor by univariate and multivariate Cox regression analysis (Table S3, Figure 8(g)). The nomogram was established based on the significant factors, including group, age median group, smoke, and pathologic stage (C-index: 0.687). The validation of the nomogram was implemented through 1-, 3-, and 5-year calibration curve plots (Figure 8(i)), which demonstrated that our nomogram model performed well on the robustness and efficacy. . Moreover, we also collected other three immuned related signature models [18][19][20], the comparison of the four model was conducted in the independent GEOdataset because the training dataset of four models was TCGA-LUAD cohort; Figures 9(g) and 9(h) showed that the AUC value of our model was significantly higher than other models in various time points.

Discussion
Our research merged multiple single-cell datasets, annotated the main cell type, and identified their cluster-specific marker genes. A five-gene risk model was obtained by the NK cell cluster marker gene screened in WGCNA analysis due to the closest relationship to tumor traits. Subsequent analysis validated the independent predicted value and well performance in immunotherapy response and revealed the crucial role of immune cells within the TME in tumor progression and metastasis. Since the rapid development and accessibility of scRNAseq in cancer research, promising findings in cancer evolu-tion, metastasis, and TME have been reached [21,22]. Previous studies have demonstrated that single-cell transcriptome analysis could apply specific signature genes to estimate cell type abundances of bulk transcriptome [23]. Schaum et al. [24] performed the CIBERSORTx deconvolution algorithm on annotated scRNA-seq to quantify the abundance of immune cells in 17 organs at ten ages based on their massive bulk RNA seq data which confirmed her findings with scRNA-seq. Recent studies reported applying scRNA-seq and bulk RNA seq data to analyze the tumor heterogeneity and immune cells in ovarian cancer [25], glioma [26], and esophageal squamous cell carcinoma [27]. Jerby-Arnon et al. [28] identified a cancer cell-related T resistance program to predict the immunotherapy response in melanoma patients. Here, we merged our dataset with two other independent datasets to expand the applicability of our signature in LUAD and targeted the NK cell cluster 1 as the essential signature by the linkage of CIBERSORTx and WGCNA analysis. Interestingly, the cluster 1 special expressed gene SFTPC was identified as one of the 5-gene risk models, demonstrating the findings' sturdiness. Altogether, the development of scRNA-seq data promoted the investigation of novel biomarkers in the specific cancer type.
Our research also showed that immune cells are TIME's backbone in LUAD. In our silicon analysis, NK cell signature and its subsequent LASSO selection constructed the 5-gene risk model, which had a significant negative correlation with the NK cell abundance. We inferred that the risk score represents the exclusion of NK cells. A previous study illustrated that NK cells were lower in NSCLC than in noncancerous lung tissue [29], holding the bridge of innate
On the contrary, NK cell infiltration could also generate durable and long-lasting antitumor immune responses against lung cancer. Our results showed that the lower risk score had a significantly higher immunotherapy response rate, a lower activated oncogenic pathways rate, and flat genomic abnormity. Generally, we suggested that our risk score inferred the infiltration of NK cells in the TIME of lung cancer.
Our research consequently identified the overlap of DEGs and NK cell cluster markers as the candidate for the risk model, and LASSO Cox regression helped us determine isocitrate dehydrogenase (NADP(+))2(IDH2), adrenoceptor beta 2(ADRB2), surfactant protein C (SFTPC), coiled-coil domain containing 69 (CCDC69), and cyclin D2 (CCND2). IDH2 is often considered to have a similar prognostic effect to IDH1 in glioma [35]. Li et al. [36] reported IDH2' as an indicator of poor prognosis and concluded that IDH2 promotes the Warburg effect and tumor proliferation through HIF1α in lung cancer. SFTPC encodes the pulmonary-associated surfactant protein C, a hydrophobic surfactant protein for maintaining stable pulmonary tissue. Moreno-Rubio et al. [37] reported the overexpression of SFTPC in long-term survival NSCLC patients, while a Norwegian group reported similar findings. They found that SFTPC and SFTPA mRNAs could be potential markers in regional nodes and peripheral blood in lung cancer [38]. In summary, we believe that the deep exploration of the molecular mechanism of the five gene model in TIME would facilitate the development of novel diagnostic biomarkers.
This study provides a new perspective on understanding the immune cells in TIME and sets a novel risk model; limitations to our research highlight the need for further work to optimize our work. Firstly, the internal validation of the model showed a good performance, and further external real-world validation is needed. Downstream and functional experiments underlying the mechanism of immune cells and model genes could help discover potential therapeutic targets. We plan to pursue applying the risk model to the diagnosis of early-rate LUAD.

Conclusions
Our study utilized the scRNA-seq data to identify the heterogeneous cell population in LUAD, applied the CIBER-SORTx algorithm to map the cell type into the bulk RNAseq, revealed the key role of immune cells, especially natural killer cells, in TIME, and constructed the 5-gene model with the robust prognostic prediction and potential to evaluate immunotherapy response.