A Cell Differentiation Trajectory-Related Signature for Predicting the Prognosis of Lung Adenocarcinoma

Objective To screen the cell differentiation trajectory-related genes and build a cell differentiation trajectory-related signature for predicting the prognosis of lung adenocarcinoma (LUAD). Methods LUAD single cell mRNA expression profile, TCGA-LUAD transcriptome data were obtained from GEO and TCGA databases. Single-cell RNA-seq data were used for cell clustering and pseudotime analysis after dimensionality reduction analysis, and the cell differentiation trajectory-related genes were acquired after differential expression analysis conducted between the main branches. Then, the consensus clustering analysis was carried out on TCGA-LUAD samples, and the GSEA analysis was performed, then the differences on the expression levels of immune checkpoint genes and immunotherapy response were compared among clusters. The prognostic model was constructed, and the GSE42127 dataset was used to validate. A nomogram evaluation model was used to predict prognosis. Results Two subsets with distinct differentiation states were found after cell differentiation trajectory analysis. TCGA-LUAD samples were divided into two cell differentiation trajectory-related gene-based clusters, GSEA found that cluster 1 was significantly related to 20 pathways, cluster 2 was significantly enriched in three pathways, and it was also shown that clusters could better predict immune checkpoint gene expression and immunotherapy response. A six cell differentiation-related genes-based prognostic signature was constructed, and the patients in the high-risk group had poorer prognosis than those in the low-risk group. Moreover, a nomogram was constructed based on the prognostic signature and clinicopathological features, and this nomogram had strong predictive performance and high accuracy. Conclusion The cell differentiation-related signature and the prognostic nomogram could accurately predict survival.


Introduction
Lung cancer is a common cancer with the highest incidence rate and mortality rate worldwide [1]. Te incidence rate and mortality rate of lung cancer in China have been increasing in recent years, and it is reported that the incidence and mortality of lung cancer in China are 17.9% and 23.8% in 2020, respectively [2]. Lung cancer can be categorized into small cell lung cancer and nonsmall cell lung cancer according to the diferent histopathological characteristics [3]. Lung adenocarcinoma (LUAD) is the most common type of lung cancer, which belongs to nonsmall cell lung cancer, accounting for 45.5% of lung cancer [4]. LUAD is also a heterogeneous tumor, and its biological behavior is afected by a complex intracellular gene regulatory network. Although the treatment of lung cancer has made good progress in recent years, the survival rate of LUAD is still not ideal, and the fve-year survival rate of LUAD is only 20-30% [5,6].
In the process of cell division and diferentiation, owing to the reprogramming of genome and epigenome and DNA replication errors, individual cells can present diferent genomes, transcriptome and epigenome [7]. Te development of single-cell RNA-seq ofers an opportunity to comprehensively describe genetic complexity at the cellular level, containing copy number variations, gene fusions, gene expression levels, etc. [8]. Cell diferentiation is closely related to tumorigenesis, with certain genes potentially functioning as diferentiation regulators. Te ability of cells to respond to cell-cell contact by restraining cell circulation and further circulation depends on diferentiation. Tis ability loss is a feature hallmark in tumorigenesis which is due to a failure of diferentiation [9,10]. However, the cell diferentiation-related genes and cell diferentiation-related signature in LUAD have been rarely reported.
Tus, in this study, based on the LUAD single-cell mRNA expression profle and TCGA-LUAD transcriptome data, we screened the cell diferentiation trajectory-related genes and built a cell diferentiation trajectory-related signature for predicting the prognosis of LUAD. Tis study provides novel insights and therapeutic targets for LUAD.

Data Collection.
Te LUAD single-cell mRNA expression profle of the GSE149655 dataset was obtained from the GEO database, among which two LUAD samples were analyzed in this study, containing GSM4506699 and GSM4506701. Besides, the TCGA-LUAD transcriptome and clinical data were acquired from the UCSC database (http:// genome.ucsc.edu/) to perform consensus clustering analysis and construction of the prognosis model, and the GSE42127 dataset including 176 LUAD samples was used as the validation set of the prognostic model.

Data Preprocessing and Principal Component Analysis
(PCA) Dimensionality Reduction Analysis. Te quality control and statistical analysis of single-cell RNA-seq data were conducted utilizing the Seurat package [11,12] in R. Filter according to the following quality control standards: (a) genes found in <5 cells were fltered out; (b) cells with a total number of detected genes <200 were fltered out; (c) cells with mitochondrial gene expression ≥25% were fltered out. Te data in the GSE149655 dataset were logarithmically normalized, and the PCA was used to determine the available dimensions and screen-related genes.

Cell Clustering and Pseudotime Analysis.
Unsupervised cluster analysis of cells was carried out by using the function of FindNeighbors and FindClusters in the Seurat package, and the nonlinear dimensionality reduction method was used. In addition, the scCATCH [13] in R was employed to annotate the cell type in the cluster. Moreover, the pseudotime on single-cell data was performed using Monocle [14]. Ten, the diferential expression analysis was conducted between the main branches with the cutof value of |log2FC| ≥ 1 and P < 0.01, and the cell diferentiation trajectory-related genes were acquired, which were considered as the marker genes.

Consensus Clustering Analysis on TCGA Samples.
Te consensus clustering analysis on the TCGA samples (based on the marker genes in each branch) was carried out using the ConsensusClusterPlus package [15] with the parameters of clusterAlg � 'km dist', distance � 'euclidean', and repeat times � '1000'. Te Chi square test was utilized to evaluate the signifcant diferences in the distribution of clinical characteristics (tumor stage and age) of diferent subtypes. (GSEA). Total 51 hallmark gene sets were downloaded from the molecular signatures database (MSigDB), then clusterprofler (version: 3.8.1, http://bioconductor.org/packages/release/bioc/html/ clusterProfler.html) in R was employed to determine whether any signatures were enriched in two clusters by GSEA analysis based on TCGA-LUAD expression matrix. Signifcantly enriched hallmarks were chosen according to a P value < 0.05.

Immune Checkpoint Genes and Immunotherapy
Response across Two Clusters. Te diferences of the six immune checkpoint expression levels among diferent subtypes were evaluated. Te possibility of subtypes responding to immunotherapy was predicted using the submap algorithm in TIDE [16] and GenePattern [17] websites.

Generation and Validation of a Prognostic Model.
Te single-factor Cox proportional hazards regression analysis was performed on the marker genes in TCGA-LUAD, and the genes with the cutof value of P < 0.05 were screened. Ten, multivariate Cox analysis and stepwise regression were carried out to identify the prognosis-related genes, and after that the prognostic model was built. To assess the prognosis value of the prognostic model, the risk score was calculated using predict function with the following formula: riskscore � h 0 (t) * exp(β 1 X 1 + β 2 X 2 + . . . + β n X n ) (β: regression coefcient, h 0 (t): benchmark risk function, exp: the nth power of natural number). Te patients were categorized into high-risk and low-risk groups based on the median risk score value. Besides, the GSE42127 dataset was utilized to validate the prognostic model.

Quality Control and PCA Dimensionality Reduction
Analysis. After fltering according to the quality control standards, a total of 2,962 cells were included in this study. Te number of detected genes was correlated with sequencing depth, with a total of 20,240 corresponding genes (Figure 1(a)). Ten, the PCA was used to determine the available dimensions and identify the related genes and a total of 20 principal components (PCs) with the threshold of P < 0.05 (Figure 1(b)).

Cell Clustering and Pseudotime Analysis.
Unsupervised cluster analysis of the 2,962 cells was carried out by using the function of FindNeighbors and FindClusters in the Seurat package. Ten, the FindClusters tool was used to cluster the cells (Figure 2(a)), and the heatmap of the top 20 genes in the cell cluster is shown in Figure 2(b). In addition, scCATCH in R was employed to annotate the cell type in clusters (Table 1; Figure 2(c)). Besides, two main branches (branch I and branch II) were obtained ( Figure 2(d)), and the cell diferentiation trajectory-related genes in each branch were acquired after differential expression analysis, which were considered as the marker genes. Ten, the 38, 93 marker genes in branch I and branch II were obtained, respectively, and the total of 131 marker genes were used for subsequent analysis.

Two Cell Diferentiation Trajectory-Related Gene-Based
Clusters from TCGA Dataset. Based on the expression pattern of 131 marker genes, the consensus clustering analysis was carried out on TCGA samples by utilizing the ConsensusClusterPlus package, and total two clusters were obtained, including cluster 1 and cluster 2 ( Figure 3(a)). Te Kaplan-Meier analysis showed that the overall survival (OS) of LUAD patients in cluster 1 was higher than that in cluster 2 ( Figure 3   the relationships of clusters and clinic information is shown in Figure 3(c). GSEA found that cluster 1 was signifcantly related to 20 pathways, including allograft rejection, epithelial mesenchymal transition, IL6-JAK-STAT3 signaling, and so on, and the top 5 involved pathways are shown in Figure 3(d). Cluster 2 was signifcantly enriched in three pathways, containing MYC-targets-V2, spermatogenesis, and unfolded protein response (Figure 3(e)).

Expression Pattern of Immune Checkpoints and Immunotherapy Response.
Te expression levels of six immune checkpoints among diferent subtypes were compared, and the results are shown in Figure 4(a); the expression levels of the six immune checkpoint genes signifcantly increased in cluster 1 when compared to that in cluster 2. In addition, the possibility of subtypes responding to PD1 and CTLA4 inhibitors was predicted using the submap algorithm in TIDE and GenePattern websites, and the result showed that LUAD

Construction of a Nomogram for Predicting Patient OS.
To explore the relation between clinicopathological features and the prognosis model, age, sex, MNT, stage, and the risk score in TCGA-LUAD samples were analyzed, and the result uncovered that stage and the risk score were independent prognostic factors in patients with LUAD (P < 0.01; Table 2). Besides, a nomogram was built with the stage and risk score (Figure 6(a)), and the calibration plots revealed that the nomogram might accurately estimate the mortality (Figure 6(b)).

Discussion
It has been suggested that cancer should be regarded as a disease of cell diferentiation [10]. Tus, this study aimed to screen the cell diferentiation trajectory-related genes and built a cell diferentiation trajectory-related signature for predicting the prognosis of LUAD. A total of 131 cell differentiation trajectory-related genes were obtained, then consensus clustering analysis was conducted on TCGA samples, and total two clusters were obtained, including cluster 1 and cluster 2. Te Kaplan-Meier analysis showed that the OS of LUAD patients in cluster 1 was higher than that in cluster 2. GSEA found that cluster 1 was signifcantly related to 20 pathways, including allograft rejection, epithelial mesenchymal transition, and IL6-JAK-STAT3 signaling, and cluster 2 was signifcantly enriched in three pathways, containing MYC-targets-V2, spermatogenesis, and unfolded protein response. Te cell-biological program termed the epithelial mesenchymal transition plays an important role in both development and cancer progression [18]. Numerous studies have found that the IL6-JAK-STAT3 signaling pathway was activated abnormally in a variety of tumor tissues, which has an immense infuence on tumor progression [19][20][21]. Schulze et al. have revealed that the MYC target V2 scores are associated with tumor aggressiveness and survival outcomes in ER-positive primary tumors, as well as in metastatic breast cancer [22]. Zhou et al.
found that the levels of spermatogenesis-associated protein increased signifcantly in the serum of patients with lung cancer compared with those in healthy controls [23]. Te unfolded protein response is a prosurvival mechanism triggered by accumulation of unfolded or misfolded proteins in the endoplasmic reticulum, and unfolded protein response signalling plays important roles in cancer progression [24]. Besides, the emergence of immunotherapy makes people have a new understanding of the treatment of tumor, and immune checkpoints have become a potential and effective treatment [25,26]. In this study, the expression levels of the six immune checkpoint genes signifcantly increased in cluster 1 when compared to that in cluster 2. In addition, LUAD patients in cluster 1 was more sensitive to anti-PD1 therapy, suggesting that cluster 1 has the potential to determine the specifc LUAD patients who are immunogenic and more responsive to immune checkpoints. Based on the expression matrix of 131 marker genes in TCGA-LUAD, six prognosis-related genes were used to construct the prognostic model, containing CD69, CLIC6, CTSL, EPHX1, LMO3, and MS4A7. CD69 is the earliest cell surface marker of activated T cells [27]. Cibrián et al. showed that CD69 regulates the regulatory T (Treg) cell diferentiation and the of IFN-c, IL-17, and IL-22 secretion [28]. Martín et al. found that CD69 related to Jak3/Stat5 proteins regulates T17 cell diferentiation [29]. de la et al. have revealed that CD69 participates in immune cell homeostasis and regulates T cellmediated immune response by controlling T17 cell diferentiation [30]. Previous studies have found that CLIC6 is overexpressed in breast cancer and endometrial cancer [31,32]. CTL upregulation is common in a variety of human tumors and has been widely correlated to metastasis, invasiveness, and poor prognosis [33]. EPHX1 plays signifcant roles in the detoxifcation and activation of tobacco-derived carcinogens, as well as lung cancer, and the low activity genotype of EPHX1 gene is related to the reduction of lung cancer risk in whites [34]. Sun [38]. Tus, we suspected that these six cell diferentiation trajectory-related genes played vital roles in the LUAD patient survival.

Genetics Research
Besides, the prognostic model could efciently stratify the patient outcomes and was verifed in an independent dataset. Patients in the high-risk group had poorer prognosis than those in the low-risk group. Te AUCs survival times at 1-year, 3-year, and 5-year were 0.706, 0.685, and 0.639 for TCGA dataset and 0.76, 0.65, and 0.616 in the GSE42127 dataset, suggesting that the performance of the gene signature was reliable. In addition, the stage and risk score were independent prognostic factors in patients with LUAD for clinical-decision support. Moreover, a nomogram was built with the stage and risk score. Because of its intuitive visual performance and personalized application, the nomogram has become a popular tool for tumor prognosis [39,40]. Consistently, in this study, the nomogram might accurately estimate the survival probabilities for LUAD patients.
Also, this study has numerous limitations. First, the data analyzed in this study were acquired from public databases, and external validation was needed. Second, the clinical features obtained from TCGA database are limited, and potential prognostic factors, such as smoking, targeted drug therapy, and personal history, should be considered in this study. Tird, the six cell diferentiation-related genes and the prognostic model analyzed in this study are needed to validate in clinical samples.

Conclusion
In summary, this study has constructed a reliable prognostic risk model that is closely associated with cell diferentiation trajectory, which can better predict survival and provide insights into potential markers for therapeutic strategies in lung cancer patients.

Data Availability
Te data that support the fndings of this study are available in TCGA and GEO databases.

Conflicts of Interest
Te authors declare that they have no conficts of interest.

Authors' Contributions
Conception and design of the research were conducted by TZ; acquisition of data was carried out by FY and JZ; analysis and interpretation of data were conducted by YZ and TZ; statistical analysis was performed by XH; drafting of the manuscript was performed by FY; revision of the manuscript for important intellectual content was conducted by YZ. All authors read and approved the fnal manuscript.