Defining Diffuse Large B-Cell Lymphoma Immunotypes by CD8+ T Cells and Natural Killer Cells

Background There is a poor prognosis for diffuse large B-cell lymphoma (DLBCL), one of the most common types of non-Hodgkin lymphoma (NHL). Through gene expression profiles, this study intends to reveal potential subtypes among patients with DLBCL by evaluating their prognostic impact on immune cells. Methods Immune subtypes were developed based on CD8+ T cells and natural killer cells calculated from gene expression profiles. The comparison of prognoses and enriched pathways was made between immune subtypes. Following this validation step, samples from the independent data set were analyzed to determine the correlation between immune subtype and prognosis and immune checkpoint blockade (ICB) response. To provide a model to predict the DLBCL immune subtypes, machine learning methods were used. The virtual screening and molecular docking were adopted to identify small molecules to target the immune subtype biomarkers. Results A training data set containing 432 DLBCL samples from five data sets and a testing dataset containing 420 DLBCL samples from GSE10846 were used to develop and validate immune subtypes. There were two novel immune subtypes identified in this study: an inflamed subtype (IS) and a noninflamed subtype (NIS). When compared with NIS, IS was associated with higher levels of immune cells and a better prognosis for immunotherapy. Based on the random forest algorithm, a robust machine learning model has been established by 12 hub genes, and the area under the curve (AUC) value is 0.948. Three small molecules were selected to target NIS biomarkers, including VGF, RAD54L, and FKBP8. Conclusion This study assessed immune cells as prognostic factors in DLBCL, constructed an immune subtype that could be used to identify patients who would benefit from ICB, and constructed a model to predict the immune subtype.


Introduction
DLBCL, responsible for nearly 40% of non-Hodgkin lymphoma, is a hematological cancer of B cells [1,2]. Data on the global epidemiology of DLBCL are scarce, but it is estimated that 7 out of 100,000 people in America suffer from this disease [3]. For patients with DLBCL, chemotherapy agents are the first treatment choice [1]. Although about 65% of DLBCL patients could survive longer than 5 years [4], more than 30% of DLBCL patients still suffer from relapse and ineffective chemotherapy agents [5]. Considering that there are limited treatment options [6], key biomarkers and therapeutic targets are urgently needed.
PD-L1-positive malignant cells can suppress immune surveillance through a variety of mechanisms, one of which involves the decreased T-cell activity by the PD-1/PD-L1 pathway [11]. Compared with the PD-1-negative subgroup, the DLBCL subgroup with PD-L1+ has an unfavorable prognosis and a reduced overall survival (OS) [12]. An evaluation of the efficacy of pembrolizumab (PD-1 antibody) in combination with R-CHOP in untreated patients with DLBCL demonstrated a 90% overall response rate and a 77% complete response rate. [13]. An additional study revealed that out of five relapsed DLBCL patients, two of them achieved complete remissions through anti-PD-1 therapy and one of them achieved partial remission through anti-PD-1 therapy [14]. ere are currently several monoclonal antibodies being developed and evaluated for the treatment of DLBCL that target the PD-1/PD-L1 pathway [5].
Using 432 samples of DLBCL collected from five data sets in the current study, we identified two immune subtypes. In the training and testing data sets, we have analyzed the association between immune subtypes and prognosis, immune cells, and immune pathways. Using the random forest algorithm, 12 genes were selected for the construction of the machine learning model to predict immune subtypes for patients with DLBCL. e machine learning model was further tested through the use of an independent data set. We constructed a 12-gene panel to predict the prognosis of DLBCL patients and validated the prediction using a validation data set of DLBCL patients.

Data Collection and Identification of Immune Subtypes.
e training set consisted of 432 DLBCL samples and GSE11318 (N � 37) [15], GSE21846 (N � 29) [16], GSE23501 (N � 69) [17], GSE32918 (N � 249) [18], and TCGA-DLBCL (N � 48). For the validation of immune subtypes, the 420 DLBCL samples from GSE10846 were chosen as the data set to use for testing [19]. A summary of the demographic data of these DLBCL samples is listed in Table 1. e effect of immune subtypes on immune checkpoint blockade (ICB) response was studied using 65 tumor samples from GSE35640 [20]. All of these samples were retained, including the RNA expression matrix, clinical parameters, and survival data. e mutations of DLBCL samples from TCGA-DLBCL were downloaded from the R package "TCGAmutations". Tumor mutational burden (TMB) was calculated by dividing the number of nonsynonymous mutations by 38, where 38 is the estimate of the exome size. By using the "GSVA" package, the proportion of 28 types of immune cells was obtained by the expression matrix [21]. According to the values of natural killer cells as well as CD8 T cells, immune subtypes can be determined.

Differentially Expressed Gene (DEG) Identification and
Gene Set Enrichment Analysis (GSEA). We determined the log2FoldChange (FC) values between the inflamed subtype (IS) and noninflamed subtype (NIS) in each training data set using the R language package "limma" [22]. en, according to p value < 0.05 and |log2FoldChange| > 0.5 as the threshold, we used the "RobustRankAggreg" package to find the common and robust DEGs between NIS and IS samples [23]. e package RobustRankAggreg (RRA) could detect genes and proteins that rank consistently better than expected. It could also calculate a significance score for each gene/protein. is method was found to be robust to outliers, noise, and errors [23]. is method was also extensively investigated and used for selecting DEGs in previous articles [24,25]. In addition, we performed the enrichment analysis according to the log2FoldChange value of robust DEGs. e parameters of GSEA analysis were set as "minSize � 1", "maxSize � 1000", and "nperm � 500".

Gene Selection by Cox Regression Analysis and Random
Forest. A univariate Cox regression analysis was performed on the expression profiles of robust DEGs to identify prognostic genes, and immune subtype-related biomarkers were selected from robust DEGs by a cut-off of p value < 0.05 in the univariate Cox regression analysis. Random forest was trained in each training dataset to calculate the importance value of each gene. e top 5 downregulated genes and top 5 upregulated genes with the highest mean value of importance were selected.

2.4.
e Construction of the Immune Subtype Classifier. ree machine learning algorithms were used in this study, including random forest (RF), support vector machine (SVM), and artificial neural network (ANN). An automatic tuning process was used to adjust the parameters in a 5-fold cross-validation loop. In each loop, parameter values were chosen using a random search with 15 iterations, and model performance was assessed. e best classifier was selected by the area under the curve (AUC) value from 5-fold crossvalidation. e prediction performance was further evaluated by the AUC value from the testing data set.

Statistical
Analyses. R language software was used to analyze the data. e difference in continuous data across groups was analyzed using the t-test.
e Kaplan-Meier (KM) and log-rank analyses were used to analyze survival curves. In this study, p value < 0.05 was considered to be statistically significant.
2.6. Molecular Docking. Virtual screening and molecular docking were computational methods to identify potential small molecules that could target proteins [26]. A list of 1604 small-molecule drugs approved by the FDA was selected and downloaded from the ZINC15 database [27]. e protein structures of three selected proteins (VGF, RAD54L, and FKBP8) were obtained from AlphaFold [28]. AutoDock Vina, a virtual screening software, was used to select the small molecule with the lowest binding energy with the protein [29]. en, AutoDock, a semiflexible molecular docking software, was used to identify the binding pose of the selected small molecule with the protein [30]. After molecular docking, PyMOL software was used to visualize the binding structures of small molecules and proteins after docking.

Identification of Immune Subtypes.
e study flow diagram is shown in Figure 1. In GSE32918, 12 immune cell types were protective (Cox coefficient < 0) and 2 immune cells were hazardous (Cox coefficient > 0, Figure 2(a)). In GSE11318, natural killer T cells were found to be significantly positively related to prognosis. In GSE23501, immature B cells were found to be significantly positively related to prognosis. In TCGA-DLBCL and GSE21846, none of the immune cells had a significant impact on prognosis. e immune cell data from different data sets were then combined. In the combined data set, activated CD8 T cells, CD56 bright natural killer cells, effector memory CD8 T cells, natural killer cells, natural killer T cells, T follicular helper cells, and type 1 T helper cells had a lower p value (p value <0.01) with prognosis.
As two cytotoxic effector cells of the immune system, activated CD8 T cells and natural killer cells were chosen because they have been implicated in cancer immunotherapy. An association between the number of natural killer cells and the number of activated CD8 T cells was significant in the combined data set (correlation coefficient: 0.41; p value < 0.0001; Figure 2(b)). Using coordinate axes and diagonal, we obtained two stable immune subtypes: the inflamed subtype (IS) and the noninflamed subtype (NIS). Samples from IS had more CD8 T lymphocytes along with higher levels of natural killer cells that were both above the diagonal (Figure 2(c)). In contrast, the NIS samples were below the diagonal and had lower levels of activated CD8 T cells and natural killer cells. Using K-M analysis, we were able to determine the correlation between immune subtypes and survival. e overall survival of patients in NIS was significantly shorter than that of patients in subtype IS ( Figure 2(d)). In the testing data set, two immune subtypes were derived by the same method (Figure 2(e)). e NIS and IS were statistically different in terms of overall survival, with IS having a better prognosis and NIS suffering from a worse prognosis (Figure 2(f )). Comparing the percentage of pathological stages between the two immune subtype groups is not significant (p value � 0.145, Supplementary Table 1).

Comparison of Immune Cells and Immune Function among the Immune Subtypes.
e combined training data set indicated that the IS was more likely to show infiltration of most types of immune cells, such as T cells in the TME compared with the NIS (Figures 3(a)). However, B cells containing activated B cells (p value < 0.01), immature B cells (p value � 0.01), and memory B cells (p value � 0.63) were higher in the NIS. For a majority of immune functions, they were enriched in the IS (Figures 3(b)). But the B cell receptor signaling was higher in the NIS (p value <0.01), which is consistent with the results of immune cell infiltration. In the testing data set (GSE10846), the same results were observed. For example, greater levels of most immune cells ( Supplementary Figure 1 Besides, we also calculated the TMB distribution between the two immune subtypes (Supplementary Figure 2). Although the TMB value appears higher in the NIS, the difference in the TMB value between the two immune subtypes was not significantly different.
e expression values of PD-1 and PD-L1 were compared between the two immune subtypes in each data set. In Supplementary Figure 3, the PD-1 expression value was found to be significantly higher in the IS than NIS at GSE10846, GSE11318, GSE21846, GSE23501, and GSE32918 (p value < 0.05). In Supplementary Figure 4, the PD-L1 expression value was found to be significantly higher in the IS than NIS at GSE10846, GSE23501, GSE32918, and TCGA-DLBCL (p value < 0.05).

Identification of DEGs and Enrichment Analyses.
In each training data set, the log2FoldChange values for each gene were obtained by using the "limma" package.
us, the DEGs lists that contained gene names, log2FoldChange values, and p values were obtained. However, the potential DEGs that are crucial for DLBCL development will be hugely reduced if the DEGs lists from different data sets are directly merged. us, the RRA method was applied to combine the results from the five data sets with minimal bias. 2409 DEGs    Figure 5).
To obtain the enriched pathways, the GO-BP, GO-CC, GO-MF, REACTOME, and KEGG enrichment analyses were applied. For GO-BP analysis, the upregulated genes in the NIS were related to metabolic pathways such as "catabolic process", "organonitrogen-compound-metabolic process", and "cellular catabolic process" (Supplementary Table 2). e upregulated genes in the IS were associated with immune pathways such as "response-to-external stimulus", "defense response", and "positive regulation of adaptive immune response" (Supplementary Table 3). For GO-CC analysis, the upregulated genes in the NIS were related to "cell junction", "anchoring junction", and "nucleolus". e upregulated genes in the IS were associated with "extracellular space" and "extracellular matrix". For GO-MF analysis, the upregulated genes in the NIS were related to "RNA binding", "Poly-A-RNA binding", and "structural molecule activity". e upregulated genes in the IS were associated with "receptor binding", "identical protein binding", "transition metal-ion binding", and "cytokine activity". For KEGG analysis, the upregulated genes in the NIS were mainly related to "ribosome", "hypertrophic-cardiomyopathy-HCM", and "adherens junction". e upregulated genes in the IS were associated with immune pathways "natural-killer-cell-mediated cytotoxicity" and "complement and coagulation cascades". For REACTOME analysis, the upregulated genes in the NIS were related to metabolic pathways. e upregulated genes in the IS were enriched in immune pathways "innate immune system", "interferon-alpha-beta signaling", and "chemokine-receptors-bind-chemokines".

Gene Selection and Construction of the Immune Subtype
Classifier. A univariate Cox regression analysis was performed to further narrow down the 2409 DEGs. In total, 623 genes were found to be associated with prognosis, 157 of which were associated with poor prognosis and 466 of which were associated with good prognosis. e importance of these 623 genes was evaluated by using the random forest algorithm based on the importance evaluator. e top five most important genes in GSE11318 were FASLG, CCR5, GZMK, TMEM155, and GIMAP4 (Figure 4(a)). FGL2, CPVL, ITK, DUSP3, and FKBP8 comprised the top five genes in GSE21846 (Figure 4(b)). TNFSF13B, SH2D1A, RARRES3, CD2, and GZMK ranked as the top 5 important genes in GSE23501 (Figure 4(c)). In the analysis of GSE32918, LAP3, RARRES3, GZMK, IL2RB, and FCER1G were the top five genes involved (Figure 4(d)). e top 5 genes in the TCGA-DLBCL were FCER1G, SFXN3, CFB, TNFSF13B, and STOM (Figure 4(e)). e top 6 upregulated DEGs with the greatest mean importance value (GZMK, FCER1G, RARRES3, TNFSF13B, SH2D1A, and CCR5) and the top 6 downregulated DEGs with the highest importance value (VGF, RAD54L, TTC27, PAQR4, AP1S1, and FKBP8) were chosen for model creation (Figure 4(f )).
e expression values of these 6 upregulated (GZMK, FCER1G, RARRES3, TNFSF13B, SH2D1A, and CCR5) and 6 downregulated DEGs (VGF, RAD54L, TTC27, PAQR4, AP1S1, and FKBP8) in the IS and NIS from the testing data set are shown in Supplementary Figure 6. Furthermore, K-M survival curves were created to analyze the relationships between the expression levels of the 12 genes and OS. In the combined training data set, upregulated genes were associated with a better prognosis, while downregulated genes were associated with a worse prognosis ( Figure 5). As shown      Supplementary Figure 8. GZMK, FCER1G, RARRES3, TNFSF13B, SH2D1A, and CCR5 were found to be positively correlated with immune cells such as T cells, and these genes were negatively correlated with B cells. In contrast, VGF, RAD54L, TTC27, PAQR4, AP1S1, and FKBP8 were found to be negatively correlated with immune cells such as T cells, and these genes were positively correlated with B cells. As shown in Table 2, random forest (RF) yields an AUC of 0.908, support vector machine (SVM) yields 0.907, and artificial neural network (ANN) yields 0.898. e random forest was selected since it had the highest AUC value. e constructed random forest model reached an AUC of 0.948 in the testing data set (Figure 6(a)). Based on the expression data of GSE35640, the immune subtype of cancer patients who received the treatment of immunotherapy was predicted. We found that the response rate to immunotherapy for the IS was higher than the NIS (0.57 vs 0.19) (Figure 6(b)). Consequently, the constructed model can serve as a useful tool to select patients who are likely to benefit from immunotherapy.

Virtual Screening and Molecular Docking Analysis.
According to the virtual screening results from AutoDock Vina, the small molecules with the lowest free energy for each protein were selected. ZINC242548690, ZINC29416466, and ZINC203686879 were selected to target VGF, RAD54L, and FKBP8, respectively. Next, AutoDock and PyMOL were used to dock and visualize small molecules and proteins. e binding poses of protein-molecule complexes were ranked by the binding free energy. e 3D images of binding poses with the lowest energy are shown in Figure 7. In addition, 2, 1, and 2 H-bonds were found among these three protein-molecule complexes.      Journal of Oncology (ABCs) and germinal center B cells (GCBs) [31]. DLBCL subtypes have been or will be taken into consideration for the treatment of DLBCL. A prior study, for example, performed a thorough genetic analysis to identify five distinct DLBCL molecular subtypes [31]. e purpose of this study was to analyze the immune subtypes of DLBCL based on specific immune cells and to evaluate the reliability of the findings. Based on CT8 T cells and natural killer cells, two distinct subtypes were identified in our study: the inflamed subtype (IS) and the noninflamed subtype (NIS). e IS was associated with immune cells such as T cells. But the NIS was associated with B cells and B-cell-related pathways.

Discussion
Furthermore, survival analysis showed that IS had a better prognosis than the NIS. Several studies have revealed that the TME plays a significant role in the ICB therapy response rate [32]. In this work, supervised machine learning approaches were used to build models that could predict DLBCL patients' immune subtypes.
e impact of immune subtypes on ICB responsiveness was then proven. We discovered that the IS had a greater response rate to immunotherapy than the NIS (0.57 vs 0.19). As a result, our machine learning model may provide a method for selecting DLBCL patients for ICB therapy. For the NIS patients, six genes (VGF, RAD54L, TTC27, PAQR4, AP1S1, and FKBP8) were selected as the subtype biomarkers. Since three of them (VGF, RAD54L, and FKBP8) showed notable differences in prognosis from the testing data set, they were selected as the target proteins to identify the potential small-molecule drugs. Based on the virtual screening and molecular docking analysis, three small molecules were finally selected as the novel therapeutic drugs for NIS patients. e NIS might be sensitive to three selected small molecules: ZINC242548690, ZINC29416466, and ZINC203686879. ZINC242548690 (digoxin) is a cardiac glycoside, but many studies suggested it could increase the effect of anticancer therapy [33]. ZINC29416466 (saquinavir) is an available human immunodeficiency virus protease inhibitor and could inhibit proteasome activity in mammalian cells as well as act on the HIV-I protease [34]. Saquinavir was found to induce apoptosis in human cancer cells and could become a new class of cytotoxic chemotherapy drugs [34,35]. ZINC203686879 (velpatasvir) is one of the hepatitis virus inhibitors [36], and studies are needed to validate the effect of velpatasvir on tumor cells. is study aims to provide novel therapies to the need for personalized and precise treatment for DLBCL patients.
ere are certain limitations to our research. In vitro and in vivo testing should be done on the effects of ZINC242548690, ZINC29416466, and ZINC203686879 on DLBCL tumor development. Additionally, the immune cells in this study were solely predicted by the R GSVA package.
ere will be a more precise evaluation of immune cells if experiments or multiple bioinformatics methods can be conducted.

Conclusion
Our study discovered the correlations of immune cells with prognosis. Based on the CD8 T cell and natural killer cells, DLBCL samples were divided into NIS and IS. Multiple cohorts evaluated and confirmed the associations of immune subtypes with prognosis and ICB therapy responsiveness. In conclusion, we constructed an accurate and robust machine learning model that may facilitate the prediction of immune subtypes and DLBCL patient selection for ICB treatment.
Data Availability e data sets were downloaded from the open-access data sets without limitations, and the resources are listed in "Data Collection" and "Identification of Immune Subtypes" of the "Materials and Methods" section.

Ethical Approval
All the expression data and clinical information were retrieved from publicly available data sets, which were free to download and analyze without limitations. Investigators of each study obtained approval from their local ethics committee and informed patient consent.

Supplementary Materials
Supplementary  Table 2. e enriched GO-BP, GO-CC, GO-MF, KEGG, and REACTOME pathways associated with DEGs of the NIS predicted by GSEA analysis. Supplementary Table 3. e enriched GO-BP, GO-CC, GO-MF, KEGG, and REACTOME pathways of the IS predicted by GSEA analysis. (Supplementary Materials)