Machine Learning Analysis of Immune Cells for Diagnosis and Prognosis of Cutaneous Melanoma

Tumor infiltration, known to associate with various cancer initiations and progressions, is a promising therapeutic target for aggressive cutaneous melanoma. Then, the relative infiltration of 24 kinds of immune cells in melanoma was assessed by a single sample gene set enrichment analysis (ssGSEA) program from a public database. The multiple machine learning algorithms were applied to evaluate the efficiency of immune cells in diagnosing and predicting the prognosis of melanoma. In comparison with the expression of immune cell in tumor and normal control, we built the immune diagnostic models in training dataset, which can accurately classify melanoma patients from normal (LR AUC = 0.965, RF AUC = 0.99, SVM AUC = 0.963, LASSO AUC = 0.964, and NNET AUC = 0.989). These diagnostic models were also validated in three outside datasets and suggested over 90% AUC to distinguish melanomas from normal patients. Moreover, we also developed a robust immune cell biomarker that could estimate the prognosis of melanoma. This biomarker was also further validated in internal and external datasets. Following that, we created a nomogram with a composition of risk score and clinical parameters, which had high accuracies in predicting survival over three and five years. The nomogram's decision curve revealed a bigger net benefit than the tumor stage. Furthermore, a risk score system was used to categorize melanoma patients into high- and low-risk subgroups. The high-risk group has a significantly lower life expectancy than the low-risk subgroup. Finally, we observed that complement, epithelial-mesenchymal transition, and inflammatory response were significantly activated in the high-risk group. Therefore, the findings provide new insights for understanding the tumor infiltration relevant to clinical applications as a diagnostic or prognostic biomarker for melanoma.


Introduction
Melanoma is the most aggressive type of cutaneous cancer derived from the melanocyte lineage, with the highest metastasis and mortality rate [1]. Despite melanoma contributing to only 5% of all skin-related cancers, it accounts for approximately 80% of deaths related to skin tumors. Like most other cancers, surgical enucleation and drug therapy are difficult to treat once it has metastasized [2][3][4]. In addition, it is difficult to detect early, and the majority of patients with melanoma were diagnosed at an advanced stage [5,6]. Currently, cancer treatment guidance and prognosis prediction are largely determined by the TNM staging system. However, the clinical experience revealed that many patients, even within the same TNM stage, have differences in overall survival [7]. e clinical limitations of the TNM stage are increasingly becoming apparent. us, it is critical to identify novel biomarkers for early diagnosis and prognostic prediction.
Growing studies have recently reported that the tumor microenvironment plays a crucial role in the initiation and development of numerous malignant tumors [8,9]. In the tumor microenvironment, the type, location, and function of immune cells are intimately associated with the clinical outcome [10,11]. Tumor cells regarded as antigens will attract immune cells and leukocytes by many chemokines to influence the immune response. Moreover, the immune escape of a tumor cell was considered a crucial factor in tumorigenesis [12][13][14]. e prognostic value of the tumor microenvironment also had been demonstrated in numerous melanoma experiments. High immune infiltration in melanoma is also shown to be associated with a favorable prognosis. Currently, immune checkpoint inhibitors, such as CTLA-4, PD-1, and PD-L1 inhibitors, are pivotal therapies for the treatment of advanced melanoma [15,16]. erefore, it is promising to explore the tumor microenvironment-associated potential novel biomarkers for early diagnosis, prognosis prediction, and melanoma patient management. Besides tumor cells, melanoma also commonly includes various types of immune cells, which may be regarded as a potential diagnostic signature to classify tumors from suspected patients. us, systematically evaluated infiltrating immune cells were recognized as significant supplemental biomarkers to the TNM stage for diagnosis and prognosis prediction. Fortunately, the numerous transcriptome profiles deposited in the availability of public databases could provide immense data to investigate the infiltration of immune cells in melanoma.
Machine learning is a powerful tool to analyze and summarize complex datasets, which can provide various computational approaches to predict clinical diseases [17]. Previously, several algorithms have been successfully applied to diagnose and predict diseases, including logistic regression model [18], support vector machine [19], random forests analysis [20], and artificial neural network [21]. Compared to classical methods, machine learning often has remarkably high sensitivity and specificity. Also, machine learning is the best choice to process the increasingly growing genomic data and clinical information in oncology research and predict the susceptibility, survival, and recurrence of cancer. erefore, in this research, we, firstly, estimate the proportions of 24 immune cells in 944 samples (653 tumors and 291 normal controls) according to their gene expression at the mRNA level. Next, we use multicategory machine learning to identify several important immune cell signatures and construct diagnostic and prognostic models, which manifest important implications in terms of melanoma patient diagnosis and prognosis.

Melanoma Collection and Normal Controls.
Melanoma patients were collected from public datasets. e eligible datasets were downloaded from the GEO database (https:// www.ncbi.nlm.nih.gov/geo) and the UCSC Xena website (http://xena.ucsc.edu/public-hubs/). Finally, five datasets consisting of 944 samples, including TCGA, GSE3189, GSE15605, GSE46517, and GSE54467, were screened out for this study. R software was used to process the raw transcriptome profiles of RNA sequencing data. To begin, the probe IDs were annotated using platform annotation metadata. e median expression value will be generated to reflect the gene expression level for the same gene corresponding to multiple IDs. en, genes with a variance of 0 will be eliminated for their low expression level. After that, log2 (x + 1) conversion was used to normalize the raw matrix data.

Estimation of Immune Cell
Types. ssGSEA algorithm was applied to estimate the proportions of immune cells and transform the normalized gene expression data into 24 human immune cell types expression, which included the dendritic cells (DCs), immature DCs (iDC), activated DCs (aDC), plasmacytoid DCs (pDC), natural killer (NK) cells,  CD56dim NK cells, CD56bright NK cells, Mast cells,  macrophages, neutrophils, eosinophils, B cells, cytotoxic  cells, and T cells. Also, the T cells were subdivided into the T  central memory cells (Tcm), T effector memory cells (Tem),  CD8 T cells, regulatory T cells (Treg), Tgd cells, T follicular  helper cells (TFH), and T helper cells, namely 1, 2, and 17 [22,23].

Diagnostic Analysis.
Firstly, these samples were categorized into tumor tissue groups and normal tissue groups. en, different analyses of immune cells between the tumor and normal cells were carried out, and the p values less than 0.05 were regarded as the differently-expressed immune cells (DEICs). en, we used the Upset plot to explore the overlap of DEICs among multiple datasets. To develop a diagnostic model with selected DEICs, five machine learning methods, including random forests (RF), logistic regression (LR), support vector machines (SVM), neural network (NNET), and least absolute shrinkage and selection operator (LASSO), combined with five-fold cross-validation, were systematically performed to construct the models in the TCGA dataset. e specificity and sensitivity of the diagnostic models were assessed by the receiver operating characteristic (ROC) curves. Principal component analysis was applied to determine whether these DEICs could definitely classify tumors from normal controls. e diagnostic model was also validated in another three independent datasets. Besides, the diagnostic score of each sample was calculated using the LR coefficients and corresponding expression level. e formula is N i�1 (coef i × exp r i ), which could well distinguish the normal and tumor tissues. Furthermore, immunohistochemistry was performed to compare the different infiltrating immune cells between melanoma and normal skin tissues. We estimated the immunohistochemical images by combining the percentage of positively stained cells with the staining intensity score.

Prognostic Analysis.
To explore the most significant immune cells in the prognosis of melanoma, the qualified melanoma samples in TCGA were equally divided into training and testing samples at random. Multicategory machine learning methods, including LASSO, SVM-recursive feature elimination (SVM-RFE), and RF-feature selection (RF-FS), were conducted to identify the important immune cells in the training dataset. en, the Cox regression method was performed to develop a prognostic model with selected immune cells.
e risk scores were generated by the formula: N i�1 (coef i × exp r i ), where N stands for the number of cells, expr i indicates the expression of cells, and coef i indicates the coefficient of cox regression. e patients in the training dataset were subsequently estimated by the risk formula, and then, the patients were divided into high-and low-risk groups based on the best cutoff risk score. Kaplan-Meier survival curves were created to assess the differences between the high-and low-risk groups, and log-rank tests were used to determine the significance. e predictive accuracy of the model for 5-year overall survival was estimated by the area under the curve (AUC) of receiver operating characteristic curves (ROC). Moreover, to demonstrate the result's robustness, the immune cell-related signature was further validated in the testing dataset and GSE54467. Finally, nomograms were constructed in this study based on patients' clinical characteristics and risk scores. To compare the prediction and actual survival, calibration curves were drawn. Decision curves were also plotted to discriminate the clinical usefulness of the nomogram and tumor stage.

Stratified Analysis.
To investigate the connection between the risk score distribution and clinical features, the stratified analysis of clinical features, containing age, stage, gender, race, vital status, and tumor status, was conducted. Moreover, the univariate and multivariate cox regressions were carried out to evaluate the prognostic value of the risk score and clinical features. Next, to study the possible biological characteristics between the high-and low-risk groups, gene expression data correlated to immune checkpoint regulators and epithelial-mesenchymal transition (EMT) were investigated [24][25][26][27]. Firstly, the expression data of these genes were extracted, and then, they were classified into high-and low-risk groups using the optimal cutoff value. en, stratified analyses of the corresponding genes were conducted.
2.6. Gene Set Enrichment Analysis. GSEA was performed to discover the significant pathways enriched in molecular mechanisms of Low-vs. high-risk groups using the "clus-terProfiler" package in the R software. Firstly, all genes were produced by the "Limma" differential analysis of low-vs. high-risk groups and preranked using the log2 fold change of the expression values. en, the cancer hallmark set (h.all.v7.0.symbols) and the KEGG set (c2.cp.kegg.v7.0.symbols) in GSEA were performed to explore the significant pathways associated with different groups of melanoma. e random sample permutations were 1000, and the q value <0.05 was the significance threshold.

Statistical Analysis.
Every statistical analysis was executed using the R package (v.3.6.0) and corresponding packages.
e Upset plot was drawn by the "UpSetR" package. e ssGSEA method was estimated by the "GSVA" package. LASSO and LR analysis were calculated by the "glmnet" package. SVM and SVM-RFE methods were conducted by the "e1017" package. RF and RF-FS algorithms were applied by "randomForest" and "varSelRF" packages, respectively. e NNET method was performed by the "nnet" package. e optimal cutoff value was generated by applying the "survminer" package. Kaplan-Meier survival curves and ROC curves were drawn by "survival" and "survivalROC" packages, respectively. GSEA analysis was performed by the "clusterProfiler" package. In all statistical tests, p < 0.05 was considered statistically significant.

Melanoma Collection and Normal Controls.
Totally, 944 samples were selected for the subsequent analysis, which were acquired from the five datasets, including the TCGA of melanoma, GSE3189, GSE15605, GSE46517, and GSE54467. e TCGA of melanoma was obtained from the UCSC Xena database, which included 372 melanomas and 233 healthy controls. GSE3189 contained 45 melanoma tumors and 25 normal controls. GSE15605 included 58 melanoma tumors and 16 normal healthy controls. GSE46517 contained 121 samples, which included 104 melanoma tumors and 17 normal controls. GSE54467 dataset had 74 melanoma samples alone with no healthy control. Moreover, 870 samples obtained from the TCGA of melanoma, GSE3189, GSE15605, and GSE46517 were used for diagnostic analysis. 446 melanoma samples obtained from the TCGA of melanoma and GSE54467 were conducted for prognostic analysis.
e complete analysis workflow in this study is illustrated in Figure 1.

Differentially Expressed Immune Cells (DEICs).
Firstly, the 24 immune cell expression matrix was calculated by ssGSEA. According to the standard of differential analysis, 19 DEICs were identified in TCGA, where 11 cells were highly infiltrated and 8 cells were lowly infiltrated ( Figure 2(a)). 13 DEICs contained 4 highly infiltrated and 9 lowly infiltrated cells and were discovered in GSE3189 ( Figure 2(b)). 11 DEICs were observed in GSE15605, which consisted of 6 highly infiltrated cells and 5 lowly infiltrated cells (Figure 2(c)). Besides, 7 highly infiltrated cells and 6 lowly infiltrated cells were identified in GSE46517 ( Figure 2(d)). Eventually, 6 overlaps of DEICs were found in the four datasets ( Figure 2(e)).
ese immune cells, including iDC, DC, Eosinophils, NK CD56bright cells, Mast cells, and Treg, were selected for subsequent research.

Constructing Diagnostic Model.
To construct a diagnostic model by applying the 6 identified DEICs and assess the effectiveness, five machine learning algorithms were comprehensively conducted to diagnose the melanoma samples from the normal healthy controls. Additionally, we also applied 5-fold cross-validation to estimate the accuracy of each model in the TCGA dataset.
ese diagnostic models were also validated in GSE3189, GSE15605, and GSE46517. Similarly, the ROC curves suggested over 90% AUC to identify melanomas from normal patients in all datasets (Figures 2(g)-2(i)). Principal components analysis illustrated that tumors and normal controls could be well distinguished according to the expression of six DEICs (Figure 2(j)). Next, we built a diagnostic score model with these DEICs using the LR method and used the diagnostic formula to calculate the score of each sample. e distributions of diagnostic scores in melanoma and healthy control were revealed in Figure 2(k). e violin plots manifested that the diagnostic values were differently distributed significantly in tumor samples and normal samples. To prove the results from the database, we used immunohistochemistry to confirm the infiltration of these immune cells. It also increased in melanoma compared to normal skin samples (Figures 3(a) and 3(b)).

Developing Prognostic Model.
After eliminating patients with no survival information, 432 melanoma samples were downloaded from the TCGA of melanoma and GSE54467. Firstly, the TCGA of the melanoma dataset was classified into training samples (N � 179) and testing samples at random (N � 179). e statistical results of the training and testing sample clinical information are displayed in Table 1    and no differences were observed between the two datasets. Next, combining the feature selection results of the LASSO method (Figure 4(a)), RF-FS method (Figure 4(b)), and SVM-RFE method (Figure 4(c)) showed that four overlapping immune cells were selected out (Figure 4(d)). en, we used these immune cells in the training dataset to develop a risk score system by Cox regression. Next, the risk model rendered a risk score for each sample. e risk score distributions, overall survival (OS) time, vital status, and the corresponding expression of immune cells in the training (Figures 4(e)-4(g)), testing (Figures 4(h)-4(j)), and GSE54467 (Figures 4(k)-4(m)) datasets were respectively shown. en, we used the optimal cut-off value to classify melanoma patients in the training dataset into high-or lowrisk groups. e curves of Kaplan-Meier (KM) survival analysis manifested that the high-risk patient has a shorter survival time than a low-risk patient with a log-rank test p � 0.003 ( Figure 5(a)). e ROC curves manifested that the 5-year of AUC was 0.664 ( Figure 5(b)). Moreover, to verify the robustness and applicability of the result, validation tests were performed in the testing set and the GSE54467 set. e testing and GSE54467 sets were classified into subrisk (high or low) groups accordingly. KM curves indicated that lowrisk patients had significantly longer survival time than highrisk patients, regardless of being tested with log-rank p < 0.001 ( Figure 5(c)) and GSE54467 with log-rank p � 0.002 ( Figure 5(e)). e 5-year of AUC in testing was 0.832 ( Figure 5(d)) and the 5-year of AUC in GSE54467 was 0.729 ( Figure 5(f )).

Nomogram Building and Validating.
To supply a simple and accurate method for OS prediction, the nomogram was built on the basis of clinical information and risk scores of patients in the training dataset ( Figure 6(a)). e points of each parameter were then added to get a total point, which can predict the likelihood of OS at 3 and 5 years. In comparison to the ideal model, the calibration plots indicated that the nomogram worked well (Figure 6(d)). Moreover, similar nomograms were also constructed in the testing ( Figure 6(b)) and GSE54467 (Figure 6(c)) datasets to prove the results. Surprisingly, the calibration plots in the testing (Figure 6(e)) and GSE54467 (Figure 6(f )) datasets for nomogram predicting 3-and 5-years' OS also worked well in comparison with the ideal model. Similarly, the nomograms' decision curve indicated that the nomogram model offered a bigger net benefit and a better clinical utility than the tumor stage, no matter in the training (Figure 6(g)), testing ( Figure 6(h)), and GSE54467 ( Figure 6(i)) datasets.

Relationships between Risk Model with Clinical Features and Gene Phenotypes.
e associations between the risk model and clinical features were explored, and the violin plot manifested that the risk score only associated with the vital status and tumor status (Figure 7(a)). Other clinical characteristics, such as age, gender, race, and stage, had no effect on the risk score. Furthermore, univariate and multivariate Cox regressions were used to compare the prognostic value of the risk score and clinical features in training, testing, and GSE54467 datasets (Table 2). e univariate Cox analysis revealed that age, tumor status, stage, and risk sore were significantly associated with overall survival, however, the multivariate Cox analysis indicated that only the risk score was associated with OS significantly and could be considered an independent risk factor in training (HR � 3.517, p � 0.005), testing (HR � 1.869, p � 0.042), and GSE54467 datasets (HR � 2.661, p < 0.000). To investigate the correlations between the risk model and selected immune checkpoint-related genes, the subgroup analysis of immune checkpoint-related genes was performed. e violin plot revealed that CD28, CTLA4, ICOS, PDCD1, TIGIT, CD274, CD226, CD40, and CD40LG in the high-risk group had a higher expression value than those in the low-risk group (Figure 7(b)). Interestingly, the subgroup analysis of EMTrelated genes showed that a majority of the EMT-related genes were differently expressed between the high-and lowrisk groups. e genes expression levels of CTNNB1, FGF2, EGFR, SNAI2, ZEB1, CXCL12, SNAI1, and PDGFB were significantly higher in the high-risk group (Figure 7(c)).

Gene Set Enrichment Analysis.
According to the selection standard, multiple significant cancer hallmark pathways were enriched, such as allograft rejection, complement, EMT, and inflammatory response (Figure 7(d)). Additionally, KEGG enrichment showed that complement and coagulation cascades, natural killer cell-mediated cytotoxicity, ECM receptor interaction, and T cell receptor signaling were positively active in the high-risk group (Figure 7(e)).

Discussion
In recent years, melanoma patients are becoming younger, with more advanced metastasis and a higher risk of death. Despite numerous advanced therapeutic methods being used to treat melanoma, such as chemotherapy, radiotherapy, and immunotherapies, their survival rate remains low [1,3]. Besides, the traditional classification is often ineffective and lacks clinical benefits. erefore, researchers are struggling to explore the new biomarkers to better diagnose and predict prognosis. Huang et al. identified eight immune-related gene biomarkers that could predict the prognosis of melanoma [28]. An RNA sequencing-based 12-gene signature was established by applying univariate and multivariate regression models to predict the prognosis of melanoma patients [29]. Lu et al. discovered a five-miRNA signature by analyzing the microarray dataset in the GEO database, which could be regarded as an independent prognostic biomarker in melanoma patients [30]. Recently, the tumor immune microenvironment in melanoma has become a research hotspot and is under active investigation [31]. Moreover, the immune cell types differentially distributed in the tumor tissue on diagnosis have attracted great interest in recent years. erefore, in this study, we systematically analyzed the immune microenvironment and tried to establish a more evaluable and precise signature for advanced melanoma patients. Various differential expressions of genes were recently analyzed to diagnose tumors. Nevertheless, little research attention looked at the effects of the immune cell on the diagnosis of melanoma. Firstly, we conducted the ssGSEA method to assess the relative expression of 24 kinds of human immune cells. Compared to normal tissues, the distribution of the immune cell was significantly higher in the tumor tissues. e overlapping DEICs were identified and put into machine learning analysis. e high sensitivity and specificity of multiple machine learning algorithms indicated that DECI was an efficient indicator for the diagnosis of melanoma. In addition, we built a diagnostic score model by logistic regression method, which could effectively distinguish the melanomas from the normal controls, replying that the immune system is closely associated with the tumorigenesis of melanoma. Similar results have been reported that the infiltration of the immune cell can be used to diagnose colon cancer, even all digestive system cancers [32,33]. In this sense, immune infiltration opened a novel strategy for diagnosing and treating melanoma.
To subsequently investigate the prognostic value of the immune infiltration in melanoma, LASSO, RF-FS, and SVM-RFE methods were jointly applied to select the potential immune cells for building the prognostic model. Finally, four types of immune cells, including 2 cells, T helper cells, Macrophages, and iDC, were employed to develop the risk model by the Cox regression method, which was also validated in the internal and external datasets.   0  20  40  60  80  100  120  140  160  180  200  220  240 Linear Predictor   Among these immune cells, some have been proven to be associated with melanoma. For instance, approximately 70% of melanoma metastatic lymph nodes were detected in the distribution of immature DCs, which may take an immunosuppressive function in melanoma [34]. e 1 and 2 cells are in a somewhat balanced condition in the normal immunological milieu. e imbalance of 1/ 2 is referred to as the 2 bias, which severed the inhibitory effect on 1 responses [35]. One of the mechanisms of tumor immune escape is the 2 bias. Studies have demonstrated that the dominance of the 2 cells could regulate chronic inflammation, which led to the metastasis of melanoma. Moreover, Falleni et al. proved that macrophage accumulation was a poor predictor of melanoma in a patient and might be considered a possible therapeutic target [36]. To assess the accuracy of prognostic prediction, we also built a nomogram-integrated risk score and clinical information. e calibration curve for the 3 and 5 years of outcomes showed that the nomogram worked well compared with the ideal model. Besides, in comparison with the tumor stage, the decision curve plots depicted that the nomogram model can acquire more benefit.
e multivariate cox analysis also suggested that the risk score of immune cells-related biomarkers might be considered an independent prognostic factor in melanoma.
Based on the optimal cutoff of risk score, melanomas were classified into subrisk groups. e KM curves revealed that high-risk patients have a poor prognosis. us, to explore the underlying mechanism with different subgroups, the stratified analyses of clinical characteristics and gene phenotypes were performed. e risk score distribution of clinical features showed that the risk score was only correlated to vital status and tumor status. Presently, checkpoint blockade immunotherapies represent a promising strategy for cancer therapy and acquired extensive investigations [37,38]. However, the efficacy of immunotherapies is dramatically varied in individual patients and different subtypes of cancer. In our research, the expression of immune checkpoint-related genes, including CD28, CTLA4, ICOS, PDCD1, TIGIT, CD274, CD226, CD40, and CD40LG, were highly expressed in high-risk patients. Besides, epithelial-mesenchymal transition (EMT) recognized the indictor for the invasion and progression of many cancers [39,40]. e selected EMT-related genes are also included in our research, and the results also manifested that most of them are highly expressed in the high-risk group. erefore, we have reason to suspect that our immune cellrelated biomarker is linked to melanoma prognosis.
To further investigate the potential biological mechanism in the high-risk phenotype, the GSEA method was applied to analyze the candidate pathways. e results showed that the high-risk phenotype was positively associated with cancer hallmarks, such as allograft rejection, complement, EMT, and inflammatory response, which supported the previous findings that EMT and immuneassociated genes were highly expressed in the high-risk group. e complement system, an essential constituent of innate immunity, affects tumor growth and metastasis by regulating chronic inflammation. Moreover, the KEGG pathway analysis showed that complement and coagulation cascades, ECM receptor interaction, natural killer cell-mediated cytotoxicity, and T cell receptor signaling pathways were enriched in the high-risk phenotype, which largely consisted with cancer hallmark analysis. e ECM-receptor  Figure 7: Stratified analysis and GSEA analysis. (a) e relationships between the distributions of the risk score and clinical variables, which contained age, race, gender, stage, vital status, and tumor status. (b) Box-violin plots of immune checkpoint-related genes between the highand low-risk groups. (c) Box-violin plots of epithelial-mesenchymal transition (EMT)-related genes between the high-and low-risk groups. * represents p < 0.05; * * represents p < 0.01; * * * represents p < 0.001; * * * * represents p < 0.0001. (d, e) GSEA of high-vs. low-risk score groups using the gene sets of the cancer hallmark pathway (h.all.v7.0.symbols) (d) and KEGG pathway (c2.cp.kegg.v7.0.symbols) (e).
interaction pathway is crucial in the metastasis of the tumor [41]. e importance of the ECM-receptor interaction pathway revealed that the tumor cells and the environment have a dynamic interaction [42].

Conclusion
To sum up, our study discovered several differential immune cells and proved the efficiency of immune cells in diagnosing and predicting the prognosis of melanoma. In clinical application and management, the developed diagnosis and prognosis models may give an easier and more accurate prediction for melanoma patients. However, our experiments are limited to bioinformatic analysis, and further experiments should be performed in vitro and in vivo.

Conflicts of Interest
e authors declare that they have no conflicts of interests.