A Methylation Diagnostic Model Based on Random Forests and Neural Networks for Asthma Identification

Background Asthma significantly impacts human life and health as a chronic disease. Traditional treatments for asthma have several limitations. Artificial intelligence aids in cancer treatment and may also accelerate our understanding of asthma mechanisms. We aimed to develop a new clinical diagnosis model for asthma using artificial neural networks (ANN). Methods Datasets (GSE85566, GSE40576, and GSE13716) were downloaded from Gene Expression Omnibus (GEO) and identified differentially expressed CpGs (DECs) enriched by Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. Random forest (RF) and ANN algorithms further identified gene characteristics and built clinical models. In addition, two external validation datasets (GSE40576 and GSE137716) were used to validate the diagnostic ability of the model. Results The methylation analysis tool (ChAMP) considered DECs that were up-regulated (n =121) and down-regulated (n =20). GO results showed enrichment of actin cytoskeleton organization and cell-substrate adhesion, shigellosis, and serotonergic synapses. RF (random forest) analysis identified 10 crucial DECs (cg05075579, cg20434422, cg03907390, cg00712106, cg05696969, cg22862094, cg11733958, cg00328720, and cg13570822). ANN constructed the clinical model according to 10 DECs. In two external validation datasets (GSE40576 and GSE137716), the Area Under Curve (AUC) for GSE137716 was 1.000, and AUC for GSE40576 was 0.950, confirming the reliability of the model. Conclusion Our findings provide new methylation markers and clinical diagnostic models for asthma diagnosis and treatment.


Introduction
Asthma is a chronic, heterogeneous respiratory disease that affects people of all age groups. Recently, asthma-related morbidity and mortality have increased annually. The clinical manifestations of asthma are mainly respiratory symptoms. The main pathological features include chronic airway inflammation, high airway response, and airway remodeling [1][2][3]. Immunoglobulin E (IgE), interleukin-5 (IL-5) and its receptors, and interleukin-4 (IL-4) receptors are used as molecular targets for clinical diagnosis of asthma; however, specific and individual differences are very large, and the clinical treatment of asthma patients is still inadequate [4,5]. DNA methylation, a major epigenetic component of humans, has a profound effect on the occurrence and development of various diseases [6,7]. There is substantial evidence that the mechanisms and characteristics of asthma depend on methylation patterns. Gaffin et al. [8] studied DNA methylation in peripheral blood mononuclear cells nuclear airway epithelial cells of atopic, non-atopic, and healthy asthmatic children and confirmed that multiple CpG sites in the ARDB2 gene promoter region were associated with reduced dyspnea in children. RNA methylation provided new options for asthma treatment [9,10].
Although multiple studies have been performed to distinguish the disease from healthy patients by identifying CpGs loci, however, the results are not encouraging [11]. Reliable quantitative measurements using fewer markers are a viable option. The application of machine learning technology in the medical field has significantly accelerated the research to understand the diseases [12,13]. Machine learning can describe the complexity and unpredictability of human diseases as reported in various studies [14][15][16]. Cao et al. [17] identified key genes for Th2high asthma using weighted by weighted gene co-expression network analysis. There is currently no standard diagnostic model for screening and early detection of asthma. The rapid development of machine learning methods, such as random forests (RF) and artificial neural networks (ANN), is frequently used in biomarker research [18][19][20][21]. This is the first study in which we have analyzed the methylation expression profile of asthma samples by machine learning (RF and ANN) and obtained DECs. The receiver operating characteristic (ROC) curve evaluated the diagnostic performance of our model. The external validation datasets also confirmed the efficiency of the model. This study aimed to identify asthma diseases by analyzing methylation data. The workflow of the study is shown in Figure 1. with champ function and obtained top 1000 CpGs heat map according to the analysis results of champ. The threshold was deltaBeta <-0.05, p-value <-10 -8 , and matched gene symbols based on methylation array 450 k for later GO and KEGG analysis (clusterProfilter, version: 4.3.3). The above analysis was performed using the R environment installation package.

Methods and Materials
2.3. Random Forest (RF) Classification. The DECs obtained by ChAMP were initially identified and classified using the R package randomForest (version 4.7.1). The value of err.rate was minimized by calculating the average model miscalculation rate of all DECs in the data to ensure the best node (mtry). In this study, the optimal variable setting of the binary tree in the node was seven, and the optimal number of trees for the random forest was 600. The Gini coefficient selected significant DECs (top 10) as specific candidates for asthma. The heat map of these DECs was constructed by pheatmap (version: 1.0.12) to show their classification ability.

Artificial Neural Network Model
Construction. The artificial neural network model of important candidate variables was constructed using R package (neuralnet, version: 1.44.2). According to the specification, the number of hidden neurons should be 2/3 of the size of the input layer plus 2/3 of the size of the output layer; the number of hidden neurons should be between the sizes of the input layer and output layers. The base expression profile data were normalized (0 to 1) and processed in neuralnet. The output was set to normal and asthma, and the output of the first hidden layer (input of the last output layer) was regarded as the result of gene weights. The termination condition was the absolute derivative of the error function (reaching the threshold < 0.01).

CpGs Landscape of GSE85566.
Methylation plays a key role in various diseases, as reported previously [25][26][27]. The methylation ChAMP package champ.DMP was used to analyze and process the methylation expression profile in the dataset GSE85566 (74 asthma samples and 41 normal samples) to understand the methylation structure of asthma samples and to calculate the differential CpGs sites. The top 1000 CpGs heat map landscape (asthma and normal samples) is displayed in Figure 2(a). Further methylation targets were searched to differentiate between asthma and healthy samples. The DECs (asthma vs. healthy) of this methylation chip dataset were identified according to champ.DMP, and the results were presented in the volcano plot ( Figure 2(b)). The threshold was set as adj.P.Val <10 -8 , deltaBeta <-0.05 for upregulated DECs (n =121) and down-regulated DECs (n =20).
The up-regulated and down-regulated DECs are shown in the heat map ( Figure 2(c)). In the heat map, we observed that the asthma group (blue) and the healthy group (red) samples are almost separable, but some asthma samples were still mixed in the healthy group (red). Thus, the recognition ability of DECs for asthma and healthy samples still needs to be improved.

GO and KEGG Analyses of DECs. GO and KEGG analyses
were used to understand the biological function and regulation of DECs GO results indicated that regulation of actin cytoskeleton organization and cell-substrate adhesion was enriched (Figure 3(a)). KEGG analysis showed the enrichment in shigellosis and serotonergic synapses (Figure 3(b)). The above results further confirmed that methylation played a key role in the pathogenesis of asthma. The identification of asthmatic and normal patients through a single CpGs site or multiple CpGs models is an urgent problem to be solved.
3.3. Differential CpGs (DECs) in the Random Forest (RF). The above results provided a preliminary understanding of the key role of methylated CpGs in asthma. Although CpGs played an important role in differentiating asthma from healthy samples, the results are not satisfactory (Figure 2(c)). These DECs were used as the input of the random forest classifier. In order to make the error rate as small as possible, we calculated the mean Asthma Sample Control Sample p p p p p p p p p p p p p p p p p p p p (c)     Computational and Mathematical Methods in Medicine error rate (err.rate), the parameter of the variable was considered to be 7, and the final neural network model incorporated 600 trees as the final model parameters to ensure that the errors were stable (Figure 4(a)). The random forest model dimension importance was obtained according to the Gini coefficient method (MeanDecreaseAccuracy and MeanDecreaseGini; Figure 4(b)). The top 10 DECs of importance were identified (cg05075579, cg20434422, cg03907390, cg00712106, cg056969 69, cg22862094, cg11733958, cg00328720, cg13570892, and cg03325522). As follow-up candidates for the classification of our random forest classification results, in these DECs, cg05075579 was considered the most important, with the mean decrease of the Gini index being much higher than DECs ( Table 1). The heat map (Figure 4(c)) showed that these 10 DCGs were better at clustering asthma samples together than in Figure 2(c).

The Construction of Artificial Neural Network Model.
The random forest classifier identified the most important 10 DECs with a significant discriminative effect to distinguish between asthma and healthy samples. The artificial neural network calculated the weights of these 10 DECs, 10 input layers, seven hidden layers, and two output layers in the GSE85566 methylation expression profile and constructed a new model ( Figure 5(a)). For an effective evaluation of the results of the neural network model, we chose the 10-fold cross-validation method. The data were randomly divided into a training set and validation set and used the pROC installation package to    Computational and Mathematical Methods in Medicine visualize the results ( Figure 5(b)). In addition, we adopted the confusion matrix of the caret package to evaluate the accuracy of the neural network models (accuracy: 0.9739). Using methylation expression profiles, we developed a novel model to differentiate asthma and healthy sample classifications based on what we demonstrated above.

ROC Identification of the Dataset.
We showed the classification of asthma and normal samples based on neural network construction. Then, we utilized two methylation datasets (GSE40576 and GSE137716) to evaluate the classification performance of our neural network model. The receiver operating characteristic curve (ROC) calculated accuracy (Figures 6(a) and 6(b)), GSE137716 dataset has AUC: 1.000, the sensitivity and specificity of 100% under the best threshold, GSE40576 dataset has AUC: 0.950, the sensitivity and specificity were 0.959 and 0.969, respectively. Comparing SVM, CART, and XGBoost machine algorithms (Table 2), the AUCs for GSE40756 are 0.825%, 0.773%, and 0.619%, respectively, and for GSE137716, AUCs are 0.938, 0.818, and 0.881, respectively. These results indicate that our neural network model had high-precision classification performance and is indicative of the classification of asthmatic patients.

Discussion
This was the first study to utilize DNA methylation-based machine learning to identify a series of asthma-related methylation loci (DECs). Interestingly, the selected methylation signatures were associated with actin cytoskeleton organization and cell-adhesion substrate, shigellosis, and serotonergic synapses, supporting the hypothesis that airway structural reorganization in asthma results from changes in DNA methylation in the epigenetic group [28,29]. Then, ten distinct specific DECs were identified based on RF, and ANN model was built by calculating the weight coefficient of ANN. The model had high accuracy and stability (the AUC of the external validation datasets was 1 and 0.95, respectively).
Recently, due to the rapid advancement of computing power, artificial intelligence methods such as machine learning have been widely employed in medicine, including disease diagnosis and disease prognosis, thereby accelerating our understanding of various diseases. In addition, it facilitates the clinicians in patient management. Multiple studies have developed novel models to predict clinical outcomes of asthma [30][31][32]. In this study, we focused on the key role of epigenetics (methylation) in asthma. The asthma-related The points marked on ROC curve are the optimal threshold points, and the values in parentheses indicate sensitivity and specificity. The AUC value was the Area Under ROC Curve, X-axis was the specificity, and Y-axis was the sensitivity. The optimal threshold was marked at the inflection point, and sensitivity and specificity were listed in parentheses.

Computational and Mathematical Methods in Medicine
DECs were obtained through differential analysis, 10 crucial candidate DECs were identified based on the random forest classifier, and the asthma-related neural classification scores were generated by artificial neural networks. We also compared the classification efficiency of individual CpGs with the classification efficiency of the model.
We identified the methylation landscape of the methylation data (GSE85566) and obtained 142 differentially expressed CpGs. GO analysis suggested that asthma was enriched in regulation of actin cytoskeleton organization [33], cell-substrate adhesion [34], and response to nutrient levels, and KEGG results identified the potential signaling pathways, shigellosis serotonergic synapse, and yersinia infection. In addition, 10 DECs obtained through the MeanDecreaseGini importance screening of the random forest model provided a base for the construction of a neural network model. The model was highly accurate (accuracy: 0.9739), and the results were also validated with two other datasets, giving the accuracy and high classification level (AUC: 1.000 and 0.950, respectively) of this neural network. We compared our model with other currently available machine learning algorithms (SVM, CART, and XGBoost) [35,36] and found that the diagnostic ability of the methylation machine model constructed by ANN was higher than other models.
There are several limitations to this study. First, our analysis results were based on an online database. There were more influencing factors between different datasets, which can be biased in the results. In addition, our study was limited and could not be validated in clinical patient samples. Due to the paucity of available methylation data, our dataset contains data from children's peripheral blood single cells, which may have affected the results. In future studies, we will verify our results with prospective studies in an effort to implement them in clinical practice and provide doctors with a treatment formulation source.

Conclusion
In general, our neural network model based on methylation epigenetics has a significant clinical value for the prediction of asthma, which is beneficial for early diagnosis of asthma.

Data Availability
The data of this study were downloaded and compiled from the GEO database (https://www.ncbi.nlm.nih.gov/gds/?term=); data used to support the results of this study were obtained from the corresponding author.

Conflicts of Interest
This research does not include any research conducted by any author on human participants or animals. The authors declare no competing interests.