Serum and Cerebrospinal Fluid Cytokine Biomarkers for Diagnosis of Multiple Sclerosis

Background Multiple sclerosis (MS) is a chronic debilitating disorder characterized by persisting damage to the brain caused by autoreactive leukocytes. Leukocyte activation is regulated by cytokines, which are readily detected in MS serum and cerebrospinal fluid (CSF). Objective Serum and CSF levels of forty-five cytokines were analyzed to identify MS diagnostic markers. Methods Cytokines were analyzed using multiplex immunoassay. ANOVA-based feature and Pearson correlation coefficient scores were calculated to select the features which were used as input by machine learning models, to predict and classify MS. Results Twenty-two and twenty cytokines were altered in CSF and serum, respectively. The MS diagnosis accuracy was ≥92% when any randomly selected five of these biomarkers were used. Interestingly, the highest accuracy (99%) of MS diagnosis was demonstrated when CCL27, IFN-γ, and IL-4 were part of the five selected cytokines, suggesting their important role in MS pathogenesis. Also, these binary classifier models had the accuracy in the range of 70-78% (serum) and 60-69% (CSF) to discriminate between the progressive (primary and secondary progressive) and relapsing-remitting forms of MS. Conclusion We identified the set of cytokines from the serum and CSF that could be used for the MS diagnosis and classification.


Introduction
Multiple sclerosis (MS) is a chronic inflammatory disease of the central nervous system (CNS). It is believed that inflammation in CNS leads to myelin degradation and axon damage [1]. Studies suggest that autoreactive T lymphocytes contribute to the immune assault against myelin, neuronal death, and subsequent plaque formation [2].
Analysis of the CSF from MS patients demonstrated high levels of multiple cytokines, suggesting ongoing inflammation and leukocyte activation [3,4]. Analysis of MS CSF revealed the activation of cytokines associated with Th1 (IFN-γ, TNF-α, and IL-2) and Th2 (IL-4, IL-5, IL-13, and IL-6) leukocytes [5]. We have shown differential activation of cytokines in CSF and serum of MS patients, where the leading role of Th1 lymphocytes in CNS MS pathogenesis was supported. Also, the upregulation of CCL27 was demonstrated in serum and CSF of MS, suggesting the role of this cytokine in neuroinflammation [6]. Although proinflammatory cytokines are commonly detected in CSF of MS, data remain limited.
The analysis of serum and CSF could identify novel biomarkers, aiding the diagnosis of various diseases and improved the therapeutic outcome. Analysis of serum and CSF samples could facilitate the early diagnosis of MS with more accuracy and cost efficiency as compared to expensive techniques such as CT or MRI scans. As an alternative approach, the machine learning methods were successfully employed for the prediction of multiple diseases, including MS [7][8][9]. Using this approach, we have previously identified serum cytokines which could be used for the diagnosis of MS [10].
There is no single diagnostic test for MS [11], and the current diagnosis is based on clinical symptoms and MRI data. Each of these diagnostic criteria has limitations; therefore, the identification of the novel biomarkers of the disease remains critical. Serum samples are often collected as a routine procedure for the diagnosis of MS, and they could be used to identify the disease biomarkers. We have shown multiple cytokines affected in MS [6,10,12]; however, these studies had limited patient group and restricted number of cytokines analyzed. In addition, we employed several machine-learning approaches to identify cytokines, which could provide high accuracy of MS diagnosis.
The present study is aimed at identifying serum and CSF cytokine-based markers for the diagnosis of MS from the panel of forty-five cytokines. ANOVA-based feature selection and Pearson correlation coefficients were done to select the most relevant cytokines from the studied panel to differentiate MS patients from non-MS. Further, five different machine learning models were developed by using selected serum and CSF cytokines to diagnose MS and identify the progressive (primary progressive MS; PPMS and secondary progressive MS: SPMS) and relapsing-remitting MS (RRMS) forms of the disease.

Materials and Methods
The study was organized into 5 stages ( Figure 1): (1) Dataset collection: the levels of different cytokines were analyzed in serum and CSF from MS patients and non-MS controls. (2) Dataset managing and Feature Selection: the feature score (ANOVA) and r score (Pearson coefficient correlation) were calculated between two groups (serum and CSF of MS vs. non-MS; serum vs. CSF of MS patients and non-MS controls). (3) Model Training and Testing: the cytokines (features) having high feature score and rscore > 0:5 were selected for developing machine learning model. (4) Model evaluation and Cross Validation: the data of the selected features was then given as input into the five (gNB, KNN, DT, XGB, and RF) machine learning models to predict the outcome of the disease. (5) Results Analysis: the performance of each model was evaluated by parameters such as accuracy and AUC values Figure 1.
2.1. Study Subjects, Samples. One hundred one MS cases (28 males, 73 females; mean age 35:6 ± 12:52 years) admitted to the Republican Clinical Neurological Center, Republic of Tatarstan, Russian Federation, were included in this study. MS diagnosis was based on clinical presentation and brain MRI results. Serum and CSF were collected from each patient. Additionally, CSF was collected from 25 individuals, herein referred to as non-MS controls (9 males, 16 females; mean age 38:5 ± 9:2 years). Non-MS CSF samples were diagnosed with tension-type headache, residual encephalopathy, unspecified demyelinating disease of the central nervous system, cerebrovascular diseases, progressive multifocal leukoencephalopathy, and migraine with aura. Separately, serum samples were also collected from 101 non-MS controls.

Ethics Statement.
This study was done in accordance with the recommendations of the Biomedicine Ethic Expert Committee of Republican Clinical Neurological Center, Republic of Tatarstan, Russian Federation, and the study was approved (protocol no 218, 11.15.2012) by this committee. All subjects gave written informed consent in accordance with the Declaration of Helsinki. Data on three cytokines (IL-1Ra, IL-2ra, and IL-17) was found to be missing in some patients; therefore, these three cytokines were excluded from the dataset used in building the machine learning models.

Feature Selection.
As the analysis of all 45 cytokines (herein referred to as features) would have increased the complexity of the machine learning algorithm, the dimensions of the dataset (feature selection) were reduced. The process of feature selection is described as the dimension reduction. The reduced dimension results in smaller dataset size and easier interpretation of data. Feature selection also benefits the reduction of the overfitting. Overfitting negatively influences the performance of the machine learning models. In order to reduce the dimensions, ANOVA-based feature selection was carried out using python software. Pearson correlation coefficient (r) was also calculated for each cytokine to measure the association within the two variables in python software. The r score represents the degree of association within the variables such that if the value lies above 0.5, then it is said to have a strong correlation. If the value lies between 0.3-0.49, it represents moderate correlation and it is said to have a low correlation if the r score value lies below 0.29 [13]. In the current study, we have considered the fea-tures with the r score values above 0.5 and significantly different at p < 0:05 for building the machine learning models.

Machine Learning Methods.
Five machine-learning models, k-Nearest Neighbor (KNN), Decision Tree (DT), XGB (XG boost), Gaussian Naïve Bayes (gNB), and Random Forest (RF), were used. The required packages and tuning parameters to obtain the optimum results using these models are summarized in Table 1. Models were trained by considering selected cytokines to predict the target (MS vs. non-MS) and classify the progressive (primary and secondary progressive) and relapsing-remitting forms of the disease.
TN is true negative; TP is true positive; FP is false positive; FN is false negative.

Repeated K-Fold Cross Validation. K-fold validation
Random forest (RF) (RandomForestClassifier) RandomForestClassifier Criterion = " gini " , n_estimators = 50  [15]. Forty-nine cases were diagnosed with RRMS, 21 and 31 cases were diagnosed as PPMS, and SPMS, respectively ( Table 2). The mean duration of the disease was 4:98 ± 6:65 years. Expanded Disability Status Scale score (EDSS) and Multiple Sclerosis Severity Score (MSSS) were 2:88 ± 1:5 and 4:5 ± 2:20, respectively. MRI detected lesions in the subcortical region, corpus callosum, and pons in MS patients. Twenty-five patients received the disease-modifying therapy while 76 patients had no treatment.  Table 1. These cytokine levels in serum and CSF from MS patients and non-MS were analyzed using the ANOVA-based feature score selection method. Subsequently, the Pearson correlation coefficient (r score) for each feature was calculated. Feature and r score data of serum and CSF cytokines in MS and non-MS are summarized in Supplementary Tables 2 and 3, respectively. Features with r > 0:5 were representing strong correlation between groups (serum and CSF of MS vs. non-MS; serum vs. CSF of MS and non-MS) was selected as a cutoff. Cytokines (features), having high feature score and r > 0:5, were used for the machine learning analysis (Figures 2 and 3). These selected cytokines were also found to be significantly different at p < 0:05.    Mediators of Inflammation
String analysis also identified differences in the interaction of cytokines affected in serum and CSF of MS (Figures 4(a) and 4(b)).

Proposed Predictive Model.
Five different models were trained on the dataset with the selected features to predict MS, and the proposed algorithms are depicted in Figure 5. The cytokine dataset of patients and controls was divided into training (70%) and testing (30%) dataset for both serum and CSF. Equal number of MS patients (101) and non-MS (101) controls were taken for serum analysis. However, the CSF dataset represents an unequal number of patients (101 samples) and controls (25 samples) making data unbalanced. Due to this unbalance dataset and thus to overcome the effects of overfitting, we used recursive testing.
Three independent datasets, including ten cytokines commonly affected in CSF and serum, twelve cytokines uniquely changed in CSF, and ten cytokines uniquely affected in serum (identified in Figure 3), were selected as an input to five state-of-the-art machine-learning models to predict MS.
From each cytokine datasets, five cytokines were randomly selected as an input to five machine-learning models to evaluate the accuracy in MS prediction. Interestingly, all combinations of randomly selected five cytokines have shown an accuracy of MS diagnosis ≥ 92% (Table 3).
An example of the accuracy of MS diagnosis is presented by using KNN, one of the models tested, for randomly  Figure 6).
Interestingly, the accuracy of MS diagnosis was the highest, reaching 99%, when cytokines CCL27, IL-4 and IFN-γ were included into the randomly selected five cytokines. Area under the Curve (AUC), which represents the reliability of the model, is also demonstrated using gNB, one of the randomly generated model using five serum cytokines. This AUC value was more than 0.95, indicating high reliability of the model (Figure 7).
All five machine learning models (KNN, DT, XGB, gNB, and RF) demonstrated relatively similar accuracy indicating that any of them can be used for the prediction of MS (Table 3).
Twenty and twenty-two cytokines affected in serum and CSF, respectively, were also taken as input to classify progressive (PPMS and SPMS) and RRMS forms of the disease. The accuracy was found to be in the range of 70-78% and 60-69% for serum and CSF, respectively (Table 4). In each cytokine dataset, randomly five cytokines were taken in each model to calculate the accuracy with k-fold cross validation. * Cytokines found commonly affected in serum and CSF of MS; * * Cytokines uniquely affected in CSF; * * * Cytokines uniquely affected in serum.

Discussion
Diagnosis of MS remains a challenge, as the disease has multiple clinical forms and symptoms could relapse and disappear [16]. Changes in a large number of cytokines, a soluble biomarkers of inflammation and leukocyte activation, in serum and CSF were demonstrated [6,12]. These data suggest that some cytokines could have a diagnostic and prognostic value in MS [10]. We have previously applied the machine learning models to diagnose MS, using limited data on cytokines affected in the serum of MS [10], which produced relatively low confidence result. Therefore, we expanded the number of cytokines (total of 45), which included interleukins, growth factors, and chemokines, so the computational data analysis would identify the group of biomarkers differentiating MS with a high level of confidence. We found that twenty-two and twenty cytokines in CSF and serum, respectively, were affected (strong correlation, r > 0:5) in MS as compared to non-MS. All five models have shown the high accuracy of MS diagnosis, ranging between 90-99%. In our previous report [10], four models (SVM, DT, RF, and KNN) were applied and only RF model has shown the accuracy of 90.91% with limited cytokine dataset (eight cytokines). In the present analysis, the RF model has shown the accuracy of ≥92%, which is similar to our previous report [10]. As compared to the basic models (KNN, DT, gNB, and RF), XGB uses the ensemble approach [17]. XG also demonstrated remarkable accuracy, ranging between 92 and 98%. Seven (IL-1β, IL-2, IL-4, IL-8, IL-10, IFN-γ, and TNF-α) out of eight cytokines used in the previous report [10] were also found in the current study, indicating the reliability of the published and current results. However, by using the larger number of biomarkers, we were able to determine that the minimum of five cytokines is required to achieve the highest accuracy of MS diagnosis. Also, these models were able to identify PPMS, SPMS, and RRMS forms of the disease by taking input of twenty (serum) and twenty-two (CSF) cytokines with an accuracy of 70-78% and 60-69%, respectively.
When five cytokines were randomly selected, the group including CCL27, IL-4, and IFN-γ combined with any other three cytokines demonstrated the highest precision of MS differentiation from non-MS. This data corroborated our previous report on the potential role of CCL27 in MS pathogenesis [18]. Originally identified in skin [19], the expression of this cytokine was demonstrated in brain neuroglia [20]. It was shown that CCL27 could enhance the inflammation by releasing IL-4 [21]. Studies also have demonstrated that CCL27 could trigger T memory cells [22] to produce IL-4 and IFN-γ [23]. Although these data provide limited evidence on the link between CCL27 and MS pathology, our observation of the high level of this cytokine in MS serum and CSF suggests its role in the pathogenesis of the disease. This assumption was confirmed by Monfared et al., demonstrating the increased serum CCL27 level in MS [24].
One of the interesting observations was that some cytokines are differentially affected in serum and CSF in MS. IL-1α, IL-1β, and IL-18, affected in MS CSF, were linked to strong inflammatory reaction [25]. Two of these cytokines, IL-1β and IL-18, are the product of activated inflammasome, regulating inflammatory response [26]. These findings support the key role of inflammation in brain pathology and, also, supports the use of inflammasome inhibitors as therapeutic for MS [27]. In addition to inflammation, interleukins and chemokines affected in serum and CSF could direct leukocyte migration targeting Th1 cells [28].

Conclusion
We have identified ten cytokines (IL-1α, IL-4, IL-18, CCL7 CCL27, INF-γ, LIF, M-CSF, SCF, and TNF-α) which can be analyzed either in serum or CSF to differentiate MS from non-MS. Also, the random selection of any five cytokines from the dataset of altered cytokines in serum and CSF could diagnose MS with an accuracy between 90 and 99%. Interestingly, the highest accuracy of MS diagnosis (99%) was demonstrated when CCL27, IL-4, and IFN-γ were selected, suggesting their important role in MS pathogenesis. Also, the accuracy of models to identify progressive (PPMS and SPMS) and RRMS was 70-78% (serum) and 60-69% (CSF).

Data Availability
The corresponding author has full access to data and will share it with requesting researchers on request after signing an agreement stating not to use data for any purpose other than intended research.