Machine Learning-Based Model to Predict the Disease Severity and Outcome in COVID-19 Patients

,


Introduction
Coronavirus (COVID- 19) started in China in December 2019.As of January 2021, over 95 million cases have been reported around the world, with a mortality rate of 2% of the total closed cases [1]. is rapid pandemic expansion represents a global concern and a serious threat to the public health and economy worldwide.To prevent the infection from spreading, most countries restricted social interaction through precautionary measures such as isolation and quarantine.However, many infected patients did not benefit from the proper treatment due to late diagnosis and the novel and unknown nature of the virus.Recently, many researchers focused on developing new methodologies to screen infected patients in different stages to find notable associations between the patient's clinical features and the chances to succumb to the disease [2,3].Current investigation studies determined that artificial intelligence (AI) and machine learning (ML) techniques can play a key role in reducing the effect of the virus spread [4][5][6].ML application technologies on patients' data fall under a range of different research directions [7].One of the most important research directions is predicting the infection rate and mortality rate and building a model to classify patients based on their clinical findings [8,9].ese research investigations are extremely important and would greatly assist people in the health sectors to be well prepared and take all necessary precautions to minimize the pandemic spread.e aim of this research is to develop a prediction model to calculate the severity of the disease in COVID-19 patients, using risk factors that can be monitored remotely, with the patient being at home.Moreover, the study explores the impact of vital signs, chronic diseases, preliminary clinical investigations, and demographic features to predict the survival versus the mortality of COVID-19 patients.e study used COVID-19 patients' data from the King Fahad University Hospital containing the clinical findings and demographic information to validate the model performance and effectiveness.All the risk factors or vital signs that can be measured through widely used sensors were included in the study such as oxygen level in the blood, temperature, pulse rate, and blood pressure.e model will serve as an early warning system to timely identify at-risk patients.
1.1.Related Work.Early detection and diagnosis using AI techniques help to prevent the spread and to combat the COVID-19 pandemic using different data such as CT scans, X-ray, clinical data, and blood sample data.
Yan et al. [10] predicted the criticality and survival chances of patients with severe COVID-19 infection based on different risk factors and demographic information.e dataset used consists of 375 records from patients admitted to Tongji Hospital from January 10th to February 18th, 2020, including 201 survivors and 174 deceased within the same period.ey used an XGBoost (XGB) model and identified only three main clinical features as significant, i.e., lactic dehydrogenase (LDH), lymphocyte, and high-sensitivity C-reactive protein (Hs-CRP), selected from more than 300 features.e proposed model was validated using data from 29 patients.e key findings of the research were the model's ability to predict the risk of death with 0.95 precision and 0.90 prediction accuracy.Such models will equip physicians with a tool for identifying critical conditions, thereby helping to reduce the mortality rate.Even though these findings are of great importance, the research has some limitations, which affect the accuracy of the reported results.
ese limitations were due to the small size of the dataset, namely, 29 records of patients only.
Similarly, Wong and So [11] also used XGB with another dataset to predict the severe and the death cases and identify the risk factors associated with COVID-19.e dataset was retrieved from United Kingdom Biobank (UKBB) and includes 93 different variables collected between 16 March 2020 and 19 July 2020.Two different studies have been conducted based on the sample's groups.For the first study, the data were clinical prediagnostic data of 1747 COVID-19 infected patient records containing both severe and death cases.For the severity class, the accuracy achieved was 0.668, and for the fatality class, the accuracy was 0.712.For the second study, the data were taken from the negative cases, the general population with no COVID-19 infection, consisting of 489987 records.e same model was applied, and the accuracy achieved was similar to the first study, with an accuracy of 0.669 for the severity class and 0.749 for the fatality class, respectively.It is worth mentioning that the researchers identified the five most significant risk factors for severe cases and death cases, with age being the top factor for both cases.Other factors include obesity, impaired renal function, multiple comorbidities, and cardiometabolic abnormalities.
Sun et al. [12] developed a prediction model using the support vector machine (SVM) to predict the severe cases of COVID-19 patients.In the study, they used the clinical and laboratory features that are significantly associated with these cases.Using 336 cases of COVID-19 patients, 26 severe/critical cases and 310 noncritical, they found that the main features to discriminate the mild and severe cases are age, growth hormone secretagogues (GHSs), immune feature cluster of differentiation 3 (CD3) percentage, and total protein.ey found that the proposed model was effective and robust in predicting patients in severe conditions with up to 0.775 accuracy.
Another research conducted by Yao et al. [13] also applied the SVM model to classify the COVID-19 patients according to the severity of the symptoms.
ey applied SVM for the binary class label on a total of 137 records including urine and blood test results and combining both severely ill patients and patients with mild symptoms.e results showed that around 32 factors have high correlations with severe COVID-19, with an accuracy of 0.815.It is worth mentioning that, amongst all factors, age and gender had mostly affected the classification of cases between severe and mild.Patients aged around 65 had more severe cases than others.Moreover, male patients were at a higher risk of developing severe COVID-19 symptoms.In terms of the urine and blood test samples, blood test result features show more significant differences between severe and mild cases than urine test result features.
ey used a dataset containing demographic and clinical data for 115 COVID-19 patients under the nonsevere condition and 68 COVID-19 patients under the severe condition.Four features have been selected as the most significant features to discriminate the mild and severe cases: age, high-sensitivity C-reactive protein level, lymphocyte count, and d-dimer level.is model was evaluated, and the results showed that the prediction was effective with area under the receiver operating characteristic (AUROC) of 0.881, sensitivity of 0.839, and specificity of 0.794, respectively.Bertsimas et al. [15] used 3927 COVID-19 patients' sample for predicting the mortality risk using XGB.e study used demographic and the clinical features of the patients from 33 hospital data.
e model achieved the accuracy of 0.85 and AUC of 0.90.Moreover, Sánchez-Montañés et al. [16] developed LR-based mortality prediction using 1969 COVID-19-positive patients.e study found age and O 2 as the significant features and achieved an AUC of 0.89, sensitivity of 0.82, and specificity of 0.81, respectively.
In [5], supervised machine learning techniques have been investigated to predict the COVID-19 outbreak.In [5], SVM has been used for prediction over the dataset obtained from the WHO with 303 patients.
e proposed scheme exhibits an accuracy of 0.967 during the testing phase.Similarly, An et al. [17] developed the model to predict the mortality of COVID-19 patients using several machine learning algorithms such as LASSO, SVM (linear and RBF), RF, and KNN. e models were trained to identify three cases, i.e., mortality and survived and mortality and survived 2 Scientific Programming within 14 and 30 days after the initial diagnosis.Linear SVM achieved the highest performance with an AUC of 0.962, sensitivity of 0.92, and specificity of 0.91, respectively.e study found age, diabetes mellitus, and cancer as a significant factor in the mortality prediction for COVID-19 patients.
In conclusion, the importance of machine learning specifically, on predictive analysis, has been proven from several studies.Some of the studies have been conducted to perform the prediction and forecasting, yet there is still a need for further exploration and to extend the findings associated with COVID-19 using a real dataset of clinical records.e summary of the related studies is shown in Table 1.
e proposed model in this study attempts to predict and forecast the patients that are at risk along with identifying the main risk factors associated with COVID-19.Targeted patients are isolated at home.e dataset (clinical findings) has been retrieved from King Fahad University Hospital in the Kingdom of Saudi Arabia.e main aim of the study is to develop a preemptive warning model that can identify at-risk COVID-19 patients that are monitored in quarantine at home. is paper is organized as follows: Section 2 introduces the materials and methods, and Section 3 shows the experimental setup and results.Finally, the conclusion and future work are identified in Section 4.

Methodology
e following section covers the dataset description and the methodology used.Due to the class imbalance in the dataset, the synthetic minority oversampling technique (SMOTE) was used.

Dataset Description.
e study was conducted in the Department of Computer Science of Imam Abdulrahman bin Faisal University (IAU) and approved by the Deanship of Scientific Research of IAU under the research grant IRB-2020-09-160.e data were collected from King Fahad University Hospital, Dammam, Kingdom of Saudi Arabia (KSA).e dataset contains the demographic and clinical data of COVID-19-positive patients in the period from 30 April 2020 to 24 July 2020.
e dataset contains all the positive patients that were admitted in King Fahad University Hospital during the specified data collection period.
ere are 287 COVID-19 patient records in the dataset with a binary class label, namely, "survived" and "deceased," respectively.e number of survived patients is 243, and 44 patients deceased.
e distribution of instances per class label is shown in Figure 1, while the description of the dataset is mentioned in Table 2. e field BodyTemp 1 in the table indicates the first body temperature taken at the time of the patient's admission to the hospital.However, BodyTemp 2 indicates the last body temperature reading taken before the patient's discharge.Similarly, SOB indicates shortness of breath, chr_dm indicates chronic disease diabetes mellitus, chr_htn indicates hypertension, chr_cardiac represents cardiovascular diseases, chr_dlp represents dyslipidemia, and chr_ckd indicates chronic kidney disease.e baseline characteristics of the numeric attributes of the dataset are represented in terms of mean ± standard deviation (SD).By contrast, the categorical attributes are measured by a count.e characteristics of the features in the dataset are presented in Table 3.

Preprocessing.
Preprocessing is one of the key steps in data analysis and prediction.Several preprocessing techniques were applied on the dataset.e dataset contains data of all the patients admitted in the hospital.Some symptoms or vital signs occurred with very low frequency and were therefore removed from the dataset.All symptoms with occurrences at 50% or above were selected to be added to the feature set, while the symptoms with occurrences in the range from 2% to 49% were cumulated as one feature the was assigned a unique code.
e first three vital signs: fever, cough, and shortness of breath (SOB) were defined as symptom features, while the remaining features were incorporated as a new attribute "sym_others."5% of the patients in the study were asymptomatic at the time of initial diagnosis and considered as a part of the sym_others attribute.Similarly, the chronic top three (3) diseases (i.e., diabetes, high blood pressure, and cardiac) with the highest frequency were included as features.However, all other chronic disease types with more than 1 occurrence were incorporated as one feature "chr_others."After the initial preprocessing data, an encoding scheme was applied on the categorical features.As the dataset contains a small number of missing values, imputation was performed using the Kmeans technique.

Prediction Model.
In the study, three classification algorithms were used: logistic regression (LR), random forest, and extreme gradient boosting (XGB).A brief description of the classification algorithms is given below.

Logistic Regression.
Logistic regression is one of the widely used statistical classification algorithms for binary and multiclass problems.For predicting the probability of the class label, logistic function is used [18].e functional form of the hypothesis is where C is the list of regression coefficients and X is the list of the features.
where β i represents the regression estimators also known as predicted weights for the selected features in the data and β 0 represents the intercept of the equation. ( Since the dataset used in the study consists of 25 features in total, the logistic regression algorithm for our study is e model will predict the record as survived or death if the value of For optimal selection of regression estimator, maximum-likelihood ratio concept is used. Sigmoid function (logistic function) is used to map the attributes with the class label.e functional form of the sigmoid equation is given in the following equations:

S(g) � 1
1 where e is a numeric constant Euler's number.In LR, a regularization parameter is used to reduce the chance of model overfitting.e logistic regression was optimized using grid search to get hyperoptimized parameters.e parameter set for logistic regression used in our study is shown in Table 4.

Random Forest. Random forest is an ensemble-based classification and regression model initially proposed by
Zhang [19].Random forest can be used for feature selection as well.It uses the bootstrapping data sampling method for partitioning of the data into training and testing sets.e model iteratively generates the trees for every bootstrap.e final prediction is made using the mean vote for each class.It is the combination of all generated decision trees.A decision tree is the hierarchical classification algorithm.e selection of the decision node is made using entropy, information gain, gain ratio, and Gini-index, respectively.In our study, we used information gain and entropy, as shown in the following equations: where E(Y) represents the entropy of the target, while Entropy(X, Y) is the entropy of the attributes with the target, in which X � x 1 , x 2 , . . ., x n   is the set of attributes in the dataset.e attribute with the highest information gain will be the root attribute, as follows: It combines the predictions made by multiple trees using randomly selected vectors represented by θ T .e selected   Scientific Programming vectors are independent with the previously selected vectors. is results in the collection of trees represented by h(x).e generalization error of decision tree is represented as follows: where P X,Y is the probability of set of the attributes to map to class label Y. e parameters used in our study for random forest classifier are shown in Table 5.

Extreme Gradient Boosting. Extreme gradient boosting (XGB) algorithm is an ensemble-based classification and regression technique.
It is the regularized form of the gradient boosting algorithm.Gradient boosting algorithm due to the data imbalance sometimes suffers from model overfitting.However, in the XGB algorithm, the regularization parameter reduces the risk the model overfitting.Like random forest, XGB is also a tree-based ensemble classifier.e boosting data resampling method attempts to enhance the model accuracy by minimizing the misclassification error [19].It is an iterative approach.e records that were not successfully predicted in the previous iteration were used in the next iteration for training the model.e model will repeat the process until the model achieved an optimal result.e regularization parameter reduces the variance in the model by increasing the weights of the misclassified instances.
e increase in weight decreases the model underfitting.However, for reducing the bias of the model, penalty regularization was used to control the model overfitting without leading to a high misclassification rate.
e XGB algorithm is the combination of several parameters.e optimal combination of parameters enhances the performance of the model.For parameter optimization, the gird search technique was used.e parameter used in the XGB algorithm is represented in Table 6.

Performance Evaluation.
e performance of the model was evaluated using the standard evaluation measures such as accuracy, precision, sensitivity, specificity, and F-score, respectively.Area under curve and receiver operating characteristic (ROC) were also used for comparing the classifiers.It is one of the widely used tests for exploring the trade-off between true-positive (sensitivity) and false-positive rate (specificity) for the diagnostic test.
where the accuracy of the model represents the proportion of the test records that is correctly classified.
Sensitivity is the proportion of the positive class labels that is correctly predicted.It is also known as the truepositive rate (TPR) or positive-predicted value (PPV).
Sensitivity also known as the true-negative rate (TNR) or negative-predicted value (NPV) is the proportion of the negative class labels that are correctly predicted as negative.
where F-score is the harmonic mean of precision and recall.

Experimental Setup and Results
Data imbalance is one of the challenges in data analysis and usually leads to model overfitting.e dataset in this study also suffers from data imbalance as presented in Figure 1. e number of records for the survived category is 243 and for death category is 44.K-nearest neighbor-(KNN-) based synthetic minority oversampling e following tables present the performance of the classifiers in terms of accuracy, sensitivity, specificity, and Fscore.e results showed that random forest outperformed the other models with SMOTE data.Table 7 presents the performance of the classifiers using all features.Table 8 presents the outcome using the top 20 features, Table 9 presents the results with the top 15 features, and Table 10 presents the comparison with the top 10 features, respectively.
Experimental results revealed that random forest outperformed the other classifiers using the top 20 features with SMOTE data with the accuracy of 0.952, sensitivity of 0.949, specificity of 0.956, and F-score of 0.955, respectively.Similarly, the AUC-ROC curves for logistic regression, random forest, and extreme gradient boosting are shown in Figures 3, 4, and 5, respectively, using the top 20 features.Random forest achieved the AUC of 0.99.However, the random forest achieved the highest specificity of 1 using the top 15 features.
Logistic regression, on the other hand, underperformed over other classifiers in the top 20, 15, and 10 features using SMOTE data with the accuracy of 0.86, 0.82, and 0.84, respectively.
e AUC-ROC curve shows that LR achieved 0.91.However, LR in our study performed better than another study conducted by Yao et al. [13].ey used the LR model to identify the COVID-19 patients' severity and the results achieved an AUC-ROC of 0.881.
A number of studies focused on prediction of severity or mortality have noted that the age is one of the top features that helps to predict the severity of cases [10][11][12][13].In our study, age was ranked among top 10 features across all 25 features used in our prediction model.In addition, our study outperformed other studies that are covered in the literature review with an accuracy of 0.952 and AUC-ROC curve of 0.99. is study covers the prediction of the survival and the death of COVID-19-positive patients using demographic, vital signs, and chronic diseases, respectively.e overall result demonstrates the significance of the proposed study with the accuracy of 0.95 and the AUC value of 0.99 using 20 features.e study was performed using a real dataset from the King Fahad University Hospital.Moreover, the dataset

Conclusion
e COVID-19 pandemic outbreak has devastated the whole world and lead to a state of worldwide health emergency.Several efforts have been performed to combat this pandemic.In this study, we aimed to explore the impact of vital signs, chronic disease, preliminary clinical data, and demographic features to predict the mortality and survival of the COVID-19 patients using supervised machine learning algorithms.Due to the reduced mortality risk of the COVID-19 cases, the dataset suffers from data imbalance.SMOTE technique was used to alleviate the data imbalance.e results showed that random forest outperformed the other models using 10-fold cross-validation.Grid search technique was applied for parameter optimization.e study achieved the accuracy of 0.952 and AUC of 0.99.Despite the significant outcome achieved from this proposed model, there is still a need for improvement.e models need to be validated using multiple datasets.Furthermore, in the future, we will incorporate and explore the impact of other clinical features and laboratory results that were identified as significant in the previous studies.

Figure 1 :
Figure 1: Number of records per class label.

Figure 2 :
Figure 2: Correlation of top 20 features in the dataset.

Table 1 :
Related studies on mortality prediction for COVID-19 patients.

Table 2 :
Description of the dataset.

Table 3 :
Characteristics of the samples in the dataset.

Table 4 :
Logistic regression parameters using grid search optimization.

Table 7 :
Performance comparison of classifiers using all features (25) using original and SMOTE data.

Table 8 :
Performance comparison of classifiers using top 20 features using original and SMOTE data.

Table 9 :
Performance comparison of classifiers using top 15 features using original and SMOTE data.
contains a very small number of missing data.Despite the several advantages, the study can be further improved by increasing the number of patients.Furthermore, the study needs to incorporate other laboratory tests like lactate dehydrogenase (LDH), neutrophils, lymphocyte, and highly sensitive C-reactive protein.Several identified significant features from the literatures need to be included for predicting the mortality risk in COVID-19 patients.