Decision Tree Ensembles to Predict Coronavirus Disease 2019 Infection: A Comparative Study

,


Introduction
SAR-CoV-2 is a new strain of coronavirus causing Covid-19, which was first identified in Wuhan city, Hubei province of China (WHO) [1,2]. It was declared a pandemic by the World Health Organization (WHO) [1,2]. On 24 November 2020, there have been more than 59.2 confirmed cases of Covid-19 and more than 1.3 million deaths [3], reported to WHO.
Covid-19 is a respiratory infection and in severe cases may have Covid-19 pneumonia [4]. e reverse transcript polymerase chain reaction (RT-PCR) is considered as the gold standard for Covid-19 diagnosis [5]. It requires at least 24 hours to produce a result. Clinicians use chest imaging tests to diagnose Covid-19 disease when awaiting RT-PCR test results.
Machine learning algorithms are very effective for prediction [6]. Chest X-ray images and computed tomography (CT) scans have been used to train machine learning models [7][8][9]. Trained models are then used to predict the Covid-19 positive cases. Using Point-of-care ultrasound, machine learning methods can accurately predict which Covid-19 patients are at a greater risk of death on initial lung scans [10]. Machine learning methods can also help distinguish Covid-19 infection from the communityacquired pulmonary infection [11].
Various studies have been carried out to investigate whether Covid-19 positive cases can be predicted using commonly taken laboratory tests such as hematocrit, hemoglobin, platelets, red blood cells, lymphocytes, and leukocytes [12][13][14].
Some of these datasets are imbalanced datasets [12,14]; in other words, the number of positive cases is very small as compared to the number of negative cases. ere are various machine learning methods which have been specifically developed for these datasets [15,16]. e performance measures for these datasets should be carefully selected as, in some datasets such as medical datasets, the penalty for false negative is more than the penalty for false positive [17].
An ensemble is a combination of many classifiers [18]. An ensemble performs better than an individual classifier if member classifiers are accurate and diverse. Classifier ensembles have been used in many applications [19,20]. A decision tree is a popular machine learning algorithm [6]. Random forest [21], eXtreme Gradient Boosting (XGBoost) [22], Bagged decision trees [23], and so forth are examples of decision tree ensembles. Ensembles of decision trees are accurate and quite robust to the selection of parameters. Decision tree ensembles have also been developed for imbalanced datasets [24,25]. Various Covid-19 datasets based on laboratory tests [12][13][14] are imbalanced datasets. However, they have not been studied using machine learning algorithms developed for imbalanced datasets. e paper will investigate the application of decision tree ensembles for Covid-19 dataset that is based on commonly taken laboratory tests. e dataset is imbalanced as the ratio of positive class to negative class data points is 1 to 6.5. erefore, decision tree ensembles that are specifically developed for imbalanced datasets are also applied. Undersampling and oversampling techniques [15,16] are used to address the data imbalance problem. As the data is imbalanced, classification accuracy is not an appropriate performance measure to compare different classifiers. F-measure, precision, recall, area under the precision-recall curve, and area under the receiver operating characteristic curve [17] performance measures are employed to compare the performances of different decision tree ensembles. e paper is organized in the following way. Section 2 discusses the research works that apply machine learning algorithms on various kinds of datasets to predict the Covid-19 positive cases. Information on the Covid-19 dataset, decision tree ensembles, and performance measures used in the experiments is presented in Section 3. Section 4 has experiments and discussion. e paper ends with a conclusion and future work section.

Related Work
Machine learning algorithms have been applied on various types of datasets such as X-ray images, CT scans, and pointof-care ultrasound to predict the Covid-19 positive cases [7,10]. However, in this section, we concentrate on the research works related to the application of machine learning algorithms to predict Covid-19 positive cases using only routinely collected laboratory findings of the patients [12][13][14].
Batista et al. [13] collected data from 235 adult patients from the Hospital Israelita Albert Einstein in São Paulo, Brazil, from 17 to 30 March 2020. 15 variables which include age and gender other than 13 laboratory tests such as hemoglobin, platelets, and red blood cells are used to create a prediction model. Five machine learning algorithms (neural networks, random forests, XGBoost, logistic regression, and support vector machines (SVM)) were tested on this dataset. e best predictive performance was obtained by the SVM algorithm. Schwab et al. [14] used anonymised data from a cohort of 5644 patients seen at the Hospital Israelita Albert Einstein in São Paulo, Brazil, in the early months of 2020. In the dataset, the rate of positive patients was around 10%. ey used 97 routine clinical, laboratory, and demographic measurements as the features. ey applied the same five different machine learning algorithms as applied by Batista et al. In their experiments. XGBoost performed the best. ey also did the experiments for predicting hospitalisation and ICU admission. Random forests performed best for predicting hospital admissions for Covid-19 positive patients, whereas SVM performed best for predicting ICU admission for Covid-19 positive patients. Alakus and Turkoglu [12] modified the data used by Alakus and Turkoglu [12].
e new dataset has 18 laboratory findings of 600 patients. In the dataset, the rate of positive patients was around 13%. ey carried out a comparative study of deep learning approaches. Six different model types-Artificial Neural Network (ANN), Convolutional Neural Networks (CNN), Long-Short Term Memory (LSTM), Recurrent Neural Networks (RNN), CNNLSTM, and CNNRNN-were applied to this dataset. LSTM performed best for the 10-fold cross-validation approach. e datasets used by Schwab et al. [14] and Alakus and Turkoglu [12] are imbalanced. erefore, it will be interesting to apply algorithms that have been developed for imbalanced datasets.

Data and Classifiers Used in the Experiments
e section will discuss the Covid-19 dataset, decision tree ensembles, and performance measures which are used in our experiments.
3.1. Data. 111 laboratory findings of 5644 patients seen at the Hospital Israelita Albert Einstein in São Paulo, Brazil, were collected to detect Covid-19 in the early months of 2020 [14]. Alakus and Turkoglu [12] selected the 18 lab findings that play the most important role in Covid-19. ey removed those data points that had missing values for these 18 lab findings. e final data has 600 data points: 520 data points are negative and 80 data points are Covid-19 patients. e ratio of negative data (majority class) to positive data (minority class) is 6.5 to 1 which makes the data imbalanced data. is dataset is used in the experiments.

Decision Tree Ensembles.
A decision tree has been a very successful classifier that has been applied in many domains [6]. Decision trees are built using a recursive partition process in which data points are split at each node by using the selected split criterion. A path from the root node to a leaf is a rule which is used for the prediction. An ensemble of classifiers consists of many classifiers [18]. e final decision is the combination of all the member classifiers. An ensemble generally performs better than individual members if individual members are accurate and diverse. Decision tree ensembles are quite robust to the selection of parameters and 2 Complexity perform well. In the experiments, many decision tree-based ensembles are used. As the data is imbalanced data, decision tree-based ensembles that are developed for imbalanced datasets are also used. We will discuss these ensembles and their implementation.
3.2.1. C4.5 Decision Tree. Various split criteria have been proposed for decision trees. C4.5 uses the information gain ratio split criterion which reduces a bias towards multivalued attributes [26]. For all the ensembles, we use C4.5 or its version as the base classifier.

Decision Tree
Ensembles. Different kinds of ensembles methods have been proposed. Some of them are general methods such as bagging [23] and AdaBoost [27] which can be used with any classifier, whereas some of them are specific to a decision tree such as random forests [21]. XGBoost [22] is a scalable gradient tree boosting ensemble method, which has produced excellent results in many domains.

Decision Tree Ensembles for Imbalanced Datasets.
Many approaches have been proposed to handle imbalanced datasets. Undersampling of the majority class and oversampling of the minority class are two important approaches to reduce the imbalance of the datasets [15,16]. Random undersampling (RUS) [15,16] selects some data points from the majority class and combines them with the minority class to reduce the imbalance. Synthetic Minority Oversampling Technique (SMOTE) [28] generates synthetic minority data points which are combined with the original dataset to reduce the imbalance. SMOTEBoost [29] combines SMOTE algorithm and the boosting procedure, whereas SMOTEBagging [30] is a combination of SMOTE and Bagging algorithm. RUSBoost [31] combines data undersampling and boosting. RUSBagging [32] is the combination of random undersampling and bagging. Balanced random forests [25] use undersampling of the majority class to create balanced data for each tree of random forests. Different packages such as Weka [33], imblearn [32], ebmc [34,35], and XGBoost [36] are used in the experiments. Table 1 shows the classifiers and the related packages used in the experiments.
e default values of parameters were used for all the classifiers. However, the sizes of ensembles were fixed to 50. A 10-fold cross-validation procedure was used in the experiments.
e average results are presented in different tables (Tables 2-8).

Performance Measures.
ere are different measures to compute the performance of the classifiers. Accuracy is one of the most commonly used performance measures. However, for imbalanced datasets, accuracy measure is not very useful [17]. Accuracy, precision, recall, and F1-measure are used in the experiments [17]. We discuss these measures in detail in the appendix. e area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) are also used to compare the performance of decision tree ensembles [17].
(i) AUROC: e ROC is the curve between TPR and FPR at different decision threshold settings. e area under the curve is used as a performance measure. AUROC values range from 0 to 1. e baseline (random classifier) is 0.5 [37].
(ii) AUPRC: e precision-recall curve is the curve between precision and recall at different decision threshold settings. e area under the curve is used as a performance measure. AUPRC takes the values from 0 to 1. e baseline is not constant for AUPRC. It depends on the ratio of positive and negative samples. It is equal to positive/(positive + negative) [37]. For the dataset, which was used in the experiment, it is equal to 80/600 � 0. 13. It has been demonstrated [37] that, for imbalanced datasets, AUPRC is a better performance measure as compared to AUROC.

Results and Discussion
Different experiments were carried out to study the performance of different decision tree ensembles. is section will present the results of those experiments.

e Comparative Study of Various Decision Tree
Ensembles. Different types of decision tree ensembles are compared using various performance measures. Results are presented in Table 2. For two (accuracy and precision) out of six performance measures, the standard classifiers performed best, whereas for the other four performance measures (F1-measure, recall, AU-ROC, and AUPRC) decision tree ensembles for imbalanced datasets perform best. Random forests performed best for two performance measures (accuracy and precision). Balanced random forest (RUS) performed best for three performance measures: recall, F1-measures, and AUPRC. RUSBagging performed best for AUROC. AUROC and AUPRC are widely used performance measures for imbalanced datasets. RUSBagging performed best for AUROC with the value 0.881, whereas balanced random forest (RUS) gave the best AUPRC with the value 0.561. e study demonstrates that decision tree ensembles for imbalanced datasets perform better for this Covid-19 dataset.

Standard Decision Tree Ensembles with Different Sampling
Techniques. Sampling is an approach to overcome the imbalance of datasets. We use standard decision tree ensembles with two sampling approaches: SMOTE and RUS. As AUROC and AUPRC are mostly used performance measures for imbalanced datasets, only these performance measures were used. Results with SMOTE oversampling sampling method are presented in Table 3. We carried out experiments with different ratios of minority class and majority class data points. e results on the original data points were also presented. Results suggest that the best Complexity       AUROC was 0.872 by random forest with the original dataset, whereas the best AUPRC was 0.648 by bagging with the original dataset. For bagging and random forest, the results improved with oversampling. Except for a single decision tree, SMOTE oversampling method had a negative effect on the performance of other ensembles. e presence of noisy minority points may be the reason for the poor performance of SMOTE oversampling method [38]. RUS was done on the majority class to change the ratio of minority class and majority class. e best results (AUROC and AUPRC) were obtained with XGBoost. e best AUROC was 0.873 with the data with a ratio of 0.5 whereas the best AUPRC was 0.554. XGBoost and Ada-Boost performed best for the dataset with a ratio of 0.5 whereas random forest performed best for the original data. e results suggest that the sampling methods did not have similar effects on all the ensembles. However, the best results among all the classifiers were obtained with the XGBoost for the data with a ratio of 0.5 created with RUS.

Effects of the Ensemble
Size. An ensemble is a combination of classifiers. e number of classifiers in an ensemble is the size of the ensemble. An experiment is carried out to study the effect of ensemble sizes on different classifier ensembles. For the study, those ensembles were selected which performed better with default values. AUROC and AUPRC performance measures are used for the study. Results for AUROC and AUPRC are presented in Tables 5  and 6, respectively. Results suggest that generally the performance slightly improves or remains constant with size.
e results are consistent with the theory of classifier ensembles that most of the performance improvement comes with the first few classifiers in ensembles, adding more classifiers may not be very useful [39].

Effects of the Age Variable.
e data has 18 lab findings. e data also has the age of each patient. e previous experiments [12] used 18 lab tests for the Covid-19 prediction. ey did not use age in the experiments. e age of a patient plays a very important part in the severity of Covid-19. erefore, it is important to understand the effect of age on the Covid-19 prediction. An experiment is carried out with a dataset with 19 attributes (18 lab findings + age variable) and the results are compared with that of 18 attributes. Experiment settings were the same as discussed for the dataset with 18 attributes. Two sets of experiments were done, one with the ensemble size equal to 50 and the other with ensemble size equal to. Results are presented in Tables 7  and 8. For the AUROC performance measure with an ensemble size of 50, seven out of nine ensembles performed better for the dataset with 19 attributes. Similarly, with an ensemble size of 100, six out of nine ensembles performed better with the dataset with 19 attributes. For the AUPRC performance measure, eight out of nine ensembles performed better for the dataset with 19 attributes, with ensemble sizes of 50 and 100. e results suggest that including age variable in the dataset can improve the prediction performance.
Following are the findings of the experiments: (I) Covid-19 dataset based on laboratory test [12] is an imbalanced dataset. However, it has not been studied using machine learning algorithms developed for imbalanced datasets. We studied them using decision tree ensembles for imbalanced datasets. e study demonstrates that decision tree ensembles for imbalanced datasets perform better for this Covid-19 dataset. (II) Experiments with different sampling techniques suggest that that the sampling methods did not have similar effects on all the general decision tree ensembles. However, the best results among all the classifiers were obtained with the XGBoost for the data with a ratio of 0.5 created with RUS. III. e previous experiments [12] did not use the AUPRC performance measure, which is a better performance measure as compared to AUROC [37]. e results could be misleading with inappropriate performance measures. As in Table 2, the performance of balanced random forest (RUS) is best for AUPRC, whereas RUSBagging is best for AUROC.
e classifier based on the AUPRC performance measure is used. (IV) e dataset has 19 attributes (18 lab tests + age). e previous experiments [12] used 18 lab tests for the Covid-19 prediction. We also studied the effect of the age attribute on the Covid-19 prediction. It is found that including age attribute with 18 lab tests can improve the Covid-19 prediction accuracy.

Conclusion and Future Work
e prediction of Covid-19 positive cases is an important step for managing Covid-19 positive patients. Machine learning algorithms can be useful for this classification task. Various kinds of decision tree ensembles are applied on a Covid-19 dataset. e dataset has commonly taken laboratory tests. erefore, these kinds of datasets are easy to collect. e dataset has a class imbalance problem. e results demonstrate that decision tree ensembles developed for imbalanced datasets perform better than standard decision tree ensembles.
is suggests that the selection of classification methods should be based on the properties of data. If the data is imbalanced, classifiers developed for imbalanced datasets should be used. Similarly, the appropriate performance measures should be used for a given classification problem. Otherwise, the results could be misleading. e results also suggest that combining the age variable with the other laboratory tests can improve the prediction performance. In the future, we will compare the performance of decision tree ensembles with the other type of classifiers such as SVM and deep learning classifiers. We will further study the combination of laboratory tests with X-ray data for the prediction of Covid-19 positive cases. e prediction of the severity of Covid-19 by using datasets with laboratory tests will also be investigated.

Appendix
We discuss various performance measures in detail. We define terms which are used to define these measures for a binary dataset. e dataset has two classes: positive and negative. In our experiments, the minority class is taken as a positive class.
True positive (TP): correctly predicted positive class data points.
False positive (FP): negative class data points predicted as a positive class. True negative (TN): correctly predicted negative class data points.
False negative (FN): negative class data points predicted as a positive class. P: total positive class data points. Data Availability e data are publicly available data.

Conflicts of Interest
e authors declare no conflicts of interest.