A Novel Benchmark Dataset for COVID-19 Detection during Third Wave in Pakistan

Coronavirus (COVID-19) is a highly severe infection caused by the severe acute respiratory coronavirus 2 (SARS-CoV-2). The polymerase chain reaction (PCR) test is essential to confirm the COVID-19 infection, but it has certain limitations, including paucity of reagents, is computationally time-consuming, and requires expert clinicians. Clinicians suggest that the PCR test is not a reliable automated COVID-19 patient detection system. This study proposed a machine learning-based approach to evaluate the PCR role in COVID-19 detection. We collect real data containing 603 COVID-19 samples from the Pakistan Institute of Medical Sciences (PIMS) Hospital in Islamabad, Pakistan, during the third COVID-19 wave. The experiments are separated into two sets. The first set comprises 24 features, including PCR test results, whereas the second comprises 24 features without PCR test. The findings demonstrate that the decision tree achieves the best detection rate for positive and negative COVID-19 patients in both scenarios. The findings reveal that PCR does not contribute to detecting COVID-19 patients. The findings also aid in the early detection of COVID-19, mainly when PCR test results are insufficient for diagnosing COVID-19 and help developing countries with a paucity of PCR tests and specialist facilities.


Introduction
Coronavirus is a highly severe infection caused by the severe acute respiratory coronavirus 2 (SARS-CoV-2) [1]. It has quickly spread across the world [2][3][4]. It is a highly lethal disease and caused more than 500, 000 deaths in only 216 countries. It affects daily human activity, and early detection of this virus is critical to preventing the spread of this contagious virus. COVID-19 may now be detected using the reverse transcription polymerase chain reaction (RT-PCR) test and the rapid antigen testing (RAT) test [5,6]. ese tests are not 100% reliable in identifying COVID-19; 20% of false positives have been recorded, and the method is very time-consuming [7]. e RAT test can identify antibodies to lgM and lgG. It has a sensitivity score of 18.8% and a specificity score of 78.1%, which are the main drawbacks. e majority of underdeveloped nations cannot efficiently execute these tests. Consequently, additional components and testing methods that are more readily available and need fewer computing resources are being promoted [8].
Pakistan is one of those most affected countries by COVID-19, with 2,53,604 COVID-19-positive cases [26,27]. e first COVID-19 report was published in Pakistan on February 26, 2020. According to this research, three COVID-19 instances appeared in Pakistan within two days, with no link between these individuals [28]. e number of COVID-19 cases climbed with time, with 1,39,230 positive cases discovered until June 12, after which the aggregate number of cases fell. Until July 25, there were 2, 73, 113 confirmed case reports. e COVID-19 outbreak threatens underdeveloped countries like Pakistan's healthcare and medical systems.
ere are not enough clinicians and medical and healthcare resources in Pakistan to meet these needs. A lack of resources, especially in smaller cities and villages, makes treating patients challenging and reducing mortality rates challenging. Although Pakistan has limited resources and a poor healthcare system, it has effectively enhanced its level of preparedness for COVID-19 [29].
Pakistan is now undergoing three distinct COVID-19 waves. e first wave of COVID-19 arrived in Pakistan in May 2020, with daily increases in COVID-19 confirmed cases and new deaths. It came to an end in the middle of July. e initial wave of COVID-19 had a low mortality rate and went soon, and COVID-19 cases and mortality rates decreased rapidly after peaking.
Pakistan's COVID-19 crisis calmed following the first wave, with fewer new deaths and positive cases. However, at the start of November 2020, the second wave of COVID-19 arrived, and new cases and deaths started increasing. e second is modest in intensity, primarily affecting Sindh's southern province, and reached its peak in the middle of December 2020. e third wave in Pakistan began in the middle of March 2021. When the third wave began, the pattern of mortality rate and COVID-19 cases reached a climax.
is wave primarily impacted on Punjab and Khyber Pakhtunkhwa provinces. e third wave reached its apex in April 2021; then, COVID cases and the number of deaths decreased. Section 2 discusses prior studies on the subject. e dataset is described in Section 3. e suggested approach is presented in Section 4. Section 5 presents and evaluates the experimental setup and results. Section 6 contains the conclusion and future work.

Literature Review
is section evaluated several previous works similar to ours. We reviewed the literature based on the different techniques for diagnosing COVID-19, samples obtained, selected samples for analysis, and the constraints observed throughout the study. PCR test is performed when the RNA is extracted correctly using a specified clinical methodology.
is is currently a one-of-a-kind approach for diagnosis. Unfortunately, there are still many constraints on it. is test necessitates using advanced equipment and skilled personnel [30]. Testing a single sample is impractical due to the high cost and time required, almost 4 to 5 hours. PCR machines are utilized with a set of testing samples to preserve expense. e incorrect results are found at a rate expected to range from 3 to 30% [31]. is false-negative rate is risky since the patient is not isolated and may contribute to the spread of this disease. In addition to PCR testing, CT scans can be used to detect this virus [32,33]. Unfortunately, CT scan is unable to detect a precise diagnosis of this infection. It is also not easily accessible everywhere and might expose patients to needless radiation [34]. As a result, clinicians do not advise CT scans and chest radiographs (CXR) for all patients [35]. Clinical and standard blood tests can be utilized to identify COVID-19 affordable and timely manner. Various researches are available that use single, multiple models in a research or a combination of different models. Several studies have utilized various ML models to diagnose COVID-19.
Authors in [36] used routine blood parameters to train the ML model and diagnose COVID-19. ey employed 11 of the original 49 parameters. is study included 235 patients, of which 105 were confirmed, COVID-19 patients. ey used accuracy, specificity, and sensitivity assessment measures and obtained 95.95% accuracy, 95.13% specificity, and 96% sensitivity, respectively. e suggested research by the authors focuses on early diagnosis by prompt treatment. A random forest technique is used to mine the researcher's work on 11 major blood indicators. e equipment required for a commercial blood test (CBT) to generate 49 blood test samples is then utilized to construct the tool set for assistant discrimination.
e critical problem encountered in this investigation is that the need to identify COVID-19 instances with a common symptom is not validated since these cases were difficult to get in the current environment.
Authors in [37] proposed the ML technique and used LR and RF models to detect COVID-19. e data come from 52 COVID-19-infected patients whose CT scans were taken from 5 hospitals in China from January 23, 2020, to February 8, 2020. e models offered properly compute the length of stay in the hospital for patients with COVID-19 pneumonia [38]. e outcome indicates that the patient has a minimum hospital stay of fewer than 10 days or a maximum hospital stay of more than 10 days. Authors in [39] advised using a chest X-ray (CXR) since it is less costly, quicker, and more commonly used. is study uses X-ray imaging to distinguish COVID-19-induced pneumonia from other types. Seven 1144 X-ray images from seven distinct classes are included in this study. is study, however, does not entail a conclusive COVID-19 diagnosis, but it does aid in screening patients in emergency care.
Authors in [40] collected the dataset from San Rafael Hospital. e blood parameters of 279 patients, of which 177 are confirmed, are COVID-19 patients. e most critical characteristics for diagnosis of COVID-19 were aspartate aminotransferase (AST), lactate dehydrogenase (LDH), lymphocyte count (LC), C-reactive protein (CRP), and white blood cell (WBC) out of a total of 279. Various ML models in this study attained accuracy ranging from 82% to 86%.
Authors in [41] used several machine learning models to predict the presence of coronavirus. ey used RF, DNN, and XGB models for classification, and the XGB model achieved the highest results compared with RF and DNN model results. is study involves data from 160 COVID-19 patients from Slovenia's University Medical Centre Ljubljana. Furthermore, 5,333 more COVID-19-negative patient data were added to the dataset.
Compared to previous studies, we concentrate on proposing a strategy for early diagnosis of COVID-19 in Pakistan and analyzing the contribution of PCR test in COVID-19 detection. e proposed approach selects the best attributes for early COVID-19 diagnosis using machine learning methods. We utilized a dataset of 603 inpatients in age, gender, and comorbidity. Table 1 summarizes the remaining articles and their primary characteristics.

Dataset and Preliminaries
is section describes the overall data collection procedure for positive and negative COVID-19 patients from the Pakistan Institute of Medical Sciences (PIMS) Hospital. From August 2021 to December 2021, we gathered data on inpatients for five months. We collect the dataset every week. During the data collection phase, we go through various challenges. One of the primary challenges is choosing the crucial features that have a part in determining a COVID-19-positive patient. For this purpose, we took advice from a healthcare specialist. Finally, with the assistance of a clinical specialist, we completed the process of dataset selection. Initially, our team collected the dataset in hardcopy form, and then we selected only those columns referred by the clinical expert and finalized our dataset.
is dataset contains demographic information, various symptoms, HRCT scan, blood test, and disease histories for 603 inpatients of various genders and ages. e dataset contains 403 COVID-19-positive patients and 200 COVID-19-negative patients. Initially, we had 38 columns in our dataset, and on the suggestion of a clinical expert, we eliminated five columns from the dataset ("Pulse," "Serum Albumin," " | BP systolic," "HB," and "BP systolic") since these columns do not help for COVID-19 detection. e clinical expert's feature importance of each column is present in Figure 1.

Proposed Approach
is research presents a unique ML classifier training and feature selection technique for identifying relevant factors in COVID-19 patients. Collecting COVID-19 patient datasets, choosing features indicated by clinical experts, data analysis, co-relation analysis, feature extraction, model selection, and final prediction is all part of the proposed technique. is section discusses the methodology recommended for this research investigation. Figure 2 depicts an overview of the suggested method. Initially, we collected the dataset with the help of a clinical expert. e next step is to turn the hardcopy data into a soft copy or CSV format. e three main phases in data preparation are eliminating irrelevant characteristics, removing missing values, and label encoding. COVID-19 severity and nonseverity may be determined by a patient's first symptoms and clinical expert-selected characteristics. Selecting the most critical characteristics to feed into the machine learning model is accomplished using feature selection algorithms to limit the dataset.
Five prominent ML classifiers, including multi-layer perceptron (MLP), K-nearest neighbor (KNN), support vector machines (SVMs), Naive Bayes (NB), and decision tree (DT), combined with a unique feature selection procedure were utilized to detect positive and negative COVID-19 patients. Model prediction on new unseen data is made after several ML models have been trained on training data.
ere are numerous approaches in the literature for feature reduction, like using all feasible subsets, forward selection, or backward elimination, and some transformation-based techniques like PCA, fuzzy c-mean, ICA, and their different versions. Unfortunately, these approaches have flaws that require a proper statistical analysis [49][50][51]. For example, to select the optimal feature subset, all possible subset technique requires 2p rounds, where p refers to the number of features. In this study, we engage a clinical expert who understands the benefits and drawbacks of each feature in a dataset to select optimum features. We also incorporate other relevant feature extraction approaches, such as random forest features, extreme gradient boosting (XGB) features, CatBoost features, and chi-square features. We compare features derived by different feature extraction techniques to features specified by a clinical expert, and the random forest feature extraction approach retrieves the most suitable features, which aid the ML classifier in COVID-19 identification. e comparison of all the feature extraction techniques is presented in Table 2.

Preprocessing and Co-Relation
Analysis. Data preprocessing is essential in machine learning because the quality of the data and the meaningful information taken from it directly influence our model's ability to learn. As a Computational Intelligence and Neuroscience result, data preprocessing is an important stage in machine learning. e three critical processes in data preprocessing are imputation of missing values, removing nonuseful features, and label encoding using the category code technique. We discovered that our dataset had no null values when doing preprocessing processes. We removed specific nonuseful columns from the dataset that did not contribute to COVID-19 detection in the next preprocessing step. We employed the category code strategy for encoding in the third preprocessing phase. is method transforms category data into numerical numbers. First, we look at the data types of each column because this method needs the category column to be of the "category" data type. So, before employing this method, we alter the data type to "category." After completing the preprocessing data phase, 33 columns and 603 rows remain. Now we move on to co-relation coefficient analysis. Pearson's coefficient (PCC) is used to examine the co-relationship between the features to eliminate nonessential, duplicate, and redundant features from the data, as shown in Figure 3. e correlation coefficient ranges between −1 and 1. If the value is near −1, the features are adversely associated; if the value is close to 1, the features are closely related and significantly impact model performance. To find the PCC, we set the threshold values to 0.95%; if the correlation value exceeds the threshold, we ignore the feature; if the correlation value is below the threshold, we keep the feature. After the feature co-relation analysis, 31 attributes remain. Consequently, there are few associated characteristics in the dataset; hence, we ignored these features. erefore, since they are significantly associated or connected, we excluded these columns ("lymphocyte count," "IL6") from the dataset.

Feature Selection.
e clinical expert understands the benefits and drawbacks of each dataset attribute. e clinical expert evaluates each feature of the dataset and identifies the 24 most important attributes, which are given in Table 2. To identify essential features using artificial intelligence and machine learning-based techniques, we first explored the impact of each feature on severity via feature importance analysis using random forest, XGB, CatBoost, and chisquare feature selection techniques. We then compared the results of these techniques to the clinical expert-selected features. After analyzing the feature importance scores from each feature selection technique, we compared the top 24 features selected by each feature selection approach with clinical expert-selected features. We found that the random   or multiple hidden layers, and finally, one output layer at the end that performs all computations. MLP is a part of supervised learning technique, and it uses function f(Z): R i ⟶ R n to train on the given data. e total output dimensions are presented by n, and the total input dimensions are presented by i. We have features set F � f 1 , f 2 . . . f 24 with target label T l . Every node is a neuron that performs classification or regression by using a nonlinear activation function. We use most of the default parameters for the MLP model, but some parameter settings are ReLU as activation function, and the learning rate is adam (0.0001) with 200 iterations.

K-Nearest Neighbor.
e KNN model is a supervised learning approach that may be used to solve classification and regression tasks. It categorizes data based on numerical outputs. is approach selects the K number of data points from the training data most comparable to the new data point for classification purposes. ese neighbors are then utilized to continue the procedure to categorize additional Computational Intelligence and Neuroscience data points. is process is continued until the gap between them is minimal. We performed experiments with 3, 5, and 7 neighbors and observed that the results were almost the same. us, we kept the default setting for n neighbors. All other parameter values are left at their defaults. We utilized 5 clusters and Euclidean distance to calculate the distance value in this research.

Support Vector
Machine. e support vector machine is another fundamental approach each machine learning expert should use in their research. e SVM model is a popular classification algorithm because it produces high accuracy while consuming less computational resources. SVM is a part of supervised learning algorithms used for classification and regression; however, it is frequently used in classification tasks. It is well known due to its ability to identify abnormalities in higher-dimensional data such as audio data. e scikit-learn package is used to construct this approach. e RBF kernel parameter of the SVM model is set to 4 with a gamma value of 0.001, and the probability state is true in the SVM model parameters.

Naive Bayes.
e NB model is an ML classifier that makes strong independence assumptions using Bayes theory. It is a combination of multiple probability models with strict independence assumptions. In simple terms, a Naive Bayes classifier asserts that the presence of a single character in a class is unrelated to the occurrence of any other feature. It is employed in various fields, including spam or ham categorization, sentiment classification, COVID-19, and other medical diagnostics. It is frequently used for classification. Aside from its simplicity, Naive Bayes has outperformed even the most sophisticated classification algorithms. In this study, we use the default parameters of the Naive Bayes classifier.

Decision Tree.
A decision tree (DT) is a rule-based classifier that recursively splits the dataset until it is left with Computational Intelligence and Neuroscience leaf nodes. DT works based on information gain and entropy measures. Entropy is the impurity that shows how impure a particular node is, and information gain reveals how informative a particular node is to predict the classes. For this study, we set several parameters according to the current scenario. We set cross-validations to 10, and the rest of the parameters are set to default. We set the confidence factor to 1.0.

Experimental Result and Discussion
is study examined data from 603 patients; 400 were COVID-19-positive, and 203 were COVID-19-negative patients. e dataset is separated into two sections. 80% of the dataset is utilized for training the ML models, while the remaining 20% is used for testing. e selection of a model and its deployment to a dataset to train the model is simply one component. Applying multiple assessment measures to evaluate the model's capabilities on previously unknown datasets is vital in developing any machine learning model.
is study included four assessment metrics: accuracy, precision, recall, and F1-score. ese assessment measures were determined by using a confusion matrix, with the true positives (TPs) representing COVID-19 patient that is predicted as COVID-19-positive and is COVID-19-positive, the true negatives (TNs) representing COVID-19 patient that is predicted as COVID-19-negative and is COVID-19negative, the false positives (FPs) representing COVID-19 patient that is predicted as COVID-19-positive but is COVID-19-negative, and the false negatives (FNs) representing COVID-19 patient that is predicted as COVID-19negative but is COVID-19-positive. e measures employed were as follows.
Accuracy is the ratio of correctly classified COVID-19positive and COVID-19-negative patients to all correctly and incorrectly classified cases. e equation for accuracy is

Computational Intelligence and Neuroscience
Acc � Precision is defined as the ratio of a COVID-19 patient successfully detected as COVID-19-positive to all COVID-19-positive occurrences e recall is defined as the model's sensitivity, which refers to how effectively the classifier can recognize COVID-19-positive patients e F1-measure is calculated by taking the weighted average of the recall and precision measurements Cross-validation is a typical approach for evaluating ML models' performance and an essential strategy to overcome overfitting.
is work utilizes the assessment metrics accuracy, precision, recall, F1-score, and AUC curve score to evaluate model performance. We use a weighted average to evaluate the precision, recall, and F1-score.
is work employed a ten-fold cross-validation approach to assessing our suggested models' performance. e experiments are implemented using Google Colaboratory   e experiments in this study were divided into two phases; in the first phase, we performed experiments that included the PCR test results of patients with other feature sets, and in the second phase, we excluded the PCR test results to determine the importance of the PCR test in diagnosing COVID-19. e investigation reveals that the experimental findings in both scenarios are almost similar, indicating that the PCR is not the only parameter that indicates whether the patient is COVID-19-positive or not. In many circumstances, such as human mistakes, machine error, and testing samples not obtained appropriately, the PCR result may be inaccurate; thus, it is essential to ensure that the PCR test findings are not the only factor in excluding COVID-19 [52]. Various other characteristics contribute to identifying the existence of COVID-19-positive patients, as indicated in the data   Table 3, while  those without the PCR test are shown in Table 4. e results in Table 3 demonstrate that the DT model yields the highest accuracy, which is 98.347% compared with other machine learning models (KNN, MLP, SVM, and NB). We also employed precision, recall, and F1-score assessment measures, and the score of these measures using the DT model is 98.134%, 98.245%, and 98.024% in terms of precision, recall, and F1-score. Furthermore, we also compute the AUC score of the DT model, which is 99%. We plot the AUC curve of all models as shown in Figure 4. e AUC curve of the DT model is depicted in Figure 4(e). e findings in Table 4 show that again, in the second phase of experiments, the DT model has the highest accuracy, 96.694%, compared to other machine learning models (KNN, MLP, SVM, and NB). We also use precision, recall, and F1-score evaluation measures in the second phase of experiments. e scores are 97.365%, 97.231%, and 97.153% in precision, recall, and F1-score using the DT model. Furthermore, we compute the AUC score of the DT model, which is 98%. We plot the AUC curve of all models as shown in Figure 5. Figure 5(e) depicts the AUC curve of the DT model. e experimental findings show that the DT model outperformed other ML models' inaccuracy in both scenarios. e highest AUC curve score demonstrates that the DT model worked admirably on the dataset. As a result, we may deduce that their performance may increase if these algorithms are fed additional data. As we apply the proposed approach to the real dataset to detect this virus, our efforts will benefit the community by analyzing clinical information and taking appropriate steps. e J48 decision tree is depicted in a systematic visual manner in Figure 6. e prediction of positive and negative COVID-19 cases is shown in the decision tree. Ten features participate in predicting COVID-19-positive and COVID-19-negative prediction. LDH is the root node and is considered the best node that provides more information about COVID-19 predicting process. e J48 has 98.70% precision, recall, F1-measure, TP rate, and a 98.90% ROC area. Using a function of the attribute values, the internal nodes with outgoing edges partitioned the instance space into two subspaces. e J48 decision tree classifies and forecasts occurrences as positive or negative based on the input factors.

Conclusion
Most prior research has been limited to chest CT scan and chest X-ray images using DL algorithms to identify positive and negative patients of COVID-19. However, these approaches may accurately diagnose the COVID-19 cases, but these techniques cannot be utilized every time for patients due to high radiations, high prices, and a limited number of available equipment. As a result, distinguishing between positive and negative COVID-19 instances remains a substantial challenge. According to these considerations and based on the previous research analysis, no diagnostic model for identifying COVID-19 cases utilizing numerous clinical features has been developed. As a result, this study aims to use ML classification algorithms to predict COVID-19 patients based on 24 clinical parameters. Five ML models (i.e., MLP, KNN, SVM, NB, and DT) are employed in this study to diagnose COVID-19 patients using 24 clinical features. e models were evaluated using various metrics (accuracy, precision, recall, F1-score, and AUC score). e experiments were carried out in two phases: with PCR test results and without PCR test results. In both cases, the DT classifier outperformed the other four ML classifiers. With an accuracy of 98.347% using PCR and 96.694% without the PCR test attribute, DT is the best classifier for predicting COVID-19 cases based on the 24 clinical variables utilized in this investigation. From a clinical perspective, our findings show that the DT is the best classifier for predicting COVID-19 instances using the 24 variables utilized in this investigation.
When clinicians depend entirely on PCR test results to declare a patient as COVID-19-positive, there is a risk of false-positive and false-negative patients. As a result, disease treatment would be delayed, allowing false-negative patients to spread rapidly. e prediction models might be helpful in early diagnosis, especially when PCR test result is insufficient for diagnosing COVID-19 infection. As a result, this study might help clinicians increase their prediction rate of confirmed COVID-19 infections. e findings are also likely to benefit other countries, especially the developing countries with a scarcity of PCR tests and specialist facilities. e limitation of this research is the less sample size and data balancing issue. In the future, we intend to obtain a more accurate dataset for future studies for better COVID-19 detection.

Data Availability
e data used to support the findings of this study have not been made available because of the agreement with the PIMS Hospital.

Conflicts of Interest
e authors share no conflicts of interests.