On-Ground Distributed COVID-19 Variant Intelligent Data Analytics for a Regional Territory

Department of Computer Systems Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan Department of Software Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan Department of Electrical Engineering, Mehran University of Engineering and Technology, Jamshoro, Pakistan Department of Computer Education, Sungkyunkwan University, Seoul, Republic of Korea


Introduction
The COVID-19 was declared a deadly virus by the World Health Organization (WHO) [1,2]. There is a need for countries to act in unison to prevent further transmission of the disease. A pandemic is a disease that spreads worldwide [3]. Throughout history, the world has witnessed many pandemics. The most recent was in the year 2009 due to the H1N1 flu. The first few cases of COVID-19 were reported to the WHO on 31 December 2019 in the city of Wuhan, Hubei province in China, wherein several people were afflicted with pneumonia, and the cause could not be determined. In January 2020, officials identified a novel virus that was not named yet [4,5], which was subsequently popularized as the 2019 novel Coronavirus [6]. Upon obtaining the samples and analyzing the virus genetics, it was established that it caused the outbreak. The virus was named Coronavirus 2019 (COVID-19) by the WHO in February 2020 [7], while some studies found that this deadly COVID-19 virus is associated with SARS-CoV-2 [8,9]. With its 204 million population, Pakistan saw first of its case in February 2020 [10]. With the 5th largest population globally, it became essential to understand how the virus will progress in this vast population and how it will progress in Pakistan. Therefore, it has become essential to address the problem of the future trend of COVID-19-positive cases in Pakistan by using the COVID-19 dataset from [11]. Machine learning is widely used to handle large data, and it can help in this regard. We specifically test three methods, namely, linear regression, random forest, and XGBoost algorithm. In this paper, we predict positive COVID19 cases in Pakistani regions of Sindh, Punjab, Gilgit Baltistan, Balochistan, Khyber Pakhtunkhwa, Azad Jammu, and Kashmir using three ML algorithms, and we compare the results; in order to find out the optimal algorithm for the dataset which gives the highest accuracy for the forecast of COVID-19-positive cases. A realtime forecasting scheme is presented based on ML models, which provides real-time prediction allowing citizens and the government of Pakistan to take actions proactively [12,13]. This paper effectively predicts future COVID-19 pandemic trends by employing open-source data science libraries and machine learning tools in Python. The primary objectives of this study are as follows: (i) To source [12], preprocess, visualize, and analyze the data of COVID-19 in Pakistan (ii) To recognize the various parameters required for COVID-19 modeling and drive these variables for all the three forecasting algorithms used (iii) To rectify and eliminate biases (iv) Model and predict future the trend of the COVID-19 pandemic (v) Visualize and discuss the results The COVID-19 dataset of Pakistan has not been tested on a large scale by using machine learning algorithms. This paper contributes to using machine learning algorithms on indigenous datasets in Pakistan, which can significantly help in assessing and planning to take actions accordingly. The paper is structured in the following manner: Section 2 presents an overview of literature related to COVID-19 forecasting; Section 3 explains the methodology for predicting COVID-19; Section 4 shows the results for all three machine learning models; and Section 5 illustrates the relationship between parameters. In Section 5, we summarized this work and presented various results.

Related Work
Kavadi et al. [1] developed a mathematical model to assess and estimate the growth of the worldwide COVID-19 pandemic. Machine learning generalized inverse Weibull model has been implemented to evaluate the potential risks associated with the Coronavirus. In order to ensure precise and real-time prediction on the growth of the pandemic, cloud computing was employed. A model was implemented by Nemati et al. [3] to highlight the efforts of the Pakistan government to fight with COVID-19. This paper presents the current scenario of the Coronavirus situation in Pakistan and provides information about the hospital facilities provided for COVID-19 patients. The results show that the recovery rate is higher than the mortality rate in Pakistan, and Balochistan has more hospitals for COVID-19 patients. Azad Jammu and Kashmir have the least hospitals for COVID-19 patients. Isolation zones were built in Pakistan, and this study shows that Punjab and Khyber Pakhtunkhwa regions have more isolation wards and better medical facilities. Ardabili et al. [4] proposes the PDR-NML method (partial derivative regression and nonlinear machine learning) to predict the pandemic trends of COVID-19. The results show that the proposed ML method is more effective than other state-of-the-art methods in the Indian population. Thus, it can be an innovative tool in helping other countries make their predictions. The authors of this study have also used PPDLR for normalizing the features required for timely prediction and PDLFR for robust and accurate prediction and observed that machine learning performed well for data analysis than artificial intelligence. Lalmuanawma et al. [5] predicted the trend of COVID-19. The Fb-prophet model is used to establish the pandemic curve and forecasting its direction. The disadvantage of this study is that they have used the limited dataset this work is integrated into the logistic model. Three significant points have been summarized based on the modeling results related to Indonesia, Peru, Brazil, India, and Russia. According to estimations based purely on mathematical aspects, the peak of the virus will be witnessed globally in late October, and it is expected that 14.12 million people will be impacted on a cumulative basis. Rustam et al. [7] implemented the autoregressive integrated moving average (ARIMA) model to predict the new COVID-19 cases each day in Saudi Arabia for four weeks. The authors have summarized four different prediction models in this study, including autoregressive model, moving average, a combination of both (ARMA), and integrated ARMA (ARIMA), to identify the apt model fit. The results show that the ARIMA model is more effective in comparison to the other models. Pandey et al. [8] aim to forecast the COVID-19-positive cases in India and Odisha by using linear regression and multiple linear regression. Therefore, it is observed that both models provided remarkable accuracy for the prediction of the COVID-19 pandemic. Roy et al. [10] summarized four machine learning algorithms to forecast COVID-19-infected people. The data of COVID-19 between 20/01/2020 and 18/09/2020 for the USA, Germany, and global were obtained from the World Health Organization. The performance of all algorithms is compared according to the RMSE, APE, and MAPE criteria, and it was observed that these models could be used to diagnose the COVID-19 data over time. To predict the future forecast of the COVID-positive case, Ayyoubzadeh et al. [11] used XGBoost, K-means, and long short-term memory (LSTM) neural networks to construct a prediction model. Therefore, it was observed that K-means-LSTM provides higher accuracy with an error score of 601.20%.

2
Wireless Communications and Mobile Computing

Methodology
In this study, classification algorithms were applied, and an evaluation process is done for each algorithm based on different parameters shown in Figure 1. This research work involves few significant steps like data collection, data preprocessing, applying machine learning algorithms, evaluation, and comparative analysis.
3.1. Data Collection. The data used in this work is accessed from http://covid.gov.pk [12,14]. The information related to COVID-19 cases in Pakistan has been compiled from different sources, including Kaggle and World Health Organization (WHO) [6,[15][16][17]. A cumulative data set is created from a mix of the above resources. The information taken from http://covid.gov.pk/ data is not in a required CSV format. It also contained some unnecessary data that was not needed to predict positive cases in Pakistan data preprocessing was done. The dataset includes the hospital data of COVID-19-positive patients, deceased patients, recovered patients, total deaths of patients, and the number of swab tests conducted every day in each region of Pakistan. The dataset contains all the COVID-19 data of the patients in the specified data collection period.

Data Preprocessing.
After the collection of information, the data was transformed into the required CSV format. In order to rectify the issue of systemic bias, a feasible methodology was adopted. The moving-average method, which is typically used to assess time-series through the computation of averages of various subsets within the complete dataset, was adopted for this purpose. The moving-average method, which is typically used to assess time-series through the computation of averages of various subsets within the complete dataset, was adopted for this purpose. In this context, seven days were taken as the complete dataset. Initially, the moving average was computed by finding the average of the first subset over seven days. Then, the subset was altered as the following fixed subset was chosen. This went on till all the subsets were subjected to this method. Essentially, this method tends to smoothen the data by mitigating anomalies, the weekend bias. In Figures 2-5, the dataset variables are plotted as time series depicting total COVID-19-positive cases across Pakistan, total COVID-19 deaths across Pakistan, new COVID-19-positive cases in Pakistan regions, and COVID-19 patients who are in serious condition. Figure 2 displays the daily new COVID-19-positive cases in Pakistan as it is essential for forecasting. Figure 3 displays the average of COVID-19-positive cases in a week. And, Figure 4 represents total COVID-19-positive cases across Pakistan. Also, Figure 5 represents total deaths across Pakistan. Figure 6 displays daily new reported COVID-19 cases in Pakistan regions, whereas Figure 7 illustrates the COVID-19 patients' data who are in serious condition.

Applying Machine Learning Algorithms.
After preprocessing, random forest, XGBoost, and linear regression models were applied to predict COVID-19-positive cases in Pakistan [18]. A linear regression model was employed to model the COVID-19 trend. It was trained using positive cases and new positive cases data on both the national and provincial levels in Pakistan. In regression, the R 2 coefficient of determination is a statistical measure that informs the preciseness of the regression predictions by comparing them with the fundamental data points. If the value of R 2 is deduced to be 1, it denotes that the regression predictions accurately align with the data. Thus, the closer the value of R 2 is to 1, the more influential the model is in predicting trends [19]. The random forest algorithm is a popular unsupervised machine learning algorithm, and it is employed for classification [20,21]. It is an ensemble machine learning method. The random forest represents a decision tree. N number of outputs are obtained by the N number of the decision tree using this algorithm.

Forecasting the Trend of Positive COVID-19 Cases across
Pakistan Regions. The COVID-19 outbreak has badly affected the essential aspects of life around the world. In order to control this outbreak, smart lockdowns have been imposed all over the country and are highly affected areas of Pakistan. This study will provide an idea about the increase of COVID-19 in Pakistan and its provinces. It will also help Pakistan and its citizens make appropriate decisions to handle the situation by following proper SOP's and guidelines.

Forecasting the Trend of Positive COVID-19 Cases Using
Linear Regression Algorithm. In this study, a detailed description of linear regression is presented. In addition, all the tests performed for the validity of linear pegression are analyzed and discussed. We have used linear regression to forecast the value of a dependent variable by provided independent variable data [22][23][24]. It was observed that there is a linear relationship between independent variables and dependent variables. In our study, we considered X as an independent variable and Y as a dependent variable, and the value of Y is predicted by using the following equation: where X = ½x 1, x 2 , x 3, , ⋯, x p is a vector of P input parameters and Y = ½y 1 , y 2 , y 3 , ⋯, y Q is a vector of Q output parameters. X is also called independent variables as response variables. In machine learning regression is a method to find the relation between X and y i . When the relationship is done using a linear predictor function, assuming a system is linear, equation (1) represented by Here, a is the vector of coefficients of regression and € represents the vector of model error. If we expand the above equation then the equation would be represented as In equation (3), i, a, and € are estimated by using standard methods. Let us assume that the estimated coefficients is defined by a and the fitting response is represented in     Wireless Communications and Mobile Computing The R 2 (coefficient of determination) is given by Here, Y represents the forecast of total positive cases in Pakistan and X variable represents the date, a 0 denotes the Y-intercept and a 1 indicates the slope. The linear regression model is built by learning the values of a 0 and a 1 from a given dataset, where R 2 is the measure of the proportion of variation in y explained by the P input parameters. In this study, R is used to determine the values of a, R 2 , and ∈, and y is the mean of all observations.

Forecasting the Trend of Positive COVID-19 Cases Using
Random Forest Algorithm. To implement the random forest model first, we have taken the COVID-19 dataset of Pakistan as an input. Then, the random forest model was trained on that dataset. Independent variables are considered dependent variables. The actual number of COVID-19 cases is regarded as the dependent variable [25]. The random forest model was used for forecasting the COVID-19-positive cases in Pakistan territories. Implementation of this model is described in the following flowchart.
Random forest consists of many decision trees. The higher the number of decision trees, the more accurate results we will get. There is a direct relation between outcome and number of decision trees in Random Forest. It consists of many decision trees. The higher the number of decision trees, the more accurate results we will get. There

Wireless Communications and Mobile Computing
is a direct relation between outcome and number of decision trees in random forest [26][27][28]. The primary purpose of this algorithm is to improve prediction accuracy by aggregating multiple classifiers. The random forest algorithm is widely used for classification and prediction. It can be applied to many fields such as forecasting, data analysis, text classification, and face recognition [29]. This algorithm combines multiple decision trees and classifier models. The construction process of random forest is described in Figure 8. In our study, the prediction process is divided into two significant parts: the first part is the growth of the decision tree, and the second part is the voting process. The growth process is divided into three categories: first is a random selection of training set, second is random forest construction, and third is split node. In the node splitting process, Gini is selected as the smallest coefficient to split the feature. The steps for calculation of coefficient Gini is given as follows: where p i 2 represents the probability of category M j in the sample set K.   where |M | represents the number of the sample set K and |M 1 | and |M 2 | represents the samples in subsets M 1 and M 2 . Therefore, it was observed that the random forest algorithm provides better performance due to the random selection of the feature set and training set. In this study, the R 2 and mean square error for random forest were calculated using evaluation metrics [30]. The formula for calculating R 2 is given below: In the above equation, x i represents actual values,x i represents the predicted value, and x i represents the average of all values. If the value R 2 is nearer to 1 then the model is good for forecasting.
The formula for calculation of root mean square (RMSE) is shown below: Here, x i represents the actual value,x i represents the predicted value, and K indicates the number of samples, and i = 1, 2, 3, 4 ⋯ n.

Forecasting the Trend of Positive COVID-19 Cases Using
Extreme Gradient Boost (XGBOOST) Algorithm. The extreme gradient boost (XGBoost) is a widely used and most good machine learning algorithm.
It converts the weak classifier into the robust classifier. The process is repeated according to the needs of the desired model, which in this study is to forecast the positive cases of COVID-19 in Pakistan territories. XGBoost algorithm is a tree learning model which takes the decision tree as its basic unit, and the final learning model of XGBoost consists of many decision trees. It is an impaired algorithm based on a gradient boosting tree. It uses CART or linear classifier as the gradient boosting algorithm. XGBoost algorithm has several advantages for prediction problems.
Here, A represents the function space formed by all tree models and f m represents regression tree.
Here, u shows the mapping relationship from a to the leaf node, and R represents the weight to the leaf node.
The objective function of the defined model is as follows: In the above equation, b i represents the actual value and y i represents the forecast value where the first part is the learning loss, and the second part represents the sum of complexity of each tree; the complexity of T th is indicated by where Z is the number of leaf nodes, J is the leaf weight, λ is the penalty coefficient of leaf weight, and γ is the penalty coefficient of profit function of segmented leave node.
The gradient boost strategy is used in the XGBoost algorithm to generate a regression tree after every iteration which is added to the existing model.  By combining equations (14), (11), and objective function, we get where C is the constant term.

Wireless Communications and Mobile Computing
Now, apply second-order Taylor expansion on the above equation then we get Here, Þ represent the first and second derivatives of the loss function and C is a constant. By removing constants, we get

Results and Discussion
The COVID-19 virus has infected many people, and the number of infected people may increase in the future. The machine learning system approach will show promising results for the forecast of COVID-19-positive cases. Statistical models are essential techniques for evaluating infectious disease data analyses in real-time. In this research, a realtime COVID-19 forecast is built for the regions of Pakistan.
Our predicted models performed very well in predicting the daily new confirmed COVID-19-positive cases in the regions of Pakistan. All the steps involved in building the proposed model are implemented in python using Pandas library for data loading and preprocessing of data Matplotlib is used to plot the curves, and Scikit-learn library is also used for implementation of the classifier. This research's experiments are executed on a system with a Dell i7 processor with 64 GB RAM. For further evaluation, metrics (accuracy, precision, support, recall, F1-score, and sensitivity) are used to measure the quality of machine learning models. We have proposed a prediction model that works over six months to predict COVID-19 activity by combining the previous incidence of COVID-19. Our proposed model performed well for all regions of Pakistan. The performance of algorithms was evaluated by using mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) [15,31], and evaluation metric. This proposed model has several advantages compared to other reported works on a similar topic. Our forecasting models performed well for the COVID-19 forecast, but random forest and XGBoost provide better accuracy. We have used a large amount of data which improved the performance of all ML models. Figures 9-14 show results of the COVID-19 trend using linear regression in regions of Pakistan. The red bars are the training data, whereas the blue is the predicted trend with indicated model scores. If the blue bar is increasing, it means positive cases are increasing day by day. In Figure 9, red bars represent the actual COVID-19-positive cases data in Sindh, Pakistan, whereas blue bars represent the predicted COVID-19-positive cases. According to prediction, this figure shows that Sindh may have a higher number of posi-tive cases in May. In Table 1, error metrics shows that the MSE score for Sindh is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated using evaluation metrics on the Sindh region, whose accuracy percentage is 86%, support percentage is 28%, precision percentage is 82%, recall percentage is 1%, F1-score percentage is 82%, and sensitivity percentage is 1%. Figure 10 represents the forecast prediction of Punjab, Pakistan. According to prediction, this figure shows that Punjab may have a higher number of positive cases in May. In Table 1, error metrics shows that the MSE score for Punjab is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated using evaluation metrics on the Punjab region, whose accuracy percentage is 82%, support percentage is 27%, precision percentage is 72%, recall percentage is 1%, F1-score percentage is 83%, and sensitivity percentage is 1%. Figure 11 represents the forecast prediction of Gilgit Baltistan, Pakistan. According to prediction, this figure shows that in January and February, Gilgit Baltistan may have a higher number of positive cases than cases are slowly decreasing. In Table 1, error metrics shows that the MSE score for Gilgit Baltistan is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, the linear regression model is evaluated in Table 2. By using evaluation metrics on the Gilgit Baltistan region, it shows accuracy percentage is 84%, support percentage is 94%, precision percentage is 84%, recall percentage is 1%, F1-score percentage is 96%, and sensitivity percentage is 1%. Figure 12 represents the forecast prediction of Khyber Pakhtunkhwa, Pakistan. In Table 1, error metrics shows that the MSE score for Khyber Pakhtunkhwa is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated using evaluation metrics on the Khyber Pakhtunkhwa region, whose accuracy percentage is 86%, support percentage is 27%, precision percentage is 76%, recall percentage is 1%, F1-score percentage is 82%, and sensitivity percentage is 1%. Figure 13 represents the forecast prediction of Balochistan, Pakistan. In Table 1, error metrics shows that the MSE score for Balochistan is 0.202, MAE is 2.025, RMSE is 4.860, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated by using evaluation metrics on the Balochistan region, whose accuracy percentage is 82%, support percentage is 28%, precision percentage is 71%, recall percentage is 1%, F1-score percentage is 82%, and sensitivity percentage is 1%. Figure 14 represents the forecast prediction of Azad Jammu And Kashmir, Pakistan. In Table 1, error metrics shows that the MSE score for Azad Jammu And Kashmir is 0.202, MAE is 2.024, RMSE is 4.859, and MAPE score is 0.011. Furthermore, in Table 2, the linear regression model is evaluated by using evaluation metrics on Azad Jammu And Kashmir region whose accuracy percentage is 74%, support percentage is 30%, precision percentage is 5%, recall percentage is 1%, F1-score percentage is 83%, and sensitivity percentage is 1%. By using the above random forest methodology, a visualization of records in terms of actual versus predicted values  Table 3, error metrics shows that the MSE score for Sindh is 0.006, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated by using evaluation metrics on Sindh region whose accuracy percentage is 93%, support percentage is 136%, precision  percentage is 84%, recall percentage is 82%, F1-score percentage is 90%, and sensitivity percentage is 92%. Figure 16 shows that in March, April, and May, Punjab may have a higher number of positive cases. In Table 3, error metrics shows that the MSE score for Punjab is 0.149, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random Forest model is evaluated by using evaluation metrics on Punjab region whose accuracy

Wireless Communications and Mobile Computing
percentage is 93%, support percentage is 154%, precision percentage is 85%, recall percentage is 75%, F1-score percentage is 88%, and sensitivity percentage is 92%. Figure 17 represents the forecast of Khyber Pakhtunkhwa, and in April and May, Khyber Pakhtunkhwa may have a higher number of Positive cases. In Table 3, error metrics shows that the MSE score for Khyber Pakhtunkhwa is 0.022, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated using evaluation metrics on Khyber Pakhtunkhwa region whose accuracy percentage is 93%, support percentage is 154%, precision percentage is 84%, recall percentage is 84%, F1 -score percentage is 89%, and sensitivity percentage is 92%. Figure 18 represents the forecast of Gilgit Baltistan. In Table 3, error metrics shows that the MSE score for Gilgit Baltistan is 0.002, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated using evaluation metrics on the Gilgit Baltistan region, whose accuracy percentage is 95%, support percentage is 117%, precision percentage is 90%, recall

12
Wireless Communications and Mobile Computing percentage is 76%, F1-score percentage is 92%, and sensitivity percentage is 90%. Figure 19 represents the forecast of Balochistan, and blue bars mean that in April, May, and June, Balochistan May have a higher number of COVID-19-positive cases. In Table 3, error metrics shows that the MSE score for Balochistan is 0.013, MAE is 2.035, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated by using evaluation metrics on Balochistan region whose accuracy percentage is 93%, support percentage is 156%, precision percentage is 92%, recall percentage is 79%, F1-score percentage is 86%, and sensitivity percentage is 92%. Figure 20 represents the forecast of Azad Jammu And Kashmir forecast, and blue bars represent that in April, May, and June, Azad Jammu And Kashmir May have a higher number of COVID-19-positive cases. In Table 3, error metrics shows that the MSE score for Azad Jammu And Kashmir is 0.126, MAE is 2.030, RMSE is 3.389, and MAPE score is 0.006. Furthermore, in Table 4, the random forest model is evaluated by using evaluation metrics on Azad Jammu And Kashmir region whose accuracy percentage is 93%, support percentage is 181%, precision percentage is 85%, recall percentage is 74%, F1-score percentage is 85%, and sensitivity percentage is 92%.   Using the above XGBoost methodology, a visualization of records in terms of actual versus predicted values is shown below in graphs. Figures 21-26 show results of the COVID-19 trend using the XGBoost model in regions of Pakistan. The red bars are the training data, whereas the blue is the predicted trend. In Figure 21, red bars represent the actual COVID-19-positive cases data in Sindh, Pakistan, whereas blue bars represent the predicted COVID-19-positive cases. According to prediction, this figure shows that in May, Sindh may have a higher number of positive cases. In Table 5 Error Metrics shows MSE score for Sindh is 0.074, MAE is 0.579, RMSE is 1.389, and MAPE score is 0.003. Figure 22 shows that in April and May, Punjab may have a higher number of positive cases. In Table 5, error metrics shows that the MSE score for Punjab is 0.394, MAE is 1.332, RMSE is 3.17, and MAPE score is 0.007. Figure 23 represents the forecast of Balochistan. In Table 5, error metrics shows that the MSE score for Balochistan is 0.304, MAE is 1.169, RMSE is 2.807, and MAPE score is 0.006. Figure 24 represents the forecast of Khyber Pakhtunkhwa. In Table 5, error metrics shows that the MSE score for Khyber Pakhtunkhwa is 0.198, MAE is 0.836, RMSE is 2.008, and MAPE score is 0.004. Figure 25 represents the forecast of Gilgit Baltistan. In Table 5, error metrics shows that the MSE score for Gilgit Baltistan is 0.049, MAE is 0.944, RMSE is 2.266, and MAPE score is 0.005. Figure 26 represents the forecast of Azad Jammu And Kashmir. In Table 5, error metrics shows that the MSE score for Azad Jammu And Kashmir is 0.049, MAE is 0.472, RMSE is 1.135, and MAPE score is 0.002.

Comparative Analysis.
Linear regression, random forest, and XGBoost algorithms are used to predict COVID-19 cases, and it is observed that the random forest algorithm is better than linear regression. The random forest provides high accuracy for the prediction of positive COVID-19 cases in Pakistan. To compare the performance of linear regression, XGBoost, and random forest estimation method, mean square error (MSE), root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) are used [32,33].

Evaluation Metrics.
Since it is an inevitable prediction [34], the accuracy of all algorithms is checked. To identify the model with the best prediction power, we considered six evaluation metrics, including accuracy, precision, sensitivity, recall, support, and F1-score [35,36]. Tables 2 and 4 show the performance results of machine learning algorithms for regions of Pakistan for our proposed model. It is observed that the linear regression and random forest show comparable results. Random forest has comparably better performance than linear regression. However, this paper also proposes using the XGBoost algorithm, which performs better than both ML algorithms.
4.3. Correlation. It is used to measure the interrelation between two variables and also the direction of their relationship. The value of correlation is always greater than -1 and less than +1. If the coefficient reaches point 0, then the relationship between variables becomes weak. In correlation positive (+) sign indicates a positive relationship between variables, and a negative (-) sign indicates a negative relationship. There are several types of correlation: pointbiserial correlation, Kendall rank, Spearman correlation, and Pearson correlation [37,38].

Pearson Correlation.
Through Pearson correlation, we can measure the relationship between linearly related variables. It is a widely used correlation. In this type of correlation, when variables whose correlation is to be found are supposed to be normalized, if they are not normalized, then the first normalization should be performed [39]. The relationship between two variables must be straight, assuming that data is equally distributed about the regression line. Correlation between dataset features provides detailed information about features and the ratio of influence that they have on the target value. The heat map of Pearson correlation between the features of the dataset is shown in Figure 27. It revealed a stronger positive correlation between new positive cases and hospitalized with symptoms. There is also a strong correlation between total cases and deaths. Correlation in Figure 28

16
Wireless Communications and Mobile Computing correlation between new positive cases and recoveries, and there is also a strong correlation between total cases and total recoveries.

Conclusion
This deadly virus has killed many people all around the world. It is a dangerous disease that transfers from one human to another, and it creates severe damage to the lungs.
In this paper, we have proposed machine learning methods for forecasting COVID-19-positive cases in Pakistan regions. Random forest, XGBoost, and linear regression algorithms were used as prediction models. After evaluating these algorithms, it is identified that the random forest and XGBoost algorithm provide better accuracy than linear regression. Random forest and XGBoost algorithms provide a high   prediction rate. The evaluation results of this proposed model prove that using variables as predictors can lead us to high forecasting accuracy. These predictions will be helpful for researchers, government authorities, and health industry planners to manage services and arrange medical infrastructure accordingly. Additionally, the correlation matrix reveals that positive COVID-19 patients and hospitalized patients have a robust correlation. This proposed model is also helpful for other countries for forecasting COVID-19-positive cases.
In the future, this model can be extended to implement various other ML algorithms and prediction methodologies.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.