Neonatal Disease Prediction Using Machine Learning Techniques

Neonatal diseases are among the main causes of morbidity and a significant contributor to underfive mortality in the world. There is an increase in understanding of the pathophysiology of the diseases and the implementation of different strategies to minimize their burden. However, improvements in outcomes are not adequate. Limited success is due to different factors, including the similarity of symptoms, which can lead to misdiagnosis, and the inability to detect early for timely intervention. In resource-limited countries like Ethiopia, the challenge is more severe. Low access to diagnosis and treatment due to the inadequacy of neonatal health professionals is one of the shortcomings. Due to the shortage of medical facilities, many neonatal health professionals are forced to decide the type of disease only based on interviews. They may not have a complete picture of all variables that have a contributing effect on neonatal disease from the interview. This can make the diagnosis inconclusive and may lead to a misdiagnosis. Machine learning has great potential for early prediction if relevant historical data is available. We have applied a classification stacking model for the following four main neonatal diseases: sepsis, birth asphyxia, necrotizing enter colitis (NEC), and respiratory distress syndrome. These diseases account for 75% of neonatal deaths. The dataset has been obtained from the Asella Comprehensive Hospital. It has been collected between 2018 and 2021. The developed stacking model was compared to three related machine-learning models XGBoost (XGB), Random Forest (RF), and Support Vector Machine (SVM). The proposed stacking model outperformed the other models, with an accuracy of 97.04%. We believe that this will contribute to the early detection and accurate diagnosis of neonatal diseases, especially for resource-limited health facilities.


Introduction
Te neonatal period is a critical time in human life when a newborn baby has to adapt to a new environment and complete several physiological adjustments that are essential for life [1]. Neonatal mortality is a signifcant contributor to underfve mortality [1]. According to estimates for 2018, more than 2.4 million children died before their second month of life [2]. Te neonatal mortality rate shows differences between regions and nations. One-third of the world's neonatal deaths are from sub-Saharan Africa, with about 34 deaths per 1000 live births. Te risk of neonatal death is approximately 55 times higher in the country with the highest mortality rate than in the country with the lowest mortality rate [3]. Te neonatal mortality rate in Ethiopia is about 30 per 1000 live births [4]. Te region is falling short of achieving Sustainable Development Goal 3 (SDG-3) [5].
Te leading neonatal diseases are sepsis, respiratory distress syndrome, birth asphyxia, and necrotizing enter colitis accounting for 26%, 23%, 19%, and 7%, respectively [6][7][8]. In Ethiopia, the most common diseases leading to neonatal death are sepsis, birth asphyxia, necrotizing enter colitis (NEC), and respiratory distress syndrome (RDS) [4]. Contributing factors for neonatal death include shortages of neonatologists and pediatricians, the inadequacy of diagnostic tools, diagnostic delay, and lack of quality care and treatments for neonatal conditions [9]. Some neonatal diseases have similar symptoms, which often result in the inappropriate use of antibiotics, which increases the risk of the development of antimicrobial resistance. For instance, neonatal sepsis is very similar to diseases such as perinatal asphyxia and necrotizing enter colitis which makes it diffcult to accurately diagnose and treat. In resource-limited countries like Ethiopia, neonatal diseases exert a heavy burden on families, society, and the health system. Tere are preventive and curative strategies to mitigate the impact. But there are limited improvements in the outcomes. Preventive approaches focus on maternal health before birth, such as maternal immunization and eforts to guarantee a healthy pregnancy [10,11]. With respect to curative approaches, there are limited diagnostic tools, and the results of diagnostics take longer. Te delay in results often leads to a neonate's condition rapidly deteriorating [12]. It has serious repercussions including chronic lung disease, neurodevelopmental abnormality, and long-term impairment that necessitate continuous hospitalization [13][14][15][16]. Tere are also signifcant increases in expenses and burdens for both survivors and caregivers. Hence, early identifcation of neonatal disease with appropriate antibiotic therapy can be efective in reducing neonatal death, reducing cost, and lowering antibiotic resistance in the community [17]. Detection of diseases at an early stage with minimum cost is an area of interest to many researchers [18]. Previous studies have shown the efectiveness of machine learning techniques in early recognition for timely preemptive clinical intervention [19]. Tere have been successful applications of single classifers, ensemble techniques, stacking, and hybrid machine learning methods [20]. Late-onset sepsis (LOS) is one of the major contributors to morbidity and mortality in neonates. Early detection of LOS is critical to reduce related illnesses and death. Machine learning techniques have been used efectively for the early recognition of LOS [21]. By identifying disease beginning before it becomes clinically evident and starting antibiotic medication on time, it may be possible to avert negative outcomes in newborns.
In this study, we used a stacking machine learning model to classify the following four major neonatal diseases: sepsis, birth asphyxia, necrotizing enter colitis (NEC), and respiratory distress syndrome, which account for 75% of neonatal deaths. Te dataset was obtained from the Asella Comprehensive Hospital. It has been collected between 2018 and 2021. Comparisons have been made between the developed stacking model and selected machine learning models such as XGBoost, Random Forest (RF), and Support Vector Machine (SVM) with and without feature selection.
Te paper's remaining part has been organized into four sections. In Section 2, related works on neonatal disease prediction have been discussed. Section 3 contains materials and methods. Te following topics have been covered: dataset, preprocessing, proposed machine learning model, and evaluation. In Section 4, experiments, results, discussions, and evaluations of the proposed method were incorporated. Lastly, the conclusion that highlights the major fndings and inferences has been incorporated in Section 5.

Related Works
Machine learning approaches have a lot of potential considering high-risk neonates receive intensive care that is getting more and more complicated. It has been used in numerous studies to forecast neonatal illnesses and mortality. Selected related studies on neonatal disease prediction have been discussed.
Supervised machine learning techniques have been used for the diagnosis of neonatal diseases, and some of them have been explored for their comprehensive application to analyze neonatal data by Shirwaikar et al., [22]. Tey have critically analyzed and discussed the methods and performance metrics of supervised techniques used on neonatal data to suggest ways to improve performance. From their review, the ensemble technique has better predictive power than SVM, neural networks, and decision trees.
Sheikhtaheri et al. applied machine learning techniques to improve the performance of prediction of neonatal mortality and its risk [23]. Te dataset was collected from Iran in two phases. Te factors that lead to infant death, including diseases, were initially identifed before training, testing, and evaluating the efectiveness of several algorithms, such as ANN, RF, CHART, SVM, and ensembles. SVM had the best accuracy of 94%.
Using a BP learning algorithm, Chowdhury et al., trained a multilayer perception to identify a design pattern for the prediction of neonatal illnesses. Tey compared their approach with diferent algorithms that have been previously used for the prediction of neonatal diseases such as conjugate gradient descent and quick propagation. Te proposed model used 94 cases of diferent symptoms and signs as a parameter to test the model and obtained 75% accuracy [24]. Safdari et al. developed an expert system with fuzzy logic that predicts the risk of neonatal death. To gain knowledge, they created questionnaires and distributed them to neonatologists [25]. Ten, they combined computational and fuzzy models based on an inference system for the prediction of neonatal death risk. Tey used MATLAB for model building and C# for the graphical user interface (GUI). Te model has a 90% accuracy.
Shirwaikar et al. applied machine learning techniques to predict episodes of apnea in preterm neonates. Tey have only considered neonates who are not older than one week. Te 229 neonates admitted to the neonatal intensive care unit (NICU) make up the dataset. SVM, RF, and decision trees have been used to predict apnea episodes in neonates. RF outperforms the other machine learning models with an accuracy of 88% [26]. Tey have developed a machine learning-based automated solution to predict apnea in neonates.
Mani et al. have developed machine-learning models to predict LOS using secondary data from electronic medical records (EMR) [17]. Comparisons have been made between predictions made by models resulting from machine learning algorithms and the sepsis treatments administered 2 Journal of Healthcare Engineering by physicians. Te outcome was impressive, with eight out of nine machine learning algorithms tested have outperformed physicians in terms of treatment sensitivity, and all nine machine learning algorithms are superior in terms of specifcity.
Tere are studies in Ethiopia to predict neonatal diseases and mortality. Bitew et al. showed the risk of underfve mortality in Ethiopia using RF, LR, and KNN [27]. Tey tried to identify important sociodemographic determinants using the 2016 EDHS dataset. RF has the highest accuracy of 67.2%. Diferent regions of Ethiopia have diferent underfve mortality rates. Te summary of selected related works is shown in Table 1.

Materials and Methods
In this study, four high-burden neonatal diseases such as sepsis, birth asphyxia, necrotizing enter colitis (NEC), and respiratory distress syndrome have been classifed using a stacked machine learning approach. Te dataset was obtained from Asella Compressive Hospital. Figure 1 shows the overall workfow.
Te proposed architecture has been shown in Figure 2. Steps starting from collecting relevant data to evaluation have been followed. Te dataset undergoes preprocessing including cleaning, handling missing values, and transforming the data. Recursive feature elimination with crossvalidation has been chosen as an appropriate feature selection technique to identify relevant features. Ten, preprocessed data was fed into SVM, RF, and XGB. Te results of three selected models have been combined to form stacking. Te models' performances were evaluated using stratifed k-fold cross-validation (k = 10) with and without feature selection methods. Tese steps and techniques have been discussed in the following sections: 3.1. Data Collection. Data used for this research was obtained from neonatal patient cards' of patients admitted to the NICU of Asella Comprehensive Hospital, Asella, Oromia, Ethiopia, during the period of 2018 to 2021. Te hospital keeps the record of each patient in a manual format. Te primary task in the data collection was to carefully encode each instance into a soft copy. It was compiled from neonatal disease discharge summaries and examination cards. Te three-year dataset has 2298 instances with 20 features. Te registered dataset includes admission information, delivery information, symptoms, laboratory results, and X-ray results. A description of the features of the dataset is shown in Table 2. Experts working in the NICU reviewed the patient history dataset. To enhance our understanding of the situation and features, we conducted interviews with pediatricians. We have also assessed different local and global literature on neonatal disease.

Preprocessing.
Te dataset of the study contains incomplete, noisy, inconsistent, inaccurate, and irrelevant values. Preprocessing has been carried out before modeling, as shown in Figure 3.

Cleaning Data and Missing Values Handling.
Missing values can be handled in several ways, including by dropping them if they have an insignifcant impact on individual instances, replacing them with a global constant, imputation, and predicting missed values. In the dataset, 12 features contain missing values, as shown in Table 3. Te missing values were flled up via imputation using mean values for categorical features and mode values for numeric features.

Handling Imbalanced Data and Feature Scaling.
Te dataset has a slight class imbalance. Tis has been handled by setting the class weight of the hyperparameter setting. Standardized scalar has been used for feature scaling in this study.
Standardize scalar where X is the score of a sample, u is the training sample mean and s is the standard deviation.

Selection of Features.
One of the preprocessing steps is identifying the feature set that is relevant to generate the best possible result with a feasible computational cost. It is the process of deciding which feature set, typically from a large number of input features, is the most important because not all features will necessarily be useful. Hence, the primary goal of feature selection is to choose an essential set of features to reduce the computational cost without compromising the performance of the model. Clinical datasets frequently use a flter, wrapper, and embedded feature selection approaches [28][29][30][31]. By evaluating the correlation between features and the target feature, the most important features are chosen using the flter approach. It is independent of the machine-learning algorithm. Another popular feature selection method is the wrapper method, which selects a set of features as a search problem in which several combinations are generated, estimated, and compared with one another. Univariate, recursive feature elimination, and sequential forward selection are better methods. Efective techniques for selecting features include recursive feature elimination (RFE). It is efcient at picking out the most essential features. Hence, recursive feature elimination with cross-validation (RFECV) has been chosen in this study.

Modeling.
Instead of individual learners, we used the stacking approach, which is one of the most successful approaches to classifcation and regression problems. If appropriately applied, multilevel stacking generates more precise results than individual models. In stacking, individual model predictions from the prior level are used as input for models in the subsequent level, like meta-learner [32]. It combines multiple classifers or models M1, M2, . . ., Mn on a single dataset S [33]. S consists of examples si � (xi, yi), i.e., pairs of feature vectors (xi) and their classifcations  4 Journal of Healthcare Engineering (yi). It started with the generation of base-level classifers C1, . . ., Cn, where Ci � Mi (s). Second, the output of the baselevel classifers is used as input by the meta-level learner. Cross-validation has been applied to create a training set for the meta-level classifer. Te procedure continues, as shown in Figure 4. Tree base-level learners; SVM, RF, and XGB, have been combined for stacking with and without feature selection. Te model-building workfow has shown in Figure 5, and the base-level learners have been discussed in the following subsections:

Support Vector Machine (SVM)
. SVM is a collection of similar classifcation and regression learning methods. It can be linear, multiple, or nonprobabilistic. Te primary goal is to fnd the best possible boundary between classes. In order to classify data, SVM creates a hyperplane or set of hyperplanes in a high-dimensional space, as shown in Figure 6. Te data points on the opposite side of the hyperplane belong to diferent classes. Te longer the hyperplane's distance from the closest training data points, the better the separation for classifcation. Hence, the longer the margin, the smaller the classifer's error. In this study, we used the  Journal of Healthcare Engineering

Random Forest (RF).
It is an ensemble of classifers that can solve classifcation and regression problems and is often composed of a decision tree. Tis technique generates a forest of several decision trees at random. Te result is more precise when there are more trees in the forest. Te way RF operates is to frst select K randomly chosen data points from the training sample. It then creates decision trees associated with the selected data points. It then repeats steps 1 and 2 after selecting the number N for the intended decision trees to be built. It also identifes the predictions made by each decision tree and assigns the new data instances to the category with the most votes.

XGBoost (XGB).
XGBoost is an extended version of gradient-boosting decision trees designed for the speed and performance of machine learning. XGBoost is used for both classifcation and regression tasks. Important features of XGBoost are as follows: (i) Parallelization: implemented on multiple CPU cores to train (ii) Regularization: XGBoost uses diferent regularizations to avoid overftting (iii) Nonlinearity: the ability to generate nonlinear data. (iv) Cross-validation: built-in (v) Scalability

Hyperparameter Tuning.
Hyperparameter tuning is a method of selecting a group of hyperparameters to optimize performance. Te tuning can be carried out manually or automatically. Manually, diferent sets of hyperparameters are selected and tested. Tis is tiresome and may not be feasible when we have a large number of hyperparameters to try. But with automatic approaches, an optimization algorithm is used to select the optimal set of hyperparameters. In this study, we have used the automatic method. Te two most popular algorithms are grid search and random search. A grid search is a common technique for hyperparameter optimization that conducts a complete search on a predetermined subset of the algorithm's hyperparameter space. Candidates are generated during training using a particular grid of parameter values. High-dimensional spaces are problematic for this approach. Grid searches are inferior to random searches, especially when only a small number of hyperparameters afect the performance of the machine learning algorithm. Hence, a random search has been used for this study.

Evaluation.
Evaluation techniques have been used to evaluate the performance of the proposed model. Te performance evaluation method may be holdout or crossvalidation. By testing a model on data other than the ones used to train it, holdout evaluation attempts to provide an objective assessment of learning performance. A large dataset is divided into two subsets at random using this basic strategy, such as training and testing sets. Te machine learning models are trained with the training dataset. Te models' performance is then tested using an unseen testing dataset. K-foldcross-validation is the technique used for evaluating a model's performance on an unseen test dataset. Te stratifed form of the k-fold cross-validation enforces matching the class distribution in each split with the entire training dataset. Due to the availability of a slightly imbalanced class distribution, we believe that stratifed k-fold cross-validation is appropriate. Hence, it has been used in this study. Te performance of selected models has been evaluated using various performance evaluation metrics, including precision, recall, accuracy, and f1-score. When classifcation is conducted, four diferent kinds of results could be found as follows: F1-score is calculated with the harmonic mean of precision and recall as shown in the following equation: (5) Table 4 shows the confusion matrix.

Results and Discussion
In this section, dataset exploration, feature selection, modeling, and evaluation have been discussed. Te results of selected models and a newly developed stacking model were compared. Te best-performing model has been deployed using a Flask server. A comparative discussion of the results with those of previous studies has also been made.

Te Dataset Exploration.
Te total size of the dataset is 2298, with 20 features including the target class. Four dominant neonatal diseases considered in the study are sepsis, respiratory distress syndrome (RDS), necrotizing enterocolitis (NEC), and parental asphyxia (PA). Teir distribution has been shown in Figure 7, which is 711 instances of sepsis, 648 instances of respiratory distress syndrome (RDS), 527 instances of parental asphyxia (PA), and 412 instances of necrotizing enterocolitis (NEC). Tere is a slight class imbalance. As shown in Figure 8, 59.9% of women follow up on antenatal care during their pregnancy. As shown in Figure 9, 49.3% of neonates were born term, 4.6% were born preterm, and 46.1% were born post-term.

Feature Relevance.
Te ranking of features based on their relevance has been shown in Figure 10. Feature selection methods have been applied in order to select relevant feature sets for the better predictive performance of classifers with an acceptable computational cost. Recursive feature elimination with cross-validation (RFECV) was used in the training of the SVM, RF, XGB, and stacking ensemble models. As a result, 12 features were selected.
Models were built on multiclass datasets with and without feature selection techniques. Stratifed 10-fold crossvalidation has been used along with other evaluation methods, as previously discussed. Stacking, SVM, RF, and XGB performance have been discussed using the original features of the neonatal disease dataset without any feature selection.

Journal of Healthcare Engineering
Te performance of SVM has been shown in the confusion matrix in Figure 11. 104 instances of NEC out of 105, 119 instances of PA out of 127, 154 instances of RDS out of 163, and 164 instances of sepsis out of 180 have been correctly classifed. It wrongly classifed 1 instance of NEC as PA, 2 instances of PA as NEC, 1 instance of PA as RDS, and 5 instances of PA. Similarly, the other wrongly classifed can also be seen from the fgure. Te normalized confusion matrix is displayed in Figure 11(b), and it is identical to Figure 11(a) except that it displays instances that were correctly identifed as decimal. Figure 12 shows a confusion matrix used to assess Random Forest's performance. It correctly classifed 102 instances of NEC out of 105, 120 instances of PA out of 127, 158 instances of RDS out of 163, and 166 instances of sepsis out of 180. RF only misclassifed 1 instance of NEC as PA and 2 instances of NEC as sepsis. Similarly, the other misclassifcations can be seen in the fgure. Instances that have been correctly classifed in decimals have been shown in the normalized confusion matrix.
Te other classifer that has been used is XGB and its performance is shown in Figure 13. It correctly classifed instances of 105 out of 105, 121 out of 127, 154 out of 163, and 163 out of 180, as Sepsis, NEC, RDS, and PA, respectively. Misclassifcations can be seen in the fgure. Te confusion matrix in Figure 13(b) is identical to Figure 13(a) with the exception that it has been normalized.
Te evaluation results of RF, XGB, SVM, and stacking models without feature selection have been summarized in Table 5. Stacking's score is the highest in all the following four performance matrices: precision, recall, F1-score, and accuracy.
Te next set of experiments were using RFE to choose the best feature subset with the objective of enhancing the performance of models. Te evaluation results of RF, XGB, SVM, and stacking models with feature selection have been discussed in the paper.
Te evaluation result of SVM using recursive feature elimination with cross-validation is shown in Figure 14. 104 instances of NEC out of 105, 120 instances of PA out of 127, 156 instances of RDS out of 163, and 167 instances of sepsis out of 180 have been correctly classifed. Tere are few wrongly classifed values. Figure 14(b) shows a normalized confusion matrix for SVM with RFECV.
Te confusion matrix performance evaluation result of the Random Forest model with recursive feature elimination and cross-validation has been illustrated in Figure 15. 102 instances of NEC out of 105, 120 instances of PA out of 127, 158 instances of RDS out of 163, and 166 instances of sepsis out of 180 have been correctly classifed. Tere are a few wrongly classifed instances. Figure  15(b) shows normalized evaluation results for RF with RFECV.
Te performance evaluation results of the XGBoost model with recursive feature elimination and crossvalidation have been illustrated in Figure 16. 105 instances of NEC out of 105, 121 instances of PA out of 127, 154 instances of RDS out of 163, and 163 instances of sepsis out of 180 have been correctly classifed. Tere are few wrongly classifed instances. Figure 16(b) shows a normalized confusion matrix for XGBoost with RFECV.
Te confusion matrix of the stacking model with recursive feature elimination and cross-validation has been illustrated in Figure 17. Te 105 instances of NEC out of 105, the 123 instances of PA out of 127, the 158 instances of RDS out of 163, and the 171 instances of sepsis out of 180 have been correctly classifed. Tere are very few wrongly classifed instances.
Te stratifed 10-fold cross-validation with recursive feature elimination evaluation result of SVM, RF, XGB, and stacking is shown in Table 6. Stacking's score is the highest in the following four performance matrices: precision, recall, F1-score, and accuracy. It outperformed three models in all performance matrices.

Journal of Healthcare Engineering
Although direct comparisons are difcult due to dataset diferences, population diferences, and other diferences, we identifed that the developed stacking model has better performance when compared to the results of previous works, as shown in Table 7.
One of the main results is the improved performance of the machine learning model by combining base models, known as stacking. Diferent experiments have been carried out to improve predictive performance. Te APGAR score, CRP (C-reactive protein), resuscitate, LLVW (low lung     An artifcial neural network model for neonatal disease diagnosis [24] ANN has been used. Te accuracy is 75% 4 Developing a fuzzy expert system to predict the risk of neonatal death [25] A fuzzy model inference system has been used. Te accuracy is 90% 5 Machine learning techniques for neonatal apnea prediction [26] DT, SVM, and RF have been used. Te highest accuracy is 88% with RF 6 Medical decision support using machine learning for early detection of late-onset neonatal sepsis [17] SVM, NB, and its variants TAN and AODE, K-nearest neighbor, CART, RF, LR, and LBR. Machine learning algorithms outperformed clinicians in terms of sensitivity and specifcity 7 Machine learning approach for predicting underfve mortality determinants in Ethiopia: evidence from the 2016 Ethiopian demographic and health survey [27] RF, LR, and KNN have been used. Te highest accuracy is 67.2% with RF 8 Proposed stacking model Stacking with the highest accuracy of 97.04% volume and whiteout), ICSCR (intercostal subcostal retractions), blood cultures, SpO 2 (oxygen saturation), GA (gestational age), WBC (white blood cells), seizures, RR (respiratory rate), weight, and grunting are the major features used to predict neonatal diseases. Te stacking model outperforms three base models; Random Forest, Support Vector Machine, and XGB, with and without feature selection. Models with RFECV perform better than models with original features. Te stacking model's accuracy, precision, recall, and f1-score are 97.04%, 97.21%, 97.38%, and 97.30%, respectively.

Conclusion
Deaths caused by neonatal diseases are a signifcant global contributor to underfve mortality. Tere are advancements to combat the challenge, including an enhanced understanding of the pathophysiology of the diseases and technological assistance for diagnosis and treatment. But the improvement is limited. Te similarity of disease symptoms, which may lead to misdiagnosis, and the inability of early diagnosis for timely intervention are among the factors contributing to limited success. Neonatal disease is a major child health challenge in resourcelimited countries like Ethiopia. In Ethiopia, neonatal mortality accounts for 43.3% of underfve mortality, which indicates that it has to get adequate attention and prioritization to sustain the intended progress in the reduction of child mortality. Early detection of neonatal diseases is believed to have an important contribution. In this study, the main aim was to detect and classify four major neonatal diseases (NEC, PA, RDS, and sepsis) using machine learning techniques. Te data was gathered at Asella Compressive Hospital in Oromia, Ethiopia. It has 2298 instances and 20 features. Diferent preprocessing techniques have been applied to the dataset, including handling missing values with mean imputation, standard scaling, converting categorical features with label encoders, and class balancing. Further, recursive feature elimination with cross-validation has been applied to choose a relevant set of features. Ten, modeling has been carried out using four machine learning models, such as stacking, RF, XGB, and SVM, with stratifed 10-fold crossvalidation. Te performance evaluation showed that stacking with RFECV feature selection outperformed the other models with an accuracy of 97.04%. We believe that this will be useful for accurate diagnosis and early detection of neonatal diseases.

Data Availability
All the data related to this study will be provided upon request to the corresponding author.