Development of Hepatitis Disease Detection System by Exploiting Sparsity in Linear Support VectorMachine to Improve Strength of AdaBoost Ensemble Model

School of Computer Science and Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, China Department of Computer Science, MNS University of Engineering and Technology Multan, Multan, Pakistan Department of Computer Science, COMSATS University Islamabad,Lahore Campus, Lahore, Pakistan School of Information and Software Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, China School of Information and Communication Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu, China Department of Electrical Engineering, University of Science and Technology, Bannu, Pakistan


Introduction
Hepatitis is considered a major chronic liver disease worldwide. e liver is considered to be the heaviest and one of the largest organs of the human body [1]. e liver is one of the key organs of a human body responsible for different functions.
ese functions include bile secretion, protein formation, and elimination of toxins from body. Hence, inflammation of liver (caused by hepatitis) results in dysfunction of the liver, and consequently, the health of the subject is deteriorated. e symptoms of hepatitis are different in different patients, with some subjects showing no signs. Well-known symptoms include yellowish eyes and skin, abdominal pain, poor appetite, and tiredness [2,3]. Hepatitis can be acute or chronic depending on duration. If it lasts for less than six months, it is acute; however, if it lasts for more than six months, it is chronic [4]. It has been reported that hepatitis results in more than a million deaths each year. Diagnosis of hepatitis through conventional methods is a difficult job and requires expensive medical tests [5]. Additionally, the diagnosis of such disease through intelligent system reduces the cost and also examines the patient in shorter time. Hence, development of intelligent diagnostic systems for such type of disease prediction is very important.
In this paper, we develop a hybrid intelligent diagnostic system. To improve the strength of AdaBoost predictive model, we propose to use L 1 -penalized linear SVM. e L 1 penalty makes the linear SVM sparse, thus making it capable of eliminating redundant features by making their coefficients zero through sparse solutions. After elimination of redundant features through the sparse linear SVM, the remaining features are supplied to the AdaBoost model for classification. In order to analyze the impact of the sparse linear SVM on the AdaBoost model, we performed two types of numerical experiments. In the first experiment, we developed the conventional AdaBoost model, while in the second experiment, we constructed a learning system by stacking the sparse SVM with the AdaBoost model. e performance of both the models, developed in the two experiments, was evaluated using an online hepatitis disease data. Experimental results demonstrated that the sparse linear SVM enhances the accuracy of conventional Ada-Boost (for the hepatitis disease prediction based on the collected clinical features). Additionally, the sparse linear SVM also reduces AdaBoost model's complexity as the optimal subset of features contains less number of features. e rest of the manuscript is organized as follows. Datasets, the proposed sparse linear SVM, and AdaBoost-based learning system are elaborated in Section 2. Section 3 discusses various schemes for validation as well as multiple metrics for evaluation used in the manuscript. Section 4 discusses experimental setup and obtained results, whereas the last section concludes the paper.

Dataset Description.
e hepatitis dataset consists of 155 samples, and each sample contains 19 features. Details about the 19 commonly used features for the hepatitis dataset are given in Table 1. e label of the dataset is binary, i.e., it can have a value of 1 or 2, where 1 means the sample belongs to a patient who died, while 2 means the sample is that of a subject who survived. ere are 32 samples having label 1 and 123 samples having the label value of 2, i.e., the dataset contains 123 samples belonging to healthy class and 32 samples belonging to patient class. In machine learning, we split the data into two parts, namely, training and testing. e training part is used to train the model, and its performance is checked by testing the trained model on the testing data. In this study, the dataset is divided into training and testing datasets using 70-30 data portioning. Hence, out of the 155 samples, 108 samples are used for training purposes, and the remaining 47 samples are used for testing purposes. Out of the 108 training samples, 23 samples belong to the patient class, and 85 patients belong to healthy class. On the other hand, out of the 47 testing samples, 7 samples belong to the patient group, and 38 samples belong to the healthy group. It can be noticed that lower class distribution of the patient class is a limitation of the dataset.

Proposed Method.
As discussed above, in this paper, we exploit the sparsity in linear SVM to improve the strength of machine learning models, namely, k-nearest neighbours (KNN), Gaussian Naive Bayes (GNB), linear discriminant analysis (LDA), and AdaBoost ensemble model. Initially, L 1 -penalized linear SVM is used to generate sparse features, i.e., to process the full set of features, null the redundant features, and yield a subset of features containing relevant features only. e generated subset of features by sparse linear SVM is supplied to machine learning models for classification purposes. e sparsity of the linear SVM is controlled by its hyperparameter λ. Hence, for distinct values of λ, various distinct features will be nullified resulting in different subsets of features. us, for achieving better hepatitis prediction accuracies, it is necessary to develop a sparse linear SVM that would nullify the most redundant or irrelevant features and generate a subset of the most relevant features.
is can be accomplished by tuning the hyperparameter λ. In order to better comprehend the functioning of the proposed learning system, it is pertinent to briefly discuss the L 1 -penalized linear SVM model and its formulation. e formulation is as follows.
Support vector machines (SVMs) are considered powerful learning methods and have been widely used in different biomedical-and health informatics-related problems [30]. During the training process, SVM tries to construct an optimal hyperplane that can better differentiate the data points of the two classes (in case of binary classification) [31]. e major reason that motivates machine learning researchers to use SVM for their problems is that SVMs have powerful generalization capabilities to unseen data and they depend on very small number of hyperparameters [32].
where p i stands for i th instance, Q represents the dimension of the original feature space of hepatitis data, and q i denotes the class labels, i.e., presence or absence of hepatitis disease. e value is 19 for the hepatitis dataset considered in this paper. e SVM model determines a hyperplane calculated by g(x) � β T * x + δ, where δ represents the bias and β denotes the weight vector. Based on the training data, the hyperplane g(x) of SVM augments the margin, whereas it curtails the classification error [33]. e sum of the distances between the closest negative and closest positive instances is called margin. In other words, the hyperplane augments the margin distance 2/‖β‖ 2 2 . SVM uses a set of slack variables denoted by θ i , i � 1, . . . ., S and a penalty parameter, i.e., λ, and attempts to maximize ‖β‖ 2 2 and minimize the errors of misclassification [34]. is fact is formulated as follows: subject to , where θ is the slack variable that calibrates the degree of misclassification and Euclidean norm or L 2 -norm is the penalty term. A varied version of SVM was introduced by Bradley and Mangasarian which replaces the Euclidean norm, i.e., L 2 -norm with L 1 -penalty function [35]. e L 1 -penalized SVM produces sparse solutions and has the feature selection property due to its competence of overthrowing irrelevant or noisy features automatically and hence can be used for feature selection. e formulation of L 1 -penalized SVM is given as follows: From the above formulas, it can be seen that, for different settings of the hyperparameter of the L 1 SVM, i.e., λ, different features will be nulled; consequently, a different subset of features will be produced [36]. e goal is to tune the value of λ in such a way to produce a subset of features which will show best performance in terms of hepatitis disease prediction accuracies. is is done by using exhaustive search methodology. After production of the features' subset, its application to AdaBoost machine learning models is carried out. e AdaBoost model is used for classification task.
AdaBoost (also known as adaptive boosting classifier) is an ensemble learning model. It utilizes boosting approach to construct a metaclassifier by combining the strengths of base classifiers, i.e., weak estimators. e boosting operation helps convert the weak estimators into a stronger or boosted model. During the process of boosting, weighted sum of the base learners or estimators is evaluated to produce the final output of the boosted model. is fact is reflected in the following formulation: where the m th base classifier is denoted by B m and α m denotes the weight of the m th classifier or estimator. To implement the AdaBoost model, we used scikit-learn python API [37]. In the following discussion, E denotes the total number of classifiers or estimators used for constructing the eventual AdaBoost model. e primary objective of this paper is to investigate and exploit the sparsity in the linear L 1 -regularized SVM to further improve the strength of the AdaBoost model. To meet this objective, we develop a cascade of the L 1 linear sparse SVM and AdaBoost model. e full feature set is supplied at the input of L 1 SVM which produces different subset of features based on the value of its hyperparameter λ. Performance of the subset of features is evaluated by their application to AdaBoost model. us, in the initial stages, we need to discretize the λ hyperparameter. After discretization of λ, we will have to search the optimal value of λ that will produce optimal subset of features which will show best classification performance. e whole process of the proposed method is shown in the Figure 1. From the figure, it can be seen that initially, a subset of features is generated by utilizing a specific value of λ. e subset of features is given to the AdaBoost model which is trained using one value of E. For the subset of features, performance is evaluated under optimal E. Furthermore, another subset of features is generated by utilizing another discrete value of λ, and again the AdaBoost model is trained and evaluated under optimal value of E. e process is repeated until all the subset of features are evaluated and tested. At the end, the optimal subset of features is selected based on the performance.

Evaluation of the Proposed Method
In literature, different researchers have utilized various metrics for performance evaluation of their proposed methods. However, for a more realistic evaluation of the performance of our proposed method, we utilized the following five evaluation metrics known as accuracy (ACC), specificity (Spec.), sensitivity (Sen.), and Matthews correlation coefficient (MCC). Accuracy gives information about the total number of correctly classified subjects (whether healthy or patients). Specificity conveys information about the number of healthy subjects which are classified correctly. Similarly, sensitivity represents the percentage of subjects which are classified correctly. MCC is used to measure the quality of binary classification. e basic formulas for these metrics are given as follows: ACC � tp + tn tp + fp + tn + fn ,

Results and Discussion
In this section, the experimental setting and the obtained results are analyzed and discussed. All the experiments (including conventional machine learning-based experiments and the proposed method-based experiments) are performed using Python software (scikit-learn). e experiments were simulated using Intel Core i5 processor with 8 GB RAM and 64-bit operating systems. For the purpose of comparison, we performed two types of experiments. First, the conventional AdaBoost model is developed for the prediction of hepatitis disease. Second, the proposed hybrid model is developed to predict hepatitis disease based on the filtered set of features.

Simulation of Conventional AdaBoost Model on Hepatitis
Data. In this experiment, we develop the conventional AdaBoost model for the hepatitis disease data. e model is trained using 70% of the dataset and tested on the remaining 30% of the data. An exhaustive grid search algorithm is used to search the optimized version of the AdaBoost model. e results on both optimal hyperparameters and nonoptimal hyperparameters are given in Table 2. It is evident from the table that best performance of 82.97% accuracy, 11.11% sensitivity, 100% specificity, and MCC of 0.302 is obtained at optimal hyperparameter, i.e., E � 3.

Simulation of the Proposed Method Using the Sparse
Linear SVM and AdaBoost Model on Hepatitis Data. In this experiment, the proposed learning system is developed by using both the models, i.e., sparse linear SVM and AdaBoost model. e simulation results are reported in Table 3. As can be seen in the table, different values of λ for the sparse SVM generate different subsets of features with different sizes. For subset of features with sizes from N � 1-10, no improvement in the performance is observed. However, from N � 10 onwards, we see changes in performance of the system. It is evident from the table that best performance of 89.36% is obtained at N � 16, i.e., with subset of features having only 16 features. However, the best performance on full feature set, i.e., on conventional AdaBoost is 82.97% which is shown in the last row of the table. Hence, it can be observed that coupling the conventional AdaBoost model with sparse linear SVM model improves the performance by 6.39%.
To statistically analyze the results on the testing data, we utilize confusion matrix. As discussed above, the dataset is divided into training and testing datasets using 70-30 data portioning. Hence, out of the 155 samples, 108 samples are used for training purposes, and the remaining 47 samples are used for testing purposes. Out of the 108 training samples, 23 samples belong to the patient class, and 85 patients belong to healthy class. On the other hand, out of the 47 testing samples, 7 samples belong to the patient group, and 38 samples belong to the healthy group. e predicted results of the proposed L 1 SVM-AdaBoost model are depicted statistically in the confusion matrix in Figure 2.
To further show that the coupling of the sparse linear SVM with conventional AdaBoost model enhances the    performance of conventional AdaBoost model, we use AUC. e AUC in case of conventional AdaBoost model is 0.587, while AUC in case of the proposed method is 0.649. Hence, the ROC charts further validate the fact that the coupling of the sparse linear SVM enhances the performance of Ada-Boost for hepatitis disease data.

Comparison of the Proposed Method with Some Other Proposed Methods Applied to Hepatitis Data.
e above discussion validates that the learning system proposed in this paper significantly augments the strength of the conventional AdaBoost model. In this section, the effectiveness of the learning system thus developed is further validated by carrying out a comparison of its performance with some of the well-known models presented in previous studies. e prediction accuracies and brief details about the models are given in Table 4. It is evident that our proposed method promises better performance upon 23 other machine learning models.
By analyzing Table 4, it can be seen that previous methods have exploited various machine learning-based methods to improve the hepatitis disease prediction accuracy. For example, Stern and Dobnikar developed methods based on discriminant analysis (including linear discriminant analysis and quadratic discriminant analysis) and could achieve a classification accuracy of 85.8% with quadratic discriminant analysis. Similarly, Ozyildirim and Yildirim developed a number of models for searching out optimum model with better classification accuracy. ey obtained the highest classification accuracy of 83.75% using radial basis function (RBF). Moreover, if we analyze the results tabulated in Table 4, the previous methods have carried out analysis of their proposed method by only considering classification accuracy. In this paper, we analyzed the results of the proposed hybrid method with a number of metrics and proved the robustness of the proposed method from two key metrics, i.e., classification accuracy and area under the curve (AUC).

Limitations of the Study.
Although this paper demonstrated the effectiveness of exploitation of sparsity in feature space to improve the performance of the machine learning models, the main limitation is lower sensitivity rate. is is due to the low representation of the patient class in the dataset. e main limitation of the hepatitis disease dataset is its imbalanced nature. e dataset has uneven class distribution, i.e., out of 155 samples, 123 samples belong to the healthy class, and 32 samples belong to the patient class. Recent research pointed out that machine learning models trained under such imbalanced classes show biased performance against the minority class (i.e., the models show very poor performance on the minority class) [40]. On the other hand, the models are biased towards the majority class, i.e., the models will show very good performance on the majority class. In case of the hepatitis disease dataset, the minority class is the patient class, and the majority class is the healthy class. From the results, it can be seen that the majority class has 100% detection accuracy (i.e. 100% specificity) while the minority class has poor detection accuracy, i.e., 44%. In future studies, we need to collect balance datasets, i.e., having the same representation for both the classes. Machine learning models trained under such balanced scenario are supposed to show better sensitivity. Moreover, the exhaustive search method for hyperparameters optimization is time-consuming. In future, application of metaheuristic algorithms [41,42] should be explored.

Conclusion and Future Work
is work developed an automatic hepatitis disease detection system by using machine learning methods. e AdaBoost model was developed for the hepatitis disease prediction. To improve the classification strength of the AdaBoost model, sparsity in the linear SVM model was exploited. e SVM model eliminated redundant or irrelevant features and thus improved the prediction accuracy of the AdaBoost model. It was also shown that the proposed sparse linear SVM also proves helpful in decreasing the time complexity of the AdaBoost model. Moreover, as evident by the simulation results, our proposed method surpassed many previously published methods in terms of hepatitis disease prediction accuracy. Given the experimental quantitative figures and results, it can thus be safely concluded that the proposed methodology can also be exploited to improve performance of other machine learning models and thus can help to make quality decisions in various other disease detection problems as well.
As discussed above, although the proposed method can be used as a tool to improve the performance of machine learning models, the obtained accuracy still needs considerable amount of improvement.
us, in future studies, more robust cascaded models should be developed by using deep learning approaches for classification. Additionally, the low rate of sensitivity that is caused by lower class representation of the patient class in the dataset is also a limitation of the study that should be considered as an open challenge for the future work. In future studies, extended hepatitis disease datasets should be collected that will have balanced class distribution.

Data Availability
All the data used in this study are available at the UCI Machine Learning Repository.

Conflicts of Interest
e authors declare that they have no conflicts of interest.