Development of Health Parameter Model for Risk Prediction of CVD Using SVM

Current methods of cardiovascular risk assessment are performed using health factors which are often based on the Framingham study. However, these methods have significant limitations due to their poor sensitivity and specificity. We have compared the parameters from the Framingham equation with linear regression analysis to establish the effect of training of the model for the local database. Support vector machine was used to determine the effectiveness of machine learning approach with the Framingham health parameters for risk assessment of cardiovascular disease (CVD). The result shows that while linear model trained using local database was an improvement on Framingham model, SVM based risk assessment model had high sensitivity and specificity of prediction of CVD. This indicates that using the health parameters identified using Framingham study, machine learning approach overcomes the low sensitivity and specificity of Framingham model.


Introduction
Cardiovascular disease (CVD) is the single biggest cause of mortality worldwide [1]. Globally, an estimated 17.5 million deaths were attributable to CVD in 2005 [2,3]. Early identification of persons with higher risk of CVD is useful for timely implementation of preventative strategies for preventing cardiac episodes that lead to death or disabilities [3,4]. For this purpose, risk factors for CVD [4] such as cholesterol, hypertension, and diabetes have been identified and various risk assessment models and techniques have been developed [5].
The commonly used risk assessment models for CVD prediction are the Framingham Risk Score [1], Reynolds Risk Score [6], QRISK [7], Prospective Cardiovascular Munster Heart Study (PROCAM) [8], the Systematic COronary Risk Evaluation (SCORE) system [9], and UKPDS [10]. Many of these have been adapted in primary care as simplified charts, tables, computer programs, and web-based tools which are routinely referred to in policy documents and guidelines.
The accuracy of the Framingham Risk Score is superior to any single risk factor. However its predictive power leaves room for improvement because the sensitivity and specificity are not very high [11][12][13][14]. It has been observed that the overall absolute coronary risk assigned to individuals in the United Kingdom has been significantly overestimated [11]. This highlights the necessity to refine the prediction models.
There can be a number of reasons underpinning the low prediction of CVD risk when using Framingham equation. This research has studied two of the possible causes for poor sensitivity and specificity of the equation, difference in demographics, and the linearity assumptions of the model. One cause for poor sensitivity and specificity could be due to the difference in the demographics of the population being studied compared with the Framingham population [4] which was used to develop the Framingham equation. The issue of demographics was showcased in a 2015 study which reported that the magnitude of the effect of different cardiovascular risk factors upon a patient is highly dependent on their ethnicity [15]. If this is the reason, it would require that the equation parameters (model) would have to be redefined for different demographics. Redefining the parameters may not always be possible because the modelling requires large amount of longitudinal data, which may not be available outside major hospital centres. However, identifying the cause of differences between groups can lead to better understanding and also provide the reason for developing new databases.
Another reason for poor performance can be attributed to the type of model. Framingham equation and other similar techniques are generalized linear equations. However, the relationship between the multiple factors associated with the health of large number of people may require more complex representation and is not suitable for linear approximation. To overcome this problem, the redevelopment of the model is required without the constraints of linearity.
This work has tested whether population difference or model type is the cause of poor outcomes for Framingham model. This has been done by developing risk assessment models using a longitudinal population database that compares the specific linear equation and machine learning methods. The commonly accepted health parameters that have been described by Framingham model were used and the scope of this study was to compare machine learning technique, linear regression, and direct use of Framingham model for identification of these parameters with disease. The linear equation was used to test the effect of customisation by using coefficients obtained using local database which would improve the results. To determine if the parameters used by Framingham model are relevant to a different database, this study measured the sensitivity and specificity obtained using support vector machine (SVM). While machine learning is generally expected to provide improved results, this study tested the effect of parameters used in the Framingham model which are relevant to a different database.

Database.
To ensure that the study had adequate power, a large longitudinal database is required. Such population databases provide the natural numbers of cases and controls that are matched as in the real world. The longitudinal database ensures the population baseline for demographics and ethnicity.
In this study, the Blue Mountain Eye Study (BMES) [16] database was used. This database was created from a population based cohort study which recorded eye and other health outcomes in an urban Australian population greater than 49 years of age. The majority of this population (∼99%) was of European descent. Baseline participants ( = 3654) represented 82.4% of those eligible in the selected postcode areas. The population group had a 5-year followup protocol with the last examination conducted 15 years after baseline examination. Participants of the study provided written informed consent prior to their involvement and any data collection.
The study population was followed up at 5-year intervals and the latest follow-up examination was conducted 15 years since the baseline examination. The study was approved by the Western Sydney Area Health Service Human Research Ethics Committee. Written informed consent was obtained from all participants prior to recording their data.
The 5-and 10-year follow-up data was used in this study. The database consisted of health and other parameters that have been identified by Framingham study [1] and consisted of gender, smoking status, cholesterol (combined and highdensity), systolic and diastolic blood pressure, body mass index, diabetes, and hypertension. These have been described in detail in Table 1.
People who had CVD episodes before the baseline examination or who died during the follow-up period due to a noncardiovascular aetiology were excluded from the study. The size of the study became 2770 subjects after the above exclusions. After a further 364 patients were excluded due to missing data, the remaining database of 2406 people had 1450 females and 956 males. The CVD cases were divided in two: hard and soft. Incident "hard CVD" included myocardial infarction, stroke, bypass surgery for coronary artery disease (CAD), or death from CAD. Self-reported angina was categorized as a "soft CVD" incident outcome. The mortality data were obtained by linkage with the Australian National death Index (NDI) and all nonexact matches were manually analyzed and accepted only if the mismatch was a single noncritical characteristic. In this set, there were 535 (267 women and 268 men) who had incident CVD (hard and soft) events in a period greater than 5 years but less than 10 years and this is shown in Table 1.

Data Management.
The data was randomly divided into two subsets corresponding to training data and test data using Scikit [21]. The training data consisted of 1896 (approximately 80%) and the balance of 510 samples (approximately 20%) of the total data was for testing. Thus, 80% of the data was used for training and the balance of 20% for testing purposes, with no overlap. This data is available online and in accordance with privacy regulations.
Pattern recognition and risk prediction techniques applied to population health data may suffer when these datasets are highly imbalanced. To overcome this imbalance, Synthetic Minority Oversampling Technique (SMOTE) [22] was used to boost the minority class (CVD case) numbers by 400% in the training data by artificially generating samples using a nearest neighbour approach [23].

Framingham Risk Equation.
The Framingham model provides a gender-specific model for various cardiovascular outcomes and is the basis for estimating cardiovascular risk profile and number of major public health policies [24]. We used a 10-year general cardiovascular risk prediction Framingham equation (FEq) for our analysis [1] with the regression coefficients and hazard ratios shown in Table 2.
The outcome of the equation is a risk of CVD over the following 10 years. It was applied to data on each subject in the test database (described in Data Management) and a risk percentage obtained. These predictions were compared with  the known CVD episodes from the records. To interpret the risk percentage obtained with the information of the CVD episodes, weighted statistical analysis was performed to optimally classify the cases and controls using the training data.
For the training data, this threshold was found to be 22.3%, and this was used on the test data to separate the case and control. According to the parameters in FEq, the samples that were above the age of 79 were "not classifiable."

Logistic Regression Analysis (LRA).
LRA develops a linear equation to best model a database with multiple features and two outcomes. Linear regression is performed to maximize the separation between the two outcomes. Consider that there are samples in the database that belong to two classes, and there are features (predictors). With the two classes, (i) CVD and (ii) no-CVD, logistic regression using the probability function was used to determine the relationship between the predictors. This was based on the conditional probability and described in the equation below: In this equation, the probability of CVD based on the predictor vector is obtained by considering each predictor, , and is the regression coefficient which indicates the relevance of the predictor or the contribution of the predictor on the outcome class. LRA was trained to obtain the parameters of each feature using the training section and tested using the test section of data (as described in Data Management). The default value (CVD | ) > 0.5 was used for classification. The prediction was performed on the test data ( = 510 subjects) and compared with prior knowledge of the CVD episodes. The weaknesses of Framingham equation with 79 years being the limit of the age and having predefined coefficients have been overcome by LRA.

Support Vector Machine (SVM)
. SVM is a set of related supervised learning methods that are used for prediction and regression analysis with applications in fields such as clinical and population based data [25], text classification, bioinformatics, handwriting recognition, and image analysis.  As a first step, the SVM was trained using the training subset (refer to Data Management) which was used as the input to the SVM and the target output was the known history of CVD episodes (as defined earlier) during the 5 to 10 years after time zero. The parameters for the SVM, Kernel, , and were identified using grid search method reported by Bergstra and Bengio [26]. This method [26] exhaustively generates possible values from a grid of the following specified two parameter values: All possible combinations of parameter values were fitted on the dataset and evaluated with an output score. Based on the score the following parameter values were used in this study: (i) Radial Basis Function (RBF) Kernel, (ii) = 100, (iii) = 0.01.
This SVM model was used to rank the parameters in terms of their relevance based on the weights obtained during the training (Table 3) [20]. The trained SVM was tested using the subsample of the test dataset (510 samples). This strategy ensured that the test data was independent of the training data. Diagnostic odds ratios were calculated [17] to compare its performance with the Framingham model and logistic regression analysis. Table 3 shows the relevance of the features as obtained from the ranking of logistic regression coefficients obtained for BMES dataset, while Table 4 reports the ranking of these features based on SVM weights. Comparing the results from Tables 2-5, it is observed that the highest three relevant factors (features) are the same for the three methods [1]: age, BMI, and current smoker.

Results
A confusion matrix shows the extent of the mislabelling performed by the prediction algorithm. Tables 5-7 show the confusion matrices for FEq, LRA, and SVM, respectively. Each row represents the instances in a predicted class, while each column represents the instances in an actual class. From these results, it is observed that the correct prediction using FEq was 40, using LRA was 50, and using SVM was 71 from a total of 104 CVD cases.
The confusion matrices also show that the number of false positives when the prediction was performed using Framingham was 108, using SVM was 57, and using LRA was 68. The results also show that the number of cases that were falsely identified to be controls by FEq were 37, 54 by LRA, and 33 by SVM. However, while SVM and LRA classified all the test samples (104 cases and 406 controls), there were 27 cases and 46 controls that were unclassifiable by FEq because of the age of these people being above 79 years. This is a major limitation for FEq, especially when we have an ageing population with significant population being older than 79 years.
The sensitivity and specificity obtained from SVM analysis, logistic regression, and FEq are shown in Table 8. This table also lists the range for 95% confidence interval (CI) of the data. Sensitivity obtained from the FEq was 0.52 (95% CI: 0.4096 to 0.6275), from the LRA was 0.48 (95% CI: 0.3817 to 0.5809), and from the SVM was 0.682 (95% CI: 0.589 to 0.764). This shows that the sensitivity of the FEq and logistic analysis is comparable, while that of SVM is better and thus provides better risk assessment. This is also confirmed with the ROC analysis curve as shown in Figure 1 and it is also observed from the area under ROC curve (AUC) corresponding to SVM which has the highest coverage (Table 8).
From Table 8, it is observed that specificity of the SVM classifier (0.859) was the highest when compared with FEq (0.70) and LRA (0.832). It is also observed that the diagnostic Computational and Mathematical Methods in Medicine 5       The statistical significance test between the sensitivity and specificity of prediction was performed by comparing the AUC measured from the ROC curves for SVM, LRA, and FEq [18,19]. When comparing the SVM technique with LRA and FEq, there were significant differences between SVM and FEq ( < 0.0002) and also LRA ( < 0.02).

Discussion
These findings show that there are a large number of unclassifiable cases and controls when using Framingham equation (FEq) due to the age constraints of the equation and in this database, 27 cases corresponding to ∼26% of all cases were not classifiable. This is a major weakness because with our ageing society significant amount of the population is older than 79 years. The results show that only 40 out of total 104 cases were identified correctly. LRA classified all samples and 50 of the 104 cases were identified correctly and SVM identified 71 cases correctly. This shows that while LRA overcame some of the limitations, it was not sufficient and the labelling of the outcome lacked sensitivity and specificity.
The results also showed that that there were a large number of false positives by FEq and 108 out of total of 406, or approximately 27% of the controls were misclassified to be case. This number reduced to 68 (∼17%) when the LRA was used and 57 (∼14%) when the SVM was used. The diagnostic odds ratio for FEq is 2.52, LRA is 3.05, and SVM is 13.17. SVM gave the highest correct predictions, lowest false positives, and false negatives and classified all the samples.
This study has shown that machine learning approach gave significantly better AUC. The study also demonstrated that the health parameters identified using Framingham model are relevant for other populations such as Blue Mountains in Australia, but when the weaknesses of the earlier model are overcome using machine learning approach, it should be noted that in this study SVM is an example of machine learning classifiers and was selected as an example to demonstrate the effectiveness of using machine learning based health parameter classification.

Conclusions
This study has compared the linear model and SVM approaches to classify the health features that are used by Framingham equation. To ensure that there is no bias due to differences in the database, all the analyses were performed on one database, BMES, which is a population based database that is well regarded for quality, duration, and size [14].
LRA and FEq are based on linearity assumption. However the FEq parameters were determined historically using Framingham database while LRA was trained on the local database to classify all subjects irrespective of the age. This would explain why LRA had improved true positive prediction of CVD (50 compared with 40), but there was also an increase in the false negatives (54 compared with 37 for FEq). Overall, the SVM performed significantly better. This may be attributed to SVM not being restricted by linearity which allows for nonlinear separation between the case and control class. It may also be based on the database being local. In conclusion, we propose that using an SVM with a local database may provide improved risk assessment. However, this needs to be tested on more databases and with more health parameters. It is also important to note that this work has only used the health parameters that were identified in Framingham study. However, it is now established that there are a number of other relevant parameters that need to be considered. Thus, it is essential that new databases with all the health parameters be developed and classified using SVM. Support vector machine and other similar machine learning approaches are very useful in providing the flexibility that is lacking in linear models. However, there is the shortcoming that such an approach is a black-box approach and it is essential that training data should be balanced and representative of the complete database. There are also the difficulties for data points that may appear as outliers. This is often difficult to control and erroneous training can lead to incorrect outcomes. Thus, it is essential for the test results to be monitored by the experts. It is also important for the software to automatically identify the outliers which would trigger supervised assessment.