Based on the growing problem of heart diseases, their efficient diagnosis is of great importance to the modern world. Statistical inference is the tool that most physicians use for diagnosis, though in many cases it does not appear powerful enough. Clustering of patient instances allows finding out groups for which statistical models can be built more efficiently. However, the performance of such an approach depends on the features used as clustering attributes. In this paper, the methodology that consists of combining unsupervised feature selection and grouping to improve the performance of statistical analysis is considered. We assume that the set of attributes used in clustering and statistical analysis phases should be different and not correlated. Thus, the method consisting of selecting reversed correlated features as attributes of cluster analysis is considered. The proposed methodology has been verified by experiments done on three real datasets of cardiovascular cases. The obtained effects have been evaluated regarding the number of detected dependencies between parameters. Experiment results showed the advantage of the presented approach compared to other feature selection methods and without using clustering to support statistical inference.
Nowadays, data play a very important role in medical diagnostics since, due to equipment development, an increasing amount of data can be collected and thus, a huge volume of information concerning patient characteristics can be acquired. However, the possibilities of using data in medical diagnosis depend on the efficacy of the applied techniques. In practise, medical diagnostics are mainly supported by statistical inference, though in many cases it does not appear effective enough. It is worth emphasising that in medicine, the results of analysis are expected to be implemented in real life and thus the efficiency and usefulness of the methods should be taken into consideration. To obtain valuable recommendations for diagnostic statements, more sophisticated analytical methods are required. Including data mining algorithms to the process seems to be appropriate. Those techniques were recognized as efficient by Yoo et al. [
Integrating statistical analysis and data mining may not only improve the effectiveness of the obtained results, but also, by finding new dependencies between attributes, enable a multiperspective approach to medical diagnosis.
The research concerning the integration of cluster analysis and statistical methods on medical data, for defining the phenotypes of clinical asthma, has been presented in [
In this paper, we examine combining statistical inference and cluster analysis as a methodology supporting cardiovascular medical diagnosis. Including clustering in the preprocessing phase allows identifying groups of similar instances, for which respective parameters can be evaluated efficiently and thus statistical models of good quality can be created. Such an approach has been proposed in [
In the current research, we introduce a modification to the RCA that concerns the choice of the first attribute. Moreover, we extend the study [
In this paper, we validate the performance of the investigated methodology applied to datasets of real patient records via numerical experiments. We consider three datasets of different proportions between the numbers of instances and attributes. The experimental results are evaluated by statistical inference performed on clusters. The results demonstrate that the statistical inference performed on clusters enable detection of new relationships, which have not been discovered in the whole datasets; thus, significant benefits of using the proposed hybrid approach for improving medical diagnosis can be recognized. The proposed feature selection algorithm outperforms the effects obtained by other considered techniques. As in all the analysed cases, we attained the best results regarding the numbers of discovered dependencies.
The remainder of the paper is organised as follows. In the next section, the cardiovascular disease diagnosis problem is introduced and the whole methodology is described including its overview, the RCA feature selection, and all the considered algorithms. Next, the experiments carried out for the methodology evaluation are presented regarding the dataset characteristics, and the results obtained at all the stages of the proposed method are discussed. The final section presents the study’s conclusions and delineates future research.
The detection and diagnosis of heart diseases are of great importance due to their growing prevalence in the world population. Heart diseases result in severe disabilities and higher mortality than other diseases, including cancer. They cause more than 7 million deaths every year [
Heart diseases include a diverse range of disorders: coronary artery diseases, stroke, heart failure, hypertensive heart disease, rheumatic heart disease, heart arrhythmia, and many others. Therefore, the detection of heart diseases from various factors is a complex issue, and the underlying mechanisms vary, depending on the considered problem and the conditions that affect the heart and the whole cardiovascular system. Moreover, there are many additional socioeconomic, demographic, and gestational factors that affect heart diseases, and are considered as their main reasons [
To improve early detection and diagnosis of heart abnormalities, new factors and dependencies that may indicate cardiovascular disorders are searched. Statistical data analysis supports the evaluation of the characteristics of the parameters in medical datasets and helps in discovering their mutual dependencies. However, in some situations the significance of statistical inference between medical attributes may be interfered by a wide range of values, subsets of relatively dissimilar instances, or outliers. Thus, there is a strong need for new techniques that will support statistical inference in finding parameter dependencies and thereby improve medical diagnosis.
The considered methodology for supporting the process of medical diagnosis by patient dataset analysis consists of three main steps. They are preceded by data preparation, which aims at adjusting original datasets to analysis needs. The proposed steps can be presented as follows:
Feature selection, based on statistical analysis of correlation coefficients, which enables appointing the set of attributes for clustering Finding groups of similar characteristics, including a validation technique used to determine the appropriate number of clusters Statistical analysis performed in clusters to find new dependencies between all the considered parameters
The general overview of the method is shown in Figure
Overview of the methodology.
Patient records usually contain many attributes that may be used for supporting medical diagnosis. However, the performance of the diagnostics process may depend on the choice of the attributes in all the phases of the considered methodology. The quality of results obtained in the final step depends not only on the choice of parameters used for finding correlations, but also depends on the quality of patient groups and thus on the subset of attributes used in the clustering process. Therefore, the process of feature selection for cluster analysis is crucial for the whole presented methodology of medical diagnosis.
Regarding the main supporting tool, which is a statistical inference according to physician preferences, we propose the reversed correlation algorithm (RCA) that uses correlation coefficients but in a reversed order. This means that we look for features that are the least correlated with all their predecessors.
First, we start building a subset of features with the attribute that is the least correlated with the others. Then, correlation coefficients between the chosen feature and the rest of the parameters are calculated. The attribute with the lowest correlation value is indicated as the second feature. The obtained subset of two features is further extended by adding the attribute of the correlation coefficient with the lowest value between the subset and the rest of the parameters. The process of appending the features of the lowest correlation values is repeated unless all the correlation coefficients indicate statistically significant dependencies (respective values exceed thresholds) or the number of features in the subset is equal to the determined percentage of the total number of attributes. The whole procedure is presented in the Algorithm
In order to compare the results of the proposed feature selection algorithms, two other techniques have been considered: the opposite approach represented by the correlation-based feature selection (CFS) and an extension of the relief algorithm called ReliefF.
Correlation-based feature selection (CFS) ranks attributes according to a heuristic evaluation function based on correlations [
There exist different variations of CFS that employ different attribute quality measures, such as Symmetrical Uncertainty, normalized symmetrical Minimum Description Length (MDL), or Relief.
Relief algorithm, described in [
ReliefF algorithm has been proposed in [
Cluster analysis is an unsupervised classification technique, which can be used for grouping complex multidimensional data. Opposite to supervised methods, the profiles of obtained groups cannot be obviously stated and using additional techniques for discovering the meaning of clustering is required in many cases [
In further investigations, which aim at evaluating the presented technique regarding its efficiency on cardiovascular data, simple popular clustering algorithms will be considered, for such techniques are expected to be comprehensible for physicians.
We will examine two different clustering approaches: deterministic and probabilistic. The first approach will be represented by
The goal of a statistical model is to find the most probable set of clusters on the basis of training data and prior expectations. As a representative of these techniques, EM (expectation-maximization) algorithm, based on the finite Gaussian mixtures model, has been investigated. EM generates probabilistic descriptions of clusters in terms of means and standard deviations [
The choice of the optimal number of clusters is one of the most important parts of the clustering process. In the case of the
Thus, considering two clustering methods equipped with different techniques for choosing the optimal number of clusters may help in confirming the right choice. However, it is worth noticing that in medicine there exists the usual intent to split the whole dataset into two groups and thus the number of clusters is very often equal to two [
Before carrying out statistical inference, the assessment of measures of descriptive statistics should be performed. Such an approach allows detecting errors that were not identified during the data preparation phase. As the main descriptors, for which the evaluation is indicated, one should mention central tendency measures (arithmetic mean, median, and modal) as well as dispersion measures (range and standard deviation). Next, an appropriate test is run as a part of the statistical analysis process. The test should be chosen according to the type and the structure of analysed data regarding such characteristics as attribute types, the scale type, the number of experimental groups, and their dependencies, as well as the test power. Additionally, the selection should be consistent with the requirements of the USMLE (The United States Medical Licensing Examination). In the presented research, these are considered the tests usually applied in medical diagnostics [ Kolmogorov–Smirnov test, which is used to check the normality of distribution of the attributes Unpaired two-sample Student’s Mann–Whitney
Pearson’s correlation coefficient
The performance of the proposed methodology has been examined by experiments conducted on the real datasets collected for supporting heart disease diagnosis. The statistical analysis results obtained for clusters have been compared with the ones taken for the whole datasets.
The experiments were carried out on three datasets:
“HEART” “IUGR” “CORONARY”
The “HEART” dataset consisted of 30 cases collected to discover dependencies between arterial hypertension and left ventricle systolic functions. The “IUGR” dataset includes 47 instances of children born with intrauterine growth restriction (IUGR), gathered to find out dependencies between abnormal blood pressure and being born as small for gestational age. The data of both of the datasets were collected in the Children’s Cardiology and Rheumatology Department of the Second Chair of Paediatrics at the Medical University of Lodz.
Each dataset was characterized by two types of parameters: the main and the supplementary ones, all of them gathered for discovering new dependencies. The attributes correspond to high blood pressure and include echocardiography and blood pressure assessment, prenatal and neonatal history, risk factors for IUGR, and family survey of cardiovascular disease, as well as nutritional status. There were no missing values within the attributes. The full medical explanations of the data are given in [
The “CORONARY” dataset also refers to cardiovascular problems. It comes from the UCI Machine Learning Repository [
The summary of characteristics for all the datasets was presented in Table The number of instances in the “HEART” dataset is smaller than the number of parameters The number of instances in the “IUGR” dataset is comparable with the number of attributes In the “CORONARY” dataset, the number of instances is greater than the number of parameters
The characteristics of datasets.
Dataset | Instances | Main attributes | Supplementary attributes |
---|---|---|---|
HEART | 30 | 14 | 35 |
IUGR | 47 | 6 | 40 |
CORONARY | 303 | 10 | 44 |
Tables
Characteristics of attributes for the “HEART” dataset.
Attribute(s) | Description | Range | Median/mean (mean range) | SD (SD range) |
---|---|---|---|---|
Main attributes | ||||
BMI | Current body mass index | 17.00 to 25.00 | 22.16 | 1.64 |
Birth_weight | Birth weight | 2500 to 4000 | 3158 | 392.00 |
SBP, DBP, ABPM-SBP, ABPM-DBP | Average systolic/diastolic blood pressure taken manually and by ABPM | 61 to 150 | 74.87 to 136.97 | 5.22 to 7.04 |
HR | Heart rate | 44 to 91 | 75.97 | 11.20 |
Risk factors | Risk factors | True/false | — | — |
Supplementary attributes | ||||
IVSd, IVSs, PWDd, PWDs, LVDd, LVDs | Left ventricular dimensions | 5.00 to 56.00 | 8.00 to 46.03 | 1.51 to 9.02 |
EF, SF | Systolic function | 34 to 84 | 40 to 70 | 3 to 5 |
Sm, Sml, V/S/SR long/rad/circ | Tissue Doppler echocardiography parameters | −37 to 40.17 | −27.25 to 29.64 | 0.42 to 6.35 |
Characteristics of attributes for the “IUGR” dataset.
Attribute(s) | Description | Range | Median/mean (mean range) | Mode ( |
---|---|---|---|---|
Main attributes | ||||
Birth_weight | Birth weight | 1980–2850 | 2556.70 | 2700 (7) |
Head_circ | Head circumference | 29–35 | 33 | 32 (16) |
Gest_age | Gestational age | 38–42 | 39 | — |
Apgar | Apgar score at 1 min | 7–10 | 9 | 9 (23) |
5_Percentile | Growth chart factor | True/false | — | False (25) |
Supplementary attributes | ||||
SBP, DBP | Average systolic/diastolic blood pressure | 55–137 | 55–115 | 5.03–8.73 |
SBP load, DBP load | Blood pressure loads | 0–96 | 9–20 | 10–21 |
LVm | Left ventricular mass (Simone, Devreux) | 17.65–93.21 | 30.26–59.11 | 6.91–12.91 |
Risk factors | Risk factors | True/false | — | — |
Characteristics of attributes for the “CORONARY” dataset.
Attribute(s) | Description | Range | Median/mean (mean range) | Mode ( |
---|---|---|---|---|
Main attributes | ||||
Q wave, St elevation, St depression, Tinversion, LVH, poor R progression | ECG parameters | Yes/no | — | — |
FBS | Fasting blood sugar | 62–400 | 119 | 52 |
EF-TTE | Ejection fraction—transthoracic echocardiography | 15–60 | 47 | 9 |
Region RWMA | Regional wall motion abnormalities | 0–4 | 0 (217) | — |
Supplementary attributes | ||||
Age | Age | 30–86 | 58.00 | 10.39 |
Weight | Weight | 48–120 | 73.83 | 11.89 |
Sex | Sex | Male/female | — | Male (176) |
BMI | BMI | 18–41 | 27.25 | 4.10 |
DM, HTN, current smoker, ex-smoker, FH, obesity, CRF, airway disease, thyroid disease, CHF, DLP | Diabetes mellitus, hypertension, current smoker, ex-smoker, family history, obesity, chronic renal failure, cerebrovascular accident, airway disease, thyroid disease, congestive heart failure, dyslipidemia | Yes/no | — | — |
Edema, weak peripheral pulse, lung rales, systolic murmur, diastolic murmur, typical chest pain, dyspnea | Symptom and examination parameters | Yes/no | — | — |
Cr, TG, LDL, HDL, BUN, ESR, HB, K, Na, WBC, lymph, neut, PLT | Laboratory parameters (creatine, triglyceride, low density lipoprotein, high density lipoprotein, blood urea nitrogen, erythrocyte sedimentation rate, haemoglobin, potassium, sodium, white blood cell, lymphocyte, neutrophil, platelet) | 0.5–18,000 | 1.05–7652.04 | 0.24–2413.74 |
For each dataset, only parameters concerning main characteristics were considered as initial attributes used for grouping. The selection of the appropriate features for building clusters has been performed by using three different techniques:
The reversed correlation algorithm (RCA) CFS method ReliefF algorithm
The parameters necessary to run the RCA algorithm were chosen according to principles commonly approved in statistics (see [
In the case of the ReliefF algorithm, the threshold for the number of attributes included in the subset of selected features was set to
The subsets of features presented in Table
Feature selection results.
Dataset | FS algorithm | Size | Supplementary attributes |
---|---|---|---|
HEART | RCA | 6 | Physical_activity, fundus, BMI, HR, height, birth_weight |
CFS | 1 | Weight | |
ReliefF | 6 | Physical_activity, family_interview, weight, fundus, height, BMI | |
IUGR | RCA | 3 | Apgar_score, ponderal_index, 5_percentile |
CFS | 1 | Birth_weight | |
ReliefF | 3 | Head_circ, ponderal_index, birth_weight | |
CORONARY | RCA | 4 | FBS, EF-TTE, St depression, LVH |
CFS | 5 | Q wave, Tinversion, FBS, EF-TTE, region RWMA | |
ReliefF | 5 | Region RWMA, Tinversion, St depression, St elevation, Q wave |
In the next step of the experiments, the clusters for diagnosed patients were created by using two clustering algorithms:
Clusters were built regarding the main characteristics and the parameters indicated by feature selection methods, namely RCA, CFS, and ReliefF.
In the case of the EM algorithm, the best number of clusters was indicated by using cross-validation. To choose the best number of clusters for
Validation of clustering for the HEART dataset.
Validation of clustering for the IUGR dataset.
Validation of clustering for the CORONARY dataset.
The results of clustering are presented in Table
Clustering results.
Dataset (1) | FS algorithm (2) | Cluster algorithm (3) | No of clusters (4) | Clustering schema (5) | |||
---|---|---|---|---|---|---|---|
HEART | Main attributes | EM | 2 | 7 | 23 | ||
2 | 8 | 22 | |||||
RCA | EM | 2 | 6 | 24 | |||
2 | 6 | 24 | |||||
CFS | EM | 1 | 30 | ||||
2 | 11 | 19 | |||||
ReliefF | EM | 4 | 6 | 4 | 15 | 3 | |
EM | 2 | 21 | 9 | ||||
2 | 6 | 24 | |||||
IUGR | Main attributes | EM | 2 | 22 | 25 | ||
2 | 22 | 25 | |||||
RCA | EM | 2 | 25 | 22 | |||
2 | 25 | 22 | |||||
CFS | EM | 2 | 12 | 35 | |||
2 | 16 | 31 | |||||
ReliefF | EM | 4 | 7 | 12 | 18 | 10 | |
3 | 13 | 14 | 20 | ||||
CORONARY | Main attributes | EM | 4 | 22 | 49 | 1 | 231 |
4 | 148 | 50 | 71 | 34 | |||
RCA | EM | 2 | 71 | 232 | |||
2 | 232 | 71 | |||||
CFS | EM | 3 | 101 | 17 | 185 | ||
k-Means | 2 | 213 | 90 | ||||
ReliefF | EM | 3 | 89 | 23 | 191 | ||
2 | 213 | 90 |
Correlation values obtained for the clusters were compared with the ones taken for the whole group of diagnosed patients in terms of different selection techniques. Comparison of results confirmed the effectiveness of the proposed methodology. For each dataset, we obtained a greater number of statistically significant correlations in clusters which may lead to improved medical diagnosis in the future. By significant correlations we mean values with correlation coefficient
Numbers of statistically significant correlations detected in the whole datasets and in clusters.
Dataset | Whole dataset | Main features | RCA | CFS | ReliefF | ||||
---|---|---|---|---|---|---|---|---|---|
EM | EM | EM | EM | ||||||
HEART | 14 | 29 | 30 | 28 | 28 | 14 | 28 | 28 | 28 |
IUGR | 11 | 15 | 15 | 16 | 16 | 11 | 11 | 16 | 15 |
CORONARY | 14 | 15 | 20 | 16 | 16 | 15 | 16 | 16 | 16 |
One can easily notice that the results attained by the unsupervised RCA feature selection technique and supervised ReliefF algorithm were comparable; however, the first method outperforms the second one in the case of the IUGR dataset and
The process of computer-aided medical studies is usually based on only one of the data analysis methods, most often a statistical approach. In this paper, we present an approach that integrates a feature selection technique and clustering with statistical inference, to improve medical diagnosis by finding out new dependencies between parameters. We consider using the new feature selection technique based on reversed correlations (RCA), combining it with two clustering algorithms: EM and
The experiments have shown that the proposed hybrid approach provides significant benefits. The statistical inference performed in clusters enabled detection of new relationships, which have not been discovered in the whole datasets, regardless of the applied feature selection algorithm and the clustering technique. Moreover, the proposed RCA technique attained results at least as good as other considered feature selection methods, but as opposed to CFS and ReliefF, it belongs to unsupervised approaches, which implies a more flexible application. It is also worth emphasising that the presented approach has been checked using datasets of different mutual proportions between the number of instances and the number of attributes. The experimental results have shown that the proposed methodology performs well on datasets with the small number of instances and what is more, the biggest growth of the number of correlations concerns the dataset where the number of instances is smaller than the number of attributes. Such situations very often take place in the case of patient datasets.
Future research will focus on further investigations that aim at improving medical diagnostics by using hybrid approaches combining data mining and statistical inference. First, more datasets should be examined regarding different mutual proportions between the number of instances and the number of attributes. The research area should be broadened to diagnostics for the diseases of other types. Further research should also include indicating the effective integration of feature selection and clustering algorithms that will perform well combined with statistical inference.
The dataset “CORONARY” that supports the findings of this study is openly available at the UCI Machine Learning Repository at
The authors declare that there is no conflict of interest regarding the publication of this paper.
The authors received funding from the Institute of Information Technology, Lodz University of Technology.