Integrating Correlation-Based Feature Selection and Clustering for Improved Cardiovascular Disease Diagnosis

Based on the growing problem of heart diseases, their efficient diagnosis is of great importance to the modern world. Statistical inference is the tool that most physicians use for diagnosis, though in many cases it does not appear powerful enough. Clustering of patient instances allows finding out groups for which statistical models can be built more efficiently. However, the performance of such an approach depends on the features used as clustering attributes. In this paper, the methodology that consists of combining unsupervised feature selection and grouping to improve the performance of statistical analysis is considered. We assume that the set of attributes used in clustering and statistical analysis phases should be different and not correlated. Thus, the method consisting of selecting reversed correlated features as attributes of cluster analysis is considered. The proposed methodology has been verified by experiments done on three real datasets of cardiovascular cases. The obtained effects have been evaluated regarding the number of detected dependencies between parameters. Experiment results showed the advantage of the presented approach compared to other feature selection methods and without using clustering to support statistical inference.


Introduction
Nowadays, data play a very important role in medical diagnostics since, due to equipment development, an increasing amount of data can be collected and thus, a huge volume of information concerning patient characteristics can be acquired.However, the possibilities of using data in medical diagnosis depend on the efficacy of the applied techniques.In practise, medical diagnostics are mainly supported by statistical inference, though in many cases it does not appear effective enough.It is worth emphasising that in medicine, the results of analysis are expected to be implemented in real life and thus the efficiency and usefulness of the methods should be taken into consideration.To obtain valuable recommendations for diagnostic statements, more sophisticated analytical methods are required.Including data mining algorithms to the process seems to be appropriate.Those techniques were recognized as efficient by Yoo et al. [1], who indicated that the application of descriptive and predictive methods are useful in biomedical as well as healthcare areas.In addition, stand-alone statistical analysis cannot be supportive in many cases, especially when correlations between attributes, considered as important by physicians, cannot be found.Such situation usually takes place for datasets of great standard deviation values [2].What is more, dissimilarities or inconsistencies within the datasets can appear due to incorrect measurements or distortions.The presence of these kinds of deviations may lead to the rejection of true hypothesis; for example, such situation takes place when datasets are of small sizes.In these cases, supporting medical diagnosis becomes a complicated task, particularly when the number of attributes exceeds the number of records.
Integrating statistical analysis and data mining may not only improve the effectiveness of the obtained results, but also, by finding new dependencies between attributes, enable a multiperspective approach to medical diagnosis.
The research concerning the integration of cluster analysis and statistical methods on medical data, for defining the phenotypes of clinical asthma, has been presented in [3].
The research was proposed against other models of asthma classification and, according to authors, it might have played a supporting role for different phenotypes of a heterogeneous asthma population.Data mining methods have been used in several clinical data systems.A survey of these systems and the applied techniques has been presented in [4].Data mining techniques have been also considered in different clinical decision support systems for heart disease prediction and diagnosis in [2].However, in the investigation results, the authors stated that the examined techniques are not satisfactory enough.Moreover, a solution for the identification of treatment options for patients with heart diseases is still lacking.Statistical inference of heart rate and blood pressure was investigated in [5].The authors examined the correlation between raw data, then they examined the correlation between filtered data, and finally they applied the least squares approximation.In all the cases, the obtained correlation coefficients seemed to be unpredictable random numbers.
In this paper, we examine combining statistical inference and cluster analysis as a methodology supporting cardiovascular medical diagnosis.Including clustering in the preprocessing phase allows identifying groups of similar instances, for which respective parameters can be evaluated efficiently and thus statistical models of good quality can be created.Such an approach has been proposed in [6] to improve the performance of statistical models in hypertension problems in cardiovascular diagnosis.In the paper [7], a new reversed correlation algorithm (RCA) of an automatic unsupervised feature selection complemented the methodology.The RCA algorithm consisted of choosing subsequent features as the least correlated with their predecessors.
In the current research, we introduce a modification to the RCA that concerns the choice of the first attribute.Moreover, we extend the study [7] by comparing the performance of the considered algorithm with two other feature selection methods: correlation-based CFS and ReliefF.We also examine the effectiveness of the presented methodology regarding not only the statistical approach, but also the deterministic clustering algorithm with elbow criterion for determining the best number of clusters.Additionally, during the experiments we broaden the range of patients involved by changing the considered datasets.In the current research, instead of one of the three datasets gathered from children [7], we use a reference "CORONARY" dataset with a higher number of patient records.The dataset was derived from the UCI repository [8].
In this paper, we validate the performance of the investigated methodology applied to datasets of real patient records via numerical experiments.We consider three datasets of different proportions between the numbers of instances and attributes.The experimental results are evaluated by statistical inference performed on clusters.The results demonstrate that the statistical inference performed on clusters enable detection of new relationships, which have not been discovered in the whole datasets; thus, significant benefits of using the proposed hybrid approach for improving medical diagnosis can be recognized.The proposed feature selection algorithm outperforms the effects obtained by other considered techniques.As in all the analysed cases, we attained the best results regarding the numbers of discovered dependencies.
The remainder of the paper is organised as follows.In the next section, the cardiovascular disease diagnosis problem is introduced and the whole methodology is described including its overview, the RCA feature selection, and all the considered algorithms.Next, the experiments carried out for the methodology evaluation are presented regarding the dataset characteristics, and the results obtained at all the stages of the proposed method are discussed.The final section presents the study's conclusions and delineates future research.

Heart Disease Diagnosis
Problem.The detection and diagnosis of heart diseases are of great importance due to their growing prevalence in the world population.Heart diseases result in severe disabilities and higher mortality than other diseases, including cancer.They cause more than 7 million deaths every year [9,10].
Heart diseases include a diverse range of disorders: coronary artery diseases, stroke, heart failure, hypertensive heart disease, rheumatic heart disease, heart arrhythmia, and many others.Therefore, the detection of heart diseases from various factors is a complex issue, and the underlying mechanisms vary, depending on the considered problem and the conditions that affect the heart and the whole cardiovascular system.Moreover, there are many additional socioeconomic, demographic, and gestational factors that affect heart diseases, and are considered as their main reasons [11][12][13].
To improve early detection and diagnosis of heart abnormalities, new factors and dependencies that may indicate cardiovascular disorders are searched.Statistical data analysis supports the evaluation of the characteristics of the parameters in medical datasets and helps in discovering their mutual dependencies.However, in some situations the significance of statistical inference between medical attributes may be interfered by a wide range of values, subsets of relatively dissimilar instances, or outliers.Thus, there is a strong need for new techniques that will support statistical inference in finding parameter dependencies and thereby improve medical diagnosis.
2.2.The Method Overview.The considered methodology for supporting the process of medical diagnosis by patient dataset analysis consists of three main steps.They are preceded by data preparation, which aims at adjusting original datasets to analysis needs.The proposed steps can be presented as follows: (1) Feature selection, based on statistical analysis of correlation coefficients, which enables appointing the set of attributes for clustering (2) Finding groups of similar characteristics, including a validation technique used to determine the appropriate number of clusters (3) Statistical analysis performed in clusters to find new dependencies between all the considered parameters The general overview of the method is shown in Figure 1.We assume that clustering and statistical analysis are applied on the separate subsets of attributes.The descriptions of the main steps of the methodology are presented in Subsections 2.3, 2.4, and 2.5.

Feature Selection.
Patient records usually contain many attributes that may be used for supporting medical diagnosis.However, the performance of the diagnostics process may depend on the choice of the attributes in all the phases of the considered methodology.The quality of results obtained in the final step depends not only on the choice of parameters used for finding correlations, but also depends on the quality of patient groups and thus on the subset of attributes used in the clustering process.Therefore, the process of feature selection for cluster analysis is crucial for the whole presented methodology of medical diagnosis.
Regarding the main supporting tool, which is a statistical inference according to physician preferences, we propose the reversed correlation algorithm (RCA) that uses correlation coefficients but in a reversed order.This means that we look for features that are the least correlated with all their predecessors.
First, we start building a subset of features with the attribute that is the least correlated with the others.Then, correlation coefficients between the chosen feature and the rest of the parameters are calculated.The attribute with the lowest correlation value is indicated as the second feature.The obtained subset of two features is further extended by adding the attribute of the correlation coefficient with the lowest value between the subset and the rest of the parameters.The process of appending the features of the lowest correlation values is repeated unless all the correlation coefficients indicate statistically significant dependencies (respective values exceed thresholds) or the number of features in the subset is equal to the determined percentage of the total number of attributes.The whole procedure is presented in the Algorithm 1.
In order to compare the results of the proposed feature selection algorithms, two other techniques have been considered: the opposite approach represented by the correlation-based feature selection (CFS) and an extension of the relief algorithm called ReliefF.
Correlation-based feature selection (CFS) ranks attributes according to a heuristic evaluation function based on correlations [14].The function evaluates subsets made of attribute vectors, which are correlated with the class label, but independent of each other.The CFS method assumes that irrelevant features show a low correlation with the class and therefore should be ignored by the algorithm.On the other hand, excess features should be examined, as they are usually strongly correlated with one or more of the other attributes.The criterion used to assess a subset of l features can be expressed as follows: where M S is the evaluation of a subset of S consisting of l features, t cf is the average correlation value between features and class labels, and t f f is the average correlation value between two features.There exist different variations of CFS that employ different attribute quality measures, such as Symmetrical Uncertainty, normalized symmetrical Minimum Description Length (MDL), or Relief.
Relief algorithm, described in [15], concerns the evaluation of attributes based on the similarity of the neighbouring examples in the set of analysed instances [16].For the given set of training instances, sample size, and the relevancy threshold τ, Relief detects features that are statistically consistent with the target task.Relief picks an instance X from the set and its two nearest neighbours: one of the same class-called "near-hit" and one of the opposite class-called "near-miss".Then, it updates the feature weight vector W for every triplet and uses it to determine the average relevance feature vector.The algorithm selects those features for which the value of the average weight, called relevance level, exceeds the given threshold value τ.
ReliefF algorithm has been proposed in [16].Contrary to Relief, it is not limited to two class problems, it is more effective and can deal with noisy or incomplete data, for missing values of attributes are treated probabilistically.Similarly to Relief, ReliefF randomly selects an instance X, but it searches for the determined number of the nearest neighbours from the same class, called "nearest hits," and the same number of the nearest neighbours from every different class ("nearest misses").Then, it updates the vector W of estimations of the qualities for all the attributes depending on their values for X and sets of hits and misses.

Cluster Analysis.
Cluster analysis is an unsupervised classification technique, which can be used for grouping complex multidimensional data.Opposite to supervised methods, the profiles of obtained groups cannot be obviously stated and using additional techniques for discovering the meaning of clustering is required in many cases [17].On the other side, statistical analysis is the most popular tool used in the medical field.Therefore, in this area, combining clustering and

Statistical analysis
Parameter selection Cluster analysis Figure 1: Overview of the methodology.
3 Complexity statistical inference may not only enable patient grouping, but also finding dependencies between their characteristics and thus supporting medical diagnostics.
In further investigations, which aim at evaluating the presented technique regarding its efficiency on cardiovascular data, simple popular clustering algorithms will be considered, for such techniques are expected to be comprehensible for physicians.
We will examine two different clustering approaches: deterministic and probabilistic.The first approach will be represented by k-means algorithm, which in comparison to other techniques, demonstrated good performance for medical data regarding accuracy as well as lower root mean square error [18].The k-means algorithm is one of the most popular partitioning methods, where clusters are built around k centers, by minimizing a distance function.The goal of the algorithm is to find the set of clusters for which the sum of the squared distance values between their points and respective centers is minimal.As the distance function, the Euclidean metric is used, which has been applied in most of the cases [19,20].The first k centers are usually chosen at random, which does not guarantee finding optimal clusters.To increase the chance of finding the optimum, the algorithm is usually launched several times with different initial choices and the result of the smallest total squared distance is indicated [20].
The goal of a statistical model is to find the most probable set of clusters on the basis of training data and prior expectations.As a representative of these techniques, EM (expectation-maximization) algorithm, based on the finite Gaussian mixtures model, has been investigated.EM generates probabilistic descriptions of clusters in terms of means and standard deviations [17].The algorithm iteratively calculates the maximum likelihood estimated in parametric models in the presence of missing data [21].EM enables using crossvalidation for selecting the number of clusters and thus obtaining its optimal value [20].That feature allows avoiding the determination of the number of clusters at the beginning of the algorithm.
The choice of the optimal number of clusters is one of the most important parts of the clustering process.In the case of the k-means algorithm, the elbow technique was used.It is based on the statement that the number of clusters should increase together with the increase of the quantity of information.The last number of clusters, for which a gain value was augmented, should be indicated as optimal.On the graph, where validation measure is plotted against the number of clusters, that point is presented as an angle, and called the elbow.There are cases, when angles cannot be unambiguously identified, and the number of clusters indicated by the elbow technique should be confirmed by other methods.
Thus, considering two clustering methods equipped with different techniques for choosing the optimal number of clusters may help in confirming the right choice.However, it is worth noticing that in medicine there exists the usual intent to split the whole dataset into two groups and thus the number of clusters is very often equal to two [18].Besides, in medical applications, the number of collected instances is very small and the high number of clusters may result in small group sizes and in less reliable medical inference, as the consequence of the lack of statistical tests of high power [19,22].

Statistical Analysis.
Before carrying out statistical inference, the assessment of measures of descriptive statistics should be performed.Such an approach allows detecting errors that were not identified during the data preparation phase.As the main descriptors, for which the evaluation is indicated, one should mention central tendency measures (arithmetic mean, median, and modal) as well as dispersion measures (range and standard deviation).Next, an appropriate test is run as a part of the statistical analysis process.The test should be chosen according to the type and the structure of analysed data regarding such characteristics as attribute types, the scale type, the number of experimental groups, and their dependencies, as well as the test power.Additionally, the selection should be consistent with the requirements of the USMLE (The United States Medical Licensing Examination).In the presented research, these are considered the tests usually applied in medical diagnostics [2]: (i) Kolmogorov-Smirnov test, which is used to check the normality of distribution of the attributes (ii) Unpaired two-sample Student's t-test for the significance of a difference between two normally distributed values of attributes of all the features * /; P/ * statistical significance level * /; R/ * a threshold for correlation coefficient levels * /; N/ * the maximum of features for the subset/ * ; Output: F s / * selected subset of features * /; (1) Initialize F s with feature f j ϵ F that is the least correlated with other ones; (2) do (3) Compute C ij (F s , F \ F s ) as a vector of correlation coefficients between F s and each f i ϵ {F \ F s }; (4) Choose f j ϵ {F \ F s } with the lowest value of correlation coefficient in a vector C ij (F s , F \ F s ); (5) Include f j in F s (6) while (s < N AND p > P AND C ij (F s , F \ F s ) < R).
Algorithm 1: Proposed feature selection algorithm using reversed correlations 4 Complexity (iii) Mann-Whitney U test, which is a nonparametric test for the determination of significant differences, where attributes are in nominal scales Pearson's correlation coefficient r P x, y is used to express the impact of one variable measured in an interval or ratio scale to another variable in the same scale.Spearman's correlation r S x, y test is used, in the case when one or both of the variables are measured with an ordinal scale, or variables are expressed as an interval scale, but the relationship is not a linear one.

Results and Discussion
The performance of the proposed methodology has been examined by experiments conducted on the real datasets collected for supporting heart disease diagnosis.The statistical analysis results obtained for clusters have been compared with the ones taken for the whole datasets.

Data Description.
The experiments were carried out on three datasets: The "HEART" dataset consisted of 30 cases collected to discover dependencies between arterial hypertension and left ventricle systolic functions.The "IUGR" dataset includes 47 instances of children born with intrauterine growth restriction (IUGR), gathered to find out dependencies between abnormal blood pressure and being born as small for gestational age.The data of both of the datasets were collected in the Children's Cardiology and Rheumatology Department of the Second Chair of Paediatrics at the Medical University of Lodz.
Each dataset was characterized by two types of parameters: the main and the supplementary ones, all of them gathered for discovering new dependencies.The attributes correspond to high blood pressure and include echocardiography and blood pressure assessment, prenatal and neonatal history, risk factors for IUGR, and family survey of cardiovascular disease, as well as nutritional status.There were no missing values within the attributes.The full medical explanations of the data are given in [13,23].
The "CORONARY" dataset also refers to cardiovascular problems.It comes from the UCI Machine Learning Repository [8].The dataset contains the records of 303 patients, each of which is described by 54 features.The attributes were arranged in four groups of features: demographic, symptom and examination, and ECG, as well as laboratory and echo ones [24][25][26].
The summary of characteristics for all the datasets was presented in Table 1.The datasets have been chosen to ensure diversification of the mutual proportion between the number of instances and the number of attributes: (i) The number of instances in the "HEART" dataset is smaller than the number of parameters (ii) The number of instances in the "IUGR" dataset is comparable with the number of attributes (iii) In the "CORONARY" dataset, the number of instances is greater than the number of parameters Tables 2-4 describe the selection of the parameters with the main statistical descriptors: the values of range, median or mean, and standard deviation (SD).

Selecting Relevant Features.
For each dataset, only parameters concerning main characteristics were considered as initial attributes used for grouping.The selection of the appropriate features for building clusters has been performed by using three different techniques: (1) The reversed correlation algorithm (RCA) (2) CFS method (3) ReliefF algorithm The parameters necessary to run the RCA algorithm were chosen according to principles commonly approved in statistics (see [24,28]): (i) N = 50% n for the maximal number of features (ii) R = 0 3 for the maximal value of correlation coefficients (iii) P = 0 05 for the maximal value of statistical significance p value In the case of the ReliefF algorithm, the threshold for the number of attributes included in the subset of selected features was set to N = 50% n.
The subsets of features presented in Table 5 were obtained as the results of the proposed feature selection process.The first column of the table represents names of datasets, the second column represents the names of the feature selection algorithms, and the following columns contain the number and names of selected features in the order indicated by the algorithms.

Data Clustering.
In the next step of the experiments, the clusters for diagnosed patients were created by using two clustering algorithms: k-means and EM implemented by WEKA Open Source software [20].5 Complexity Clusters were built regarding the main characteristics and the parameters indicated by feature selection methods, namely RCA, CFS, and ReliefF.
In the case of the EM algorithm, the best number of clusters was indicated by using cross-validation.To choose the best number of clusters for k-means clustering, the elbow criterion has been applied and within cluster sum of squares has been considered as a validation measure.The charts of validation measures plotted against the number of clusters with marked elbow points for HEART, IUGR, and CORO-NARY datasets, respectively, are presented in Figures 2-4.For better result visualisation, the values of within cluster sum of squares were normalized.
The results of clustering are presented in Table 6, where the first column describes datasets, the second column contains the names of the feature selection methods, and the last two columns present the number of clusters and clustering schemes.

Statistical Inference.
Correlation values obtained for the clusters were compared with the ones taken for the whole group of diagnosed patients in terms of different selection techniques.Comparison of results confirmed the effectiveness of the proposed methodology.For each dataset, we obtained a greater number of statistically significant correlations in clusters which may lead to improved medical diagnosis in the future.By significant correlations we mean values with correlation coefficient r ≥ 0 3 and p value ≤ 0.05 ( [27,28]).The biggest growth of the number of correlations concerns the HEART dataset, where the number of instances is smaller than the number of parameters.The numbers of detected correlations are presented in Table 7.
One can easily notice that the results attained by the unsupervised RCA feature selection technique and supervised ReliefF algorithm were comparable; however, the first method outperforms the second one in the case of the IUGR dataset and k-means technique.As in many cases, the supervised technique of feature selection cannot be used due to the lack of information on labels; one can expect that the RCA method would be indicated as more often used than the ReliefF algorithm.

Conclusions
The process of computer-aided medical studies is usually based on only one of the data analysis methods, most often a statistical approach.In this paper, we present an approach that integrates a feature

Complexity
The experiments have shown that the proposed hybrid approach provides significant benefits.The statistical inference performed in clusters enabled detection of new relationships, which have not been discovered in the whole datasets, regardless of the applied feature selection algorithm and the clustering technique.Moreover, the proposed RCA technique attained results at least as good as other considered feature selection methods, but as opposed to CFS and ReliefF, it belongs to unsupervised approaches, which implies a more flexible application.It is also worth emphasising that the presented approach has been checked using datasets of different mutual proportions between the number of instances and the number of attributes.The experimental results have shown that the proposed methodology performs well on datasets with the small number of instances and what is more, the biggest growth of the number of correlations concerns the dataset where the number of instances is smaller than the number of attributes.Such situations very often take place in the case of patient datasets.
Future research will focus on further investigations that aim at improving medical diagnostics by using hybrid approaches combining data mining and statistical inference.First, more datasets should be examined regarding different mutual proportions between the number of instances and the number of attributes.The research area should be broadened to diagnostics for the diseases of other types.Further research should also include indicating the effective integration of feature selection and clustering algorithms that will perform well combined with statistical inference.

Table 1 :
The characteristics of datasets.

Table 2 :
Characteristics of attributes for the "HEART" dataset.

Table 3 :
Characteristics of attributes for the "IUGR" dataset.

Table 4 :
Characteristics of attributes for the "CORONARY" dataset.

Table 5 :
Feature selection results.

Table 7 :
Numbers of statistically significant correlations detected in the whole datasets and in clusters.