Evaluation of Modified Categorical Data Fuzzy Clustering Algorithm on the Wisconsin Breast Cancer Dataset

The early diagnosis of breast cancer is an important step in a fight against the disease. Machine learning techniques have shown promise in improving our understanding of the disease. As medical datasets consist of data points which cannot be precisely assigned to a class, fuzzy methods have been useful for studying of these datasets. Sometimes breast cancer datasets are described by categorical features. Many fuzzy clustering algorithms have been developed for categorical datasets. However, in most of these methods Hamming distance is used to define the distance between the two categorical feature values. In this paper, we use a probabilistic distance measure for the distance computation among a pair of categorical feature values. Experiments demonstrate that the distance measure performs better than Hamming distance for Wisconsin breast cancer data.


Introduction
Breast cancer is the most common form of cancer amongst women [1]. Early and accurate detection of breast cancer is the key to the long survival of patients [1]. Machine learning techniques are being used to improve diagnostic capability for breast cancer [2][3][4]. Wisconsin breast cancer dataset has been a popular dataset in machine learning community [5]. Various classification techniques such as techniques like decision trees [6], support vector machines [7], and fuzzygenetic algorithm [8] have been used to study this dataset. In medical datasets, sometimes it is difficult to put some data points in one of the groups. Fuzzy methods are better equipped to handle these kinds of datasets [9][10][11].
Clustering divides the data points into different groups (clusters) depending upon a similarity measure [12]. The data points in a group (cluster) are similar whereas data points in different groups (clusters) are dissimilar. Clustering algorithms can be divided into two groups [12,13]: hard clustering algorithms and fuzzy clustering algorithms. In hard clustering, a data point can have a membership to a cluster. However, in fuzzy clustering, a data point has memberships to all the clusters.
-means algorithm [14] is very popular hard clustering algorithm because of its linear complexity. -means clustering algorithm is an iterative algorithm which computes the mean of each feature of data points presented in a cluster. This makes the algorithm inappropriate for the datasets that have categorical features. Huang [15] extends the -mean algorithm for the datasets having categorical features. Instead of mean, mode is used to represent a cluster. Hamming distance is used to calculate the membership of a data point. In Hamming distance if the feature values are same for two data points the distance is taken as 0; otherwise the distance is taken as 1.
Hierarchical clustering algorithms [12] can be applied for categorical datasets; however they have high computation complexity. This makes them less useful for large datasets.
Fuzzy clustering has shown great promise in understanding medical datasets [10,11]. It has been shown that the fuzzy clustering can be used to improve the classification performance of various classifiers for diagnosis of breast cancer [16]. Fuzzy -mean (FCM) [17,18] is one of the most popular clustering techniques. Original FCM clustering technique can only handle numeric features. Using the methodology of FCM, fuzzy -mode algorithm [19] is proposed for categorical datasets. This method use Hamming distance and hard cluster centres. Kim et al. [20] propose a fuzzy clustering algorithm that uses fuzzy cluster centres. This algorithm performs better than fuzzy -mode algorithm [20].
Most of fuzzy clustering algorithms for categorical datasets use Hamming distance. However, Lee and Pedrycz [21] show that the simple matching similarity like Hamming distance cannot capture the correct similarities among categorical feature values; hence an appropriate distance measure should be used to improve the performance of fuzzy clustering algorithm with fuzzy cluster centres.
Various dissimilarity measures have been proposed for categorical feature values [23]. Ahmad and Dey [22] present a dissimilarity measure for categorical features. Ahmad and Dey [22] show that -mode clustering algorithm can be improved with this dissimilarity measure. Ahmad and Dey [24] use this distance measure to propose a clustering algorithm for datasets having numerical and categorical features. Ahmad and Dey [25] also suggest a subspace clustering algorithm with this dissimilarity measure. Motivated by the success of the dissimilarity measure for clustering categorical data, Ji et al. [26] use the distance measure for fuzzy clustering of mixed datasets. Ahmad and Dey [27] presented a fuzzy clustering method that uses a distance measure that calculates distances for each iteration.
Wisconsin breast cancer dataset has been studied extensively in machine learning field [25,[30][31][32]. Each feature of Wisconsin breast dataset has ten categories (1 to 10). It has been a popular dataset for analysing clustering algorithms for categorical datasets [25,[30][31][32]. In this paper, we show the application of the clustering algorithm proposed by Ji et al. [26] for Wisconsin breast cancer dataset. This way we will show the applicability of the distance measure proposed by Ahmad and Dey [22] for the analysis of categorical breast cancer dataset. This paper has the following organization. We will discuss fuzzy -mean clustering algorithm in Section 2. Section 3 reviews the method that computes the distance between two categorical feature values. Section 4 discusses the method to compute the fuzzy centroid for categorical datasets and the distance between a data point and a cluster centre [26]. Experimental results are presented in Section 5. Section 6 has conclusion and future work.

Fuzzy -Mean Clustering Algorithm
Fuzzy -mean (FCM) [17,18] is a popular clustering algorithm. In this section, we will discuss FCM.
The following information is given: FCM compute the cluster centres V ( = 1, 2, . . . , ), where V = {V 1 , V 2 , . . . , V }, and the fuzzy membership matrix . It is done by minimizing an objective function, , presented below iteratively: ( is used as defined real number which controls the fuzziness) where is the distance between data point and cluster centre V .
For numeric data, V and are computed as follows: The steps for FCM based algorithm are presented as follows.
Step 1. Select a stopping value . Initialize the fuzzy membership matrix U. It is done by creating × random numbers; these numbers are in the interval [0, 1].

do
Step 2. Compute cluster centres.
Step 3. Compute distances from centres and use these distances for updating fuzzy membership matrix .
Step 4. Calculate the objective function .
While (the difference between two subsequent computed values of is more than the given stopping value ).

The Distance between Two Categorical Feature Values
Ahmad and Dey [22] propose an algorithm to calculate the distance between two categorical feature values in an unsupervised framework. Unlike Hamming distance, this distance measure does not take binary measure for the distance between two categorical values. The distance is calculated by computing the cooccurrence of the feature values (for which the distance is calculated) with feature values of other features. The distance between categorical feature values and of feature against the feature , for a subset of feature values, is defined as follows: The distance ( , ) between the feature values and for against feature is presented by ( , ) and is defined by ( , ) = ( / ) + (∼ / ) − 1, where is the subset of feature values of that maximizes the quantity ( ( / )+ (∼ / )). To compute the distance between and , we compute the distances between and against every other feature. The average distance is taken as the distance, ( , ), between and in the dataset. Distances between every pair of feature values are employed to calculate the distance between a data point and a cluster centre.

Modified Centre and the Distance from the Modified Centre
For categorical datasets, the mode is used to calculate the centre of clusters [19]. However, taking only one feature value to represent a cluster centre does not capture the cluster centre well; hence loss of information takes place. Ji et al. [26] use the fuzzy centroid [20] concept with distance measure suggested by Ahmad and Dey [22] for fuzzy clustering of categorical datasets. The fuzzy centroid for a cluster, , for a categorical dataset is defined as Assume that th feature has different values. Thus, where , , is the association of value , ( th feature value for the th feature) with cluster : where ( = , ) = 1 for a data point having th feature value = , , = 0 for a data point having th feature value ̸ = , .
The distance between a data point having th categorical feature value in the th dimension and the centre of cluster is defined as where , is the th feature value of th categorical feature.
( , ) is calculated by the method discussed in Section 3. For dataset having features, the distance is calculated for each feature value of the data point and the summation of these distances is the distance between the data point and the centre. In FCM, the distances between data points and cluster centres are used to calculate fuzzy membership matrix. Hence, this distance measure will be employed to compute the fuzzy membership matrix.
The cluster centre definition and distances between cluster centre and data points discussed in this section can be used with FCM algorithm discussed in Section 2 to create fuzzy clustering algorithm for categorical datasets [26]. The steps of fuzzy clustering algorithm for categorical data are as follows.
Step 1. Select a stopping value . Initialize the fuzzy membership matrix . It is done by creating × random numbers; these numbers are in the interval [0, 1].
Step 3. Compute distances from centres by using (8). Hamming distance/distances discussed in Section 3 will be used in this step. Use these distances for updating fuzzy membership matrix U.
Step 4. Calculate the objective function .
While (the difference between two subsequent computed values of J is more than the given stopping value ).

Results and Discussion
The experiments were carried out on Wisconsin breast cancer data. This dataset has 699 data points. Each data point is represented by 9 features. 16 data points have missing values.
Missing feature values were replaced by the mode of that feature. The information about these features is given in Table 1. These are two groups in this dataset: benign and malignant. Benign group has 458 data points whereas malignant group has 241 data points. Each feature has categories (0-10). We ran fuzzy clustering with fuzzy centroid with Hamming distance and the distance measure proposed by Ahmad and Dey [22] to see how the incorporation of the distance measure affects the quality of the clustering.
Clustering error = the number of data points not in desired clusters/the number of data points.
To assess the quality of clustering, it is assumed that a preclassified dataset is provided and the "overlap" between an achieved clustering and the ground truth classification is measured. Experiments were carried out at different values of : 1.1, 1.5, and 1.9. The random initialization was used for both clustering algorithms. Clustering algorithms were run 100 times in each setting (different values) and average results are presented in Table 2. We also presented the performance of various clustering algorithms on Wisconsin breast cancer dataset. We performed the experiments for fuzzy -modes clustering algorithm. The average result of 10 runs with = 1.1 is presented. Other clustering results are taken from [30]. Results are presented in Table 3. Confusion matrices for different setups are presented in Tables 4-9.
Clustering results suggest that for all values of the fuzzy clustering algorithm with Ahmad and Dey [22] distance measure performed better; for example, for = 1.1, the average clustering error for the proposed algorithm was 5.0%, whereas the average clustering error with Hamming distance was 10.4%. This shows that the application of the distance measure improved the clustering results. Table 3 suggests that the fuzzy clustering algorithm with Ahmad and Dey [22] distance measure performed better than other clustering algorithms.
The other interesting observation is that, with Hamming distance, the clustering algorithm was putting malignant data points in benign clusters. In other words, it had difficulty in assigning malignant data points correctly, whereas the clustering algorithm with Ahmad and Dey [22] distance measure had better assignment of malignant data points. To understand this point more, we compared the membership of different data points for these two algorithms. Figures 1 and 2 show the membership of different data points for benign cluster. It shows that with Hamming distance even the malignant data points have high memberships for benign cluster. However, with the distance measure proposed by Ahmad and Dey [22], we have better membership relationship. Figures  3 and 4 show the membership of different data points for malignant cluster. This suggests that with Hamming distance membership values of benign data points are low; however, membership values of malignant data points are not as high as that with the proposed algorithm. These observations demonstrate that, with the distance measure proposed by Ahmad and Dey [22], better membership values were achieved.

Conclusion and Future Work
Early and correct detection is the key for the cure of breast cancer. Machine learning techniques are important diagnostic tools for breast cancer. Fuzzy clustering algorithms have shown great promise in analysis of breast cancer. Wisconsin breast cancer dataset has been treated as a categorical dataset in different studies because its features have categories (1-10). Ahmad and Dey [22] suggested a distance measure that has been successfully used in many clustering algorithms for categorical datasets. We used this distance measure for fuzzy clustering of Wisconsin breast cancer dataset. Our results suggest that we got better results as compared to the fuzzy clustering algorithm with Hamming distance. Experiment results also suggest that the membership values achieved by the distance measure proposed by Ahmad and Dey [22] better matched the given information. In future, we will apply this distance measure to other medical datasets. Various other fuzzy clustering algorithms for categorical datasets have been suggested [33][34][35]; in future, we will study the applicability of the distance measure proposed by Ahmad and Dey [22] for these algorithms. A comparative study of other distance measures will also be carried out [36]. The cluster centre initialization is a problem as different random initialization leads to different clustering results [37]. In future, we will apply different cluster centre initialization methods to overcome this problem.

Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.