We take the concept of typicality from the field of cognitive psychology, and we apply the meaning to the interpretation of numerical data sets and color images through fuzzy clustering algorithms, particularly the GKPFCM, looking to get better information from the processed data. The Gustafson Kessel Possibilistic Fuzzy c-means (GKPFCM) is a hybrid algorithm that is based on a relative typicality (membership degree, Fuzzy c-means) and an absolute typicality (typicality value, Possibilistic c-means). Thus, using both typicalities makes it possible to learn and analyze data as well as to relate the results with the theory of prototypes. In order to demonstrate these results we use a synthetic data set and a digitized image of a glass, in a first example, and images from the Berkley database, in a second example. The results clearly demonstrate the advantages of the information obtained about numerical data sets, taking into account the different meaning of typicalities and the availability of both values with the clustering algorithm used. This approach allows the identification of small homogeneous regions, which are difficult to find.
The objective of clustering algorithms is to find an internal structure in a numerical data set in order to separate it into
The clustering algorithms help us to get simplified representation of a numerical data set into
Also, the clustering algorithms that partition a given space in a hard, fuzzy, probabilistic, or possibilistic way, according to a data set and after a learning process, provide a set of prototypes as the most representative elements of each group. Classic k-means or hard c-means (HCM) [
In this work we propose to take the interpretation of the typicality concept according to the cognitive psychology point of view, such that it is possible to obtain a more natural interpretation of data. Rosch and Mervis [
A prototype is based on the notion of typicality; all members belonging to the same category do not represent it in the same way; that is, some members are more typical than others. In [
In this work we use the concepts of typicality and membership degrees in order to categorize linguistic concepts, looking for a better understanding of the information extracted from a numerical data set and digital images through segmentation using the Gustafson Kessel Possibilistic Fuzzy c-means (GKPFCM) clustering algorithm [
The remainder of the paper is organized as follows. In Section
In the search of prototypes proposal by [
The categories of some concepts can be vague or fuzzy; that is, there are objects whose membership to the category is uncertain, and this is not due to a lack of knowledge but to the lack of a clear rule defining the edges of the categories [
For a better understanding of the differences between typicality and vagueness, we take an example from the work of [
The membership degree of the penguin and the dove to the category of birds is 1. However, the dove is more typical than the penguin. For the concept of
When applying clustering algorithms to numerical data sets it is required that the members of each group have similar values to each one of the features; otherwise, some members would be identified as members of other groups. On the other hand, in the theory of prototypes each member of a group must have all of the features and no matter if some of them have atypical values for some features, they continue to belong to the group. This means that clustering algorithms identify disjoint subgroups, each one containing typical and atypical data, and the only common data to the subgroups is noise. Figure
Illustration of the relation between prototype theory and fuzzy clustering.
The partitional clustering algorithms have a great similarity to the theory of prototypes, although the latter is related to building categories about concepts whereas the clustering algorithms focus on the unsupervised classification of numerical data; however both approaches have the same objective.
In this section we give an interpretation of the typicality of the fuzzy clustering algorithms, based on a psychological and cognitive interpretation, as presented in the previous section and as a way to gain greater knowledge than usual from numerical data sets.
Ruspini [
With the FCM, the calculus of the membership degree of a point
The Possibilistic c-means (PCM) clustering algorithm was proposed in [
The prototypes are selected as the best examples to represent categories or groups according to a given criterion, and they have the most important features. In the case of birds, for example, the dove is more typical than the ostrich and the penguin, because it has more features of a bird. However, ostriches and penguins are members of the category of birds. Therefore, there are an internal resemblance among the members of a group and an external dissimilarity to the members of other categories, even when several categories share some features, as it happens with birds and reptiles, as both kinds of these animals reproduce by eggs.
A similar situation happens with a numerical data set; that is, it is possible to take into account an external dissimilarity and an internal resemblance through distance measures, making it possible to quantify the similarity or the dissimilarity among patterns and prototypes of the different groups. Among the most used distance measures we have the Euclidean and the Mahalanobis distances. The former is used when the correlation of patterns is low. In this case the algorithm identifies groups with spherical forms. The last one is preferred for patterns with medium or high correlation. However, the correct decision on the selection of the distance measure depends on the available data and the statistical distribution of the features (attributes).
Thus, the FCM and the PCM algorithms provide information about a numerical data set (see Figure
Relative typicality (FCM) and absolute typicality (PCM).
Numerical data sets are complex because they generally have a lot of data, and they could be incomplete, or data could be imprecise, uncertain, and even vague in certain cases. Thus, the extracted knowledge from them must agree with the features and the data set, and this information must be easily understood by users. The first step is then to identify the prototypes, afterwards the groups, and finally Rosch's typicality of data.
The PFCM algorithm is based on the Euclidian distance and the identified clusters are constrained to spherical shapes, as described previously. So, in order to give more flexibility to the algorithm, such that the identified clusters are better adapted to the distribution of groups in the data set, we decided to use the Gustafson Kessel Possibilistic Fuzzy c-means (GKPFCM) algorithm proposed by Ojeda-Magana et al. [
Provide an initial value for the prototypes (center) Find the value of the parameter Choose standard parameters by GK-B [ Choose standard parameters by PFCM ( Calculate the covariance matrices for each group according to
and estimate the covariance as suggested in [ where
Finally Calculate the distance among data and prototypes:
Determine the membership matrix Determine the typicality matrix Modify the prototypes Verify that the error is within the proposal tolerance If the error is greater than
As previously discussed, an example from the theory of prototypes is, for example, the category of birds, which include birds that can fly and birds that cannot. Obviously, the first kind of birds is more typical than the second one. If we try to qualify the characteristic of “flying,” the less typical birds would have a very low score and this causes them to be further from the prototype. In the same way, when working with typicality through fuzzy clustering algorithms, the typicality can be used after groups are identified and data further away from the prototypes can be qualified as atypical. However, as we do not have constraints in a numerical data set, if data points are extremely far away from the prototypes, they could be considered as noise, in case they do not belong to another group.
Using a hybrid algorithm, such as the GKPFCM, it is possible to identify the groups and to use the typicality values of the algorithm in a way that different subsets of data in each group can be defined depending on how typical they are. In order to do this, it is necessary to establish thresholds dividing each group into typical, atypical, and noise data [
Applying the GKPFCM clustering algorithm we can identify groups and their prototypes. In this particular work and in order to maintain a more direct relationship with the theory of prototypes, we will only divide each group into typical and atypical data points, using the typicality values (matrix Propose the parameters Run the GKPFCM algorithm to estimate the relative typicality provided by the From the From the
In the next two subsections we apply this approach to a synthetic numerical data set and to a digitized image of a glass in a first example and four color images from the Berkeley database in a second example.
For the synthetic numerical data set we use similar data as that presented by Lesot and Kruse [
(a) Synthetic data set
To reduce the effects of noise in the prototype it is necessary to make a good choice of parameters
For this work, we use the values
Figure
(a) Orthogonal projection of
The presence of noise in the
(a) Partition based on the external dissimilarity. (b) Partition based on the internal resemblance.
On the other hand, the evaluation of the
For example, selecting a threshold of
As it is shown in Figure
One of the most challenging problems in computer vision is that of the segmentation of images, as these ones must be divided in regions of interest, and the objects in the scene must be clearly identified. The aim of segmentation is to divide an image into nonoverlapping regions that are homogeneous in some features such as gray levels of pixels, color, texture, and depth or even a combination of some of these. The level to which the subdivision is carried out depends on the problem being solved.
Partitional clustering algorithms are considered for image segmentation, because of the great similarity between segmentation and clustering, although clustering was developed for feature space, whereas segmentation was developed for the spatial domain of an image.
From a general point of view, the segmentation of images can be divided into two types: region based or edge based. In this work we focus on the former approach, where the objects in the image result from homogeneous regions inside the RGB color space. We use two examples to show these results, the first one concerning the analysis of the differences with the typicality of Rosch, whereas the second one is based on the absolute typicality in order to improve the results of segmentation.
As a first test we have used an image of a glass where we identify this object and separate it from the background. As can be seen in Figure
Original image of a glass.
The clustering algorithms work in the features space; the RGB color space for this work. For the final result we use the mean value for each cluster. So, the quality of the segmentation can be evaluated through the similarity between the real colors of the image and the identified colors based on the concept of typicality.
In order to compare the results, two clustering algorithms have been used. The first algorithm is the FCM, used to identify two clusters in the image. The algorithm provides the
As can be seen in Figure
Segmentation results of the clustering algorithm: FCM at the upper image, GKPFCM (
As we are interested in the typicality concept for knowledge discovery in numerical data sets, we have used the GKPFCM algorithm such that the absolute and the relative typicalities are available, and the results can be improved using them. Figure
Comparing the results of the previous images, the FCM gives comparable or even better results than the GKPFCM based on the relative typicality. However, this last algorithm uses a measure that allows better adaptation of the clusters to the distribution of data, and hence the results provide more homogeneous regions. This can be seen more easily through the absolute typicality whose results are shown in Figure
Regarding the results of the absolute typicality, this provides additional information that can be used to get more homogeneous segmented regions and to identify the atypical pixels inside them. This process has been called
Four images from the Berkeley database (see Figure
Segmentation results of the four images from the Berkeley database, with four partitional clustering algorithms.
The first image from the Berkeley database concerns an airplane where we can find three objects: the airplane (region1), the clouds (region2), and the sky (region3). The segmentation with the k-means and the FCM results in an airplane where almost half of its pixels, region1, belong to the sky, region3.
On the other hand, the GK-B gives better results than in the previous cases, as the segmented regions represent in a more approximated way the objects in the image. The GKPFCM algorithm was also applied, and the corresponding results are analyzed according to the relative and absolute typicalities. The results are not so good in the first case as the region1 is totally associated with region2. However, the results are much better when the absolute typicality is used, and the threshold is established at
The only drawback of the previous results is that the atypical pixels, in the RGB features space, are located at both extremes of the ellipsoids. This depends on the particular distribution of pixels in the features space and the value assigned to the threshold. For the airplane identification, the atypical pixels allow much better identification of this object. However, in this case we also have the lighter atypical pixels of the sky. These can be seen at the left lower side of the resulting image.
The image of the field was also segmented in three regions, the stones (region1), the bushes (region2), and the earth (region3). No method among the k-means, the FCM, and the GK-B was able to detect region2, except the GKPFCM which gives better results when a threshold
The image of the horses was segmented in four classes, and we got good results with all the algorithms. Nevertheless, if the images are observed carefully, there are a lot of details that are lost. Take for example the white line in front of the horse's head. The algorithms k-means, FCM, and GK-B associate this object to other regions and, even if the number of classes is increased, this region is not detected. On the other hand, the GKPFCM with a very low threshold
The last image was that of a parachute. This image was also segmented in three regions. Here we find that the k-means, the FCM, and the GKPFCM-U are being incapable of correctly identifying the parachute. An exception must be made here, as was done for the GK-B and the GKPFCM. For the last algorithm a threshold
From the results of the previous examples we can find that the absolute typicality is clear, especially for the segmentation of color images. In this case the typicality value could be viewed as a means to apply a homogenization procedure, as the division of each region in typical and atypical pixels gives results with the most uniform pixels.
One of the major challenges when looking for patterns in data sets and images is to find the most homogeneous groups in a feature space. In this work, we have used the GKPFCM clustering algorithm which meets the following features. It adapts better to the natural shape of the data, that is, convex hyperellipsoids. It provides It has parameters, whose particular function has been explained, to give more importance to any dissimilarities or resemblances. With the internal resemblance provided by the algorithm, we can identify the atypical data of each class and more homogeneous regions can be formed without the need to increase the number of groups.
The first point is a result of using the Mahalanobis distance in the GKPFCM algorithm. This leads to the achievement of a better partition of the feature space. However, the drawback of the algorithm is its limitation to the identification of nonconvex groups; that is, as the nonconvexity becomes more severe, the quality of results diminishes.
The second point follows from the availability of the relative and absolute typicalities, which are directly associated, through the relationship between fuzzy clustering and the theory of prototypes, with an external dissimilarity and an internal resemblance. This allows for a better categorization, knowing, additionally, the degree of typicality of each object to each category. The simplest case is the k-means, which provides a discrete relative typicality, and the FCM that provides a continuum relative typicality in the interval
Through the four parameters of the GKPFCM algorithm (
With the absolute typicality, or the internal resemblance, it has been possible to identify the atypical data inside each cluster. The result is a set of typical data and a set of atypical data, the former representing a more homogeneous region in this case. Nevertheless, the atypical set could not necessarily be homogeneous as the data could be located at both extremes of the corresponding ellipsoid in the RGB color space.
In this work the images are in the RGB color space, even though there are other proposed color spaces in order to improve the results of image processing [
Taking advantage of the typicality values, the absolute typicality, or the internal resemblance, we are able to enhance the segmentation process and to get more homogeneous regions, at least for the typical data. This is a promising result, as we are able to find very small homogeneous regions, which are very difficult to identify, even if the number of regions to segment is increased to a very large value.
Categorizing data into concepts, analogous to the theory of prototypes, allows us to understand the problem of unsupervised classification (also known as clustering) and to propose an approach to look for particular points inside each of the categories. This was shown using a synthetic numerical data set, a digitized image of a glass, and four images from the Berkeley database. In these cases, the external dissimilarity and internal resemblance are used in a better way and more information can be obtained from the same data compared to a classical approach. The existence of hybrid algorithms, as the GKPFCM or the PFCM, allows us to get both values at the same time, providing us with more information about the internal structure of data sets. In this work we have related classifications made by human beings to those made by automatic algorithms. This approach is very interesting when we try to look for special cases inside an image. For this reason we have attempted to join the theory of prototypes and the partitional clustering algorithms.
The authors wish to thank The National Council for Science and Technology (CONACyT) in Mexico and the Departamento de Sistemas de Información (CUCEA) and the Departamento de Ingeniería de Proyectos (CUCEI) at the Universidad de Guadalajara for the help provided to complete this study.