A Novel Robust Fuzzy Rough Set Model for Feature Selection

School of Instrument Science and Engineering, Southeast University, Nanjing 210000, China School of Control Science and Engineering, Shandong University, Jinan 250100, China Department of Anesthesiology, %e %ird Xiangya Hospital, Central South University, Changsha 410013, China Science and Technology on Information Systems Engineering Laboratory, %e 28 Research Institute of CETC, Nanjing 210000, China


Introduction
In the current era of big data, the data scale is massive, and the presentation is high-dimensional. e high dimension of data representation is mainly due to that data often contains a large number of redundant or irrelevant features, resulting in excessively high data dimension, which seriously reduces the processing capacity and time efficiency of pattern classification as well as the resolution ability of decision making. High-dimensional data also makes the fast, timely, and accurate data mining task face great challenges. erefore, how to effectively select features for these data has become one of the hot topics in the field of machine learning [1,2]. e purpose of feature selection is to remove a large number of irrelevant and redundant features from the original feature set on the premise of ensuring the learning performance to find a set of feature subsets containing all or most of the classification information of the original feature space to reduce the impact of "dimension disaster" and improve the learning performance. erefore, feature selection (or attribute reduction) is very necessary, which has become a research hotspot in machine learning. Meanwhile, fuzzy rough set (FRS) theory is not only an objective and effective mathematical tool to deal with incomplete and uncertain information [3][4][5][6][7] but also a powerful and effective computing paradigm to realize feature selection [8][9][10]. In recent years, FRS theory has been widely concerned and applied in data mining, machine learning, pattern recognition, and other fields [11][12][13].
FRS can effectively deal with the fuzziness and vagueness of data. However, based on the upper and lower approximation of the classical FRS model, the nearest sample of the given target sample is used in the calculation, so the classical FRS model constrained by the nearest sample is extremely sensitive to noise.
In order to improve the robustness of classical FRS and reduce the influence of noise samples on the approximation under the model, many robust FRS models were proposed [14][15][16][17][18][19][20]. An important application of FRS theory is feature selection, also known as attribute reduction, which maintains the inconsistency between feature and decision label. is inconsistency is manifested as two samples having the same feature value but different decision labels. Redundant or irrelevant features can be deleted to improve the classification prediction performance of learning algorithm and save running time and space, so that people can have a clearer understanding of the actual problems based on FRS. Feature selection based on FRS refers to the removal of redundant and irrelevant features from data without changing the ability of data classification. e advantage of FRS method is that in the process of data processing, only the information of the data itself can be utilized without any other prior knowledge and additional information. Generally speaking, there are two types of FRS-based feature selection methods: heuristic method based on dependency [21][22][23] and a structured method based on a discernibility matrix [24][25][26][27][28].
e dependency-based heuristic method uses the positive region and the dependency function [21][22][23] as the feature evaluation criterion and uses the heuristic search to obtain the feature subset. Specifically, FRS-based feature selection is calculated by using the forward or backward searching method on the premise of keeping a certain metric constant. Pawlak [1] introduced the concept of positive region when describing the consistency between features and decision label to select features. Hu and Cercone [23] designed an algorithm for calculation of reduction under the condition that the positive region of the decision was kept unchanged. Another kind of method is structured method based on discernible matrix. By introducing the identification matrix, a Boolean identification function is constructed, and the minimal form of the identification function is obtained by logical operation, thus all possible reduction results of the decision table are obtained. Based on the identification information in the identification matrix, many scholars have studied the computational problems of reduction [24][25][26][27][28]. For example, Yao and Zhao [24] transformed an identification matrix into its minimal form by designing an absorption matrix algorithm to calculate the reduction, so that the union of all elements constituted a reduction. Because this algorithm needs to calculate all the elements in the identifiable matrix, it will consume a lot of calculation time. In order to save a lot of running time and storage space, Chen et al. [27,28] determined minimal elements of discernibility matrix through sample pair selection method (SPS) and designed a fast algorithm for calculation reduction based on minimal elements.
In our previous work [29], we proposed a novel FRS model, namely, fuzzy rough set, with representative sample (RS-FRS) model, which is constructed to reduce the influence of noisy samples. In the RS-FRS model, the fuzziness of sample membership is taken into consideration. Using fuzzy equivalent approximate space, other subsets of domain space can be approximated more precisely than those from a conventional FRS model. RS-FRS model does not require preset parameters. Our pilot study indicates that implementing such a new framework could reduce the complexity of the model and human intervention. However, the previous work needs further research in several aspects. Firstly, the proposed theorem which supports the model is not thoroughly derived, which calls for additional theoretical derivation to strengthen the mathematical background. Moreover, the verification of the proposed model is not comprehensive since the model was tested using only the KNN classifier. Lastly, the previous work lacks solid statistical validations due to its nature as a pilot study that explores the method's potential effectiveness.
In this manuscript, we extensively addressed these drawback points. We conducted detailed derivations of Properties 2 and 3 to complete the RS-FRS model at the theoretical level. e completion of the theorem also supports the generation of the feature selection algorithm, which is significant to the framework based on RS-FRS models. In addition to the previous KNN classifier, the performance evaluation of the proposed method is extended by validating the algorithm using CART and LSVM classifiers. Our new results show satisfactory accuracy and robustness values, which further support this new method's effectiveness. Lastly, a comprehensive statistical analysis of the results is conducted and reported in this manuscript by applying Friedman tests and Bonferroni-Dunn tests to the model outputs. In summary, the previous pilot study is strengthened with theoretical, experimental, and statistical proofs to demonstrate the performance and robustness of the proposed RS-FRS method.

Preliminaries
e equivalence classes of Pawlak rough set are crisp subsets of the domain. ese crisp information granules cannot reflect the fuzziness in reasoning. In practical classification learning, the features describing samples may be fuzzy, or the relations between samples are fuzzy relations calculated by numerical attributes. erefore, FRS as Pawlak rough set extension model came into being.
For a nonempty universe U, if R is a binary relation and it satisfies reflexivity, symmetry, and sup-min transitivity, R is a fuzzy equivalence relation. e fuzzy equivalence class [x] R is generated by R with respect to sample x ∈ U. [x] R is a fuzzy set on U, which is also referred as the fuzzy neighborhood of x, i.e., Definition 1. Given a nonempty finite domain U � x 1 , x 2 , . . . , x n , R is the fuzzy equivalence relation of U. For ∀x ∈ U, the fuzzy equivalence class of x is [x i ] R � r i1 /x 1 + r i2 /x 2 + · · · + r in /x n , and it is the fuzzy subset of U. e membership degree value of ∀x j ∈ U to [x i ] R is [x i ] R (x j ) � r ij . e set of fuzzy equivalence classes forms a basic conceptual system for approximating any subset in the theoretic domain space, which is called fuzzy equivalence approximate space FAS � 〈U, R〉. e upper and lower approximations of X ∈ F(U) in 〈U, R〉 are defined as 2 Complexity In equation (1), the upper approximation of the sample x ∈ U is determined by R(x, y) and X(y). e lower approximation of the sample x ∈ U is determined by 1 − R(x, y) and X(y), where X(y) is the membership degree of the sample y ∈ U with respect to the equivalence class X ∈ F(U).
At present, the existing fuzzy rough set models [14][15][16][17][18][19][20] consider that the decision attribute D divides the sample set into several crisp decision classes, so the membership degree X(y) in the upper and lower approximations of the FRS model is a binary function with the value of 0 or 1. en, the upper and lower approximations of the classical FRS model are reduced to

The Proposed Model
At present, the existing FRS models [14][15][16][17][18][19][20] consider that the decision attribute D divides the sample set U into several crisp decision classes, so X(y) as the membership degree in equation (1) is a binary function with the value of 0 or 1. en, the upper and lower approximations of classical FRS model are degenerated as equation (2). However, this strategy that it divides the sample set into crisp decision classes is extremely sensitive to the noise samples, and if there exist noise samples, X(y) has fuzzy uncertainty. Only defining X(y) as a binary function of 0 or 1 does not meet the requirements of practical application, and it cannot well reflect the real membership relations between samples and each equivalence class. erefore, determining the membership of samples is another important challenge for fuzzy rough set models.

Representative Sample.
Unlike the existing FRS models, we consider D divides U into several fuzzy decision classes and define "representative sample" to calculate the fuzzy membership of samples. To be specific, the corresponding representative samples are found for each label. en, the fuzzy membership degree of the target sample with respect to each label is calculated according to the distance between the target sample and the representative sample, so as to design a robust FRS model. Definition 2 (see [29]). Let 〈U, A ∪ D〉 be a fuzzy decision system, where the sample set U � x 1 , x 2 , . . . , x n has m attributes A � a 1 , a 2 , . . . , a m . e decision attribute D divides the sample set U into r crisp equivalent decision classes U/D � D 1 , D 2 , . . . , D i , . . . , D r (1 ≤ i ≤ r). e representative sample RS i of the class D i is defined as where d(x, y) is the distance between two samples in the class D i . In this paper, Euclidean distance is used as a basic implementation of d(x, y).
Definition 3 (see [29]). Let 〈U, A ∪ D〉 be a fuzzy decision system, where the sample set U � x 1 , x 2 , . . . , x n has m attributes A � a 1 , a 2 , . . . , a m . e decision attribute D divides the sample set U into r crisp equivalent decision classes U/D � D 1 , D 2 , . . . , D i , . . . , D r (1 ≤ i ≤ r). e representative sample of the class D i is RS i . e membership degree of the sample x ∈ U with respect to class D i is defined as where d(x, RS i ) is the distance between sample x and representative sample RS i . According to Definition 3, we can determine the membership degree of sample x with respect to each equivalence class by calculating the distance between sample x and the representative samples of each equivalence class.
It can be seen that D i (x) can fully reflect the fuzziness of sample membership. e larger the value of D i (x), the higher the degree of sample x belongs to class D i . e smaller the value of D i (x), the lower the degree of sample x belongs to class D i . In the dataset, the samples located in the boundary region may be the noise samples that have been mislabeled. By the calculation of fuzzy membership, the degree of the boundary samples belonging to each class can be determined.

FRS Model with Representative Sample. Based on Definitions 2 and 3, we propose a FRS model with representative samples (RS-FRS).
Definition 4 (see [29]). Let 〈U, A ∪ D〉 be a fuzzy decision system, where the sample set U � x 1 , x 2 , . . . , x n has m attributes A � a 1 , a 2 , . . . , a m . e decision attribute D divides the sample set U into r crisp equivalent decision . e lower approximation of the FRS model indicates the certainty that a sample belongs to its decision class in the fuzzy approximation space, and the upper approximation indicates the possibility that a sample belongs to its decision class in the fuzzy approximation space. erefore, the lower approximation can be used for classification and feature selection. e samples of dataset located in the classification boundary region are most likely to be noise samples. Because Complexity these samples are close to the samples of other classes, these noise samples are often used to calculate the lower approximation of the classical FRS model. is process causes the lower approximation to get smaller. In the proposed RS-FRS model, we consider not only the distance between the target sample and the nearest different classes' sample but also the fuzzy membership of the nearest different classes' sample. e fuzzy membership degree is calculated to expand the lower approximation of the model and reduce the influence of noise sample on the lower approximation, thus RS-FRS is robust. e main difference between the RS-FRS model and the classical FRS is that the fuzziness of sample membership is ignored when the upper and lower approximations are calculated based on the classical FRS. is can easily lead to errors in the upper and lower approximations of classical FRS model when dealing with datasets containing noise samples. erefore, the classical FRS can only maintain the maximum fuzzy dependence, but it cannot process noise information well. In contrast, the RS-FRS model considers the fuzziness of sample membership degree and can more precisely approximate other subsets of the domain space with fuzzy equivalent approximation space, so that the data fitting effect is better. Compared with the existing robust FRS model, the RS-FRS model does not need to set parameters for the model in advance, which can effectively reduce the model complexity and human intervention.

Related Properties of FRS Model with Representative
Sample. For standard max operator S(x, y) � max(x, y), standard min operator T(x, y) � min(x, y), and standard complement operator N(x) � 1 − x, some properties of RS-FRS model are discussed. If other fuzzy operators are used [13], the relevant conclusions can be similarly generalized.
Property 1 (see [29]). For ∀A ∈ F(U), the following statements hold: □ Property 2. For ∀A ∈ F(U), if x is a normal sample, the following statements hold: Because x is a normal sample, x can be used directly to compute its lower approximation R RS A(x), erefore, R RS A⊆A.
Because x is a normal sample, x can be used directly to compute its upper approximation R RS A(x), (1) R RS A ⊇ R S A, Proof. ∀x ∈ U, according to equation (2) in Section 2, when the standard operator is given, then erefore, R RS A⊇ R S A.
erefore, R RS A⊆R T A. Property 1 proves the duality between the upper and lower approximations. Property 2 demonstrates the inclusion relationship between the upper and lower approximations and the original sample set. Property 3 proves the relationship of the upper and lower approximations between the RS-FRS model and the classical FRS model. e presentation and derivation of these properties provide a theoretical basis for the establishment and design of subsequent algorithms.

Feature Selection Based on FRS Model with Representative Sample
Although the method based on discernibility matrix [27] can be used for feature selection, it has two obvious disadvantages: (1) a waste of computing time cost. All feature subsets can be obtained by using discernibility matrix method. However, when the result of feature selection is used for pattern recognition or classification, only one feature subset is sufficient. In other words, feature selection based on discernibility matrix method will increase invalid time cost. (2) A waste of storage space. When using the discernibility matrix, it is necessary to store all the elements in this matrix and reduce them through absorption operator. However, the key to find one feature subset is to determine the minimum elements. Compared with storing the minimum elements, the discernibility matrix method will cause a waste of storage space. erefore, in this paper, we use the improved discernibility matrix method, named sample pair selection algorithm (SPS) [28]. Since each of these minimum elements is determined by at least one sample pair, SPS algorithm can select the pairs of samples corresponding to the minimum elements in the sample set.
Based on this, we propose a feature selection algorithm based on the RS-FRS model with SPS to save time and space costs.
. e discernibility matrix of this decision system is a n × n matrix, denoted by M D (U, A), and the element of discernibility matrix is c ij : (2) c ij � ∅, otherwise, where λ i is the lower approximation of In the RS-FRS model, the relative discrimination relation of conditional attribute a t with respect to decision attribute D is defined as a binary relation, which is calculated by the following equation: We obtain the following results by finding the relationship between DIS ′ (R a t { } ) and element c ij in the dis- Obviously, N ij � |c ij |. According to the above definitions and analysis, a feature selection algorithm based on RS-FRS model with SPS can be described by Algorithm 1.

Datasets.
In our experiments, 12 datasets from the open access UCI database [34] are utilized and described in Table 1. UCI dataset is a commonly used dataset of machine learning standard tests. For example, in dataset "wine", these data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.
e analysis determined the quantities of 13 constituents found in each of the three types of wines. e attributes are alcohol, malic acid, ash, alkalinity of ash, magnesium, total phenols, flavanoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, OD280/OD315 of diluted wines, and proline.

Comparison of Classification Accuracy.
On three classifiers, we conduct comparison of classification accuracy with different FRS models in original data and noise data. e corresponding noise levels are 0%, 5%, and 10%, respectively. 0% noise level means the original data (the original data itself assumes no noise). 5% (or 10%) noise level means there are 5% (or 10%) samples mislabeled randomly in the original data. In order to ensure the reliability of the experiments, we carry out the 10-fold cross-validation based on each independent noise random process. Tables 2-4  In terms of the global perspective on the three classifiers, the average precision of the RS-FRS model is higher than that of the other FRS models upon the original data (0%) and the noise data (5% and 10%). Furthermore, in terms of the perspective of each dataset, each classifier, and each noise level, with KNN, RS-FRS model is optimal on 8 out of 12 datasets upon the original data (0%), and it is optimal on 15 out of 24 datasets upon the noise data (5% and 10%) in Table 2. Similar results are shown in Tables 3 and 4. With CART, RS-FRS model is superior on 7 out of 12 datasets upon the original data (0%), and it also is superior on 16 out of 24 datasets upon the noise data (5% and 10%) in Table 3. With LSVM, RS-FRS model is optimal on 6 out of 12 Complexity datasets upon the original data (0%), and it is optimal on 16 out of 24 datasets upon the noise data (5% and 10%) in Table 4. It is worth noting that for wine, soy, hepatitis, ICU and WPBC, these 7 FRS models have the same or similar performance in 0% noise level. is is because there is no noise in the original data according to our obvious assumption; the 7 FRS models find the same sample as the nearest sample of the target sample to calculate the upper and lower approximations of the target sample.

Robustness Analysis.
In addition, after adding the noise gradually, the decreasing degree of the average precision of different models (Tables 2-4) is used as the basis for comparing the robustness of these models. In Figures 1-3, FRS,β-PFRS, K-trimmed FRS, K-means FRS, K-median FRS, SFRS, and RS-FRS model are numbered from 1 to 7 in sequence.
As shown in Figures 1-3, as the noise level increases gradually, the performance of every FRS model declines on three classifiers. is kind of phenomenon is consistent with the actual situation, because the noisy sample may be regarded as the nearest sample of the given target sample when these models calculate the lower approximation of the target sample. erefore, the classification accuracy of these FRS models will reduce. However, among these models, the classical FRS model declines more sharply and changes more dramatically. is is because the classical FRS model is the most dependent on the nearest sample and it has the least robust performance. e proposed RS-FRS model presents the smallest decline and the least drastic change, indicating that our strategy that divides the sample set into fuzzy decision classes is in line with reality and has the best robust performance.

Statistical Analysis.
To further explore whether there are significant differences in the average classification performance of the seven FRS models, we adopted Friedman test [35] and Bonferroni-Dunn test [36]. e Friedman statistical coefficient is defined as where R i is the average ranking of model i in all datasets and k is the total number of models and N is the number of datasets. F F obeys the Fisher distribution of (k − 1) and (k − 1)(N − 1) degrees of freedom. In this paper, k � 7 and N � 36. At significance level α � 0.1, F(6, 210) � 1.77. Table 5 shows the Friedman statistical coefficients and corresponding critical values under different classifiers. If F F is greater than the critical value, the null hypothesis is rejected, and we believe that there is a significant difference among the performance of all models. On the contrary, the null hypothesis cannot be rejected and it can be considered that there is no significant difference among the performances of all models. If Friedman's null hypothesis is rejected, the Bonferroni-Dunn statistical verification method can further analyze the relative performance between each comparison model and RS-FRS model, and RS-FRS model is regarded as the control model. If there is a significant difference between the two models, the difference between the average ranking between them should be at least greater than the critical difference (CD): Input: a set of condition attributes A; a set of samples U. Output: selected feature subset S (1) ∀x i ∈ U, according to equation (5), the lower approximation λ i of RS-FRS model is computed; (2) Compute every DIS′(R a t { } ) and DIS ′ (R A ); and add a t into S;  where k is the total number of models, N is the number of datasets, and q α is the critical value of Bonferroni-Dunn test at the corresponding significance level. At significance level α � 0.1, q α � 2.394. erefore, CD � 1.2190 (k � 7, N � 36). If the average ranking difference between the RS-FRS model and a comparison model is greater than the value of CD, we consider their performance to be significantly different. Figures 4-6 show the CD diagram of the three classifiers, in which the average ranking of all algorithms is drawn in turn along the horizontal axis. In other words, scales from 1 to 7 on the horizontal axis are the average ranking of the seven algorithms over 12 datasets, with the higher the ranking, the better the algorithm, and the algorithm with the highest ranking (optimal) in the axis is located on the right side of the axis. If the average ranking between RS-FRS and a comparison algorithm is connected by a thick line (CD value), it means that RS-FRS has comparable performance with the comparison algorithm. If the average ranking difference between the RS-FRS model and a comparison model is greater than CD value, we consider their performance to be significantly different. As shown in the above figures, we can obtain the following results:

Conclusions
e development of robust FRS is a hot spot in the theory of FRS, which has some advantages in the feature selection of noise information. In this paper, the nonparametric fuzzy membership degree is defined by fuzzy granular calculation, and a FRS model based on representative samples is proposed. Firstly, the distance between the target sample and each representative sample is calculated by looking for the representative sample of each class, then the fuzzy membership degree of the target sample about all classes is determined, a FRS model based on the representative sample is proposed, and the relevant properties of the model are studied. On this basis, a feature selection algorithm based on FRS of representative samples is designed. Experimental results show that the RS-FRS model and feature selection algorithm proposed in this paper are feasible, effective, and robust for the processing of uncertain information systems with noise information, which expands the application field of FRS and the research on feature selection. However, the paper has some limitations. In this paper, the feature selection is carried out by the lower approximation of RS-FRS model. e lower approximation of the sample reflects the degree of certainty to which the sample belongs to the class, while the upper approximation of the sample reflects the degree of probability to which the sample belongs to the class. In the future research, it is considered to design feature selection algorithms using both upper approximation and lower approximation.

Data Availability
In this paper, the authors use UCI database which is public and open access to obtain in the website http://www.ics.uci. edu/mlearn/MLRepository.html.