Neighborhood Hypergraph Based Classification Algorithm for Incomplete Information System

The problem of classification in incomplete information system is a hot issue in intelligent information processing. Hypergraph is a new intelligent method for machine learning. However, it is hard to process the incomplete information system by the traditional hypergraph,which is due to two reasons: (1) the hyperedges are generated randomly in traditional hypergraphmodel; (2) the existing methods are unsuitable to deal with incomplete information system, for the sake of missing values in incomplete information system. In this paper, we propose a novel classification algorithm for incomplete information system based on hypergraph model and rough set theory. Firstly, we initialize the hypergraph. Second, we classify the training set by neighborhood hypergraph.Third, under the guidance of rough set, we replace the poor hyperedges. After that, we can obtain a good classifier.The proposed approach is tested on 15 data sets from UCI machine learning repository. Furthermore, it is compared with some existing methods, such as C4.5, SVM, NavieBayes, and KNN. The experimental results show that the proposed algorithm has better performance via Precision, Recall, AUC, and F-measure.


Introduction
A great deal of information system in reality life is incomplete information system [1].When the precise value of some attributes in an information system is not known, that is, missing or known partially, such a system is called an incomplete information system (IIS).The problem of classification in incomplete information systems is a hot issue in intelligent information processing field.There are several approaches to deal with incomplete information systems.One of them is to remove samples with missing values.Another approach is to replace the missing value with the most common value [2].These approaches are simple but they might destroy the original distribution of the data [3].Other more complex approaches were presented in some literatures.Among these different data analysis theories and methods, rough sets [4] are the most frequently used.There are some extension models [5,6] in rough set to deal with incomplete information system, such as tolerance relation, limited tolerance relation, and nonsymmetric similarity relation.
Hypernetwork was first proposed by Sheffi [7].It has been presented as a probabilistic model of learning higher-order correlations using hypergraph structure consisting of a large number of hyperedges.Hypernetwork can be represented as hypergraph.Previous studies have shown that hypernetwork can be evolved to solve various machine learning problems.Segovia-Juarez et al. and Wang et al. [8,9] use the hypernetwork model to realize DNA molecules; Kim and Zhang [10] use hypernetwork for pattern classification.
Previous researches have shown that the original hypergraph model has a good performance in classification.However, it still has some shortcomings: (1) the conventional hypergraph can only deal with the discrete data, and it still needs to discretize the continuous data.(2) The traditional hypergraph model has randomness in the process of creating new hyperedge.For incomplete information system, it is essential to supplement the missing value in the new hyperedge.The hypergraph takes measures like attribute value random filled, hyperedge random replacement strategies during the process; it is more likely to impact the decision and classification ability of the training set.
To improve the problems mentioned above, we introduce the neighborhood rough set.Rough set theory, proposed by Pawlak in 1982 [11][12][13], can be seen as a new mathematical approach to vagueness.It has been successfully applied to various fields such as pattern recognition, machine learning, signal analysis, intelligent systems, decision analysis, knowledge discovery, and expert systems.The core concepts of rough set theory are approximations.Using the concepts of lower and upper approximations, knowledge hidden in information systems may be discovered and expressed in the form of decision rules.In other words, certain rules can be induced directly from the lower approximation, and possible rules can be derived from the upper approximation.So the study of approximation space has been developed widely.Once we apply rough set theory into hypergraph, we can supervise the hyperedge replacement process and improve the generalization ability of hyperedges as well.Lin [14] pointed out that neighborhood spaces are more general topological spaces than equivalence spaces and introduced neighborhood relation into rough set methodology.Hu et al. [15] discussed the properties of neighborhood approximation spaces and proposed the neighborhood-based rough set model.Then they used the model to build a uniform theoretic framework for neighborhood based classifiers.The neighborhood-based rough set solves the problem that classic rough set theory can not deal with the continuous data.
In this paper, we employ hypergraph model and rough set theory to build a neighborhood hypergraph model.After that, we propose a classification algorithm for incomplete information systems based on neighborhood hypergraph.This algorithm is composed of the following three steps.(1) Initialize the hyperedge set: generate hyperedges for every sample in the training set and process distinctively with samples which have missing values.(2) Classify training set: classify the training set with hyperedge set and determine whether to replace the hyperedges according to the accuracy of the classification.(3) Replace hyperedges: under the guidance of rough set, replace the unsuitable hyperedges.Compared to the algorithms implemented on WEKA platform with the existing methods, the experimental results show that the proposed classification algorithm is better than other algorithms.
The remainder of the paper is organized as follows.The basic concepts on hypergraph and neighborhood hypergraph models are shown in Section 2. The neighborhood hypergraph classifier algorithm for incomplete information system is developed in Section 3. Section 4 presents the experimental analysis.Finally, the paper is concluded in Section 5.

Hypergraph and Neighborhood Hypergraph Model
2.1.The Definition of Hypergraph.In 1970, Berge and Minieka [16] used hypergraph to define hypernetwork.It was the first time to establish undirected hypergraph theory systematically and it was applied on the operations research by matroid.Definition 1 (hypergraph [16]).

Neighborhood Hypergraph Based on IIS
Definition 2 (the relevant degree between samples [17]).Let  1 and  2 be two samples in an incomplete system; each sample has  attributes;  1 and  2 have the same attribute value in  attributes, different value in  attributes, and uncertain value in  attributes (at least one of them has missing value at the attribute).We define / as the homogeneous degree of  1 and  2 , / as the antagonisms degree of  1 and  2 , and / as the discrepant degree of  1 and  2 .We employ ( 1 ,  2 ) = / + (/) + (/) to represent the relevant degree of  1 and  2 ; namely, where ,  denotes the relevant degree of  1 and  2 .However, the relevant degree mentioned in paper [17] can only be used to compute discrete features and the continuous features are invalid.Here, we define that two continuous attribute values are considered equal if they fluctuate within an certain range in comparison.
For instance,  = { 1 ,  2 ,  3 ,  4 },  is an attribute of , and (, ) denotes the value on attribute  of sample ; ( 1 , ) = 0.3, ( 2 , ) = 0.27, ( 3 , ) = 0.35, ( 4 , ) = 0.1, and ℎ = 0, 05.So for the attribute , if only |  −  1 | ≤ ℎ,   is equal to  1 on attribute .Obviously, we can find that  1 ,  2 , and  3 are equivalent on attribute , while they are not equal to  4 on attribute .Definition 3 (the neighborhood of a sample in IIS).Lin [14] pointed out the neighborhood model in 1988.Hu et al. [15] discussed the properties of neighborhood approximation  spaces and proposed the neighborhood-based rough set model.Hu et al. [18] employed the neighborhood rough set to classify data by using the Minkowski distance to calculate the samples neighborhood threshold which involves all the attribute values of the one calculated.However, for incomplete information system, it is difficult to compute the distance for the sake of the missing value.Thus, we present an extension neighborhood of sample in incomplete information system by combining with relevant degree.
Given arbitrary   ∈  and  ⊆ , the neighborhood   (  ) of   on attribute set  is defined as where   (,   ) is the relevant degree between  and   .  (  ) denotes the sample set within the neighborhood of   .
According to the definition, we can find out easily that Combining with the neighborhood rough set theory, we define the neighborhood hypergraph as follows.
Vertices of hypergraph represent the attribution of samples in some literatures like [19] and so on.However, in this paper, vertices of hypergraph are denoted as samples and different samples on one hyperedge have the same attributes set (see Figure 2).

Definition 6 (the neighborhood hyperedge set of a sample).
Given  = ⟨,⟩ and the attribute set  ( ⊆ ), the hyperedge set which is included by sample  is defined as Definition 7 (the sample set related to a hyperedge).Given  = ⟨, ⟩, ∀ ∈ , and attributes set  ( ⊆ ), for arbitrary  ∈ , the sample set related to  is defined as   () = { |  ∈   (),  ∈ }.Given arbitrary  ⊆  and attributes set  ( ⊆ ), the sample set related to  is defined as Definition 8 (the confidence degree of a hyperedge).Given  = ⟨, ⟩, for arbitrary  ∈ , assume that () = { |  ∈ }, where  denotes decision set.  () is the sample set related to hyperedge  on attributes set  ( ⊆ ).According to the decisions ,   () is divided into  equivalence classes:  1 ,  2 , . . .,   ; when   () ̸ = ⌀, the confidence degree of  is defined as follows: Definition 9 (the upper approximation, lower approximation, boundary region, and negative domains of hyperedge set for the sample decision set).Given  = ⟨, ⟩,  is the attitudes set of samples and  is the decision set of samples.For arbitrary hyperedge set   (  ⊆ ), according to the decisions , the hyperedge set   is divided into  equivalence classes:  1 ,  2 , . . .,   .For arbitrary  ⊆ , the upper approximation, lower approximation, boundary region, and negative domains of decision  related to set of attributes  are, respectively, defined as The lower approximation of decisions  related to attribute set  is also called positive domain.The size of positive domain reflects the separable degree of classification problem in a given attribute space and the bigger the positive region is, the smaller the border is.To explain how to divide the upper approximation, lower approximation, and boundary region, here we give an example (see Example 10).

A Classification Algorithm Based on Neighborhood Hypergraph for IIS
The process of the hypergraph model classification is generally divided into 3 steps.(1) Step 1: initialize hypergraph.(2) Step 2: classify the training set.(3) Step 3: replace hyperedges: to improve the accuracy of classification by replacing the unsuitable hyperedges.Steps 2 and 3 are iterative process, which will stop as far as the accuracy reaches a specific threshold.Generating hyperedge is a random process in the traditional hypergraph classification method, which makes it difficult to search and replace the bad performance hyperedges at the hyperedge replacement phase.Therefore, we introduce the rough set theory to separate the hyperedge set into three parts, including the upper approximation, lower approximation, and boundary region.In addition, the hyperedges in the boundary region will be tackled specially, which will result in the improvement of classification accuracy.The flow chart of this algorithm can be seen in Figure 4.

Initialize Hypergraph.
Initialize hypergraph, namely, to generate hyperedges based on samples.For each new hyperedge, we need to consider the initialization of both condition attributes and decision attributes.In incomplete information system, there are two kinds of samples.One does not have missing values, while the other has missing values.We may call them type 1 and type 2, respectively.For type 1, we create three hyperedges for each sample while we create five new hyperedges for type 2. Furthermore, we need to fill the missing value in the new hyperedge.The specific process of creating new hyperedge is as follows.
(1) Inherit the Condition Attributes Randomly.The attribute number of hyperedge is the same as the sample.The attribute value of hyperedge is inherited from the sample.Given a sample, we select 70% of attributes in hyperedge randomly and inherit their values from the sample.For the values of the rest of attributes in hyperedge, we can process as follows.If the attribute is continuous, this attribute in the hyperedge will inherit the average of the attribute values from all samples.If the attribute is discrete, we generate the value randomly in the domain of the attribute value.
For the samples having missing values, we need to fill the values in the new hyperedge if the value is inherited from the sample which is missing.If the attribute value is continuous, we can fill the attribute value with the average value of all samples whose decision value is equal to the hyperedge.Otherwise, we can fill the attribute value with the most frequent attribute value.
(2) Inherit Decision Attribute Directly.The decision attribute of the new hyperedge is inherited directly from the sample which creates the hyperedge.

Classify the Training Set.
Given a simple , the neighborhood threshold of  is defined as follows: where (  ∈ ) ∩ (  ̸ = ),  is a fixed value.For each sample in the training set, its classification can be determined by the voting result among its neighborhood hyperedge set.In order to get better hyperedge, we can compute the accuracy by analyzing the classification result of the training set.If the accuracy is higher than the threshold, we can output the hyperedge set.Otherwise, the hyperedge should be replaced.
Definition 11 (the decision subset of   ()).Given  = ⟨, ⟩,   () is the neighborhood hyperedge set of sample ; the set whose decision is  in   () is defined as follows: Given a sample , its classification process is as follows: (1) Compute the neighborhood hyperedge set of sample :   ().
(  We use the classification rules above to classify the training set.Once the accuracy is no less than 0.95 or the iterations are not less than 10 times, we output the hypergraph.On the contrary, we should replace the unsuitable hyperedges.

Replace Hyperedges.
In the process of hyperedge initialization, we fill the missing value and replace some attribute value in hyperedge.As a result, some of hyperedges are not suitable for sample classification.In order to acquire better performance, we should replace the poor hyperedges by generating new hyperedges, namely, replacing hyperedge.
The rough set theory divides hyperedge set into upper approximation, lower approximation, boundary region, and negative region.The confidence degree of hyperedges in lower approximation is 1.The confidence degree of hyperedges in boundary region is between 0 and 1. Hyperedges whose confidence degree is 0 belong to negative region.Hyperedges in lower approximation are all retained because they are very helpful for classification.On the contrary, since hyperedges in negative region are counteractive for classification, they will be replaced.For the hyperedges in boundary region, they will be dealt with by a threshold.When the confidence degree of a hyperedge is less than the threshold, it will be replaced.Through the above, we can enhance the pertinence and validity of hyperedge replacement.While we replace the hyperedge, it is prior to replace the hyperedge which is generated by the sample with missing values.

Search the Unsuitable Hyperedges. Consider the following
(1) Set the confidence degree threshold for hyperedge:  = 0.5.
(2) Find out the hyperedges whose confidence degree is under the threshold from the hyperedge set.
Given a hyperedge , according to Definitions 7 and 8, we can calculate the confidence degree of hyperedge  by the following three cases.
Case 1.If   () = ⌀, then no sample is related to the hyperedge (see Figure 6).
In this situation, we find out five samples whose relevant degree with  is the maximum (not all the five simples are shown in Figure 6) and we assume that  is related to this five samples.Then calculate the confidence degree according to formula (5).If Conf  () > , keep ; otherwise replace .
Case 2. If Conf  () > , we note that we can keep hyperedge .
Case 3. If 0 ≤ Conf  () ≤ , we note that hyperedge e needs to be replaced.Now, we present an example to illustrate the process (see Example 13).

Data Sets. Experimental analysis is conducted on 15 UCI
[20] data sets.There are four incomplete data sets, namely, Mammographic, Credit Approval, Hungarian, Postoperative, and eleven complete data sets.Complete data sets are modified to obtain incomplete data sets, by random missing some attribute values.The missing degree ranges from 0.6% to 18.97%.The data sets are outlined in Table 1 (sorted by the size of the data set).

Experimental Evaluation.
In order to evaluate the performance of the developed approach, we use accuracy, Precision, Recall, -measure [21], and area under ROC curve to evaluate the performance of classifier.
We assume that  is the number of the relevant samples that are predicted; is the number of the irrelevant samples that are predicted; is the number of the relevant samples that are not predicted; is the number of the irrelevant samples that are not predicted.Precision or confidence denotes the proportion of predicted samples that are relevant samples: Recall or sensitivity denotes the proportion of relevant samples that are correctly predicted: We expect the Precision and the Recall value are both high.In fact, they conflicted.The two values can not be high at the same time.So we use -measure to consider them comprehensively; -measure is the harmonic mean of Precision and Recall: Another appropriate metric that could be used to measure the performance of classification is Receiver Operating Characteristic (ROC) [22] graphics.In these graphics, the tradeoff between the benefits and costs can be visualized.The area under the ROC curve (AUC) [23] corresponds to the probability of correctly identifying which of the two stimuli is noise and which is signal plus noise.AUC provides a single number summary of the performance of learning algorithms.Most of time, the value of AUC is from 0.5 to 1.0: the higher, the better.When the AUC value is under 0.5, it means the classifier has no positive effect on classification and it should be abandoned.

Experimental Method and Results
. In order to evaluate the performance of HyperGraph we point out, we compared it with some other algorithms in related literatures: C4.5, SVM, NaiveBayes, and NN [24].C4.5, SVM, and NBC are all classic classification algorithms.They have great performance for the majority of data sets in both theory and practice.NN is a simple and classic algorithm; we choose it because the main idea of NN is majority voting with near samples, which are very similar to the proposed methods.Their source codes are afforded by Weka software [25].What is more is that we also compared to two rough set methods whose source codes are implemented in RIDAS [26].One of the rough set algorithms directly handles the data sets by using rough set theory (labeled as "Rough (incomplete)" in Tables 2, 3, 4, 5, and 6) [27,28].The other algorithm based on rough set theory is to fill the missing values of the data sets before classification (labeled as "Rough (complete)" in Tables 2, 3, 4, 5, and 6) [29][30][31].
The proposed HyperGraph algorithm is implemented by JAVA.All the results are obtained from 10-fold cross validation [32].
Contrastive experiment results on accuracy, Precision, Recall, -measure, and AUC within each algorithm are shown in Tables 2 to 7 and Figure 8.
we also compare the accuracy values of cases of whether missing values exist or not by using 22 data sets.There are 11 complete data sets and 11 incomplete data sets that derive from each complete data set by a random missing process (the missing degree is 5%).The final results are shown in Table 7.
In order to view the performance on 5 algorithms, the average value of different indicator of 5 algorithms is shown in Figure 8.
From Tables 2 and 4 and Figure 8, we can figure out that HyperGraph has higher average accuracy, Recall, and measure.Furthermore, the average AUC value is superior to majority algorithms as shown in Tables 3 and 6 and Figure 8.And it is also indicated in Table 7 that the proposed method is suitable for incomplete information system.It still has good performance when the data set has missing values.
Tables 3 and 6 and Figure 8 show that NaiveBayes has higher average Precision and AUC value than HyperGraph, because NaiveBayes classifier classifies data by using the NBC model.In theory, when the properties of sample are independent of each other, NBC model has the minimum misclassification error rate compared with other classification methods [33].Majority of the data sets we use in this paper have independent properties, which makes the NaiveBayes have average lower misclassification error rate.Moreover, Precision and AUC are generally inversely proportional to the misclassification error rate, according to their definition.Thus, NaiveBayes has higher average Precision and average AUC than HyperGraph.
However, as indicated in Tables 3, 5, and 6, the proposed classifier on Precision, -measure, and AUC value for lenses data set is poorer than most of the other algorithms.By analyzing the distribution of the lenses data set, we find out most attributes value are extremely approximate between the two

Figure 2 :
Figure 2: An example of neighborhood hypergraph.

3 Sample of class 1 Hyperedge of class 1 Sample of class 2 Hyperedge of class 2 Figure 3 :
Figure 3: An example of upper and lower approximation and boundary region.

Sample of class 1 Hyperedge of class 1 Sample of class 2 Hyperedge of class 2 Figure 7 :
Figure 7: An example of Case 3.
3; END IF Step 3. (Replace hyper-edge) replace = 0; //the initialize of hyper-edge replacement set FOR each   in  DO According to Definition 7 and formula (5), calculate the confidence degree of   : Conf  (  ).IF Conf  (  ) ≤ 0.5 THEN replace = replace + 1;  =  − {  }; END IF END FOR While we replace the hyper-edge, it is prior to replace the hyper-edge which is generated by the sample with missing values.WHILE (replace ̸ = 0) Generate a new hyper-edge   through the process similar to Step 1.  =  ∪ {  }; replace = relplace − 1; END WHILE GOTO Step 2; Step 4. (Return) RETURN E; Algorithm 1: A classification algorithm for incomplete information system based on neighborhood hypergraph.
Unsuitable Hyperedges.Take one hyperedge   out from the hyperedge replacement set and get the sample  which generates the hyperedge.Generate a new hyperedge   using this sample  and replace   with   .It is worth mentioning that we need to preferentially replace those hyperedges created by sample with missing values when replacing.
Training set  Output: Hyper-edge set  Step 1. (Initialize hypergraph) FOR each  in X DO Create one hyper-edge  of sample : First, Inherit attributes from the sample randomly and replace the values randomly on the rest attribute.Second,  inherits the decision attribute of .Third, if  has missing value, we fill the attribute value in terms with continuous attributes or discrete attribute.According to formula (7), calculating the neighborhood threshold  for each sample; FOR each  in X DO FOR each  in E DO According to formula (1), calculate the relevant degree of  and , (, ).IF (, ) ≥  THEN   () =   () ∪ {}; END IF END FOR FOR each  in   () DO IF () ==  THEN   ()  =   ()  ∪ {}; // is the decision attribute value END IF END FOR Compute the classification of , () =  ()  |).END FOR Compute the correctly classified ratio of the training set: accuracy; IF accuracy ≥ 0.95 or iterations ≥ 10 THEN GOTO Step 4; ELSE GOTO Step Input: