Rough Sets Data Analysis in Knowledge Discovery: A Case of Kuwaiti Diabetic Children Patients

1 Information Technology Department, Faculty of Computer and Information, Cairo University, 5 Ahmed Zewal Street, Orman, Giza 12613, Egypt 2 Quantitative and Information System Department, College of Business Administration, Kuwait University, P.O. Box 5468, Safat 13055, Kuwait 3 Quantitative Methods and Information Systems Department, College of Business Administration, Kuwait University, P.O. Box 5969, Safat 13060, Kuwait 4 Department of Solar and Space Research, National Research Institute of Astronomy and Geophysics, Helwan, Cairo 11421, Egypt


INTRODUCTION
Recently, diabetes has become one of the most common chronic diseases among children.Several studies have shown that diabetes has a great impact on the children life.In particular, it has been shown that diabetic children are highly exposed to emotional and behavioral problems compared with normal children [1].The relationship of diabetes to psychological problems among adolescents has been investigated by many authors [2].
Medical databases have accumulated large quantities of information about patients and their medical conditions.Relationships and patterns within these data could provide new medical knowledge [3][4][5][6].Analysis of medical data is often concerned with treatment of incomplete knowledge, with management of inconsistent pieces of information and with manipulation of various levels of representation of data.Ex-isting intelligent techniques of data analysis are mainly based on quite strong assumptions (some knowledge about dependencies, probability distributions, large number of experiments), that are unable to derive conclusions from incomplete knowledge or cannot manage inconsistent pieces of information.
The classification of a set of objects into predefined homogenous groups is a problem with major practical interest in many fields, in particular, in medical sciences [5,7,8].Over the past two decades, several traditional multivariate statistical classification approaches, such as the linear discriminant analysis, the quadratic discriminant analysis, and the logit analysis, have been developed to address the classification problem.More advanced and intelligent techniques have been used in medical data analysis such as neural network, Bayesian classifier, genetic algorithms, decision trees, fuzzy theory, and rough set.Fuzzy sets [9] provide a natural

ROUGH SETS: FOUNDATIONS
Rough sets theory is a new intelligent mathematical tool proposed by Pawlak [13][14][15].It is based on the concept of approximation spaces and models of the sets and concepts.In rough sets theory, the data is collected in a table, called a decision table.Rows of a decision table correspond to objects, and columns correspond to features.In the data set, we assume that a set of examples with a class label to indicate the class to which each example belongs are given.We call the class label a decision feature, the rest of the features are conditional.Let O, F denote a set of sample objects and a set of functions representing object features, respectively.Assume that B ⊆ F , x ∈ O. Further, let [x] B denote [x] B = {y : x∼ B y}. ( Rough sets theory defines three regions based on the equivalent classes induced by the feature values: lower ap-proximation BX, upper approximation BX, and boundary BND B (X).A lower approximation of a set X contains all equivalence classes [x] B that are subsets of X, and upper approximation BX contains all equivalence classes [x] B that have objects in common with X, while the boundary BND B (X) is the set BX \ BX, that is, the set of all objects in BX that are not contained in BX.So, we can define a rough set as any set with a nonempty boundary.
The indiscernibility relation ∼ B (or by Ind B ) is a mainstay of rough set theory.Informally, ∼ B is a set of all objects that have matching descriptions.Based on the seletion of B, ∼ B is an equivalence relation partitions a set of objects O into equivalence classes.The set of all classes in a partition is denoted by O/∼ B (also by O/Ind B ).The set O/Ind B is called the quotient set.Affinities between objects of interest in the set X ⊆ O and classes in a partition can be discovered by identifying those classes that have objects in common with X. Approximation of the set X begins by determining which elementary sets [x] B ∈ O/∼ B are subsets of X.
Here we provide a brief explanation of the basic framework of rough set theory, along with some of the key definitions.A review of this basic material can be found in sources such as [13][14][15].

Information system and approximation
Definition 1 (information system).Information system is a tuple (U, A), where U consists of objects and A consists of features.Every a ∈ A corresponds to the function a : U→V a , where V a is value set of a.In applications, we often distinguish between conditional features C and decision features D, where C ∩ D = ∅.In such cases, we define decision systems (U, C, D).

Definition 2 (indiscernibility relation). Every subset of features B ⊆ A induces indiscernibility relation:
( For every x ∈ U, there is an equivalence class [x] B in the partition of U defined by Ind B . Due to imprecision which exists in real world data, there are sometimes conflicting classification of objects contained in a decision table.Here conflicting classification occurs whenever two objects have matching descriptions, but are deemed to belong to different decision classes.In that case, a decision table contains an inconsistency.Definition 3 (lower and upper approximation).In the rough sets theory, the approximations of sets are introduced to deal with inconsistency.A rough set approximates traditional sets using a pair of sets named the lower and upper approximations of the set.Given a set B ⊆ A, the lower and upper approximations of a set Y ⊆ U are defined by, respectively, Definition 4 (lower approximation and positive region).The positive region POS C (D) is defined by POS C (D) is called the positive region of the partition U/Ind D with respect to C ⊆ A, that is, the set of all objects in U that can be uniquely classified by elementary sets in the partition U/Ind D by means of C [16].
Definition 5 (upper approximation and negative region).
The negative region NEG C (D) is defined by that is, the set of all objects that can be definitely ruled out as members of X.
Definition 6 (boundary region).The boundary region is the difference between upper and lower approximations of a set X that consists of equivalence classes having one or more elements in common with X.It is given by the following formula:

Reduct and core
An interesting question is whether there are features in the information system (feature-value table) which are more important to the knowledge represented in the equivalence class structure than other features.Often we wonder whether there is a subset of features which by itself can fully characterize the knowledge in the database.Such a feature set is called a reduct.Calculation of reducts of an information system is a key problem in RS theory [14,15,17].We need to get reducts of an information system in order to extract rule-like knowledge from an information system.
Definition 7 (reduct).Given a classification task related to the mapping C→D, a reduct is a subset and none of proper subsets of R satisfies analogous equality.
The computation of the reducts and the core of the condition features from a decision table is a way of selecting relevant features.It is a global method in the sense that the resultant reduct represents the minimal set of features which are necessary to maintain the same classification power given by the original and complete set of features.A straighter manner for selecting relevant features is to assign a measure of relevance to each feature and choose the features with higher values.Based on the generated reduct system we will generate list of rules that will be used for building the classifier model of the new object with each object in the reduced decision table (i.e., reduct system) and classify the object to the corresponding decision class.The calculation of all the reducts is fairly complex (see [12,[18][19][20]).

Significance of the attribute
Significance of features enables us to evaluate features by assigning a real number from the closed interval [0,1], expressing how important a feature in an information table is.Significance of a feature a in a decision table DT can be evaluated by measuring the effect of removing of the feature a in C from feature set C on a positive region defined by the table DT.As shown in Definition 2, the number γ(C, D) expresses the degree of dependency between feature C and D or accuracy of approximation of U/D by C; the formal definition of the significant is given as follows.
Definition 11 (significance).For any feature a ∈ C, we define its significance ζ with respect to D as follows: Definitions 7-11 are used to express the importance of particular features in building the classification model.For a comprehensive study we refer to [21].One of importance measures is to use frequency of occurrence of features in reducts.Then, one can also consider various modifications of Definition 7, for example approximate reducts, which preserve information about decisions only to some degree [12].Further, positive region in Definition 4 can be modified by allowing for the approximate satisfaction of inclusion [x] C ⊆ [x] D , as proposed, for example, in VPRS model [22].Finally, in Definition 2, the meaning of IND(B) and [x] B can be changed by replacing equivalence relation with similarity relation, especially useful when considering numeric features.For further reading, we refer to, for example, [14,17].

Decision rules
In the context of supervised learning, an important task is the discovery of classification rules from the data provided in the decision tables.The decision rules not only capture patterns hidden in the data as they can also be used to classify new unseen objects.Rules represent dependencies in the dataset, and represent extracted knowledge which can be used when classifying new objects not in the original information system.When the reducts were found, the job of creating definite rules for the value of the decision feature of the information system was practically done.To transform a reduct into a rule, one only has to bind the condition feature values of the object class from which the reduct originated to the corresponding features of the reduct.Then, to complete the rule, a decision part comprising the resulting part of the rule is added.This is done in the same way as for the condition features.To classify objects, which has never been seen before, rules generated from a training set will be used.These rules represent the actual classifier.This classifier is used to predict to which classes new objects are attached.The nearest matching rule is determined as the one whose condition part differs from the feature vector of re-image by the minimum number of features.When there is more than one matching rule, we use a voting mechanism to choose the decision value.Every matched rule contributes votes to its decision value, which are equal to the times number of objects matched by the rule.The votes are added and the decision with the largest number of votes is chosen as the correct class.Quality measures associated with decision rules can be used to eliminate some of the decision rules.

ROUGH SETS DATA ANALYSIS TECHNIQUES
In this section, we will discuss in details the proposed rough sets scheme to analyze the diabetic children patient's databases.The scheme used in this study consists of two main stages: preprocessing and processing.Preprocessing stage includes tasks such as data cleaning, completeness, correctness, attribute creation, attribute selection and discretization.Processing includes the generation of preliminary knowledge, such as computation of object reducts from data, derivation of rules from reducts, and classification processes.These stages leading towards the final goal of generating rules from information or decision system of the diabetic database.Figure 1 shows the overall steps in the proposed rough sets data analysis scheme.

Preprocessing stage
In order to successfully analyze data with rough sets, a decision table must be created.This is done with data preparation.The data preparation task includes data conversion, data cleansing, data completion checks, conditional attribute creation, decision attribute generation, discretization of attributes, and data splitting into analysis and validation subsets.Data conversion must be performed on the initial data into a form in which specific rough set tools can be applied.Data splitting created two subsets of size 252 objects for the data analysis set and 50 objects for the validation set using a random seed.More details will be discussed later in the data characteristics and its description section.

Data completion and discretization of continuous-valued attributes
Data completion is often the case in which real world data will contain missing values.Since rough set classification involves mining rules from the data, objects with missing values in the data set may have an undesirable effect on the rules that are constructed.The aim of this procedure is to remove all objects that have one or more missing values.Incomplete information systems exist broadly in practical data analysis, and approaches to complete the incomplete information system through various completion methods in the preprocessing stage are normal in data mining and knowledge discovery.These methods may result in distortion of original data and knowledge, and can even render the original data mining system unminable.To overcome these shortcomings inherent in traditional methods, we used the decomposition approach for incomplete information system proposed in [23].
When dealing with attributes in concept classification, it is obvious that they may have varying importance in the problem being considered.Their importance can be preassumed using auxiliary knowledge about the problem and expressed by properly chosen weights.However, in the case of using the rough set approach to concept classification, it avoids any additional information aside from what is included in the information table itself.Basically, the rough set approach tries to determine from the data available in the information table whether all the attributes are of the same strength and, if not, how they differ in respect of the classifier power.Therefore, some strategies for discretization of real value attributes have to be used when we need to apply learning strategies for data classification with real value attributes (e.g., equal width and equal frequency intervals).It has been shown that the quality of learning algorithm is dependent on this strategy, which has been used for real data discritization [24].Discretization which uses data transformation procedure that involves finding, cuts in the data sets which divide the data into intervals.Values lying within an interval are then mapped to the same value.Doing this process will lead to reduce the size of the attributes value set and ensures that the rules that are mined are not too specific.In this paper, we adopt the rough sets with boolean reasoning (RSBR) algorithm proposed by Zhong et al. [23] for the discretization of continuous-valued attributes.The main advantage of RSBR is that it combines discretization of real-valued attributes and classification.The main steps of the RSBR discretization algorithm are provided below.

Processing stage
As we mentioned before, processing stage includes generating preliminary knowledge, such as computation of object reducts from data, derivation of rules from reducts, and classification processes.These stages lead towards the final goal of generating rules from information or decision system of the diabetic database.

Relevant attribute extraction and reduction
One of the important aspects in the analysis of decision tables extracted from data is the elimination of redundant attributes and identification of the most important attributes.Redundant attributes are any attributes that could be eliminated without affecting the degree of dependency between remaining attributes and the decision.The degree of dependency is a measure used to convey the ability to discern objects from each other.The minimum subset of attributes preserving the dependency degree is termed reduct.The computation of the core and reducts from a decision table is a way of selecting relevant attributes [20,25].It is a global method in the sense that the resultant reducts represent the minimal sets of features which are necessary to maintain the same classificatory power given by the original and complete set of attributes.A straighter manner for selecting relevant attribute is to assign a measure of relevance to each attribute and choose the attributes with higher values.
In decision tables, there often exist conditional attributes that do not provide (almost) any additional information about the objects.So, we should remove those attributes since it reduces complexity and cost of decision process [17,20,25,26].A decision table may have more than one reduct.Any of them can be used to replace the original table.Finding all the reducts from a decision table is NP-complete.Fortunately, in applications, it is usually not necessary to find all of them-one or few of them are sufficient.A natural question is which reducts are the best.The selection depends on the optimality criterion associated with the attributes.If it is possible to assign a cost function to attributes, then the selection can be naturally based on the combined minimum cost criteria.In the absence of an attribute cost function, the only source of information to select the reduct is the contents of the table.In this paper, we adopt the criteria that the best reducts are the ones with the minimal number of attributes and-if there are more such reducts-with the least number of combinations of values of its attributes (cf.[25,27]).We introduce a reduct algorithm based on the degree of dependencies and the discrimination factors.The main steps of the reduct generation algorithm are provided below (refer to Algorithm 2).

Rule generation and classification
The generated reducts are used to generate decision rules.The decision rule, at its left side, is a combination of values of attributes such that the set of (almost) all objects matching this combination have the decision value given at the rule's rough side.The rule derived from reducts can be used to classify the data.The set of rules is referred to as a classifier and can be used to classify new and unseen data.The main steps of the Rule Generation and classification algorithm are provided below (refer to Algorithm 3).
When rules are generated, the number of objects that generate the same rule is typically recorded.The quality of rules that are generated based on attributes included in the reduct is connected with its quality.We would be specially interested in generating rules which cover possibly largest parts of the universe.Covering the universe space with more general rules implies smaller size of a rule set.We could therefore use this idea in measuring the quality of a reduct.If a rule is generated more frequently across different rule sets, we say this rule is more important than other rule.The rule importance measure [28] R I is used as an evaluation to study the quality of the generated rule.It is defined by where τ r is the number of times a rule appears in all reduct and ρ r is the number of reduct sets.The quality of rules is related to the corresponding reduct(s).The generating rules which cover largest parts of the universe U which means more general rules implies smaller size of a rule set.

Motivation
The present study is a cross-section study conducted at Kuwait University in the period September to November 2005.The sample comprised 302 children aged 7-13 years.
Trained interviewers administered questionnaires to parents and caretakers.
Input: Information system table (S) with real valued attributes A ij and n is the number of intervales for each attribute.Output: Information table (ST) with discretized real valued attribute 1: for A ij ∈ S do 2: Define a set of boolean variables as follows: C ni (10) 3: end for Where n i=1 C ai correspond to a set of intervals defined on the variables of attributes a 4: Create a new information table S new by using the set of intervals C ai 5: Find the minimal subset of C ai that discerns all the objects in the decision class D using the following formula: Where Φ(i, j) is the number of minimal cuts that must be used to discern two different instances x i and x j in the information table.

Data characteristics and its description
The data for this study were collected by the Statistical Consultation unit in the College of Business Administration at Kuwait University, Kuwait.Participants (parents or guardians) of diabetic children were interviewed to complete the questionnaire.The interviews have been conducted at a governmental hospital.The questionnaire consists of two parts; the first part has socio-demographic and clinical characteristics of the subjects and the second part is the strengths and difficulties questionnaire (SDQ) [29].
The socio-demographic and clinical characteristics include respondent nationality, respondent gender, child gender, respondent age, child age, respondent education, child education, family income, duration of diabetes (in years), does any of the parents have diabetes, number of brothers who have diabetes, number of times the child entered the hospital because of diabetes, the Hemoglobin level, and type of diabetes.The strengths and difficulties questionnaire (SDQ) is widely used as a useful tool for screening emotional and behavioral problems in children aging 4-16 years [1,24,[30][31][32][33][34][35][36][37].The SDQ has been translated into more than 40 languages, being available in the internet at www.sdqinfo.com.The SDQ has 25 items, some are positive and some are negative.The 25 items in the SDQ comprise 5 subscales of 5 items each.The five subscales are emotional symptoms, conduct problems, hyperactivity, peer problems, and prosocial behavior.
Hyperactivity scale.Restless, overactive, cannot stay still for long, constantly fidgeting or squirming, easily distracted, concentration wanders, thinks things out before acting, sees tasks through to the end, good attention span.Each of the 25 items is marked as not true, somewhat true, or certainly true.Except for five positive items, written in italic, the scores for each item are 0 for not true, 1 for somewhat true and 2 for certainly true.The five positive items, written in italic, are scored in the opposite direction, 2 for not true, 1 for somewhat true, and 0 for certainly true.The sum of the scores of each subscale ranges from 0 to 10.The sum of the scores of the first four subscales of the SDQ gives a total difficulties score, ranging from 0 to 40  Contract the decision rule ( Scan the reduct r over an object x 5: for every c ∈ C do 7: Assign the value v to the correspondence attribute a 8: end for 9: Construct a decision attribute d 10: Assign the value u to the correspondence decision attribute d 11: end for 12: end for Algorithm 3: Rule generation and classification. strengths.The author of the SDQ classified scores for each of the subscales and for the total difficulties as normal, borderline, and abnormal (clinical).These classified scores are shown in Table 1.
Table 2 shows the percentages of boys and girls and total percentages of children whose scores are in the normal, borderline, and abnormal classes.
The cutoffs were chosen so that roughly 80% of children in the community are categorized as normal, 10% are borderline and 10% are abnormal.Goodman (1997) in [29] pointed out that the "borderline" cutoffs can be used with high-risk samples where false positives are not a major concern and the "abnormal" cutoffs can be used for studies of low-risk samples where it is more important to reduce the rate of false positives.In the present study, abnormal and borderline cases were considered positive for mental health problems.According to the total difficulties scores, the results showed that 69.1% of the children have overall mental health problems (55% abnormal and 14.1% borderline).The highest percentage (82.4%)was for the emotional problems whereas the lowest percentage (30%) was for peer relationship problems.In the present study, abnormal and borderline cases were considered positive for mental health problems.

EXPERIMENTAL ANALYSIS
The data set studies in this paper consists of 302 children patients with diabetes.Knowledge representation in rough set is done via information system which is a tabular form object and attributes value relation (refer to Table 1).The first analysis studies the statistical distribution of the attributes.For many data mining tasks, it would be useful to learn the more general characteristics about the given data set, like central tendency and data dispersion.Typical measure of central tendency is the mean or median.Very often in large data sets, there exist samples that are not consistent with the general behavior of the data model; such data are called outliers.Outlier detection is important since it may affect the classifier accuracy.The simplest approach of outlier detection is to use statistical measures.In our experiments we use the mean and median to detect the outliers in our data set.Tables 3 and 4 represent the statistical analysis and essential distribution of attributes, respectively.By applying the introduced reduct generation algorithm (refer to Algorithm 2), we compute the degree of dependencies and the discrimination factors of the attributes.
Tables 5 and 6 show the discrimination factor for one and five attributes.
From Table 5, we observe that the conduct attribute has the highest discrimination factor, so we choose it as the first attribute in the next combination to generate sets of two attributes.The first one will be the conduct attribute and the second one will be the rest of the conditional attributes.Then we compute the discrimination factor for all sets and choose the highest discrimination factor for two attributes.We repeat the same procedure with three attributes, four attributes, and so forth, until we reach the minimal number of reducts that contains a combination of attributes which has  the same discrimination factor (see Table 6).Table 7 shows the final generated reduct sets which are used to generate the list of rules for the classification.A natural use of a set of rules is to measure how well the ensemble of rules is able to classify new and unseen objects.To measure the performance of the rules is to assess how well the rules do in classifying new cases.So we apply the rules produced from the training set data to the test set data.Our measuring criteria are sensitivity, specificity, and accuracy.The sensitivity of a classifier gives a measure of how good it is in detecting that an event defined through an object has occurred, while the specificity gives us a measure of how good it is in picking up nonevent defined through the object.These evaluation measures can be calculated from confusion matrix as shown in Table 8.
Table 9 shows the number of generated rules before and after pruning process.We can observe that the number of generated rules for all algorithms is large.It makes  classification unacceptably slow.Therefore, it is necessary to prune the rules during their generation.

Statistical discriminant analysis and empirical results
Discriminant analysis is aimed at finding weighted linear functions of the predictor variables.These discriminant linear functions are used to classify objects into distinct groups according to their observed characteristics.This is usually done by calculating the scores of the linear functions.In addition, it is of interest to determine the predictor variables that contribute significantly to the linear discriminating functions.The analysis was conducted using a stepwise selection procedure.Since we have three groups (normal, borderline, and abnormal), two discriminant functions were extracted.These two functions (shown down in Figure 2) were used to classify the diabetic children into one of the three groups.When determining whether the two discriminant functions are significant in separating patients in the three groups we found that the first function explains 99.2% of the total variance and the chi-square test of Wilks' lambda is significant (P = 0).In contrast, the second function explains only 0.8% of the total variance and the chi-square test of Wilks' lambda is not significant (P = .171).To know which variables have the greater impact we examine the standardized canonical discriminant functions.Recall that the second function was not significant.For the first function, the emotional factor has the greatest impact (.644) followed by the conduct factor (.639), hyperactivity (.505), peer (.410), and child gender (−.174).Since the three centroids are significantly different the first function will do a good job discriminating between the three groups.This result has been illustrated in Figure 2. Table 10 shows the classification results based on the the statistical discriminant analysis and cross validation.Figure 3 shows the overall classification accuracy of three approaches compared with the rough set approach.It shows that the rough sets approach is much better than neural networks, ID3 decision tree and statistics discriminant analysis.Moreover, for the neural networks and the decision tree classifiers, more robust features are required to improve their performance.

CONCLUSIONS AND FUTURE WORKS
In this paper, we have presented an intelligent data analysis approach based on rough sets theory for generating classification rules from a set of observed 302 samples of diabetic Kuwaiti children patients.The main objective is to investigate the relationship of diabetes with psychological problems for Kuwaiti children aged 7-13 years old.A decomposition approach based on rough set theory to extract a complete subsets from the incomplete information system hierarchically has been used.To increase the efficiency of the classification process, rough sets with boolean reasoning (RSBR) discretization algorithm is used to discretize the data.Then, the  rough set reduction technique is applied to find all reducts of the data which contain the minimal subset of attributes that are associated with a class label for the classification.
The results proved that the rough set approach has higher classification accuracy with less number of generated rules compared to three different approaches, neural networks, ID3 decision tree, and statistical discriminant analysis.
In conclusion, this study shows that the theory of rough sets seems to be a useful tool for inductive learning and a valuable aid for building expert systems.Further work needs to be done to minimize the experiment duration in order to include experts in the experiments.A combination of kinds of computational intelligence techniques has become one of the most important ways of research of intelligent information processing.Neural network shows us its strong ability to solve complex problems such as the problem discussed here.From the perspective of the specific rough sets approaches that need to be applied, an extension work of using rough sets with other intelligent systems like neural networks, genetic algorithms, fuzzy approaches, and so forth, will be considered of our future work.

Figure 1 :
Figure 1: Overall rough sets data analysis scheme.

Figure 3 :
Figure 3: Compartive analysis in terms of the classification accuracy.
. High scores for each of the first four subscales indicate difficulties, whereas high scores on the prosocial subscale indicate Input: information table (ST) with discretized real valued attribute.Output: reduct sets R final = {r 1 ∪ r 2 ∪ • • • ∪ r n } 1: for each condition attributes c ∈ C do 2: Compute the correlation factor between c and the decisions attributes D 3: if the correlation factor > 0 then

Table 2 :
Percentage of children in normal, borderline, and abnormal groups.

Table 3 :
Statistical results of the attributes.

Table 4 :
Attribute distribution within the classes.

Table 5 :
Discrimination factor for one attributes.

Table 6 :
Discrimination factor for five attributes.

Table 9 :
Number of generated rules before and after pruning.

Table 10 :
Discriminant analysis classification results (93.3% of original grouped cases correctly classified; 91.9% of cross-validated grouped cases correctly classified).
a Cross-validation is done only for those cases in the analysis.In cross validation, each case is classified by the functions derived from all cases other than that case.