Classification Based on both Attribute Value Weight and Tuple Weight under the Cloud Computing

In recent years, more and more people pay attention to cloud computing. Users need to deal with magnanimity data in the cloud computing environment. Classification can predict the need of users from large data in the cloud computing environment. Some traditional classification methods frequently adopt the following two ways. One way is to remove instance after it is covered by a rule, another way is to decrease tuple weight of instance after it is covered by a rule. The quality of these traditional classifiers may be not high. As a result, they cannot achieve high classification accuracy in some data. In this paper, we present a new classification approach, called classification based on both attribute value weight and tuple weight (CATW). CATW is distinguished from some traditional classifiers in two aspects. First, CATW uses both attribute value weight and tuple weight. Second, CATW proposes a new measure to select best attribute values and generate high quality classification rule set. Our experimental results indicate that CATW can achieve higher classification accuracy than some traditional classifiers.


Introduction
Cloud computing has become a hot issue in recent years.With the rapid development of information technology and the popularity of cloud computing, it is necessary to mine useful information from magnanimity data [1][2][3][4][5][6][7][8][9].Classification is one of the most important tasks in the data mining and the machine learning.Classification can predict the need of users from large data.First, it builds classification rules from training dataset.Second, it uses these rules to predict the class label of new instances.
The traditional classifiers [10][11][12][13][14][15][16][17][18][19] frequently adopt the following two ways.Some traditional classifiers remove instance after it is covered by a rule, such as FOIL [20] and ELEM2 [21].Other traditional classifiers decrease tuple weight of instance after it is covered by a rule, such as PRM and CPAR [22].Then, we introduce the feature of these classifiers.In the process of extracting rules, FOIL uses measure gain to select a best attribute value and generates one classification rule.It removes instance after it is covered by a rule.As a result, this method is ineffective.It generates a small rule set and cannot achieve high accuracy in some data.ELEM2 uses another measure to generate classification rules.It also removes instance after it is covered by a rule.ELEM2 considers the degree of relevance of an attribute-value pair and selects the most relevant pairs to generate rules.PRM modifies FOIL to achieve higher accuracy.PRM does not remove instance when it is covered by a rule.PRM gives the instance a tuple weight.Thus, PRM can insure that each instance is covered more than once.PRM selects only the best gain to generate rule.CPAR stands in the middle between exhaustive and greedy algorithms and combines the advantages of both.CPAR selects several best attribute values and builds several rules at one time.It does not remove instance immediately when it is covered by a rule.CPAR also uses tuple weight to guarantee that each instance can be covered more than once.These methods do not employ attribute value weight.They cannot get high quality classification rule set.As a result, they can not achieve high classification accuracy in some data.
In this paper, we propose a new algorithm, named classification based on both attribute value weight and tuple weight (CATW).CATW uses the both attribute value weight and tuple weight.Moreover, CATW uses a new measure to improve the quality of classification rule set.Our method has following advantages.
(1) After an instance is covered by a rule, instead of removing it, its weight is decreased by multiplying a factor.Thus, we can guarantee that each instance can be covered more than once.
(2) If we only use tuple weight, we cannot change the importance of an attribute-value pair in the dataset.Therefore, CATW uses attribute value weight to reduce the importance of attribute-value pair after the rule is generated.In this way, CATW can increase the chances of attaining other optimal attribute-value pairs.We can generate more high quality of rules.
( The outline of this paper is as follows.Section 2 presents the details of CATW and describes the process of rule generation in CATW.Section 3 discusses how to predict class label using the rules.The experimental results are presented in Section 4. Finally, we conclude the study in Section 5.

Rule Generation of CATW
The algorithm of CATW has three special points: the attribute value weight, the tuple weight, and the improved measure.First, we describe the method of how to use tuple weight.Second, we introduce the use of attribute weight.Third, we propose a new measure to generate high quality classification rule set.Finally, we show the whole process of how to generate rule set.
Definition 1 (a literal).A literal  is an attribute-value pair, which follows the pattern of (  , V), where   is an attribute and V is a value of attribute   .
Definition 2 (a classification rule). →  is called a classification rule , if  consists of a conjunction of literals  1 ,  2 , . . .,   with the form of  1 ∧  2 ∧ ⋅ ⋅ ⋅ ∧   , where  is a class label.
A tuple  satisfies the antecedent of  if and only if it has all literals in .If  satisfies the antecedent of ,  predicts that  has a class label .

The Tuple Weight.
In traditional classification, all rules are generated from the training database.If a tuple  is covered by a rule , they can not ensure that  is the best rule for .If  is generated from the remaining dataset instead of the whole dataset [22],  may not be the best rule.In order to improve the classification accuracy and increase the number of rules, some traditional classifiers use tuple weight.By depending on tuple weight, these classifiers can delay removing instance after it is covered by a rule.In our algorithm, after a tuple is covered by a rule, instead of removing it, its weight is decreased by multiplying a factor.We set a threshold for tuple weight.When the tuple weight of tuple  is less than threshold, we remove the tuple  from training data.CATW produces more rules.Each tuple can be covered by classification rules more than once.
In our approach, we can set an initial threshold and an end threshold.We can limit the number of rules which are generated according to actual situation.If we set a small end threshold, it generates a large number of rules.On the contrary, if we set a large end threshold, it generates a less number of rules.In our experiment, we set an initial threshold 1, a weight factor 0.75.Moreover, we set an end threshold.The end threshold is the third power of weight factor.We can make sure that each instance can be covered three times.

The Attribute Value
Weight.Some traditional classifiers only use tuple weight.They do not change the importance of an attribute-value pair in the training data.After a rule is generated, these classifiers may select the duplicate attributevalue pair.Thus, they may miss some high quality rules which can be used to affect the classification accuracy.CATW uses attribute value weight to reduce the importance of attributevalue pair after the rule is generated.When the tuple is covered by a rule, our algorithm can reduce the importance of attribute-value pairs which are contained in it.In this way, we can increase the chances of attaining another optimal attribute-value pair.
Example 3. The following training dataset with two classes is shown in Table 1.Then, we demonstrate how to use attribute value weight.
Suppose  = {OUTLOOK = rain ∧ WINDY = TRUE → PLAY = no} to be just generated.Then, we set a weight factor 0.8, and set {PLAY = no} for positive examples.After a rule is generated, CATW uses weight factor to reduce the importance of all attribute values that are contained in antecedent of the rule in positive examples.The result is shown in Table 2.
The results of our experiment indicate that classification accuracy is influenced by attribute value weight.Compared with the classifiers which do not use attribute value weight, CATW can achieve higher classification accuracy in some data.Thus, the attribute value weight can be a help to improve the quality of classification rule.)) .

The
(1) In our experiment, we employ two different improved measures.is too small and | * | is too large, the result of gian() is not the best for rule.We use two different measures: support and correlation confidence.We divide the traditional FOIL measure in two parts.When we select literal , a global order of literal  is composed.Given two literal  1 and  2 ,  1 is better than  2 , denoted as  1 >  2 .

Algorithm of CATW.
In this part, we will introduce our algorithm in detail.The CATW algorithm is presented in Algorithm 1.

Classification of CATW
Before making any prediction, we use the Laplace expected error estimate [23] to evaluate the quality of rules.It is defined as follows: Laplace accuracy = (  + 1) where  is the number of classes and  tot is the total number of examples satisfying the antecedent of rule, among which   examples belong to .When using rules to predict the class-label of unknown instance, we use several rules which are matched by the instance.If all the rules have the same consequent of rule, we assign that label to the instance.If all the best rules have several classes, we calculate the average Laplace accuracy of each class.Then, we select the class label with the highest average value and assign it to the instance.

Experimental Results
All experiments are performed on 12 different datasets from the UCI data collection.All datasets were conducted using stratified tenfold cross-validation.In cross-validation, the data set is divided into 10 blocks.Each block is held out once.The classifier is trained on the remaining 9 blocks.The character of each dataset is shown in Table 3.We perform our experiments on a 2.2 GHz PC with 2 G memory, running Microsoft Windows XP.
In Tables 4 and 5 In Table 4, we use the measure which is an improved FOIL measure.Figure 1 gives the accuracy of FOIL, CMAR, CPAR, and CATW based on Table 4. CATW uses both attribute value weight and tuple weight and employs the improved FOIL measure.From Figure 1 and Table 4, we can see that CATW can achieve higher accuracy than FOIL, CMAR, and CPAR.
In Table 5, we use the measure which is an improved correlation measure.Figure 2 gives the accuracy of FOIL, CMAR, CPAR, and CATW based on Table 5. CATW uses both attribute value weight and tuple weight and employs the improved correlation measure.From Figure 2 and Table 5, we can see that CATW with improved correlation measure can also achieve higher accuracy than FOIL, CMAR, and CPAR.
By comparison, the accuracy of CATW with the improved correlation measure is higher than the accuracy of CATW with the improved FOIL measure.From Tables 4  and 5, we can see that it is necessary to use the improved correlation measure.Table 6 displays the accuracy of different attribute value weights in CATW.In Table 6, CATW employs the improved FOIL measure.Table 7 displays the accuracy of different attribute value weights in CATW.In Table 7, CATW employs the improved correlation measure.The results of the two tables indicate that (1) the accuracy of improved correlation measure is higher than the accuracy of improved FOIL measure and (2) different value of attribute value weight has different influence on the accuracy of classification.Through all the above results of our experiment, we can conclude that (1) it is necessary to use attribute value weight and tuple weight; (2) it is necessary to use improved correlation measure; (3) different value of attribute value weight has different influence on the accuracy of classification.

Conclusions and Future Work
With the rapid development of information technology and the popularity of cloud computing, it is necessary to mine useful information from magnanimity data.Some traditional classification methods frequently adopt the following two ways.One way is that it does not use tuple weight to remove instance after it is covered by a rule.Another way is that it only gives tuple weight of instance after it is covered by a rule.As result, they cannot achieve high classification accuracy in some data.In this paper, we present a novel approach CATW.First, CATW uses both attribute value weight and tuple weight.Second, CATW proposes a new measure which is the improved correlation measure.CATW employs the improved correlation measure to select best attribute values and generate high quality classification rule set.The results of our experiment indicate that CATW can generate a reasonable number of classification rules.In addition, CATW can achieve high classification accuracy.Our experiment shows that different value of attribute value weight has different influence on the accuracy of classification.At present, we cannot find the regular change in selecting an optimal attribute value weight.In future research, we will focus on it.We also focus on another research.We will combine distributed data mining with cloud computing platform in order to improve the efficiency of CATW.
Measure of CATW.Some classifiers use FOIL gain to select literal.FOIL gain is used to measure the information gained from adding literal  to the current rule.Let us suppose that || means the number of positive examples which satisfies the antecedent of the current rule  and ||
, Column 1 shows the accuracy of FOIL.Column 2 shows the accuracy of CMAR.Column 3 shows the accuracy of CPAR.Column 4 shows the accuracy of CATW without attribute value weight, set tuple weight 0.75.Column 5 shows the accuracy of CATW, set attribute value weight 0.8 and tuple weight 0.75.Column 6 shows the accuracy of CATW, set attribute value weight 0.5 and tuple weight 0.75.

Table 1 :
The training dataset.

Table 2 :
Attribute value weight in positive examples.

Table 3 :
[22]acteristics of UCI datasets.|meansthe number of positive examples which satisfy the antecedent of the new rule, and | * | means the number of negative examples which satisfy the antecedent of the new rule[22].The FOIL gain of  is defined as: means the number of negative examples which satisfy the antecedent of the current rule .After literal  is added to , | * Measure.In our experiment, || means total tuple weight of positive examples which satisfy the antecedent of current rule .|| means total tuple weight of negative examples which satisfy the antecedent of current rule .After literal  is added to , | * | means total attribute value weight of literal  in positive examples, and | * | means total attribute value weight of literal  in negative examples.Therefore, CATW uses both tuple weight and attribute value weight when it measures literal .We call this measure an improved FOIL measure.2.3.2.Improved Correlation Measure.In traditional FOIL gain, | * | has a huge influence to select a best attribute value.For example, if log (| * |/(| * | + | * |)) − log (||/(|| + ||))

Table 4 :
The accuracy of CATW with improved FOIL gain measure.

Table 5 :
The accuracy of CATW with improved correlation measure.
Input: Training set  =  ∪  ( and  are the sets of all positive and negative example, respectively) Output: A set of rules for predicting class labels for examples Procedure CATW attributeWeight ←  tupleWeight ←  tupleThreshold ←  rules ← null while || > 0   ← ,   ←   ← null while || > 0 and r.length < max rule length find the best attribute value av use the improved correlation measure combine tuple weight with attribute weight add av to  remove from   all examples not satisfying  remove from   all examples not satisfying  end add  to rules for each attribute at that is included in antecedent of  in  at.weight ← attributeWeight * at.weight end for each example  in  satisfying 's body t.weight ← tupleWeight * t.weight if t.weight < tupleThreshold then remove  from  end end return rules Algorithm 1: Classification based on both attribute value weight and tuple weight (CATW).Figure 1: The accuracy of CATW with improved FOIL gain measure.
Figure 2: The accuracy of CATW with improved correlation measure.