An Improved Apriori Algorithm Based on an Evolution-Communication Tissue-Like P System with Promoters and Inhibitors

,


Introduction
Frequent itemsets mining, as a subfield of data mining, aims at discovering itemsets with high frequency from huge amounts of data.Interesting implicit associations between items then can be extracted from these data, which can help researchers and practitioners make informed decisions.One famous example is "beer and diapers" [1].The supermarket management discovered a significant correlation between the purchases of beer and diapers which had nothing to do with each other ostensibly through frequent itemsets mining.Consequently, they put diapers next to beer.Through this layout adjustment, sales of both beer and diapers increased.
The Apriori algorithm is a typical frequent itemsets mining algorithm, which is suitable for the discovery of frequent itemsets in transactional databases [2].To process large datasets, many parallel improvements have been made to improve the computational efficiency of the Apriori algorithm [3][4][5][6].How to implement the Apriori algorithm in parallel to improve its computational efficiency is still an ongoing research topic.Given the extremity of the technology and theory of the silicon-based computing, new non-siliconbased computing devices P systems are used in this study.P systems are new bioinspired computing models of membrane computing, which focus on abstracting computing ideas from the study of biological cells, particularly of cellular membranes [7,8].This study uses an evolutioncommunication tissue-like P system with promoters and inhibitors (ECPI tissue-like P systems) for computation.P systems are powerful distributed and parallel bioinspired computing devices, being able to do what Turing machines can do [9][10][11], and have been applied to many fields.The applications of P systems are based on two types of membrane algorithms, the coupled membrane algorithm and the direct membrane algorithm.The coupled membrane algorithm combines the traditional algorithm with some structural characters of P systems, such as dividing the whole system into several relatively independent computing units, where the computing units can communicate with each other, the computing units can be dynamically rebuilt, and rules can be executed in parallel [12][13][14][15][16].The direct membrane algorithm designs the algorithm based on the structure, the objects, and the rules of P systems directly [17][18][19][20][21].The final goal of membrane computing is to build biocomputers and the direct membrane algorithm can be transplanted to the biocomputers directly, which is more meaningful from this perspective.However, the direct membrane algorithm needs to transform the whole traditional algorithm into P system, which is complex and difficult.Up to date, a few simple studies on the direct membrane algorithm focus on the arithmetic operations, the logic operations, the generation of graphic language, and clustering [17][18][19][20][21].
In this study, a novel improved Apriori algorithm based on an ECPI tissue-like P system (ECTPPI-Apriori) is proposed using the parallel mechanism in P systems.The information communication between different computing units in ECTPPI-Apriori is implemented through the exchange of materials between membranes.Specifically, all itemsets are searched in parallel, regulated by a set of promoters and inhibitors.For a database with  fields,  + 2 cells are used in the algorithm, where 1 cell is used to enter the data in the database into the system,  cells are used to detect the frequent itemsets, and one specific cell, called output cell, is used to store the results.The time complexity of ECTPPI-Apriori is compared with those of other parallel Apriori algorithms to show that the proposed algorithm is time saving.
The contributions of this study are twofold.From the viewpoint of data mining, new bioinspired techniques are introduced into frequent itemsets mining to improve the efficiency of the algorithms.P systems are natural distributed parallel computing devices which can improve the time efficiency in computation.Besides the hardware and software implementations, P systems can be implemented by biological methods.The computing resources needed are only several cells, which can decrease the computing resource requirements.From the viewpoint of P systems, the application areas of the new bioinspired devices P systems are extended to frequent itemset mining.The applications based on the direct membrane algorithms are limited.This study provides a new application of P systems in frequent itemsets mining, which expands the application areas of the direct membrane algorithms.
The paper is organized as follows.Section 2 introduces some preliminaries about the Apriori algorithm and about the ECPI tissue-like P systems.The ECTPPI-Apriori algorithm using the parallel mechanism of the ECPI tissue-like P system is developed in Section 3. In Section 4, one illustrative example is used to show how the proposed algorithm works.Computational experiments using two datasets to show the performance of the proposed algorithm in frequent itemsets mining are reported in Section 5. Conclusions are given in Section 6.

Preliminaries
In this section, some basic concepts and notions in Apriori algorithm [2] and ECPI tissue-like P system [7] are introduced.

The Apriori Algorithm.
The Apriori algorithm is a typical frequent itemsets mining algorithm proposed by Agrawal and Srikant [2], which aims at discovering relationships between items in transactional databases.

Definitions
(i) Item: a field in a transactional database is called an item.If one record contains a certain item "1," otherwise "0," is placed in the corresponding field of the record in the transactional database.(ii) Itemset: a set of items is called an itemset.For notational convenience, an itemset with  items  1 ,  The general procedure of Apriori from Han et al. [1] is as follows.
Input.The database contains  transactions and the support count threshold .
Step 1. Scan the database to compute the support count of each item, and obtain the frequent 1-itemsets  1 .Let ℎ = 2.
Step 4. Scan the database to compute the support count of each candidate frequent ℎ-itemset.Delete those itemsets which do not meet the support count threshold  and obtain the frequent ℎ-itemsets  ℎ .
Output.The collection of all frequent itemsets is represented by .

Evolution-Communication Tissue-Like P Systems with Promoters and Inhibitors.
Membrane computing is a new branch of natural computing, which abstracts computing ideas from the construct and the functions of cells or tissues.In the nature, each organelle membrane or cell membrane works as a relatively independent computing unit.The amount and the types of materials in each organelle or cell change through chemical reactions.Materials can flow between different organelle or cell membranes to transport information.
Reactions in different organelles or cells take place in parallel, while reactions in the same organelle or cell take place also in parallel.These biological processes are abstracted as the computing processes of membrane computing.The internal parallel feature makes membrane computing a powerful computing method which has been proven to be equivalent to Turing machines [7][8][9][10][11].
The ECPI tissue-like P system, composed of a network of cells linked by synapses (channels), is a typical membrane computing model.The whole P system is divided into separate regions through these cells, each forming one region.Each cell has two main components, multisets of objects (materials) and rules, also called evolution rules (chemical reactions).Objects, as information carriers, are represented by characters.
Rules regulate the ways objects evolve to new objects and the ways objects in different cells communicate through synapses.Rules are executed in nondeterministic flat maximally parallel in each cell.That is, at any step, if more than one rule can be executed but the objects in the cell can only support some of them, then a maximal number of rules will be executed, and each rule can be executed for only once [22].
The computation halts if no rule can be executed in the whole system.The computational results are represented by the types and numbers of specified objects in a specified cell.Because objects in a P system evolve in flat maximally parallel, regulated by promoters and inhibitors, the systems compute very efficiently [10,22].Pȃun [7] provided more details about P systems.
A formal description of the ECPI tissue-like P system is as follows.
An ECPI tissue-like P system of degree  is of the form Π = (,  1 ,  2 , . . .,   , syn, ,  out ) , where (1)  represents the alphabets including all objects of the system.(2) syn ⊆ {1, 2, . . ., }×{1, 2, . . ., } represents all synapses between the cells.(3)  defines the partial ordering relationship of the rules; that is, rules with higher orders are executed with higher priority.(4)  out represents the subscript of the output cell where the computation results are placed.
(5)  1 , . . .,   represent the  cells.Each cell is of the form In (2),  ℎ,0 represents the initial objects in cell ℎ.A  ℎ,0 =  means that there is no object in cell ℎ.If  represents an object,   represents the multiplicity of  copies of such objects. ℎ in (2) represents a set of rules in cell ℎ with the form of   →  go , where  is the multiset of objects consumed by the rule,  in the subscript is the promoter or the inhibitor of the form  =   or  = ¬  , and  and  are the multisets of objects generated by the rule.A rule can be executed only when all objects in the promoter appear and cannot be executed when any objects in the inhibitor appear.Multiset of objects  stay in the current cell, and multiset of objects  go to the cells which have synapses connected from the current cell.The th subset of rules in cell ℎ having similar functions is represented by  ℎ , and the rules in the same subset are connected by ∪.

The ECTPPI-Apriori Algorithm
In this section, the structure of the P system used in the ECTPPI-Apriori algorithm is presented first, the computational processes in different cells are then discussed in detail, a pseudocode summarizing the operations is presented, and an analysis of the algorithm complexity is provided.
3.1.Algorithm and Rules.Assume a transactional database contains  records and  fields.An object   is generated only if the th transaction contains the th item   (i.e., there is a 1 in the corresponding field in the transactional database).In this way, the database is transformed into objects, a form that the P system can recognize.The support count threshold is set to .A cell structure with  + 2 cells, labeled by 0, 1, . . .,  + 1, as shown in Figure 1, is used as the framework for ECTPPI-Apriori.The evolution rules are not shown in this figure due to their length.Transactional databases are usually sparse.Therefore, the number of objects, represented by   , to be processed in this algorithm is much smaller than .
When computation begins, objects   encoded from the transactional database and object   representing the support count threshold  are entered into cell 0. Objects   and   are passed to cells 1, 2, . . .,  in parallel, using a parallel evolution mechanism in tissue-like P systems.The auxiliary objects    are generated in cell 1.Next, the frequent 1itemsets are produced and objects representing frequent 1itemsets are generated in cell 1 by executing the evolution rules in parallel.The objects representing the frequent 1itemsets are passed to cells 2 and  + 1. Cell  + 1 is used to store the computational results.The frequent 2itemsets and the objects representing the frequent 2-itemsets are produced in cell 2 by executing the evolution rules in parallel.The objects representing the frequent 2-itemsets are passed to cells 3 and  + 1.This process continues until all frequent itemsets have been produced.As compared with that of the conventional Apriori algorithm, the computational time needed by ECTPPI-Apriori to generate the candidate frequent ℎ-itemsets  ℎ and to compute the support count of each candidate frequent ℎ-itemset can be substantially reduced.
The ECPI tissue-like P system for ECTPPI-Apriori is as follows.
The auxiliary objects    and   , for 1 ≤  ≤ , are used to detect the frequent 1-itemsets.The auxiliary objects    store the items in the candidate frequent 1-itemsets using their subscripts.For example, object   1 means the itemset { 1 } is a candidate frequent 1-itemset.The auxiliary objects   are used to identify the frequent 1-itemsets.The  copies of   initially in cell 1 indicate that the th item needs to appear in at least  records to make the itemset {  } a frequent 1-itemset.One object   is removed from, and one object   is generated in cell 1 when one more of the th item is detected in a record.Therefore, if there is no object   left and  objects   have been generated in cell 1, at least  records have been found to contain the th item.The functions of in cell 3, . .., and    1 ⋅⋅⋅  in cell  are similar to that of    in cell 1.The functions of   1  2 in cell 2,   1  2  3 in cell 3, . .., and   1 ⋅⋅⋅  in cell  are similar to that of   in cell 1.The objects   , for 1 ≤  ≤ , are used to store the items in the frequent 1-itemsets using their subscript.For example,  1 means the itemset { 1 } is a frequent 1-itemset.The functions of   1  2 in cell 2,   1  2  3 in cell 3, . .., and   1 ⋅⋅⋅  in cell  are similar to that of   in cell 1.
The evolution rules are object rewriting rules similar to chemical reactions.They take objects, transform them into other objects, and may transport them to other cells.

Computing Process
Input.Cell 0 is the input cell.The objects   encoded from the transactional database and objects   representing the support count threshold  are entered into cell 0 to activate the computation process.Rule  01 is executed to put copies of   and   to cells 1, 2, . . ., .
Frequent 1-Itemsets Generation.Frequent 1-itemsets are generated in cell 1.Rule  11 is executed to generate    for 1 ≤  ≤ .Rule  13 is executed to detect all frequent 1-itemsets using the internal flat maximally parallel mechanism in the P system.Rule  12 cannot be executed because no object   is in cell 1 at this time.The detection process of the candidate frequent 1-itemset { 1 } is taken as an example.The detection processes of other candidate frequent 1-itemsets are performed in the same way.Rule  13 is actually composed of multiple subrules working on objects with different subscripts.If object  1 is in cell 1 which means the th record contains the first item, the subrule { 1  1 →  1 } meets the execution condition and can be executed.If object  1 is not in cell 1 which means the th record does not contain the first item, the subrule { 1  1 →  1 } does not meet the execution condition and cannot be executed.Initially,  copies of  1 are in cell 1 indicating that the first item needs to appear in at least  records for the itemset { 1 } to be a frequent 1-itemset.Each execution of a subrule consumes one  1 .Therefore, at most  subrules of the form { 1  1 →  1 } can be executed in nondeterministic flat maximally parallel.The checking process continues until all objects   have been checked or all of the  copies of  1 have been consumed.If all of the  copies of  1 have been consumed, the first item appeared in at least  records and the itemset { 1 } is a frequent 1-itemset.If some copies of  1 are still in this cell after all objects   have been checked, the itemset { 1 } is not a frequent 1-itemset.
Rule  12 is then executed to process the results obtained by rule  13 .The 1-itemset { 1 } is again taken as an example.If  copies of  1 have been consumed by rule  13 , and  −  copies of  1 are still in this cell, subrule { ) ¬ 1 →  1,go } is executed to put an object  1 to cells 2 and  + 1 to indicate that the itemset { 1 } is a frequent 1-itemset and to activate the computation in cell 2. If no 1-itemset is a frequent 1-itemset, the computation halts.
Frequent 2-Itemsets Generation.The frequent 2-itemsets are generated in cell 2. Rule  21 is executed to obtain all candidate frequent 2-itemsets using the internal flat maximally parallel mechanism in the P system.The pair of empty parentheses in this subrule indicates that no objects are consumed when this rule is executed.The detection process of the candidate frequent 2-itemset { 1 ,  2 } is taken as an example.Rule  24 is executed to detect all frequent 2-itemsets using the internal flat maximally parallel mechanism in the P system.Rule  23 cannot be executed because no object   1  2 is in cell 2 at this time.The detection process of the frequent 2-itemset { 1 ,  2 } is taken as an example.Rule  24 is actually composed of multiple subrules working on objects with different subscripts.If objects  1 and  2 are in cell 2 which means the th record contains the first and the second items, subrule {( 12 )  1  2 →  12 } meets the execution condition and can be executed.If objects  1 and  2 are not both in cell 2 which means the th record does not contain both the first and the second items, subrule {( 12 )  1  2 →  12 } does not meet the execution condition and cannot be executed.Initially,  copies of  12 are in cell 2 indicating that both the first and the second items need to appear together in at least  records for the itemset { 1 ,  2 } to be a frequent 2itemset.Each execution of these subrules consumes one  12 .Therefore, at most  subrules of the form {( 12 )  1  2 →  12 } can be executed in nondeterministic flat maximally parallel.The checking process continues until all objects   have been checked or all of the  copies of  12 have been consumed.If all of the  copies of  12 have been consumed, the first and the second items appeared together in at least  records and the itemset { 1 ,  2 } is a frequent 2-itemset.If some copies of  12 are still in this cell after rule  24 is executed, the itemset Rule  23 is executed to process the results obtained by rule  24 .The 2-itemset { 1 ,  2 } is again taken as an example.If  copies of  12 have been consumed by rule  24 , and − copies of objects  12 are still in this cell, subrule { ) ¬ 12 →  12,go } is executed to put an object  12 to cells 3 and  + 1 to indicate that the itemset { 1 ,  2 } is a frequent 2-itemset and to activate the computation in cell 3.If no 2-itemset is a frequent 2-itemset, the computation halts.
Each cell  for 3 ≤  ≤  has 4 rules which are similar to those in cell 2. Each cell  performs similar functions as cell 2 does but for frequent -itemsets.
After the computation halts, all the results, that is, objects representing the identified frequent itemsets, are stored in cell  + 1.

Algorithm Flow.
The conventional Apriori algorithm executes sequentially.ECTPPI-Apriori uses the parallel mechanism of the ECPI tissue-like P system to execute in parallel.A pseudocode of ECTPPI-Apriori is shown as in Algorithm 1.

Time Complexity.
The time complexity of ECTPPI-Apriori in the worst case is analyzed.Initially, 1 computational step is needed to put copies of   and   to cells 1, 2, . . ., .
Generating the frequent 1-itemsets needs 3 computational steps.Generating the candidate frequent 1-itemsets  1 needs 1 computational step.Finding the support counts of the candidate frequent 1-itemsets needs 1 computational step.All

Input:
The candidate frequent 1-itemsets in the database are checked in the flat maximally parallel.Passing the results of the frequent 1-itemsets to cells 2 and  + 1 needs 1 computational step.Generating the frequent ℎ-itemsets (1 < ℎ ≤ ) needs 4 computational steps.Generating the candidate frequent ℎitemsets  ℎ needs 1 computational step.Cleaning the memory used by the objects needs 1 computational step.Finding the support counts of the candidate frequent ℎ-itemsets needs 1 computational step.All candidate frequent ℎ-itemsets in the database are checked in flat maximally parallel.Passing the results of the frequent ℎ-itemsets to cells ℎ + 1 and  + 1 needs 1 computational step.
Therefore, the time complexity of ECTPPI-Apriori is 1 + 3 + 4( − 1) = 4, which gives ().Note that  is used traditionally to indicate the time complexity of an algorithm and the  used here has a different meaning from that used earlier when ECTPPI-Apriori is described.
Some comparison results between ECTPPI-Apriori and the original as well as some other improved parallel Apriori algorithms are shown in Table 1, where |  | is the number of candidate frequent -itemsets and | −1 | is the number of frequent  − 1-itemsets.

An Illustrative Example
An illustrative example is presented in this section to demonstrate how ECTPPI-Apriori works.Table 2 shows the Frequent 1-Itemsets Generation.Within cell 1, the auxiliary objects  2  , for 1 ≤  ≤ 5, are created by rule  11 to indicate that each item  needs to appear in at least  = 2 records for it to be a frequent 1-itemset.Rule  13 is executed to detect all frequent 1-itemsets in flat maximally parallel.The detection process of the candidate frequent 1-itemset { 1 } is taken as an example.Objects  11 ,  41 ,  51 ,  71 ,  81 , and  91 are in cell 1 which means the first, the fourth, the fifth, the seventh, the eighth, and the ninth records contain  1 .The subrules , and { 91  1 →  1 } meet the execution condition and can be executed.Objects  21 ,  31 , and  61 are not in cell 1, which means the second, the third, and the sixth records do not contain  1 .The subrules { 21  1 →  1 }, { 31  1 →  1 }, and { 61  1 →  1 } do not meet the execution condition and cannot be executed.Initially, 2 copies of  1 are in cell 1 indicating that the first item needs to appear in at least 2 records to make the itemset { 1 } a frequent 1-itemset.Each execution of a subrule consumes one  1 .Therefore, 2 of subrules among and { 91  1 →  1 } can be executed in nondeterministic flat maximally parallel.Through the execution of 2 such subrules, both of the 2 copies of  1 are consumed and 2 copies of  1 are generated.The detection processes of other candidate frequent 1-itemsets are performed in the same way.After the detection processes, }, and { 1 ,  2 ,  5 } are all frequent itemsets in this database.
The change of objects in the computation processes is listed in Tables 3-6.

Computational Experiments
Two databases from the UCI Machine Learning Repository [23] are used to conduct computational experiments.Computational results on these two databases are reported in this section.

Results on the Congressional Voting Records Database.
The Congressional Voting Records database [23] is used to test the performance of ECTPPI-Apriori.This database contains 435 records and 17 attributes (fields).The first attribute is the party that the voter voted for and the 2nd to the 17th attributes are sixteen characteristics of each voter identified by the Congressional Quarterly Almanac.The first attribute has two values, Democrats or Republican, and each of the 2nd to the 17th attributes has 3 values: yea, nay, and unknown disposition.The frequent itemsets of these attribute values need to be identified; that is, the problem is to find the attribute values which always appear together.
Initially, the database is preprocessed.Each attribute value is taken as a new attribute.In this way, each new attribute has only two values: yes or no.After preprocessing, each record in the database has 2 × 1 + 3 × (17 − 1) = 50 attributes.ECTPPI-Apriori then can be used to discover the frequent itemsets.In this experiment, one itemset is called a frequent itemset if it appeared in more than 40% of all records; that is, the support count threshold is  = 174 (435 × 40%).The frequent itemsets obtained by ECTPPI-Apriori are listed in Table 7. [23] is also used to test ECTPPI-Apriori.This database contains 8124 records.The 8124 records are numbered orderly from 1 to 8124.Each record represents one mushroom and has 23 attributes (fields).The first attribute is the poisonousness of the mushroom and the 2nd to the 23rd attributes are 22 characteristics of the mushrooms.Each of the attributes has 2 to 12 values.The frequent itemsets of these attribute values need to be found; that is, the problem is to find the attribute values which always appear together.Initially, the database is preprocessed.Each attribute value is taken as a new attribute.In this way, each new attribute has only two values, yes or no.After preprocessing, each record has 118 attributes.ECTPPI-Apriori then can be used to discover the frequent itemsets.In this experiment, one itemset is a frequent itemset if it appears in more than 40 percent of all records; that is, the support count threshold is  = 3250 (8124 × 40%).The frequent itemsets obtained by ECTPPI-Apriori are listed in Table 8.

Conclusions
An improved Apriori algorithm, called ECTPPI-Apriori, is proposed for frequent itemsets mining.The algorithm uses a parallel mechanism in the ECPI tissue-like P system.The time complexity of ECTPPI-Apriori is improved to () compared to other parallel Apriori algorithms.Experimental results, using the Congressional Voting Records database and the Mushroom database, show that ECTPPI-Apriori performs well in frequent itemsets mining.The results give some hints to improve conventional algorithms by using the parallel mechanism of membrane computing models.
For further research, it is of interests to use some other interesting neural-like membrane computing models, such as the spiking neural P systems (SN P systems) [8], to improve the Apriori algorithm.SN P systems are inspired by the mechanism of the neurons that communicate by transmitting spikes.The cells in SN P systems are neurons that have only one type of objects called spikes.Zhang et al. [24,25], Song et al. [26], and Zeng et al. [27] provided good examples.Also, some other data mining algorithms can be improved by using parallel evolution mechanisms and graph membrane structures, such as spectral clustering, support vector machines, and genetic algorithms [1].

1 →
} is executed to delete the objects   1 and  − 1 .If all of the  copies of  1 have been consumed, subrule {(  1
objects   encoded from the transactional database and objects   representing the support count threshold .Rule  01 : Copy all   and   to cells 1 to  + 1.  ≤  to form the candidate frequent 1-itemsets  1 .Rule  13 : Scan each object   in cell 1 to count the frequency of each item.If   is in cell 1, consume one   and generate one   .Continue until all  copies of   have been consumed or all objects   have been scanned.Rule  12 : If all  copies of   have been consumed, generate an object   to add {  } to  1 as a frequent 1-itemset and pass   to cells 2 and  + 1. Delete all remaining copies of   and delete all copies of   .For (2 ≤ ℎ ≤  and  ℎ−1 ̸ = ⌀) do the following in cell ℎ: { Rule  ℎ1 : Scan the objects   1 ⋅⋅⋅ ℎ−1 representing the frequent (ℎ − 1)-itemsets  ℎ−1 to generate the objects    1 ⋅⋅⋅ ℎ representing the candidate frequent ℎ-itemsets  ℎ .Rule  ℎ2 : Delete all objects   1 ⋅⋅⋅ ℎ−1 after they have been used by rule  ℎ1 .Rule  ℎ4 : Scan the objects   representing the database to count the frequency of each candidate frequent ℎ-itemset  ℎ .If the objects  , 1 , . . .,  , ℎ−1 and  , ℎ are all in cell ℎ, consume one   1 ⋅⋅⋅ ℎ and generate one   1 ⋅⋅⋅ ℎ .Continue until all copies of   1 ⋅⋅⋅ ℎ have been consumed or all objects   have been scanned.Rule  ℎ3 : If all  copies of   1 ⋅⋅⋅ ℎ have been consumed, generate   1 ⋅⋅⋅ ℎ to add {  1 , . . .,   ℎ } to  ℎ as a frequent ℎ-itemset and put   1 ⋅⋅⋅ ℎ in cells ℎ + 1 and  + 1. Delete all remaining copies of   1 ⋅⋅⋅ ℎ and delete all copies of   1 ⋅⋅⋅ ℎ .The collection of all frequent itemsets  encoded by objects   ,   1  2 , . . .,   1 ⋅⋅⋅  .

Table 1 :
Time complexities of some Apriori algorithms.

Table 7 :
The frequent itemsets identified by ECTPPI-Apriori for the Congressional Voting Records database.

Table 8 :
The frequent itemsets identified by ECTPPI-Apriori for the Mushroom database.