With the development of human society and the development of Internet of things, wireless and mobile networking have been applied to every field of scientific research and social production. In this scenario, security and privacy have become the decisive factors. The traditional safety mechanisms give criminals an opportunity to exploit. Association rules are an important topic in data mining, and they have a broad application prospect in wireless and mobile networking as they can discover interesting correlations between items hidden in a large number of data. Apriori, the most influential algorithm of association rules mining, needs to scan a database many times, and the efficiency is low when the database is huge. To solve the security mechanisms problem and improve the efficiency, this paper proposes a new algorithm. The new algorithm scans the database only one time and the scale of data to deal with is getting smaller and smaller with the algorithm running. Experiment results show that the new algorithm can efficiently discover useful association rules when applied to data.
1. Introduction
With the rapid development of web technology, the number of choices is becoming overwhelming. It takes a long time to filter, prioritize, and efficiently deliver relevant information so as to alleviate the problem of information overload. Recommender systems [1] have grown so fast that they can meet the needs of users’ ambiguous requirements. They utilize statistic method and knowledge discovery technology, providing users with personalized content and services by searching through large volume of dynamically generated information. Recently, various approaches for building recommender systems have been developed, which can utilize collaborative filtering, content-based filtering, or hybrid filtering [2–4]. Among the above filtering techniques, collaborative filtering recommendation is the most mature and the most commonly implemented. Collaborative filtering technique can be divided into two classifications; they are model-based filtering and memory-based filtering. The model-based filtering learns a model from the user-item ratings which can be computed offline. Once the model is generated, the process of prediction will be easy and fast. Lots of model-based filtering techniques have been proposed by researchers such as Latent Semantic Indexing (LSI) [5], decision tree [6], Bayesian network approach models [7], and cluster models [8, 9]. Usually, model-based algorithm has better scalability but lower accuracy than memory-based algorithm. Although collaborative filtering technique is commonly used, it still encounters one crucial issue remaining to be solved, namely, data sparsity problem [10–13], thus leading to the nonoptimal nearest neighbors because the core of the collaborative filtering algorithm is to find the k-nearest neighbors [14–18]. For lack of reference rating values, this step of searching neighbors causes big inaccuracy. In the traditional collaborative filtering algorithm, such similarity metrics are used to calculate the similarity between users or items as cosine, Pearson-correlation, and modified cosine [19–22]. All of them present poor performance when they are applied to big data with high sparsity. This paper proposes a new algorithm, considering both user similarity and item ones. Matrix prefilling, a method of preprocessing, is based on association rules which are not proposed by others before when measuring similarity. Experimental results of the proposed model on a real dataset: the dataset proves to generate more accurate prediction results compared to the traditional ones. The remainder of this paper is organized as follows. Section 2 is a brief introduction of association rule whose concept and algorithm will be used in Section 3 to propose a new algorithm. Section 3 focuses on the algorithm for wireless and mobile networking which is the highlight of this paper. Experimental results and analyses are displayed in Section 4. Section 5 is the final part of this paper in which conclusion is reached.
2. Related Work2.1. Related Concepts of Association Rules
Transaction database D=t1,t2,…,tk,…,tn is the set of all the transactions. I=i1,i2,…,im is the set of all the items in D [23–25]. Every transaction contains a set of items which is the subset of I. Item set is a collection that contains 0 or more items. If the number of items an item set contains is k, then the item set is called k-item set. Support count is an important property of an item set. It indicates the number of a particular item set contained in the transactions. σX, the support count of item set X, is defined as follows:(1)σX=ti∣X⊆ti,ti∈D.
⋯ represents the number of elements in the collection.
A rule is defined as an implication form X→Y, where X⊂I, Y⊂I, and X∩Y=ϕ. Support and confidence are two important measures of association rules. Support indicates the frequency of the rule in a dataset. It is defined as(2)supportX⟶Y=σX∪YN.
N is the total number of transactions.
The confidence of a rule X→Y is the proportion of transactions that contains X which also contains Y. It is defined as(3)confidenceX⟶Y=σX∪YσX.
Support and confidence are two important measures to evaluate association rules. Rules with low support may occur only occasionally which are meaningless in most cases. Therefore, support is often used to delete those meaningless rules. Confidence is a measure of accuracy of association rules. If the confidence of the rule X→Y is high, the possibility of Y appearing in the transactions which contain X is larger.
2.2. Apriori Algorithm
Apriori is a typical algorithm with candidate set generated. It uses the support based pruning method and a level-wise and breadth-first search to discover the frequent item sets. Apriori uses two properties below to compress search space.
Lemma 1.
If the item set X is frequent, then all nonempty subsets are frequent too.
Lemma 2.
If the item set X is nonfrequent, then all supersets are nonfrequent too.
Candidate item set generation is a very critical step. It should ensure that the candidate item sets are complete while avoiding too many unnecessary candidates. This step consists of two parts.
(1) In the join step, this paper joins two frequent (k-1)-item sets L1 and L2 to generate candidate k-item sets. This paper should make sure that the first k-2 items of L1 and L2 are the same. Then, the first k-2 items and the last item of L1 as well as the last item of L2 compose the candidate k-item set.
(2) In the pruning step, this paper uses a strategy to delete some unnecessary candidates. According to Lemmas 1 and 2, for each k-item set generated, this paper examines whether all the k-1 subsets are frequent. If not, this paper removes it from the candidate k-item sets.
Apriori algorithm effectively filters the unnecessary candidates. It will get a good data mining result, especially for short pattern data. However, the weakness is that the database needs to be scanned many times. It will produce tremendous I/O cost. Another weakness is that a lot of candidate item sets may be generated. It will cost a lot of time and memory space.
2.3. FP-Growth Algorithm
FP-Growth is a classic algorithm without candidate item sets generated. It compresses the data into a structure called FP-tree. The frequent item sets are discovered by doing a recursive search of the FP-tree.
The process of FP-Growth mainly consists of two steps.
(1) Constructing the FP-Tree. When the database is scanned for the first time, this paper selects the items which satisfy the minimum support and puts these items to a header table with a descending sort order according to support. When the database is scanned for the second time, the items contained in a transaction are sorted according to their order in the header table and are inserted in the FP-tree. Then combine the same paths in the tree.
(2) Discovering Frequent Item Sets by Searching the FP-Tree. If the FP-tree contains only one path, enumerate all the possible item sets. If not, for each item in the header table, this paper creates its conditional pattern base so as to construct the conditional pattern tree. The recursive process will not stop until the tree is empty.
FP-growth algorithm scans the database only two times and avoids the generation of candidate item sets, but the weakness is that when the database is huge, the FP-tree is too large and even cannot be constructed in memory because all the records in database are compressed into the FP-tree.
3. The Improved Apriori Algorithm Based on Matrix
To avoid the weakness of apriori algorithm, this paper proposes an improved algorithm on the basis of apriori algorithm. This paper converts the transaction database to a Boolean matrix and deletes the unnecessary rows and columns of the matrix to reduce the scale of the data.
3.1. Related Concept
Association rules usually focus on transaction databases. If this paper converts the transaction database to a Boolean matrix, on the one hand, the database can be scanned only one time so as to reduce the cost of I/O and, on the other hand, it may reduce the memory consumption when the data is in the form of 0 and 1.
Definition 3.
Let I=I1,I2,…,In be an item set and T=T1,T2,…,Tm be a set of transactions in the database and each transaction in T has a unique transaction id called TID. The method by which transactions are converted into a Boolean matrix is as follows: let R be the binary relation from I to T. rij=RTi,Ij, R=rijm×n. Then (4)rij=1,Ij∈Ti0,Ij∉Tii=1,2,…,m;j=1,2,…,n.
An example of a transaction database is in Table 1. The Boolean matrix of the database is in Table 2.
A transaction database.
TID
Items
1
I2, I5
2
I1, I2, I4
3
I1, I3, I4
4
I2, I3, I4, I5
A Boolean matrix.
R
I1
I2
I3
I4
I5
T1
0
1
0
0
1
T2
1
1
0
1
0
T3
1
0
1
1
0
T4
0
1
1
1
1
The column vector Ij of the Boolean matrix is defined as Ij=r1j,r2j,…,rmj. The support count of Ij is(5)support_countIj=∑i=1nrij.
For k-item set I1,I2,…,Ik, its support count is(6)support_countI1,I2,…,Ik=∑i=1nri1∧ri2∧⋯∧rik.
∧ is “and” operation. When I1,I2,…,Ik are simultaneously 1, the support count is incremented by 1.
Lemma 4.
If the number of “1” instances contained in a row of Boolean matrix is less than k, then when this paper counts the support of k-item set, this row can be deleted from the matrix.
According to the definition of support count, (7)support_countI1,I2,…,Ik=∑i=1nri1∧ri2∧⋯∧rik.
If the number of “1” instances contained in a row is less than k, there will exist j which makes rij=0; then ri1∧ri2∧⋯∧rik=0. Therefore this row makes no contribution to the support count of k-item set.
Lemma 5.
If there is an item Ij, the number of Ij instances that appear in frequent k-item sets Lk is less than k; the column of Ij can be deleted in the process of frequent k+1-item set generation.
Let Y be a frequent (k+1)-item set; then all its k-subsets are frequent. For each Ij∈Y, the number of Ij instances that appear in frequent k-item sets should be k. if the number is less than k, then Ij will not be the element of the frequent (k+1)-item set.
3.2. The Searching of <inline-formula><mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M102"><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:math></inline-formula>-Nearest Neighbors
After the process above, this paper takes user similarity into account. The similarity of user i and user j is computed as (8). I denotes the set of all the items.(8)simi,j=∑c∈IRi,c-Mi¯Rj,c-Rj¯∑c∈IRi,c-Ri¯2∑c∈IRj,c-Rj¯2.
For each user u, the aim to find the k-nearest neighbor is to find a user set U=U1,U2,…,Uk, u∉U, simu,U1 has the highest value, and simu,U2 has the second highest value, and so on.
3.3. The Generation of Recommendation
After the step of finding the k-nearest neighbors, the next step is to generate recommendations. Let the set of k-nearest neighbors of user u be NNu and the rating that user u give to the item i be Ru,i; the calculation is as follows:(9)Ru,i=Ru¯+∑n∈NNusimu,n×Rn,i-Rn¯∑n∈NNusimu,n,ifru,i·flag=2ru,i,ifru,i·flag=0.
3.4. Description of the Improved Algorithm
The process of the improved algorithm is descripted in Algorithm 1. First this paper converts a database to a Boolean matrix. Then according to Lemmas 4 and 5 the unnecessary rows and columns of the matrix are deleted with the algorithm running.
<bold>Algorithm 1: </bold>The procedure of the improved algorithm based on matrix.
Input: dataset D, minimum support minsup
Output: all the itemsets satisfied with minsup
(1) Scan transaction database D and convert it to a Boolean matrix.
(2) Calculate the support of every 1-itemset. Item sets whose support are not less than min sup
compose frequent 1-item sets L1. Delete the columns of the infrequent items. Delete the rows
in which the number of “1” contained is less than 2.
(3) for (k=2; Lk-1≠ϕ; k++) do begin
(4) Combine the items of each column and generate candidate k- item setsCk.
(5) Calculate the support of Ck.
(6) Item sets whose support is not less than min sup compose frequent k-item setsLk.
(7) Delete the columns of the Items which are contained in infrequent item sets and the number
Of which appear in Lk is less than k.
(8) Delete the rows in which the number of “1” contained is less than k+1
(9) End
The improved algorithm based on matrix is shown in Algorithm 1.
3.5. Evaluation Criteria
Not all association rules are useful, so it is necessary to select the association rules in which we are interested. Support and confidence are two basic criteria to evaluate if an association rule is useful. However in some case the two criteria may give us an unexpected suggestion. So this paper uses the criterion called lift to evaluate the association rules in addition to support and confidence. The lift of a rule X→Y is defined as(10)liftX⟶Y=confidenceX⟶YsupportY.
Lift is the radio of a rule’s confidence and the consequent’s support. If the value of lift is 1, X and Y are independent. If the value is above 1, X and Y are positively correlated. If the value is below 1, X and Y are negatively correlated.
3.6. Performance Analysis
Compared with apriori algorithm, the improved algorithm scans the database only one time. It converts the transaction database to a matrix. The remaining steps are operated on the matrix without scanning the database again. This will reduce the I/O cost. The other advantage of the improved algorithm is that the scale of data to be dealt with is getting smaller and smaller with the algorithm running. In the process of frequent item sets generation, the columns of items which will not be contained in frequent item sets and the rows which make no contribution to the support count will be deleted. Therefore the scale of the matrix will be smaller and smaller and the efficiency will be improved a lot. On the other hand, when a transaction contains many items, compared with transaction list, a Boolean matrix occupies less memory space.
4. Results and Analysis
To access the performance of the improved algorithm, this paper uses apriori algorithm and the improved algorithm proposed in this paper to mine frequent item sets from different agricultural databases. The experiments were performed on an Intel i5-2450 processor 2.5 GHz with 4G memory, running Windows 8. This paper used R language to code the algorithms.
Table 3 and Figure 1 show the performance of the two algorithms in the UCI dataset named mushroom. The dataset contains 7847 records and 118 items. The minimum confidence is set to be 0.5 and the minimum support is set, respectively, to be 0.60, 0.65, 0.70, 0.75, and 0.80.
Runtime of the two algorithms on mushroom dataset.
Support
0.60
0.65
0.70
0.75
0.80
Apriori algorithm
3.25
2.12
1.95
1.85
1.72
Improved algorithm
2.72
1.73
1.63
1.54
1.42
Runtime of the two algorithms on mushroom dataset.
Table 4 and Figure 2 show the performance of the two algorithms in the UCI dataset named soybean. The dataset contains 5264 records and 655 items. The minimum confidence is set to be 0.5 and the minimum support is set, respectively, to be 0.75, 0.76, 0.77, 0.78, 0.79, and 0.80.
Runtime of the two algorithms on soybean dataset.
Support
0.75
0.76
0.77
0.78
0.79
0.80
Apriori algorithm
0.29
0.20
0.17
0.12
0.10
0.07
Improved algorithm
0.21
0.135
0.082
0.075
0.065
0.05
Runtime of the two algorithms on soybean dataset.
The results show that the runtime of the improved algorithm is less than apriori algorithm. The improved algorithm is more effective than apriori algorithm.
The evaluation method lift is used to optimize the mining result. A subset of the mining result of mushroom dataset is shown in Algorithm 2. Algorithm 2 shows the association rules whose support and confidence are high but whose lift is 1. It means that the antecedent and consequent are independent, and these association rules are not the rules this paper expect even though they have high support and confidence.
<bold>Algorithm 2: </bold>A subset of the mining result of mushroom dataset.
To avoid the weakness of apriori algorithm, this paper proposes an improved algorithm based on matrix and applies the improved algorithm to agricultural datasets. Experimental results show that the improved algorithm can efficiently discover useful association rules for the reason that database will be scanned only one time and that the data to deal with is getting smaller and smaller with the algorithm running. The improved algorithm is more applicable when the database is huge. But it is not that efficient compared with apriori when the database is not that large due to the fact that the scale data to deal with is small but the improved algorithm has an extra operation to covert the database to a matrix. Further research should be focused on the optimization of the proposed algorithm so as to further improve the efficiency when applied to big data. Algorithm parallelization can be taken into account. Therefore our future work is to improve our algorithm so as to be applicable for more kinds of database. Besides, new evaluation criteria can be used to optimize our mining result.
Conflicts of Interest
There are no conflicts of interest.
PangY.JinY.ZhangY.ZhuT.Collaborative filtering recommendation for MOOC applicationWeiJ.HeJ.ChenK.ZhouY.TangZ.Collaborative filtering and deep learning based recommendation system for cold start itemsQiL.DouW.ZhangX.Service recommendation based on social balance theory and collaborative filtering9936Proceedings of the Intelligence and Lecture Notes in Bioinformatics, 14th International Conference2016Basel, SwitzerlandSpringer International Publishing637645Lecture Notes in Computer Science10.1007/978-3-319-46295-0_43KyoungsooB.CheongjuS.-G.Social group recommendation based on dynamic profiles and collaborative filteringGaoY.Collaborative filtering recommendation model based on normalization methodSunM.LiF.LeeJ.Learning multiple-question decision trees for cold-start recommendationProceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM '13)February 2013Rome, ItalyACM44545410.1145/2433396.24334512-s2.0-84874246335MaT.-H.GuoL.-M.LiM.TangM.-L.TianY.MznahA.A collaborative filtering recommendation algorithm based on hierarchical structure and time awarenessdos SantosH. L.CechinelC.AraujoR. M.SiciliaM.-Á.Clustering learning objects for improving their recommendation via collaborative filtering algorithmsKrupaP.ThakkarA.ShahC.MakvanaK.A state of art survey on shilling attack in collaborative filtering based recommendation systemMirkoP.FabioA.Kernel based collaborative filtering for very large scale top-N item recommendationProceedings of the 24th European Symposiumon Artificial Neural Networks20161116ShenY.LvT.-G.ChenX.WangY.-D.A collaborative filtering based social recommender system for E-commerceRossiS.BarileF.ImprotaD.RussoL.Towards a collaborative filtering framework for recommendation in museums: from preference elicitation to group's visitsProceedings of the 7th International Conference on Emerging Ubiquitous Systems and Pervasive Networks, EUSPN 2016 / The 6th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare, ICTH '16September 2016London, UKElsevier43143610.1016/j.procs.2016.09.0672-s2.0-84992396702UrszulaB.Differential evolution in a recommendation system based on collaborative filteringProceedings of The Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in BioinformaticsSpringer International Publishing113122Lecture Notes in Computer Science10.1007/978-3-319-45246-3_11EbisaW.VitorW.User-based collaborative filtering recommender systems approach in industrial engineering curriculum design and review processProceedings of the ASEE Annual Conference and Exposition2016JinhyunJ.SangwonB.GeundukP.Implementation of a recommendation system using association rules and collaborative filteringProceedings of the Proceedings of the 4th International Conference on Information Technology and Quantitative Management, ITQM '162016944952NgY. K.Recommending books for children based on the collaborative and content-based filtering approachesProceedings of the Computational Science and Its Applications-ICCSA '16Springer International Publishing302317Lecture Notes in Computer Science10.1007/978-3-319-42089-9_22QiuH. H.LiuY.ZhangZ. J.LuoG. X.An improved collaborative filtering recommendation algorithm for microblog based on community detectionProceedings of the 2014 Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP)August 2014Kitakyushu, JapanIEEE87687910.1109/IIH-MSP.2014.221KimK.AhnH.Recommender systems using cluster-indexing collaborative filtering and social data analyticsLiC.-Y.HeK.-J.An optimized map reduce for item-based collaborative filtering recommendation algorithm with empirical analysisPolatidisN.GeorgiadisC. K.A dynamic multi-level collaborative filtering method for improved recommendationsPalakR. L.An effective collaborative filtering based method for movie recommendationLiuM.ZengZ.PanW.PengX.ShanZ.MingZ.Hybrid One-Class Collaborative Filtering for Job RecommendationSrideviM.RaoR. R.An enhanced personalized recommendation utilizing expert's opinion via collaborative filtering and clustering techniquesProceedings of the 2016 International Conference on Inventive Computation Technologies (ICICT)August 2016Coimbatore, IndiaIEEE1410.1109/INVENTIVE.2016.7823186ZhangH.GanchevI.NikolovN. S.O'dromaM.A trust-enriched approach for item-based collaborative filtering recommendationsProceedings of the 12th IEEE International Conference on Intelligent Computer Communication and Processing, ICCP '16September 2016656810.1109/ICCP.2016.77371242-s2.0-85006893242DongL.-Y.ZhuG.-L.ZhuQ.LiY.-L.Research on collaborative filtering recommendation based on k-means clustering