A Task Scheduling Algorithm Based on Classification Mining in Fog Computing Environment

Fog computing (FC) is an emerging paradigm that extends computation, communication, and storage facilities towards the edge of a network. In this heterogeneous and distributed environment, resource allocation is very important. Hence, scheduling will be a challenge to increase productivity and allocate resources appropriately to the tasks. We schedule tasks in fog computing devices based on classification data mining technique. A key contribution is that a novel classification mining algorithm I-Apriori is proposed based on the Apriori algorithm. Another contribution is that we propose a novel task scheduling model and a TSFC (Task Scheduling in Fog Computing) algorithm based on the I-Apriori algorithm. Association rules generated by the I-Apriori algorithm are combined with the minimum completion time of every task in the task set. Furthermore, the task with the minimum completion time is selected to be executed at the fog node with theminimum completion time.We finally evaluate the performance of I-Apriori and TSFC algorithm through experimental simulations. The experimental results show that TSFC algorithm has better performance on reducing the total execution time of tasks and average waiting time.


Introduction
Many applications, such as health monitoring application or intelligent traffic control application may need to receive feedback in a short amount of time, and the latency due to sending data to the cloud and then returning the response from the cloud to the operator of these programs has bad effects [1].So, in 2012, Bonomi presented a novel concept called the fog computing [2].Fog computing consists of a large number of geographically distributed fog servers which can be cellular base stations, access points, gateways, switches, and routers with limited capabilities, as compared to specialized computing facilities such as data centers [3][4][5].In fog computing, the massive data generated by different kinds of Internet of Things (IoT) [6,7] devices can be processed at the network edge instead of transmitting it to the centralized cloud infrastructure due to bandwidth and energy consumption concerns [8].Fog computing has become a new computing model in providing local computing resources and storage for end-users rather than cloud computing.
The contradiction [9,10] between computation intensive applications and resource limited devices becomes the bottleneck for providing satisfactory quality of experience.This contradiction needs to be solved by task scheduling in fog computing environment.Task scheduling is widely applied in distributed computing systems and the cloud computing environment [11,12].Task scheduling in fog computing is to allocate appropriate resources for application tasks.How to select appropriate resources for the application task to meet the minimum completion time, to satisfy the users' quality of service (QoS) requirements, to improve the fog computing throughput, and to achieve the load balancing scheduling can be defined as task scheduling problem in fog computing environment.Therefore, it is of great practical significance to achieve efficient resource utilization and higher performance in the fog computing environment.
In fog computing environment, task scheduling depends on whether there are dependencies between the tasks that are scheduled.It can be divided into independent task scheduling 2 Wireless Communications and Mobile Computing and related task scheduling.Related task scheduling is often referred to as dependent task scheduling [13].There is no dependency relationship and data communication among tasks in independent task scheduling [14,15].Dependent task scheduling has some dependence and there is data communication among tasks.A typical task scheduling model is built on the basis of graphs, usually called task graphs.The most common task graph is Directed Acyclic Graph (DAG), so the dependent task scheduling is also called DAG scheduling.
Before tasks are scheduled, tasks have two ways to arrive.One is the batch mode.When all tasks arrive, they are allocated to the corresponding fog nodes through a scheduling algorithm.Another is the online mode.The arrival time of each task is random and a task is scheduled to a fog node as soon as it arrives at the RMS (resource management system).Task scheduling of fog nodes has been proved to be a NP-complete problem [16].The research work of task scheduling is a very important aspect and has been widely and deeply studied by researchers [17].At present, although many research achievements have been obtained for task scheduling, researchers are still continuing to explore and study [18].Research of scheduling tasks in fog computing environment has not been well-established yet due to the lack of fog architecture that manages and allocates resources efficiently.Our research also has a positive influence on some optimization problems [19][20][21][22].
The rest of the paper is organized as follows.In Section 2 we describe the related work of the research.In Section 3, we introduce the classification mining algorithm and an improved I-Apriori algorithm.In Section 4, we introduce a task scheduling model, the scheduling algorithm, and the scheduling process in fog computing.The analysis of the experimental process and experimental results of task scheduling algorithm are given in Section 5, followed by our conclusion made in Section 6.

Related Work of Classification Mining.
Classification mining algorithms are widely used in text, image, video, traffic, medical, big data, and other application scenarios.A pipelined architecture for the implementation of axis parallel binary DTC was proposed in [23] that dramatically improves the execution time of the algorithm while consuming minimal resources in terms of area.Reference [24] proposed a fast and accurate data classification approach which can learn classification rules from a possibly small set of records that are already classified.The proposed approach is based on the framework of the so-called Logical Analysis of Data (LAD).The accuracy and stability of the proposed algorithm are better than that of the standard LAD algorithm.Sequence classification was introduced in [25] using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels.They measure the interestingness of a pattern in a given class of sequences by combining the cohesion and the support of the pattern.They use the discovered patterns to generate confident classification rules and present two different ways of building a classifier.The patterns that the algorithm discovers represent the sequences well and are proved to be more effective for the classification tasks than other machine learning algorithms.A Bayesian classification approach for automatic text categorization using classspecific features was proposed in [26].Unlike conventional text categorization approaches, the method selects a specific feature subset for each class.One noticeable significance of the algorithm is that most feature selection criteria such as Information Gain (IG) and Maximum Discrimination (MD) can be easily incorporated into the algorithm.Compared with other algorithms, it demonstrates that the algorithm is effective and further indicates its wide potential applications in data mining.Furthermore, we will apply this algorithm to other areas, such as oblivious RAM [27,28], string mapping [29], and match problem [30].

Related Work of Independent Task
Scheduling.For a large scale environment, e.g., cloud computing system, there had been also numerous scheduling approaches proposed with the goal of achieving the better task execution time for cloud resources [31].Independent task scheduling algorithms mainly include MCT algorithm [32], MET algorithm [32], MIN-MIN algorithm [33], MAX-MIN algorithm [33], PMM algorithm, and genetic algorithm.The MCT (Minimum Completion Time) algorithm assigns each task in any order to the processor core that causes the task to be finished at the earliest time.It makes some tasks unable to be allocated to the fastest processor core.The MET (Minimum Execution Time) algorithm assigns each task to a processor core in any order that minimizes the execution time of the task.Contrary to the MCT algorithm, the MET algorithm does not consider the processor core's ready time, which may lead to serious load imbalance across processor cores.The MIN-MIN algorithm calculates the minimum completion time of all unscheduled tasks firstly, and then selects the task with the minimum completion time and assigns the task to the processor core that can minimize its completion time, repeating the process many times until all tasks are scheduled.The same as the MCT algorithm, the MIN-MIN algorithm is also based on the minimum completion time.The MIN-MIN algorithm considers all tasks that are not scheduled, but the MCT algorithm considers only one task at a time.The MAX-MIN algorithm is similar to the MIN-MIN algorithm, which also calculates minimum completion time without scheduled tasks firstly and then selects the task with the largest minimum completion time and assigns the task to the processor core with the minimum completion time.The PMM (Priority MIN-MIN) algorithm is an improvement of the MIN-MIN algorithm.It does not choose the smallest task with the earliest complete time, but it selects  tasks with smaller earliest completion time and schedules the task with highest priority in the  tasks.The PMM algorithm takes the standard deviation of the task on each processor core as the priority of the task.The higher the standard deviation, the higher the task priority.
On one hand, literature of existing classification algorithms applies decision tree algorithm and Bayes classification algorithm to various application scenarios.On the other hand, combined with cloud computing, distributed computing, big data, grammatical evolution [34,35], and other technologies, researchers are focused on how to optimize and improve the performance of classification algorithms.In task scheduling, few researchers apply the classification mining algorithm to schedule tasks.

Classification Data Mining
3.1.Overview.Classification mining algorithm [36] is the key technology of data mining.As a supervised learning algorithm, it is based on existing training data sets to set up a model to predict the categories of new data sets.It can find classification rules and predict new data types through analysis of the training data set.A classification mining algorithm consists of two stages which are building the model phase and using the model phase.In the first stage, it analyzes the existing training data set and builds a corresponding model and then generates some classification rules.In the second stage, it classifies new data sets based on the constructed classification model.
Major classification mining algorithms include random decision forests [37], decision tree algorithm, Bayes algorithm, genetic algorithm, artificial neural network algorithm [34], and classification algorithm based on association rules.Classification algorithm is widely used in wireless sensor networks, network intrusion detection, call logs, and risk assessment in banks.In this paper, the classification algorithm based on association rules is introduced and the Apriori algorithm is improved and evaluated.

Mining Model.
Apriori [35,38,39] is a classical classification algorithm based on association rules (CBA).It generates frequent itemsets through an iterative process.The Apriori algorithm includes two steps.First of all, it finds frequent itemsets from a known transaction in which the frequency is greater than or equal to minimum support threshold through pruning and connection operation of frequent itemsets.Then, it generates association rules based on the frequent itemsets and minimum confidence degree.
The improved association rule mining model is implemented in two steps.(1) Firstly, the transaction database  is scanned to store the transaction identification TID for each itemset, and the candidate 1-itemset  1 is generated.Delete the itemsets from  1 which are less than the minimum support threshold, and get the frequent 1-itemsets of  1 .(2) Loop execution of the process is done until  -1 is empty.Firstly, let  -1 and  -1 be joined to generate candidate itemset   .Secondly, a new transaction identifier list can be obtained through the intersection of the transaction identifier list, and the count of the itemsets can be obtained directly through   .Thirdly, comparing the count of   with the minimum support threshold min sup, reserve itemsets which are more than or equal to minimum support threshold min sup, and delete the rest of itemsets; then the final frequent itemset  is generated.

Improved Association Rule Mining Algorithm.
In the process of producing frequent itemsets in the Apriori algorithm, there are two factors that affect the performance of the algorithm.Firstly, it needs to scan the original transaction database every time to generate the frequent k-itemsets, so the number of scanned transaction databases is too much, which can result in the decline of algorithm performance.Secondly, in the process of tree cutting, the algorithm needs to scan candidate k-1 sets to get candidate itemset.Therefore, the algorithm scans itemsets many times; it also leads to the decline of algorithm performance.In view of the above problems, we improve the process of frequent itemsets in the algorithm, and an improved I-Apriori algorithm is proposed based on the Apriori algorithm.The I-Apriori algorithm is described as follows in Algorithm 1.
In the I-Apriori algorithm, during the process of generating the candidate itemset   every time, except for storing the itemset and the count of support degree, it is more important to store the transaction identifier list attribute Tid-list.After completing the connection operation between itemsets, the algorithm can get the list of transaction identifiers and the count of itemsets directly through the attribute Tid-list and does not need to scan the transaction database again.Based on the above reasons, I-Apriori algorithm can improve the performance effectively.has infrequent subset subroutine called apriori gen subroutine in the main algorithm.It is easy to find that the time complexity of Apriori algorithm is O( 4 *  * n).According to the I-Apriori algorithm shown in Algorithm 1, because only one time is needed to scan the transaction database D, the time complexity of I-Apriori algorithm is O(m+n+ 3 ).Obviously, O(m+n+ 3 ) is better than O( 4 *  * n).The greater the transaction database D, the more the number of items, the more iterations, and the higher efficiency of the I-Apriori algorithm.

Experimental Analysis.
The Java language is used to realize the classic Apriori algorithm and the I-Apriori algorithm, respectively.The hardware environment is Intel 2.5 GHz CPU, 4 GB memory, and the operation system is Windows 7. We generated corresponding frequent itemsets for the transaction database.
When the number of transactions in the transaction database is 200 and the number of items is 20, the execution time needed for the two algorithms to generate frequent itemsets under different minimum support degree (0.4∼0.8) is shown in Figure 1.When the number of items in the transaction database is 20 and the minimum support degree is 0.4 and 0.6 (several experiments show that the execution time of the algorithms is longer when the minimum support degree is 0.4, while the algorithm has a shorter execution time when the minimum support degree is 0.6; therefore, 0.4 and 0.6 are chosen to compare the execution time of the two algorithms under different transaction numbers), the execution time needed for the two algorithms to generate frequent itemsets under different number of transactions (50∼400) is shown in Figure 2.
From Figure 1, when the minimum support degree of Apriori algorithm and I-Apriori algorithm is small, the execution time of generating frequent itemsets of I-Apriori algorithm is smaller than that of Apriori algorithm.With the increase of minimum support degree, there is little difference in execution time of the two algorithms.When the minimum support degree is large, the execution time of generating frequent itemsets of I-Apriori algorithm is larger than that of Apriori algorithm.When the minimum support degree is small and the number of iterations is greater, the efficiency of the I-Apriori algorithm is higher.When the minimum support degree is large and the number of iterations is smaller, the efficiency of the Apriori algorithm is higher.Therefore, the I-Apriori algorithm is suitable for smaller minimum support degree and more iterations in classification mining.When the minimum support degree is small, the number of iterations of classification mining will increase.The I-Apriori algorithm will reduce the times of scanning the transaction database significantly, and the execution time of the algorithm is shorter.On the contrary, when the minimum support degree is large, the number of iterations will be decreased.Although the I-Apriori algorithm also can reduce the times of scanning the transaction database, I-Apriori algorithm has no advantage over Apriori algorithm.
In the case of smaller minimum support degree in Figure 3, when the number of transactions is smaller, the execution time of generating frequent itemsets of the Apriori algorithm is smaller than that of the I-Apriori algorithm.With the increase of the number of transactions, the efficiency of I-Apriori algorithm is obviously higher than that of Apriori algorithm.In the case of larger minimum support degree in Figure 4, the execution time of generating frequent itemsets of the Apriori algorithm is larger than that of the I-Apriori algorithm when the number of transactions is small.With the increase of the number of transactions, the Apriori algorithm is more efficient than the I-Apriori algorithm.Generally speaking, the I-Apriori algorithm is suitable for small minimum support degree and large number of transactions when generating frequent itemsets.

Task Scheduling of Fog Computing
Task scheduling of fog computing is to schedule tasks to fog nodes with different computing powers, and arrange their execution order reasonably, so that the total execution time is shortest.All notations utilized in the paper are listed in Table 1.

Fog Computing System
Architecture.Fog computing system [40] has three tiers in a hierarchy network, as represented in Figure 3.The front-end tier consists of IoT devices, which serve as user interfaces that send requests from users via WiFi access points or Internet.IoT devices are always subject to strict constraints on their resource such as CPU, memory, and, when run, a very complex application.The fog tier, which is formed by a set of near-end fog nodes, receives and processes part of a workload of users' request.The fog tier is generally deployed near IOT terminals, which provides limited computing resources for users.Users can access the computing resources in the fog tier directly, so it can avoid additional communication delays.The cloud tier consists of multiple servers or cloud nodes.The remote cloud can provide abundant computing resources, but it is located physically far from the users and the transmission delay is large.

Task Scheduling Model.
In order to implement the task scheduling of fog computing effectively, the classification algorithm is integrated into the task scheduling process of fog computing.Figure 4 presents the task scheduling model of fog computing.In order to realize an effective scheduling process between the fog node set  and the task set T, the scheduling module consists of two algorithms, i.e., I-Apriori algorithm and TSFC (Task Scheduling in Fog Computing) algorithm.Firstly, based on the scheduling transaction set D, association rules of the node set and the task set are generated by the I-Apriori algorithm.Secondly, the association rules are used as the input of TSFC algorithm to get the task scheduling relationship between the fog node set and the task set.Finally, the task scheduling relationship  is inserted into the scheduling transaction set  to provide input data for the next task scheduling.

TSFC Scheduling Algorithm.
Based on the I-Apriori algorithm, TSFC algorithm is designed and is shown in Algorithm 2. The basic idea of the algorithm is to schedule tasks in the task scheduling relational table with higher priority.Set the completion time of these tasks in the table to a larger value, and then select the fog node with the minimum completion time.Execute a loop from the rest of  the tasks to select the task with minimum completion time to schedule and assign the selected task to the fog node with minimum completion time until all of the tasks are scheduled.Supposing the number of task sets, tasks, and fog nodes is k, n, and , respectively, the time complexity of TSFC algorithm is O(k *  2 +k *  * m).

Analysis of Scheduling Process.
In order to understand and analyze the TSFC algorithm, a complete case is used to analyze the scheduling process of the TSFC algorithm.We analyze the whole process of task scheduling algorithm of fog computing.Suppose that the task set  contains 10 tasks and the node set  includes 4 fog nodes; that is, n=10 and m=4.
The execution time matrix Time[, ] of task set  and node set  is shown in Table 2.
(1) Transaction database.Transaction set D[] is shown in Table 3.Each scheduling information between the task set  and the fog node set  is stored as a transaction information.A Boolean value is used to describe whether the task or node is scheduled or not.The Boolean true value representing the task or node is scheduled.On the contrary, the Boolean false value representing the task or node is not scheduled.In addition, it is assumed that the transaction set  contains 10 transactions; that is, z=10.
(3) Task scheduling relational table.According to the association rules generated by the I-Apriori algorithm, the scheduling relationship between the task set and fog node set is shown in Table 4.In the task scheduling relational table R[, ], there are three kinds of values of task   corresponding to fog node   .In the first case, if the task   and the fog node   do not appear in the association rules, every value of the row corresponding to task   is equal to −1.In the second case, if task   and fog node   appear in the association rules, then calculate the confidence degree of task   on the fog node   .Let the confidence degree of tsk   corresponding to each fog node   be tP  (k∈[1, ]), and the value of task   and fog node   is equal to   / ∑ −1 =0 t  in the task scheduling relational table R[, ].For example, the scheduling relationship value between  7 and  1 is 0.833/(0.833+ 0.833 + 0.833 + 1.0 + 1.0 + 1.0) * 100=11.15.In the last case, the corresponding scheduling relationship value is equal to 0 when the fog node does not appear in the association rules.
(4) Scheduled tasks TS.Scheduled tasks TS is a task list that needs to be scheduled in an experiment.Suppose the arrival time (AT  ) of all tasks is equal to 0. The task set to be scheduled is shown in Table 5.
(5) Task scheduling.Because all of the tasks are independent, the communication cost among tasks is not considered in TSFC algorithm.The value of every element of the communication matrix is equal to 0. The task set is scheduled based on the TSFC algorithm with Tables 2, 4, and 5 as input.Then, output the execution time of (TST) and the average waiting time (AWT) of the scheduled tasks.
Take the first task set{ 0 , 1 , 3 , 7 } in the scheduled tasks TS as an example.Task  7 is scheduled to fog node firstly because the task  7 appears in the association rules, and task  7 is scheduled on fog node  0 or  3 .Recalculate the minimum completion time of the three tasks { 0 , 1 , 3 } in the task set 1, and select task  0 with the largest minimum completion time to be scheduled on fog node  2 .Next, recalculate the minimum completion time of the remaining two tasks { 1 , 3 } again, and task  1 is scheduled on fog node  1 .Finally, task  3 is scheduled on fog node  3 .The scheduling relationship between the task and fog node in the task set 1 is shown in Figure 5.

Simulation Experiment and Result Discussion
5.1.Experimental Purpose.In order to verify the TSFC algorithm proposed in this paper, we compare the performance of TSFC algorithm under the same experimental conditions with other three independent task scheduling algorithms, MCT, MET, and MIN-MIN.

Simulation Environment.
Based on the simulator toolkit provided by SimGrid [41][42][43], the simulation environment for heterogeneous multiprocessors is built as follows: (1) Internodes are interconnected through high speed networks.(2) Each fog node can perform task execution at the same time and communicate with other fog nodes without competition.(3) Every task is not preempted on the fog node.(4) The fog nodes are heterogeneous.
The computer used in the experiment is configured as follows: Intel Core i5-3210M@2.5 GHz dual core processor, 8 GB memory.The number of the fog nodes in the experiment is 4 and 6, respectively.

Test Data Set.
The input data of TSFC algorithm include the task execution time matrix, the task scheduling relational table, and the task set.The task execution time matrix includes execution time of 10 tasks and 4 fog nodes as well as 10 tasks and 6 fog nodes.The execution time of each node is generated by a random program.The task scheduling relational table is based on the task scheduling model of fog nodes with the I-Apriori algorithm.The number of tasks in the experiment starts from 100, increasing 50 tasks each time, until the number of tasks reaches 500 tasks.

Result Analysis under 4 Fog
Nodes.The TSFC, MCT, MET, and MIN-MIN algorithms are used to schedule the task set under 4 fog nodes, respectively.TST and AWT under different number of tasks in the four algorithms are shown in Figure 6.

Result Analysis under 6 Fog
Nodes.The TSFC, MCT, MET, and MIN-MIN algorithms are used to schedule the task set under 6 fog nodes, respectively.After scheduling, TST and AWT under different number of tasks in the four algorithms are shown in Figure 7.
We can see from Figures 6(a) and 7(a) that, with the number of tasks increases, the value of TST generated by TSFC, MCT, MET, and MIN-MIN algorithms is increasing.However, the value of TST generated by the TSFC algorithm is smaller than those by MCT and MIN-MIN algorithms.As the number of tasks increases, the efficiency of TSFC algorithm is higher than MCT and MIN-MIN algorithms.When the number of tasks is small, the value of TST generated by the TSFC algorithm is lower than that by the MET algorithm.As the number of tasks increases, the value of TST generated by the TSFC algorithm is larger than that by the MET algorithm.Because TSFC algorithm takes task completion time as a main parameter, as the number of tasks increases, the total completion time of scheduled tasks will be closer to the optimal solution.algorithm takes the shortest execution time of tasks as the main scheduling parameter, the execution time of different tasks on the same fog nodes is proportional, so it will cause most of the tasks to be scheduled on the same fog node and resulting in a much higher AWT value.The TSFC algorithm schedules tasks which have minimum value in minimum completion time, and it shortens the value of task waiting time as much as possible, so the value of AWT is smaller.
In summary, the value of TST and AWT generated by TSFC algorithm is better than MCT, MET, and MIN-MIN algorithms.The TSFC algorithm is superior to MCT, MET, and MIN-MIN algorithms in the experiments.

Conclusion
The fog computing is a new paradigm which attracts lots of attention.Providing satisfactory computation performance is a great challenge in the fog computing environment.In this paper, we proposed an I-Apriori algorithm by improving the Apriori algorithm.Experimental results show that the I-Apriori algorithm can improve the efficiency of generating frequent itemsets effectively.A novel task scheduling model and a novel TSFC algorithm of fog computing environment are proposed based on the I-Apriori algorithm.Association rules are generated by the I-Apriori algorithm which act as an important parameter of TSFC task scheduling algorithm.Experimental results show that TSFC algorithm has better performance than other similar algorithms in terms of task total execution time and average waiting time.
In this article, there are some other issues that do not involve, for example, bandwidth between processors, multilayer task scheduling in fog computing, and others.In future work, we will explore these areas.Furthermore, we will apply TSFC algorithm to other areas, such as oblivious RAM, string mapping, and match problems.

3. 4 .
Algorithm Evaluation.The efficiency of Apriori algorithm and I-Apriori algorithm is evaluated based on time complexity and algorithm execution time.3.4.1.Time Complexity.Suppose the number of transactions and items in the transaction database  is  and m, and the iteration times of frequent itemsets in the algorithm is .The time complexity of classical Apriori algorithm is composed of three layers nested for loops, apriori gen subroutine and Apriori I

Figure 1 :
Figure 1: Comparison of execution time.

4 Apriori 6 Figure 2 :Figure 3 :
Figure 2: (a) Comparison of execution time under the minimum support degree 0.4.(b) Comparison of execution time under the minimum support degree 0.6.

Figure 5 :
Figure 5: Scheduling diagram between tasks and fog node.

Figure 7 :
Figure 6: (a) Comparison of execution time under 4 fog nodes.(b) Comparison of average waiting time under 4 fog nodes.

1
Input: transaction database D; min sup 2 Output: frequent itemsets L 3 C 1 =find candidate 1-itemsets(D); 4 int count=the number of TID in D; 5 for each itemset s of C 1 {

Table 2 :
Execution time matrix Time[, ] of task set T and fog node set N.

Table 4 :
Relationship between task and fog node R[, ].