A DAG Scheduling Scheme on Heterogeneous Computing Systems Using Tuple-Based Chemical Reaction Optimization

A complex computing problem can be solved efficiently on a system with multiple computing nodes by dividing its implementation code into several parallel processing modules or tasks that can be formulated as directed acyclic graph (DAG) problems. The DAG jobs may be mapped to and scheduled on the computing nodes to minimize the total execution time. Searching an optimal DAG scheduling solution is considered to be NP-complete. This paper proposed a tuple molecular structure-based chemical reaction optimization (TMSCRO) method for DAG scheduling on heterogeneous computing systems, based on a very recently proposed metaheuristic method, chemical reaction optimization (CRO). Comparing with other CRO-based algorithms for DAG scheduling, the design of tuple reaction molecular structure and four elementary reaction operators of TMSCRO is more reasonable. TMSCRO also applies the concept of constrained critical paths (CCPs), constrained-critical-path directed acyclic graph (CCPDAG) and super molecule for accelerating convergence. In this paper, we have also conducted simulation experiments to verify the effectiveness and efficiency of TMSCRO upon a large set of randomly generated graphs and the graphs for real world problems.


Introduction
Modern computer systems with multiple processors working in parallel may enhance the processing capacity for an application. The effective scheduling of parallel modules of the application may fully exploit the parallelism. The application modules may communicate and synchronize several times during the processing. The limitation of the overall application performance may be incurred by a large communication cost on heterogeneous systems with a combination of GPUs, multicore processors and CELL processors, or distributed memory systems. And an effective scheduling may greatly improve the performance of the application.
Scheduling generally defines not only the processing order of application modules but also the processor assignment of these modules. The concept of makespan (i.e., the schedule length) is used to evaluate the scheduling solution quality including the entire execution and communication cost of all the modules. On the heterogeneous systems [1][2][3][4], searching optimal schedules minimizing the makespan is considered as a NP-complete problem. Therefore, two classes of scheduling strategies have been proposed to solve this problem by finding the suboptimal solution with lower time overhead, such as heuristic scheduling and metaheuristic scheduling.
Heuristic scheduling strategies try to identify a good solution by exploiting the heuristics. An important subclass of heuristic scheduling is list scheduling with an ordered task list for a DAG job on the basis of some greedy heuristics. Moreover, the ordered tasks are selected to be allocated to the processors which minimize the start times in list scheduling algorithms. In heuristic scheduling, the attempted solutions are narrowed down by greedy heuristics to a very small portion of the entire solution space. And this limitation of the solution searching leads to the low time complexity. However, the higher complexity DAG scheduling problems have, the harder greedy heuristics produce consistent results on a wide range of problems, because the quality of the found solutions relies on the effectiveness of the heuristics, heavily.

Molecule
New molecule  time cost than heuristic scheduling strategies, but they can produce consistent results with high quality on the problems with a wide range by directed searching solution spaces. Chemical reaction optimization (CRO) is a new metaheuristic method proposed very recently and has shown its power to deal with NP-complete problem. There is only one CRO-based algorithm called double molecular structurebased CRO (DMSCRO) for DAG scheduling on heterogeneous system as far as we know. DMSCRO has a better performance on makespan and convergence rate than genetic algorithm (GA) for DAG scheduling on heterogeneous systems. However, the rate of convergence of DMSCRO as a metaheuristic method is still defective. This paper proposes a The Scientific World Journal 3 (1) //PHASE 1: Find the constrained critical paths (CCPs) (2) Find set of critical paths CP according to the description in the second paragraph of Section 3.1.
(1) for each = ( V , V , , ) in (2) CCP = BelongCCP ( V ); (3) CCP = BelongCCP ( V ); (4) if (CCP ̸ = CCP ) & (CCPE (CCP , CCP )) does not exist (5) create CCPE (CCP , CCP ) (6) end if (7) add Start and End (8) add edges among Start and CCP nodes (9) add edges among End and CCP nodes (10) end for Algorithm 2: Gen CCPDAG(DAG, CCP) generating CCPDAG. new CRO-based algorithm, tuple molecular structure-based CRO (TMSCRO), for the mentioned problem, encoding the two basic components of DAG scheduling, module execution order and module-to-processor mapping, into an array of tuples. Combining this kind of molecular structure with the elementary reaction operator designed in TMSCRO has a better capability of intensification and diversification than DMSCRO. Moreover, in TMSCRO, the concept of constrained critical paths (CCPs) [5] and constrained-criticalpath directed acyclic graph (CCPDAG) are applied to creating initial population in order to speed up the convergence of TMSCRO. In addition, the first initial molecule, InitS, is also considered to be a super molecule [6] for accelerating convergence, which is converted from the scheduling result of the algorithm constrained earliest finish time (CEFT).
In theory, a metaheuristic method will gradually approach the optimal result if it runs for long enough, based on No-Free-Lunch Theorem, which means the performances of the search for optimal solution of each metaheuristic algorithm are alike when averaged over all possible fitness functions. We have conducted the simulation experiments over the graphs abstracted from two well-known real applications: Gaussian elimination and molecular dynamics application and also a large set of randomly generated graphs. The experiment results show that the proposed TMSCRO can achieve similar performance as DMSCRO 4 The Scientific World Journal (1) InitS = ConvertMole(InitCCPS); (2) update each in molecule InitS as defined in the last paragraph of Section 5.1.1 (3) MoleN = 1; (4) while MoleN ≤ PopSize − 1 do (5) for each CCP in CCP molecule CCPS (6) find the first successor Succ( ) in CCPDAG from to the end; (7) for each CCP , ∈ ( , Succ( )) (8) find the first predecessor Pred( ) from Succ( ) to the begin in CCP molecule CCPS; (9) if Pred( ) < (10) interchanged position of (CCP , sp ) and (CCP , sp ) in CCP molecule CCPS; (11) end if (12) end for (13)   in the literature in terms of makespan and outperforms the heuristic algorithms.
There are three major contributions of this work.
(1) Developing TMSCRO based on CRO framework by designing a more reasonable molecule encoding method and elementary chemical reaction operators on intensification and diversification search than DMSCRO.
(2) For accelerating convergence, applying CEFT and CCPDAG to the data pretreatment, utilizing the concept of CCPs in the initialization, and using the first initial molecule, InitS, to be a super molecule in TMSCRO.
(3) Verifying the effectiveness and efficiency of the proposed TMSCRO by simulation experiments. The simulation results of this paper show that TMSCRO is able to approach similar makespan as DMSCRO, but it finds good solutions faster than DMSCRO by 12.89% on average (by 26.29% in the best case).  Figure 7: Illustration of the task-to-computing-node mapping for decomposition.

Related Work
Most of the scheduling algorithms can be categorized into heuristic scheduling (including list scheduling, duplicationbased scheduling, and cluster scheduling) and metaheuristic (i.e., guided-random-search-based) scheduling. These strategies are to generate the scheduling solution before the execution of the application. The approaches adopted by these different scheduling strategies are summarized in this section.

Heuristic Scheduling.
Heuristic methods usually provide near-optimal solutions for a task scheduling problem in less   than polynomial time. The approaches adopted by heuristic method search only one path in the solution space, ignoring other possible ones [7]. Three typical kinds of algorithms based on heuristic scheduling for the DAG scheduling problem are discussed as below, such as list scheduling [7,8], cluster scheduling [9,10], and duplication-based scheduling [11,12].
The list scheduling [7,[13][14][15][16][17][18][19][20][21] generates a schedule solution in two primary phases. In phase 1, all the tasks are processed in a sequence order by their assigned priorities, which are normally based on the task execution and communication costs. There are two attributes used in most list scheduling algorithms, such as b-level and t-level, to assign task priorities. In a DAG, b-level of a node (task) is the length of the longest path from the end node to the node; however, tlevel of a node is the length of the longest path from the entry node to the node. In phase 2, the processors are assigned to each task in the sequence.
The heterogeneous earliest finish time (HEFT) scheduling algorithm [16] assigns the scheduling task priorities based ( 4 , 0, p 1 ) Figure 10: Illustration of molecular structure change for synthesis. on the earliest start time of each task. HEFT allocates a task to the processor which minimizes the task's start time.
The modified critical path (MCP) scheduling [22] considers only one CP (critical path) of the DAG and assigns the scheduling priority to tasks based on their latest start time. The latest start times of the CP tasks are equal to their t-levels. MCP allocates a task to the processor which minimizes the task's start time.
Dynamic-level scheduling (DLS) [23] uses the concept of the dynamic level, which is the difference between the b-level and earliest start time of a task on a processor. Each time the (task, processor) pair with the largest dynamic-level value is chosen by DLS during the task scheduling.
Mapping heuristic (MH) [24] assigns the task scheduling priorities based on the static b-level of each task, which is the b-level without the communication costs between tasks. Then, a task is allocated to the processor which gives the earliest start time.
Levelized-min time (LMT) [17] assigns the task scheduling priority in two steps. Firstly, it groups the tasks into different levels based on the topology of the DAG, and then in each level, the task with the highest priority is the one with the largest execution cost. A task is allocated to the processor which minimizes the sum of the total communication costs with the tasks in the previous level and the task's execution cost.
There are two heuristic algorithms for DAG scheduling on heterogeneous systems proposed in [8]. One algorithm named HEFT T uses the sum of t-level and b-level to assign the priority to each task. In HEFT T, the critical tasks are attempted to be on the same processor, and the other tasks are allocated to the processor that gives earliest start time. The other algorithm named HEFT B applies the concept of b-level to assign the priority (i.e., scheduling order) to each task. After the priority assignment, a task is allocated to the processor that minimizes the start time. The extensive experiment results in [8] demonstrate that HEFT B and HEFT T outperform (in terms of makespan) other representative heuristic algorithms in heterogeneous systems, such as DLS, MH, and LMT.
Comparing with the list scheduling algorithms, the duplication-based algorithms [23,[25][26][27][28][29] attempt to duplicate the tasks to the same processor on heterogeneous systems, because the duplication may eliminate the communication cost of these tasks and it may effectively reduce the total schedule length.
The clustering algorithms [8,11,[30][31][32] regard task collections as clusters to be mapped to appropriate processors. These algorithms are mostly used in the homogeneous systems with unbounded number of processors and they will use as many processors as possible to reduce the schedule length. Then, if the number of the processors used for scheduling is  Figure 13: A molecular dynamics code.   more than that of the available processors, the task collections (clusters) are processed further to fit in with a limited number of processors.

Metaheuristic Scheduling.
In comparison with the algorithms based on heuristic scheduling, the metaheuristic (guided-random-search-based) algorithms use a combinatorial process for solution searching. In general, with robust performance on many kinds of scheduling problems, the metaheuristic algorithms need sampling candidate solutions   in the search space, sufficiently. Many metaheuristic algorithms have been applied to solve the task scheduling problem successfully, such as GA, chemical reaction optimization (CRO), energy-efficient stochastic [33], and so forth. GA [15,31,[34][35][36] is the mostly used metaheuristic method for DAG scheduling. In [15], a solution for scheduling is encoded as one-dimensional string representing an ordered list of tasks to be allocated to a processor. In each string of two parent solutions, the crossover operator selects a crossover point randomly and then merges the head portion of one parent with the tail portion of the other. Mutation operator   exchanges two tasks in two solutions, randomly. The concept of makespan is used to evaluate the scheduling solution quality by fitness function. Chemical reaction optimization (CRO) was proposed very recently [20,30,[37][38][39]. It mimics the interactions of molecules in chemical reactions. CRO has good performance already in solving many problems, such as quadratic assignment problem (QAP), resource-constrained project scheduling problem (RCPSP), channel assignment problem (CAP) [39], task scheduling in grid computing (TSGC) [40], and 0-1 knapsack problem (KP01) [41]. So far as we know, double molecular structure-based chemical reaction optimization (DMSCRO) recently proposed in [37] is the only one CRObased algorithm with two molecular structures for DAG scheduling on heterogeneous systems. CRO-based algorithm (just DMSCRO) mimics the chemical reaction process in a closed container and accords with energy conservation. In DMSCRO, one solution for DAG scheduling including two essential components, task execution order and taskto-processor mapping, corresponds to a double-structured The Scientific World Journal molecule with two kinds of energy, potential energy (PE) and kinetic energy (KE). The value of PE of a molecule is just the fitness value (objective value), makespan, of the corresponding solution, which can be calculated by the fitness function designed in DMSCRO, and KE with a nonnegative value is to help the molecule escape from local optimums. There are four kinds of elementary reactions used to do the intensification and diversification search in the solution space to find the solution with the minimal makespan, and the principle of the reaction selection is in detail presented in Section 3.2. Moreover, a central buffer is also applied in DMSCRO for energy interchange and conservation during the searching progress. However, as a metaheuristic method for DAG scheduling, DMSCRO still has very large time expenditure and the rate of convergence of this algorithm needs to be improved. Comparing with GA, DMSCRO is similar in model and workload to TMSCRO proposed in this paper.
Our work is concerned with the DAG scheduling problems and the flaw of CRO-based method for DAG scheduling, proposing a tuple molecular structure-based chemical reaction optimization (TMSCRO). Comparing with DMSCRO,  TMSCRO applies CEFT [5] to data pretreatment to take the advantage of CCPs as heuristic information for accelerating convergence. Moreover, the molecule structure and elementary reaction operators design in TMSCRO are more reasonable than those in DMSCRO on intensification and diversification of searching the solution space.

Background
3.1. CEFT. Constrained earliest finish time (CEFT) based on the constrained critical paths (CCPs) was proposed for heterogeneous system scheduling in [5]. In contrast to other approaches, the CEFT strategy takes account of a broader view of the input DAG. Moreover, the CCPs can be scheduled efficiently because of their static generation. The constrained critical path (CCP) is a collection with the tasks ready for scheduling only. A task is ready when all its predecessors were processed. In CEFT, a critical path (CP) is generally the longest path from the start node to the end node for scheduling in the DAG. The DAG is initially traversed and critical paths are found. Then it is pruned off the nodes that constitute a critical path. The subsequent traversals of the pruned graph produce the remaining critical paths. While the nodes are being removed from the task graph, a pseudo-edge to the start or end node is added if a node has no predecessors or no successors, respectively. The CCPs are subsequently formed by selecting ready nodes in the critical paths in a round-robin fashion. Each CCP may be assigned a single processor which has the minimum finish time of processing all the tasks in the CCP. All the tasks in a CCP not only reduce the communication cost, but also benefit from a broader view of the task graph. Consider the CEFT algorithm generates schedules for n tasks with | | heterogeneous processors. Some specific terms and their usage are indicated in Table 1.
The CEFT scheduling approach (Algorithm 1) works in two phases. (1) The critical paths are generated according to the description in the second paragraph of Section 3.1. The critical paths are traversed and the ready nodes are inserted into the constrained critical paths (CCPs) CCP , ∀ = 1, 2, . . . , | |. If no more ready nodes are in a critical path, the constrained critical path takes nodes from the next critical path following round-robin traversal of the critical paths. (2) All the CCPs are traversed in order (line 12). Then, ST ( , ), the maximum of AT and the start time of the predecessors of each node , is calculated (1). EFT ( ) is computed as the sum of ST ( , ) and EC ( ) (2).
( ) is the maximum of the finish times of all the CCP nodes on the same processor (3). The processor is then assigned to constrained-criticalpath CCP which minimizes the CEFT (CCP ) value (line 20). After the actual finish time AEFT of each task in CCP is updated, the processor assignment continues iteratively.

CRO.
Chemical reaction optimization (CRO) mimics the process of a chemical reaction where molecules undergo a series of reactions between each other or with the environment in a closed container. The molecules are manipulated agents with a profile of three necessary properties of the molecule, including the following. (1) The molecular structure : actually structure represents the positions of atoms in a molecule. Molecular structure can be in the form of a number, a vector, a matrix, or even a graph which is independent of the problem, (2) (Current) potential energy (PE): PE is the objective function value of the current molecular structure , that is, PE = ( ). (3) (Current) kinetic energy (KE): KE is a nonnegative number and it helps the molecule escape from local optimums. There is a central energy buffer implemented in CRO. The energy in CRO may accord with energy conservation and can be exchanged between molecules and the buffer.
Four kinds of elementary reactions may happen in CRO, which are defined as below.
(1) On-wall ineffective collision: on-wall ineffective collision is a unimolecule reaction with only one molecule. In this reaction, a molecule is allowed to change to another one , if their energy values accord with the following inequality: after this reaction, KE will be redistributed in CRO. The redundant energy with the value KE = (PE + KE − PE ) × will be stored in the central energy buffer. Parameter t is a random number from KELoss-Rate to 1 and KELossRate, a system parameter set during the CRO initialization, is the KE loss rate less than 1.
(2) Decomposition: decomposition is the other unimolecule reaction in CRO. A molecule may decompose into two new molecules, 1 and 2 , if their energy values accord with inequality (2), in which buf denotes the energy in the buffer, representing the energy interactions between molecules and the central energy buffer: after this reaction, buf is updated by (3) and the KEs of 1 and 2 are, respectively, computed as (4) and (5), where Edecomp = (PE + KE ) − (PE 1 + PE 2 ) and 1, 2, 3, 4 is a number randomly selected from the range of [0, 1]. Consider (3) Intermolecular ineffective collision: intermolecular ineffective collision is an intermolecule reaction with two molecules. Two molecules, 1 and 2 , may change to two new molecules, 1 and 2 , if their energy values accord with the following inequality: after this reaction, the KEs of 1 and 2 , KE 1 and KE 2 , will share the spare energy Eintermole calculated by (7). KE 1 and KE 2 are computed as (8) and (9), respectively, where 1 is a number randomly selected from the range of [0, 1]. Consider (4) Synthesis: synthesis is also an intermolecule reaction. Two molecules, 1 and 2 , may be combined to a new molecule, , if their energy values accord with inequality (10). The KE of is computed as (11): The canonical CRO works as follows. Firstly, the initialization of CRO is to set system parameters, such as PopSize (the size of the molecules), KELossRate, InitialKE (the initial energy of molecules), buf (initial energy in the buffer), and MoleColl (MoleColl is a threshold value to determine whether to perform a unimolecule reaction or an intermolecule reaction). Then the CRO processes a loop. In each iteration, whether to perform a unimolecule reaction or an intermolecule reaction is first decided in the following way. A number is randomly selected from the range of [0, 1]. If is bigger than MoleColl, a unimolecule reaction will be chosen, or an intermolecular reaction is to occur. If it is a unimolecular reaction, a parameter as a threshold value is used to guide the further choice of on-wall collision or decomposition. NumHit is the parameter used to record the total collision number of a molecule. It will be updated after a molecule undergoes a collision. If the NumHit of a molecule is larger than , a decomposition will then be selected. Similarly, a parameter is used to further decide selection of an intermolecule collision reaction or a synthesis reaction. specifies the least KE of a molecule. Synthesis reaction will be chosen when both KEs of the molecules 1 and 2 are less than , or intermolecular ineffective collision reaction will take place. When the stopping criterion satisfies (e.g., a better solution cannot be found after a certain number of consecutive iterations), the loop will be stopped and the best solution is just the molecule that possesses the lowest PE.

Models
This section discusses the system, application, and task scheduling model assumed in this work. The definition of the notations can be found in the Notations section.

System Model.
In this paper, there are multiple heterogeneous processors in the target system, which are presented by = { | = 1, 2, 3, . . . , | |}. They are fully interconnected with high speed network. Each task in a DAG can only be executed on one processor on heterogeneous system. The edges of the graph are labeled with communication cost that should be taken into account if its start and end tasks are executed on different processors. The communication cost is zero when the same processor is assigned to two communicating modules.
We assume a static computing system model in which the constrained relations and the execution costs of tasks are known a priori and the execution and communication can be performed simultaneously by the processors. In this paper, the heterogeneity is represented by EC ( ), which means the execution cost of a node w using processor . As the assumption of the MHM model, the heterogeneity in the simulations is set as follows to make a processor have different speed for different tasks. The value of each EC ( ) is randomly chosen within the scope of [1 − %, 1 + %] by using a parameter ( ∈ (0, 1)). Therefore, the heterogeneity level can be formulated as (1 + %)/(1 − %). is set as the value that makes the heterogeneity level 2 in this paper unless otherwise specified.

Application Model.
In DAG scheduling, finding optimal schedules is to find the scheduling solution with the minimum schedule length. The schedule length encompasses the entire execution and communication cost of all the modules and is also termed as makespan. In this paper, the task scheduling problem is to map a set of tasks to a set of processors, aiming at minimizing the makespan. It takes as input a directed acyclic graph DAG = ( , ), with | | nodes representing tasks, and | | edges representing constrained relations among the tasks.
= (V 1 , V 2 , . . . , V , . . . , V | | ) is a node sequence in which the hypothetical entry node (with no predecessors) V 1 and end node (with no successors) V | | , respectively, represent the beginning and the end of execution. The execution cost value of V on processor is denoted as EC (V ), and the average computation cost of V , denoted as (V ), can be calculated by (12). The parameter for the amounts of computing power available at each node in a heterogeneous system and its heterogeneous level value is given in the 5th paragraph of Section 6 and Table 1.
The start time of the task V on processor is denoted as ST (V ), which can be calculated using (13), where Pred(V ) is the set of the predecessors of the task V . And the earliest finish 12 The Scientific World Journal (1) for = 1; ≤ | |; ++ (2) for each CCP in molecule CCPS (3) for each cv in CCP (4) V = cv ; (5) = 0; Generate a new tuple (V , , ) end for (9) end for (10) end for (11) Generate a new reaction molecule = ((V 1 , 1 , 1 ), ( V 2 , 2 , 2 ), . . . , ( V | | , | | , | | )); (12) for each (V , , ) in reaction molecule (13) find the first successor Succ(V ) in DAG from to the end; (14) for each V ∈ (V , Succ(V )) (15) find the first predecessor V = Pred(V ) from Succ(V ) to the begin in reaction molecule ; (16) if < (17) interchanged position of (V , , ) and (V , , ) in reaction molecule ; (18) end if (19) end for (20) end for (21) for each in reaction molecule to randomly change; (22) change randomly (23) end for (24) return ; time of the task V on processor is denoted as EFT (V ), which can be calculated using (14): The communication to computation ratio (CCR) can be used to indicate whether a DAG is communication intensive or computation intensive. For a given DAG, it is computed by the average communication cost divided by the average computation cost on a target computing system. The computation can be formulated as follows:

Design of TMSCRO
TMSCRO mimics the interactions of molecules in chemical reactions with the concepts of molecule, atoms, molecular structure, and energy of a molecule. The structure of a molecule is unique, which represents the atom positions in a molecule. The interactions of molecules in four kinds of basic if > MoleColl (7) Select a reaction molecule from CROPop randomly; Call DecompT to generate new molecules 1 and 2 ; Call Algorithm 3 to calculate PE 1 and PE 2 ; (11) if Inequality (2) holds (12) Remove from CROPop; (13) Add 1 and 2 to CROPop; (14) end if (15) else (16) Call OnWallT to generate a new molecules ; (17) Call Algorithm 3 to calculate PE ; (18) If ( = InitS) (19) InitS = ; (20) end if (21) Remove from CROPop; (22) Add to CROPop; (23) end if (24) else (25) Select two molecules 1 and 2 from CROPop randomly; (26) if Call SynthT to generate a new molecule ; (28) Call Algorithm 3 to calculate PE ; (29) if Inequality (10)  chemical reactions, on-wall ineffective collision, decomposition, intermolecular ineffective collision, and synthesis, aim to transform to the molecule with more stable states which has lower energy. In DAG scheduling, a scheduling solution including a task and processor allocation corresponds to a molecule in TMSCRO. This paper also designs the operators on the encoded scheduling solutions (tuple arrays). These designed operators correspond to the chemical reactions and change the molecular structures. The arrays with different tuples represent different scheduling solutions, and we can calculate the corresponding makespan of the scheduling solution. A scheduling solution makespan corresponds to the energy of a molecule.
In this section, we first present the data pretreatment of the TMSCRO. After the presentation of the encoding of scheduling solutions and the fitness function used in the TMSCRO, we present the design of four elementary chemical reaction operators in each part of the TMSCRO. Finally, we outline the framework of the TMSCRO scheme and discuss a few important properties in TMSCRO.
14 The Scientific World Journal Communication cost from node V to , if has been assigned to node V and is assigned to node ST ( , V) Possible start time of node which is assigned the processor with the V node being any predecessor of which has already been scheduled EFT ( ) Finish time of node using processor  Table 2: CCP corresponding to the DAG as shown in Figure 1(1).

Molecular Structure, Data Pretreatment, and Fitness
Function. This subsection first presents the encoding of scheduling solutions (i.e., the molecular structure) and data pretreatment, respectively. Then we give the statement of the fitness function for optimization designed in TMSCRO.

Molecular Structure and Data Pretreatment.
A reasonable initial population in CRO-based methods may increase the scope of searching over the fitness function [20] to support faster convergence and to result in a better solution. Constrained critical paths (CCPs) can be seen as the classification of task sequences constructed by constrained earliest finish time (CEFT) algorithm, which takes into account all factors in DAG (i.e., the average of each task execution cost, the communication costs, and the graph topology). Therefore, TMSCRO utilizes the CCPs to create a reasonable initial population based on a broad view of DAG. The data pretreatment is to generate the CCPDAG from DAG and to construct CCPS for the initialization of TMSCRO. The CCPDAG is a directed acyclic graph with |CCP| nodes representing constrained critical paths (CCP ), two virtual nodes (i.e., start and end) representing the beginning and exit of execution, respectively, and |CE| edges representing dependencies among the nodes. The edges of CCPDAG are not labeled with communication overhead which is different from DAG. The data pretreatment includes two steps.
(1) The CCP and the processor allocation of each element of CCP in DAG can be obtained by executing CEFT and the first initial CCP solution, InitCCPS = ((CCP 1 , sp 1 ), (CCP 2 , sp 2 ), . . . , (CCP |CCP| , sp |CCP| )), can also be got, in which ((CCP , sp )) is sorted as the generated order of CCP and sp is processor assignment of CCP after executing CEFT. Consider the graph as shown in Figure 1; the resulting CCPs are indicated in Table 2.
(2) After the execution of CEFT for DAG, the CCPDAG is generated with the input of CCP and DAG. A detailed description is given in Algorithm 2.
As shown in Algorithm 1, the edge of DAG with the start node CCP and the end node CCP is obtained in each loop (line 1). BelongCCP(V ) represents which CCP in CCPV belongs to (line 2 and line 3). If CCP and CCP are different CCPs and there is no edge between them (line 4), then the edge between CCP and CCP is generated (line 5). Finally, the nodes, start and end, and the edges among them and CCP nodes are added (line 7, line 8, and line 9). Consider the DAG as shown in Figure 1 and the CCP as indicated in Table 1. The resulting CCPDAG is shown in Figure 3.
In this paper, there are two kinds of molecular structures of TMSCRO, CCPS, and S. CCP molecular structure CCPS is just used in the initialization of TMSCRO, which can be formulated as in (16). Whereas the reaction molecular structure converted from CCPS is used to participate in the elementary reaction of TMSCRO. In CCPS, ((CCP , sp ))s are sorted as the topology of CCPDAG in which CCP is constrained critical path (CCP), and sp is the processor assigned to CCP . |CCP| ≤ | | because the number of elements in each SCCP is greater than or equal to one. A reaction molecule can be formulated as in (17), which consists of an array of atoms (i.e., tuples) representing a solution of DAG scheduling problem. A tuple includes three integers V , , and . The reaction molecular structure is  encoded with each integer in the permutation representing a task in DAG, the constraint relationship between a tuple and the one before it, and the processor . In each reaction molecular structure , V represents a task in DAG and (V 1 , V 2 , . . . , V | | ) is a topological sequence of DAG. In , if V of the tuple , which is before tuple , is the predecessor of V of tuple in DAG, the second integer of tuple , , will be 1, or it will be 0. represents the processor allocation of each V in the tuple. The sequence of the tuples in a reaction molecular structure represents the scheduling order of each task in DAG: CCPS = ((CCP 1 , sp 1 ) , (CCP 2 , sp 2 ) , . . . , (CCP |CCP| , sp |CCP| )) ,

Fitness Function.
The initial molecule generator is used to generate the initial solutions for TMSCRO to manipulate. The first molecule InitS is converted from InitCCPS. Part three sp of each tuple is generated by a random perturbation in the first InitCCPS. A detailed description is given in Algorithms 3 and 4 and presents how to convert a CCPS to an S. Potential energy (PE) is defined as the objective function (fitness function) value of the corresponding solution represented by S. The overall schedule length of the entire DAG, namely, makespan, is the largest finish time among all tasks, which is equivalent to the actual finish time of the end node in DAG. For the DAG scheduling problem by TMSCRO, the goal is to obtain the scheduling that minimizes makespan and Algorithm 5 presents how to calculate the value of the optimization fitness function Fit( ).

Elementary Chemical Reaction
Operators. This subsection presents four elementary chemical reaction operators for sequence optimization and processor allocation optimization designed in TMSCRO, including on-wall collision, decomposition, intermolecular collision, and synthesis.

On-Wall Ineffective Collision.
In this paper, the operator, OnWallT, is used to generate a new molecule from a given reaction molecule for optimization. OnWallT works as follows.  Figure 1(2).

Decomposition.
In this paper, the operator, DecompT, is used to generate new molecules S 1 and S 2 from a given reaction molecule . DecompT works as follows. (1)  The only difference is that, in step 2, we use (V , , ) instead of (V , , ). (6) The operator keeps the tuples in 1 , which is at the odd position in , and retains the tuples in 2 , which is at the even position in , and then changes the remaining of tuples in 1 ' and 2 , randomly. In the end, the operator generates two new molecules 1 and

Intermolecular Ineffective Collision.
In this paper, the operator, IntermoleT, is used to generate new molecules 1 and 2 from given molecules 1 and 2 . This operator first uses the steps in OnWallT to generate 1 from 1 , and then the operator generates the other new molecule 2 from 2 in similar fashion. In the end, the operator generates two new molecules 1 and 2 from 1 and 2 as an intensification search. Figures 8 and 9 show the example which is the molecule corresponding to the DAG as shown in Figure 1(2).

Synthesis.
In this paper, the operator, SynthT, is used to generate a new molecule from given molecules 1 and 2 for optimization. SynthT works as follows. (1) If | | is plural, then the integer = | |/2; else = (| | + 1)/2.  5) The operator keeps the tuples in , which are at the same position in 1 and 2 with the same , and then changes the remaining in , randomly. As a result, the operator generates from 1 and 2 as a diversification search. Figures 10 and 11 show the example which is the molecule corresponding to the DAG as shown in Figure 1(2).

The Framework and Analysis of TMSCRO.
The framework of TMSCRO is shown as an outline to schedule a DAG job in Algorithm 6 and the output of Algorithm 6 is just the resultant near-optimal solution for the corresponding DAG scheduling problem. In this framework, TMSCRO first  initializes the process. Then, the process enters a loop. In each iteration, one of the elementary chemical reaction operators for optimization is performed to generate new molecules and PE of newly generated molecules will be calculated. The whole working of TMSCRO for DAG scheduling on heterogeneous problem is as presented in the last paragraph in Section 3.2. However, InitS is considered to be a super molecule [6], so it will be tracked and only participates in on-wall ineffective collision and intermolecular ineffective collision to explore as much as possible the solution space in its neighborhoods and the main purpose is to prevent InitS from changing dramatically. The iteration repeats until the stopping criteria are met. The stopping criteria may be set based on different parameters, such as the maximum amount of CPU time used, the maximum number of iterations performed, an objective function value less than a predefined threshold obtained, and the maximum number of iterations performed without further performance improvement. The stopping criterion of TMSCRO in the experiments of this paper is that the makespan is not changed after 5000 consecutive iterations in each loop. The time complexity of TMSCRO is (iters × [2 × (| | 2 + | | × | |)], where iters is the number of iterations in TMSCRO, respectively. It is very difficult to theoretically prove the optimality of the CRO (as well as DMSCRO and TMSCRO) scheme [37]. However, by analyzing the molecular structure, chemical reaction operators, and the operational environment in TMSCRO, it can be shown to some extent that TMSCRO scheme has the advantage of three points in comparison with GA, SA, and DMSCRO.
First, just like DMSCRO, TMSCRO enjoys the advantages of GA and SA to some extent by analyzing the chemical reaction operators designed in TMSCRO and the operator environment of TMSCRO: (1) the OnWallT and IntermoleT in TMSCRO exchange the partial structure of two different molecules like the crossover operator in GA. (2) The energy conservation requirement in TMSCRO is able to guide the searching of the optimal solution in a similar way as the Metropolis Algorithm of SA guides the evolution of the solutions in SA. Second, constrained earliest finish time (CEFT) algorithm constructs constrained critical paths (CCPs) by taking into account a broader view of the input DAG [5]. TMSCRO applies CEFT and CCPDAG to the data pretreatment and utilizes CCPs in the initialization of TMSCRO to create a more reasonable initial population than DMSCRO for accelerating convergence, because a wide distributed initial population in CRO-based methods may increase the scope of searching over the fitness function [20] to support faster convergence and to result in a better solution. Moreover, to some degree, InitS is also similar to the super molecule in super molecule-based CRO or the "elite" in GA [6]. However, the "elite" in GA is usually generated from two chromosomes, while InitS is based on the whole input DAG by executing CEFT. Third, the operators with the molecular structure in TMSCRO are designed more reasonably than DMSCRO. In CRO-based algorithm, the operators of on-wall collision and intermolecular collision are used for intensifications, while the operators of decomposition and synthesis are for diversifications. The better the operator can get the better the search results of intensification and diversification are. This feature of CRO is very important, which gives CRO more opportunities to jump out of the local optimum and explore the wider areas in the solution space. In TMSCRO, the operators of OnWallT and IntermoleT every time only exchange the positions of one tuple and its former neighbor in the molecule with better capability of intensification on sequence optimization than DMSCRO, of which the reaction operators, OnWall ( 1 ) and Intermole ( 1 , 2 ) [37] ( 1 and 2 are big molecules in DMSCRO), may change the task sequence(s) dramatically. Moreover, under the consideration that the optimization includes not only sequence but also processor assignment optimization, 18 The Scientific World Journal all reaction operators in TMSCRO can change the processor assignment, but DMSCRO has only two reactions, on-wall and synthesis [37], for processor assignment optimization. On the one hand, TMSCRO has 100% probability of searching the processor assignment solution space by four elementary reactions, with better capability of diversification and intensification on processor assignment optimization than DMSCRO, of which the chance to search this kind of solution space is only 50%. On the other hand, the division of diversification and intensification of four reactions in TMSCRO is very clear; however, this is not in DMSCRO. In each iteration, the diversification and intensification search in TMSCRO have the same probability to be conducted, whereas the possibility of diversification or intensification search in DMSCRO is uncertainty. This design enhances the ability to get better rapidity of convergence and search result in the whole solution space, which is demonstrated by the experimental results in Section 6.3.

Simulation and Results
The simulations have been performed to test TMSCRO scheduling algorithm in comparison with heuristic (HEFT B and HEFT T) [8] for DAG scheduling and with two metaheuristic algorithms, double molecular structure-based chemical reaction optimization (DMSCRO) [37], by using two sets of graph topology such as the real world application (Gaussian elimination and molecular dynamics code) and randomly generated application. The task graph for Gaussian elimination for input matrix of size 7 is shown in Figure 12, whereas a molecular dynamics code graph is shown in Figure 13. Figure 14 shows a random graph with 10 nodes. The baseline performance is the makespan obtained by DMSCRO.
Considering that HEFT B and HEFT T have better performance than other heuristics algorithms for DAG scheduling on heterogeneous computing systems, as proposed in the 8th paragraph in Section 2.1, these two algorithms are used to be the representatives of heuristics in the simulation. There are three reasons why we regard the makespan performance of DMSCRO [37] scheduling as the baseline performance.
(1) So far as we know, DMSCRO is the only one CRO-based algorithm for DAG scheduling which takes into account the searching of the task order and processor assignment. (2) As discussed in the 3rd paragraph of Section 2.2, DMSCRO [37] has the closest system model and workload to that of TMSCRO. (3) In [37], CRO-based scheduling algorithm is considered as absorbing the strengths of SA and GA. However, the underlying principles and philosophies of SA are very different from DMSCRO, and because the DMSCRO is also proved to be more effective than genetic algorithm (GA) [15] as presented in [37], we just use DMSCRO to represent the metaheuristic algorithms. We propose to make a comparison between TMSCRO and DMSCRO to validate the advantages of TMSCRO over DMSCRO.
The performance has been evaluated by the parameter makespan. The makespan values plotted in the bar graph of makespan and the chart of converge trace are, respectively, the average result of 50 and 25 independent runs to validate the robustness of TMSCRO. The communication cost is calculated by using computation costs and the computation cost ratio (CCR) values. The computation can be formulated as in (17): Communication Cost = CCR * Computation Cost. (19) All the suggested values for the other parameters of the simulation of TMSCRO and their values are listed in Table 3. These values are proposed in [20].

Real World Application Graphs.
The real world application set is used to evaluate the performance of TMSCRO, which consists of two real world problem graph topologies, Gaussian elimination [22] and molecular dynamics code [19].
6.1.1. Gaussian Elimination. Gaussian elimination is a wellknown method to solve a system of linear equations. Gaussian elimination converts a set of linear equations to the upper triangular form by applying elementary row operators on them systematically. As shown in Figure 12, the matrix size of the task graph of Gaussian elimination algorithm is 7, with 27 tasks in total. In [37], this DAG has been used for the simulation of DMSCRO, and we also apply it to the evaluation of TMSCRO in this paper. Under the consideration that graph structure is fixed, the variable parameters are only 22 the communication to computation ratio (CCR) value and the heterogeneous processor number. In the simulation, CCR values were set as 0.1, 0.2, 1, 2, and 5, respectively. Considering the identical operator is executed on each processor and the information communicated between heterogeneous processors is the same in Gaussian elimination, the execution cost of each task is supposed to be the same and all communication links have the same communication cost.
The parameters and their values of the Gaussian elimination graphs performed in the simulation are given in Table 4.
The makespan of TMSCRO, DMSCRO, HEFT B, and HEFT T under the increasing processor number is shown in Figure 15. As shown in Figure 15, it can also been seen that as the processor number increases, the average makespan declines, and the advantage of TMSCRO and DMSCRO over HEFT B and HEFT T also decreases, because when more computing nodes are contributed to run the same scale of tasks, less intelligent scheduling algorithms are needed in order to achieve good performance.
As the intelligent random search algorithms, TMSCRO and DMSCRO search a wider area of the solution space than HEFT B, HEFT T, or other heuristic algorithms, which narrow the search down to a very small portion of the solution space. This is the reason why TMSCRO and DMSCRO are more likely to obtain better solutions and outperform HEFT B and HEFT T.
The simulation results show that the performance of TMSCRO and DMSCRO is very similar to the fundamental reason that these algorithms are metaheuristic algorithms. Based on No-Free-Lunch Theorem in the field of metaheuristics, the performances of all well-designed metaheuristic search algorithms for optimal solution are the same, when averaged over all possible objective functions. The optimal solution will be gradually approached by a well-designed metaheuristic algorithm in theory, if it runs for long enough. The DMSCRO developed in [37] is well-designed, and we use it in the simulations of this paper. Therefore similar simulation results of the performances of TMSCRO and DMSCRO indicate that TMSCRO we developed is also welldesigned. The detailed experiment result is shown in Table 5.
In Figure 15, the figure shows that TMSCRO is superior to DMSCRO slightly. There will be only one reason for it: the stopping criteria set in this simulation are that the makespan stays unchanged for 5000 consecutive iterations in the search loop. As discussed in the last paragraph of Section 5, all metaheuristic methods that search for optimal solutions are the same in performance when averaged over all possible objective functions. And these experimental stopping criteria make TMSCRO and DMSCRO run for long enough to gradually approach the optimal solution. Moreover, better convergence of TMSCRO makes it more efficient in searching good solutions than DMSCRO by running much less iteration times. More detailed experiment results in this regard will be presented in Section 6.3. Figure 16 shows that the average makespan of these four algorithms increases rapidly under the CCR increasing. The reason for it is because as CCR increases, the application becomes more communication intensive, making the heterogeneous processors in the idle state for longer. As shown in Figure 16, TMSCRO and DMSCRO outperform HEFT B and HEFT T with the advantage being more obvious as CCR becomes larger. These experimental results suggest that, for communication-intensive applications, TMSCRO and DMSCRO can deliver more consistent performance and perform more effectively than heuristic algorithms, HEFT B and HEFT T, in a wide range of scenarios for DAG scheduling. The detailed experiment result is shown in Table 6. Figure 13 shows the DAG of a molecular dynamics code as presented in [19]. As the experiment of Gaussian elimination, the structure of graph and the number of processors are fixed. The varied parameters are the number of heterogeneous processors and the CCR values which are used in our simulation are 0.1, 0.2, 1, 2, and 5.

Molecular Dynamics Code.
The parameters and their values of the molecular dynamics code graphs performed in the simulation are given in Table 7.
As shown in Figures 18 and 19, under different heterogeneous processor number and different CCR values, the average makespans of TMSCRO and DMSCRO are over HEFT B and HEFT T, respectively. In Figure 17, it can be observed that, with the number of heterogeneous processors increasing, the average makespan decreases. The average makespan with respect to different CCR values is shown in Figure 18. The average makespan increases with the value of CCR increasing. The detailed experiment results are shown in Tables 8 and 9, respectively.

Random Generated Application
Graphs. An effective mechanism to generate random graph for various applications is proposed in [42]. By using the probability for an edge between any two nodes, it can generate a random graph without incline towards a specific topology.
In the random graph generation of this mechanism, the topological order is used to guarantee the precedence constraints; that is, an edge exists between two nodes V 1 and V 2 only if V 1 < V 2 . For probability pb, ⌊| | * pb⌋ edges are created from every node to another node ( 1 + (1/pb) * ) mod | |, where 1 ≤ ≤ ⌊| | * pb⌋, and ⌊ ⌋ is the total account of task nodes in DAG.  The parameters and their values of the random graphs performed in the simulation are given in Table 10. Figure 19 shows that TMSCRO always outperforms HEFT B, HEFT T, and DMSCRO with the number of tasks in a DAG increasing. The comparison of the average makespan of four algorithms under the increase of heterogeneous processor number is shown in Figures 20 and 21. As can be seen from these figures, the performance of TMSCRO is better than the other three algorithms in all cases. The reasons for these two figures are the same as those explained in Figure 15. The detailed experiment results are shown in Tables 11, 12, and 13, respectively.
As shown in Figure 22, it can be observed that the average makespan approached by TMSCRO increases rapidly with CCR values increasing. This may be because as CCR increases, the application becomes more communication intensive, making the heterogeneous processors in the idle state for longer. The detailed experiment results are shown in Table 14. 6.3. Convergence Trace of TMSCRO. The result of the experiments in the previous subsections is the final makespan obtained by TMSCRO and DMSCRO, showing that TMSCRO can obtain similar makespan performance as DMSCRO. Moreover, in some cases the final makespan achieved by TMSCRO is even better than that by DMSCRO after the stop criteria are satisfied. In this section, the change of makespan in the experiments as TMSCRO and DMSCRO progress during the search is demonstrated by comparing the convergence trace of these two algorithms. These experiments help further reveal the better performance of TMSCRO on convergence and can also help explain why the TMSCRO sometimes outperforms DMSCRO in some cases.
The parameters and their values of the Gaussian elimination, molecular dynamics code, and random graphs performed in the simulation are given in Tables 15, 16, and 17, respectively. Figures 23 and 24, respectively, plot the convergence traces for processing Gaussian elimination and the molecular dynamics code. Figures 25, 26, and 27 show the convergence traces when processing the sets of randomly generated  DAGs and each set contains the DAGs of 10, 20, and 50 tasks, respectively. These figures demonstrated that the makespan performance decreases quickly as both TMSCRO and DMSCRO progress and that the decreasing trends tail off when the algorithms run for long enough. These figures also show that, in most cases, the convergence traces of both algorithms are rather different even though the final makespans obtained by them are almost the same. The statistical analysis results over the average coverage rate at 5000 ascending sampling points from start time to end time of all the experiments are shown in Table 18 (the threshold of is set as 0.05), which are obtained by Friedman test, and each experiment is carried out 25 times. We can find that the differences between two algorithms in performance are significant from a statistical point of view. The reason of it is because the super molecule makes TMSOCRO have a stronger convergence capability, especially early in each run. Moreover, the performance of TMSCRO on convergence is better than DMSCRO. Quantitatively, our records show that TMSCRO converges faster than DMSCRO by 12.89% on average in all the cases (by 23.27% on average in the best case).
In these experiments, the stopping criteria of the algorithms are that the algorithm stops when the makespan performance remains unchanged for a preset number of consecutive iterations in the search loop (in the experiments, it is 5000 iterations). In reality, the algorithms can also stop when the total processing time of it reaches a preset value (e.g., 180s). Moreover, both of TMSCRO and DMSCRO have the same initial population. In this case, the fact that TMSCRO outperforms DMSCRO on convergence means that the makespan achieved by TMSCRO could be much better than that by DMSCRO when the stopping criteria of the algorithm are satisfied. The reason for this can be A reaction molecule (i.e., solution) in TMSCRO (V , , ): Atom (i.e., tuple) in InitCCPS: The first CCP molecule for the initialization of TMSCRO InitS: The first molecule in TMSCRO BelongCCP( ): C C P that node belongs to CCPE(CCPs, CCP ): Edge between CCPs and CCPe (V): Average computation cost of node V EC ( ): Execution cost of a node using processor CM( , , V, ): Communication cost from node V to , if has been assigned to node V and is assigned to node ST ( , V): Possible start time of node which is assigned the processor with the V node being any predecessor of which has already been scheduled EFT ( ): Finish time of node using processor AT : Availability time of Pred( ) : Set of predecessors of node Succ( ) : Set of successors of node CCR : Communication to computation ratio : The parameter to adjust the heterogeneity level in a heterogeneous system PE: Current potential energy of a molecule KE: Current kinetic energy of a molecule InitialKE: Initial kinetic energy of a molecule : Threshold value guiding the choice of on-wall collision or decomposition : Threshold value guiding the choice of intermolecule collision or synthesis Buffer: Initial energy in the central energy buffer 22 The Scientific World Journal KELossRate: Loss rate of kinetic energy MoleColl: Threshold value to determine whether to perform a unimolecule reaction or an intermolecule reaction PopSize: Size of the molecules NumHit: Total collision number of a molecule.