Effective Task Scheduling and IP Mapping Algorithm for Heterogeneous NoC-Based MPSoC

Quality of task scheduling is critical to define the network communication efficiency and the performance of the entire NoC(Network-on-Chip-) based MPSoC (multiprocessor System-on-Chip). In this paper, the NoC-based MPSoC design process is favorably divided into two steps, that is, scheduling subtasks to processing elements (PEs) of appropriate type and quantity and then mapping these PEs onto the switching nodes of NoC topology. When the task model is improved so that it reflects better the real intertask relations, optimized particle swarm optimization (PSO) is utilized to achieve the first step with expected less task running and transfer cost as well as the least task execution time. By referring to the topology of NoC and the resultant communication diagram of the first step, the second step is done with the minimal expected network transmission delay as well as less resource consumption and even power consumption.The comparative experiments have shown the preferable resource and power consumption of the algorithm when it is actually adopted in a system design.


Introduction
The development of integrated circuit has provided strong support for the integration of multiple processing elements (PEs) in single chip, and the on-chip communication between cores has developed from bus-based approach to two-dimensional and three-dimensional Network-on-Chip (NoC).The network-based highly parallel System-on-Chip (SoC) structure has become the inevitable choice for next generation of complex computer architecture [1].Nevertheless, the dramatic increase of PEs that can be integrated and the size of executable tasks have brought new problems and challenges to systematic design, among which the dividing and scheduling of the task and IP mapping have become the focus of systematic study.
The NoC-based task scheduling and IP mapping, on the basis of given tasks, type and amount of PEs available, and topology of NoC, assign tasks to suitable PEs, map the PEs to reasonable network topology, improve as much system efficiency as possible while the whole system meets the power consumption, and delay requirements.Its significance includes the following: (1) it serves as the bridge between applications and architecture and determines the task implementation, processing performance, and efficiency in architecture; (2) as heterogeneous multicore architecture usually associates with particular field, efficient task scheduling could acquire support applications in specific fields; and (3) as the size of tasks and multicore system architecture is increasing, efficient division of mapping will help improve the quality and efficiency of exploring mapping space and thereby improve the performance and efficiency of the entire SoC.

Related Work
Current research seldom distinguishes between task scheduling and IP mapping detailedly, and the modeling and analysis is conducted providing that a PE only performs a subtask (in some algorithms, subtasks are simplistic and considered to be PEs).That is to say, the task will be abstracted to a simple form of task model which just gives the calling relationship between subtasks; based on the above information, the scheduling algorithm will allocate as little uptime as possible [2][3][4].The approach has many drawbacks: (1) the heterogeneous nature of NoCs and the communication delay between tasks are usually neglected; (2) as the interdependence among tasks is complex, the model only abstracted the calling relationship between subtasks, with the result that other factors cannot be fully reflected and that transfer costs among different PEs are inadequately considered.The scheduling order designed by these models is not satisfactory in practical operation so that continuous recalculation and adjustment are required during the system operation, which inevitably brings additional burden to the system and poses threats to operating efficiency.
In addition, in terms of the time of scheduling decision, task scheduling can be divided into static scheduling and dynamic scheduling.Static scheduling means that the compiler makes scheduling decision at compiling time, for example, list-based algorithms [5,6], clustering algorithms [7][8][9], and duplication-based algorithms [10,11].However, static scheduling model has some drawbacks: as the model is an approximation of communication and execution time among processors, it might disagree with the actual implementation of the program or even produce poor scheduling results.
Dynamic scheduling means that a scheduler needs to schedule tasks to appropriate processors for the implementation according to their performance and in a real-time way so that the various requirements for the system can be met.Research in this area mainly employ heuristic algorithm, such as genetic algorithm (GA) [12] and ant-colony-based optimization (ACO) [13,14] heuristic task scheduling, dynamic scheduling algorithm based on task pool [15], particle swarm optimization (PSO) [16,17], optimized evolutionary algorithm [18,19], and dynamic scheduling algorithm based on real-time constrains [20].Although good scheduling results could be attained when these approaches are applied in task partitioning and mapping, in practice, the inherent defects of these algorithms easily result in many drawbacks during the operation, for example, the convergence speed is slow in the late stage of genetic algorithm; and in the early stage of ant colony algorithm, the inadequate coverage of all collections will lead to disparity between its result and the optimum value; particle swarm optimization is vulnerable to involving local optimization problems.
Meanwhile, in the aspects of NoC topology, through silicon via (TSV) technology [21] and optical interconnection technology [22,23] have made possible higher IP core density, wider bandwidth, less power consumption, and smaller size on integrated circuit chips.However, the resource occupancy and power consumption brought by NoC must be considered.In order to decline the NoC occupancy of limited resource and further decrease power consumption, various kinds of heterogeneous NoC topology are designed [24][25][26] to suit differentiated needs for network transmission delay and bandwidth of different types of PEs.Currently, most algorithms have not taken the effect of heterogeneous topology on system performance into consideration.If PEs of different types, in the premise of balanced power consumption, are mapped to reasonable area according to performance requirement and data transmission delay are minimized, the performance of system could be greatly improved.
Based on the analysis above, the whole design process is divided into two stages.As shown in Figure 1, the first stage is task dividing and scheduling.When the improved task model could faithfully reflect the real intertask relation, the local optimum question of particle swarm algorithm is solved and the optimized PSO algorithm is used to divide a big task into proper granular-sized small tasks featuring high cohesion and low coupling according to traffic and calling relationship.There exits high parallelism among these small tasks.Then assign these small tasks to corresponding PE according to the task nature and generate communication diagram to achieve the first step with expected less transfer cost as well as the least task execution time.Then the process comes to the IP mapping stage.In this stage, by referring to communication diagram and the performance disparity and delay information of topology of NoC, the PEs are reasonably mapped into switching node of NoC so as to achieve least network transmission delay with less resource occupancy and even power consumption and less resource pieces so that the system performance could avoid fluctuation when new tasks need scheduling.
The rest of the paper is organized as follows.Section 3 shows the detailed description of task dividing and scheduling.Section 4 illustrates the process of IP mapping.A comparative experiment result is shown in Section 5. Section 6 concludes the paper.

Task Dividing and Scheduling
Although the types and quantities of PEs integrated in heterogeneous multicore system based on NoC are expanding, the size of application task varies and the current task scheduling algorithm often assign and map the task in accordance with the numbers of utilizable PEs, which, to some tasks of small size, may result into problems; on one hand, as the tasks are divided into subtasks of extremely small size, communications among subtasks would become overfrequent which may lead to prolonged task execution time; on the other hand, inadequate utilization of the performance of PEs may result into increased system power consumption and reduce overall system efficiency.
This paper superimposes tasks on a PE until the computing resource of the PE is occupied at an appropriate ratio (settings are based on the performance requirement of system as well as PEs), and then new PEs are added.The approach not only ensures that tasks are divided into subtasks of appropriate size but also ensures that every PE invoked is efficiently used, thus bringing the best overall performance.

Task Model.
A task could be divided into  subtasks among which there exits certain execution sequence or control logic and these subtasks are processed by  ( types,  ≤ ) PEs.Assuming that the processing time of  types of PEs for every subtask, communication overhead among PEs, and amount of data transmission among interdependent subtasks are known, the task on heterogeneous multicore can be abstracted into a quintuple: (1) : task node-set in DAG application; that is, the vertex The target of task dividing and scheduling is to find a proper strategy of assigning and scheduling while meeting task processing sequence and resource limitation which could assign  subtasks to PEs with proper amount and schedule the execution order of every subtask in a reasonable manner, thus achieving minimum completion time of overall task with every task suiting the dependency graph.Based on task model, an improved particle swarm algorithm is used to conduct computation.

Coding and Decoding.
The resource occupation of every subtask is encoded by indirect encoding.The encoding length depends on the amount of subtasks.Every particle corresponds to a certain task assigning strategy.
Assume there exits  subtasks which are encoded by sequential encoding in a task and  PEs available which are classified into  types.For example, when  = 10,  = 3, particle (3, 2, 1, 1, 3, 2, 1, 2, 3, 3) is a feasible scheduling scheme; the particle is encoded as shown in Table 1, and as shown in Table 2, by decoding the particle, we can acquire the assigning condition of subtasks in every type of PE.Then, as shown in Table 3, after assigning the subtasks, PEs of reasonable amount are assigned to every type of PE in accordance with the processing ability and the total amount of tasks to be processed.
It follows from the task model that the running time of every subtask in different PEs is already known.The running time on every type of PE is defined as , represents the running time of subtask  on the th type of PE, and  represents the amount of subtasks assigned to th type of PE.The execution time of the entire task is obtained as follows: The overall operation cost is given as Assuming that the task set in the th type of PE is   and the task set assigned to th type of PE is   , the transfer cost between PE  and PE  is defined as The overall transfer cost is obtained as follows: by vector The fitness function of time is defined as where TFT  represents the overall completion time of the th particle; the fitness function of cost is obtained as follows: The overall fitness function is obtained as follows: The algorithm will select particles with higher fitness value so that it could provide excellent basis for generating excellent particles of the next generation.

Position and Velocity Updating.
In every iteration, the particle would update its velocity and position by (10) in accordance with its optimal historical position and the optimal position of the population.Only when the current position has better adaptive value comparing to its historical optimal position would the historical position be replaced by the current position best  is the best position experienced by th particle,  best  is the best position experienced by all particles in the population,   is significant for balancing the algorithms capability of global and local searching, and the paper adopts the decreasing inertia weight as follows: start and  end represent, respectively, the initial inertia weight and the inertia weight when maximum iteration times Gen is reached;  is the current iterations.By adopting the inertia weight above, an algorithm with strong global search capability in the early stage of iteration and more accurate local search capability in the late stage can be gotten.

Flow of Algorithm
(1) Randomly initialize the position and velocity of the particle swarm based on the description in "Initialization and Fitness Function." (2) Compute the velocity and position of every particle.
(3) Compute the fitness value of every particle and set  best  and  best  .
(4) If  best  and  best  remain unchanged after many iterations or the algorithm reached maximum iterations, output the optimum solution, end the algorithm, and go to step 6.
(6) Assign PEs of reasonable amount to every type of PE in accordance with the processing ability and total amount of tasks to be processed.

IP Mapping
After task dividing and scheduling, the IP communication diagram is formed.In the multicore system based on NoC, the further need is how to reasonably map these PEs into NoC nodes and minimize the network transmission delay during the task execution under conditions that the resources are less occupied and energy consumption is balanced.This is the question of IP mapping.There are often two orientations in IP mapping: either to minimize the internal communication cost or to minimize the external communication cost [27,28].Both orientations have their pros and cons; the former might lead to increased competition among external resources and add more computation overhead later in mapping when increasing use ratio of system resource; the later tends to arrange surplus resources well and successfully decreases competition of external resources with little changes in computation overhead.However, as each local mapping area is incomplete, it produces only second-best mapping solutions, thus undermining the global mapping optimization.While designing an IP mapping algorithm, it is necessary to make a careful balance between the two orientations above.
In the meantime, as described above, PEs of different types would have different requirements on a NoC communication capability.In order to save on-chip resource and decrease system consumption, various heterogeneous network topologies are designed.Therefore, during IP mapping, the matching between the communication requirements and on-chip communication capability entails comprehensive consideration.
The paper, based on the property of PEs to be mapped and the characteristics of distribution of transmission capability on topology, maps the PEs of high communication requirement to high-capability area, balances communication cost internal with that external, and achieves on-chip communication of system by minimum transmission delay and less resource occupancy.The mapping algorithm consists of two parts: the expression of the network topology by twodimensional matrix and the IP mapping.They are detailed as follows.(2)  represents frontier set in DAG application; that is,   ∈  indicates that there exits data exchange between  i and   ;

IP Communication
(3)  represents communication cost in undirected edge and   represents the total communication data between   and   .
It is complicated to express NoC topology directly, especially, three-dimensional NoC.Nevertheless, twodimensional matrix expresses topology well and many properties of matrix could also be applied to topology computation.Therefore, the paper expresses topology by two-dimensional matrix before IP mapping.
Three-dimensional mesh topology can be taken as an example.Shown in Figure 2(a) is a 4 * 4 * 2 three-dimensional NoC topology; the red vertices represent bottom switching nodes and the black ones represent upper switching nodes.Figure 2(b) is its two-dimensional expansion diagram, by which we can be free of the complexity in studying the threedimensional topology.For the convenience of expression and computation, the position of nodes in expansion diagram is expressed by matrix.The position of nodes in Figure 2(b) can be seen in Figure 2(c).There may exist areas where communication transmission capability is higher than that of others to fulfill the higher communication requirement of some PEs; as shown in Figure 2(c), the green areas represent areas in which there exist switching nodes with higher communication performance.For the integrity of matrix expression, areas without switching nodes are filled with shadow; in the later computing, nodes in these areas are assumed to be assigned out already.Through the approach above, there forms one-to-one correspondence between the position of every node in threedimensional NoC topology and that of every element in matrix.IP mapping conducts computing optimization on the basis of matrix.

IP
Definition 3. Communication cost in mapped area is obtained as follows: (1) Start mapping computation from collection ℎ 1 , choose communication area with high communication capability which could contain the minimum set of PEs with high communication requirement in ℎ 1 on topology as the beginning area of mapping.Name the mapped PEs as assigned area and name the occupied switching nodes area on topology as mapped area.
(   (3) Choose the node which has maximum communication data with assigned area as the next PE to be mapped.
( (5) Repeat step 3 and step 4 until all PEs are mapped and start algorithm of another PE diagram to be mapped.
Figure 3 is the simple description of mapping process.In IP communication diagram, the red PEs represent PEs with high communication requirement and blue area represents assigned area; in the topology the green area represents area of switching nodes with high communication capability and area encircled by red line represents mapped area.
The mapping algorithm arranges PEs with direct communication relationship to neighboring nodes, ensuring the road between source node and destination node to be shortest without any conflicts with other transmission roads, thus minimizing the delay in the whole mapping area.

Experiment and Simulation
The comparison and evaluation on the performance of designed algorithm are given from two aspects.The first one is the velocity efficiency itself of task dividing and scheduling algorithm.By computing tasks of the same size according to GA, ACO, PSO, and algorithm in this paper, respectively, and comparing the running time, we can prove the efficiency of algorithm.This part is conducted in Matlab with iterations being 200 times; the comparison of time required for running algorithms is shown in Figure 4.The other one is the comparison on actual mapping effect (Figure 5).By comparing the operation of different scheduling results from the above algorithms in NoC simulation environment and computing the delay of power consumption of system, respectively, we can prove the superiority of the algorithm of this paper in scheduling.

Conclusion
In this paper, the task scheduling model is further improved and the operating cost per time unit is employed as uniform measurement for PEs of different types and simplifies algorithm; task dividing and scheduling and IP mapping are handled separately so that the resultant algorithm scheduling is more efficient and truthful.The target of scheduling not only considers the total time spent but also considers the time cost and resource cost during the task running so as to achieve comprehensive optimization of system performance.

Figure 1 :
Figure 1: Two stages of task scheduling and IP mapping.
Diagram and NoC Topology.The communication diagram can be abstracted into a triple CDAG = (, , ), where (1)  represents the set of PEs in the communication diagram; that is,   ∈  is a PE with execution task; in which   represents the total communication traffic between   and   in communication diagram and MD((  ), (  )) represents Manhattan Distance of mapped position on topology between   and   .The target of the algorithm is to map PEs with high communication requirement to topology area with high communication capability and find out a mapping scheme which has minimum Com cost in the results.The algorithm divides communication diagram into collections  and  according to whether or not included PEs need to be mapped in area with high capability.In the collection  = {ℎ 1 , ℎ 2 , . . ., ℎ  } with high communication requirement, the sequence is|ℎ 1 | ≥ |ℎ 2 | ≥ ⋅ ⋅ ⋅ ≥ |ℎ  |according to the amount of PEs with high communication requirement; in the collection  = { 1 ,  2 , . . .,   } without high communication requirement, the sequence is | 1 | ≥ | 2 | ≥ ⋅ ⋅ ⋅ ≥ |  | according to amount of PEs contained.The execution steps of mapping algorithm are as follows.

Figure 2 :
Figure 2: Topology and its expression by matrix.
And the number of subtasks in DAG application is .(2): the frontier set in DAG application; that is,  , ∈  means that there exits data communication between V  and V  ; the direction of arrow indicates the direction of data transmission.(3) Type (V): the type of the task.For instance, we can use 1, 2, 3, . . . to represent different computing types.In addition, the type-set of tasks corresponds with that of PEs, which means that a task could only be scheduled to PE matching its type.This could be expressed by the matrix  =  , , where the lines represent the tasks, the columns represent the PEs, element  , = ∞ represents task V  which cannot be executed in   and  , =  represents task V  which can be executed in   with the execution time of .(4) PCU: the running cost of every type of PE per unit time, in which element PCU  (1 <  < ) represents the running cost of th type of PE per unit time.(5) : the collection of the communication overhead of directed edge. , represents the transfer cost of subtasks V  and V  when they pass the directed edge  , .When V  and V  are scheduled to the same PE,  , equals zero.

Table 1 :
Example of particle coding.
)3.3.Initialization and Fitness Function.Assuming that the population size is , amount of subtasks is , and amount of types of PEs is , the description of initialization of the population can be as follows: among the randomly generated  particles, the position of th particle is represented by vector   = ( 1 ,  2 , . . .,   ), (1 ≤  ≤ , 1 ≤  ≤ ), in which   (1 ≤   ≤ ) represents that, in the th particle, task  is assigned to PE of   type for operation; velocity is represented

Table 2 :
Example of decoding.
Mapping.Before introducing the concrete algorithm, three parameters are given as follows.Definition 1. Manhattan Distance MD(, ): in a plane, the Manhattan Distance between point   ( 1 ,  1 ) and   ( 2 ,  2 ) Definition 2. Euclidean Distance ED(, ): in a plane, the Euclidean Distance between point   ( 1 ,  1 ) and   ( 2 ,  2 ) is defined as ) Correspond the PE to switching node which has minimum Manhattan Distance with mapped area.If more than one node meet requirement, choose the node whose available neighboring nodes number is nearest to PE node degree; if there are still more than one node, then choose the switching node which has minimum Euclidean Distance from the center of mapped area.