Using Genetic Algorithms for Hardware Core Placement and Mapping in NoC-Based Reconfigurable Systems

Mapping of cores has been an important activity in NoC-based system design aimed to find the best topological location onto the NoC, such that themetrics of interest can be greatly optimized. In the last years, partial reconfigurable systems (PRSs) have included Networks-on-Chips (NoCs) as their communication structure, adding complexity to the problem of mapping. Several works have proposed specific and robust NoC architectures for PRSs, forming indirect and irregular networks, in which cases the mapping and placement problems must be treated altogether. The placement deals with the physical positioning of those cores inside the reconfigurable device. Up to now, to the best of our knowledge, the mapping-placement problem for those kinds of architectures has not been addressed yet. In this work, the problem formalization for the design-time hardware core placement and mapping in PRS-NoCs is proposed and methodologies for solving it with genetic algorithms (GAs) are presented. Several GA crossovers and methodologies are compared for obtaining the best solution. Results have shown that best GA solution obtained, in average, communication costs with 4% of penalty when compared with global minimum cost, obtained in a semiexhaustive approach. In addition, the algorithm presents low execution times.


Introduction
Run-time reconfiguration (RTR) FPGAs, also known as dynamically reconfigurable FPGAs (DRFPGAs), have been accepted as an important potential alternative for lowering costs of digital circuits, especially regarding the flexibility in rapidly changing the functions being performed and reducing area consumption.However, they add new dimensions to the system-on-chip (SoC) design space, due to the different possibilities of physical, temporal, and functional partitioning of the original application.
Considering the generalized use of robust communication resources in current complex SoCs, structured communication means, as network-on-chips (NoCs), have been included in partial reconfigurable systems, generating PRS based on Nocs, or PRS-NoCs, under different architectures [1][2][3][4].Besides having the high scalability and modularity provided by NoCs, these PRSs can free the designer from the details of minimizing data retention and signals management, allowing him/her to focus on wrappers and computing logic, reducing the design effort.
The NoC-based PRS approach has not yet been largely adopted, partially due to the lack of more established design tools in the design cycle, as for process partitioning and mapping, or physical placement, which are recognized as hard problems, even for nonreconfigurable NoC-based systems [5].Although methodologies and tools have been proposed [6][7][8], to deal with the increased design complexity of this class of circuits, solutions to the associated problems are still very ad hoc.
Mapping is the step in the SoC design flow where individual system processes are optimally assigned to NoC entry points (routers), through the reduction of traffic profile and consumed energy in the network, or other metrics defined by the designer.In a later point in the design flow, each process will be implemented either as a hardware core, a memory, or a programmable processor running a piece of software.Different mapping schemes have been presented for fixed, nonreconfigurable systems [5,9], consisting basically of different algorithmic approaches for the resolution of different cost-function optimization problems.Mapping in MPSoCs environment has also been considered [10], but that 2 International Journal of Reconfigurable Computing is treated as an operating system scheduling problem, with no regard to hardware cost criteria.
The reconfiguration capability adds a new dimension to the mapping problem since different cores are assigned to the same router but are present in the logic fabric in separate moments.There will exist several different configurations in time or contexts, each one showing a particular traffic profile.An optimized profile can be obtained for each context through the nonreconfigurable systems mapping methods, but the challenge is to consider all contexts in an interdependent form [11].
The placement problem deals with the allocation of resources (cores) inside the reconfigurable device; that is, given an assigned area, a set of cores must be placed in that area in a way that they do not overlap each other and do not exceed the space bounds.Traditionally, the placement problem is targeted to a regular NoC grid structure and performed after the core mapping.Some works have treated the placement as an execution time problem.One of the first works, made by Ahmadinia et al. [12], proposes an algorithm that finds the nearest possible position for an incoming core to the already placed cores.In [13], a complete real-time operating system was developed for tasks scheduling and placement in FPGAs.Although those approaches treat the placement problem, they cannot be applied to PRS-NoC architectures.
Some works on PRS-NoCs [1][2][3][4] have introduced advanced architectures for PRSs, in which physical area assigned to the routers is also reconfigurable.The router area may potentially be occupied by reconfigurable cores of varied sizes.The problem of placement and mapping for these architectures is extremely complex, actually, a combination of two NP-hard problems, with an explosive number of subcases to be treated.To the best of our knowledge, this problem has not been yet addressed, exception made for [14,15].In the first paper, the authors have treated the mapping-placement problem with focus in a regular an direct NoC architecture, where each node was composed of a reconfigurable slot, in which tasks can be allocated and placed as cores.In [15], a smart-exhaustive approach is presented for the mapping-placement problem for irregular and indirect reconfigurable NoCs; however, since the algorithm seeks the global minimum, it was not able to solve the problem for applications with more than 15 cores.
In this work, solutions based on genetic algorithms are presented for hardware core placement and/or mapping (in design-time) for PRS-NoC irregular and undirected NoC topologies and heterogeneous cores.The formalization of the problem is provided and the solution of the problem is obtained through genetic algorithms (GAs).For the best solution analysis, several GAs crossovers and population diversification techniques are applied and compared.The best GA solution is compared with the global minimum cost solution, obtained with the semiexhaustive algorithm described in [15].Results have indicated GAs as a good alternative for solving the mapping-placement problem; the best GA solution has obtained, in average, communication costs with 4% of penalty when compared with global minima.Besides that, GAs applied to this problem have presented very low and adequate execution times, even for applications with 26 cores and 5 reconfigurable scenarios.
The rest of the paper is organized as follows: in Section 2 some related works are presented and a general overview of complex PRS-NoC architectures is shown in Section 3. The problem formulation is presented in Section 4 and the use of genetic algorithms for its solution is described in Section 5. Section 6 presents the experimental results and, finally, Section 7 shows the conclusions of this work.

Related Work
There are a few works about placement tools for PRS described in literature.One of the first works that considered the placement problem in SDRs [12] was for FPGAs with realtime reconfiguration, where the starting point is a scheme of scheduled and partitioned processes.The placement system is composed of three parts: a data base of modules, a placement request manager, and the placer.When a new module is up to operate, a request is made for the manager, and the algorithm analyses the occupied space to determine where to place the new module.The criterion of the placement is the optimization of communication between the modules.The work was improved in [16] with the inclusion of computational geometry techniques for empty spaces control.
In [7] a placement method for PRS-NoCs in FPGAs was proposed, considering heterogeneous processing elements (PEs) in design-time.A placement algorithm called Dyno-Place was proposed.It takes into account real aspects of FPGA families, making it possible to automate this process in the design methodology.The placement was organized in five steps: in the first one, the dimensions of PEs are processed and the modules are ordered; in the second one, a list scheduling algorithm is used for placing modules with same height sequentially in rows, from bottom to top and from left to right; the procedure places the larger modules first and the smaller ones later.The three last steps make the necessary adaptations for the DRFPGA specific architecture.This work does not treat the mapping problem and does not focus on NoC regularity, allocating PEs at one side of the device, while the routers are at the opposite side.
In [13] a reliable reconfigurable real-time operating system (R3TOS) was developed for reconfigurable systems without NoC.The R3TOS executes scheduling and placement of tasks using several different metrics and methods.The placement algorithms focus on preserving the maximum empty rectangle (MER) for the future use for as long as possible.The algorithm analyses the timeline, for preventing future fragmentations of modules, keeping the modules packed, and, consequently, achieving higher computation densities.
Although these works have made complex explorations in placement in PRS, they cannot be applied in NoC-based environments (with exception of [7]).With exception of [13], the placement algorithms only consider the instantaneous scenario, ignoring the general scenario with all configurations.
The NoC mapping problem has been widely explored in literature for nonreconfigurable architectures.Basically the associated methodologies and algorithms have focused on minimizing different metrics, using different algorithmic solutions.The NMAP algorithm [9] minimizes the communication cost for single route multiple paths with an analytic approach; in [17], a study compared various algorithms for minimizing energy consumption: exhaustive search, simulated annealing, greedy incremental, and largest communication, among others.Several works also proposed a multiobjective approach like in [18], where an immuneadaptive algorithm was presented for reducing both power consumption and latency.A list of other nonreconfigurable NoC mapping works can be found in [5].
The work of Beretta et al. [14] proposes a placement and mapping methodology for simple PRS-NoCs, with a direct and regular NoC topology.The space of the FPGA is divided in big reconfigurable slots connected to routers, in which modules are placed.The placement and mapping are proposed for both application design and operating time and for multiapplications.In the preprocessing step, the solution is divided in basic mapping and specific configurations.The basic mapping is composed of the modules which are used by the most of the applications, and the specific configurations are a set of configurations that cannot be reused from the based mapping.Firstly, a genetic algorithm is used for placing modules of the base mapping.A second mapping is executed for placing the modules that belong to specific configurations.Other algorithms are proposed for real-time mapping: a greedy algorithm for configurations reuse or a SAT solver.
The only work to address the problem of mapping for the simple PRS-NoCs architectures is the one proposed for Beretta et al.However, it cannot be applied for the complex PRS-NoC architectures like the ones proposed in [1][2][3][4], which are the focus of the present work.
The only work to consider placement and mapping for complex PRS-NoC architectures was described in [15], where a semiexhaustive algorithm was proposed for solving the problem.The algorithm was tested on three synthetic applications, placing permutations of each combination of cores sequentially.As part of the algorithm, it prunes a permutation when it cannot be mapped due to extrapolation of the device space.The solution, however, could not treat applications with more than 15 cores due to the explosiveness of the problem.

Complex PRS-NoC Architectures
Complex PRS-NoC architectures are those with irregular and undirected NoC topology and heterogeneous cores, likes the ones proposed by Bobda et al. [1], Jovanovic et al. [3], and Killian et al. [4].
To characterize the complex PRS-NoC, the dynamic network-on-chip (DyNoC), proposed by Bobda et al. [1], is adopted as the base model.This architecture has been used due to its simplicity, and by using traditional NoC blocks, as the five-port router.The DyNoC is a mesh-based NoC topology where each router may be connected to a processing element (PE).The routers have five ports: east, west, north, south, and a local connection for a PE.The placementmapping task is to allocate cores in the areas reserved to PEs.
An example of the architecture is illustrated in Figure 1, showing a mesh of routers (in orange) connected (or not) to local PEs, which can be of single (in blue) or large (other colors) sizes.In this example, four large size reconfigurable cores allocated in the network (C1-C4).When a core is placed, it occupies the specific area reserved to PEs; for large cores, the routers inside the occupied area are deactivated, and the communication internal to the cores is made with local buses (represented by dotted lines).Each core communicates to other NoC elements through an associated router located at the upper-right corner of the placement location.Whenever large cores exist, the NoC becomes indirect and irregular; not every router has one associated core, as, for instance, in Figure 1, the three ones at the bottom of the rightest column.
The communication is made through packets and the authors propose to use the surrounding XY (S-XY) routing algorithm, for the dynamic architecture.Basically, the algorithm detects when a placed module is on the routing way and surrounds it.The architecture provides an environment in which the packet always has a path from source to destination, since the placed cores are always surrounded by routers.This is guaranteed by positioning routers in all sides of the mesh, which makes the relative positioning between cores and routers inverted as in the left and bottom border in Figure 1.Having always a path between any two cores is specially interesting for PRSs, since the reconfigurable cores can be placed at any region.

Problem Formulation
Although we target DyNoC as the system model for the mapping and placement problem, in order to simplify the problem formulation and make it more general, it will be considered that all the routers are placed at the upper-right corner of the PEs.Hereinafter, the point of connection for each potential router in the mesh will be called allocation slot.
A reconfigurable device entry space, , has  ×  allocation slots.The set  indicates a set of positions in the device: A 7 × 7 entry space can be observed in Figure 2, where the lower-left corner has a position  1 = (0, 0), the upper-left corner has a position  43 = (6, 0), and the upper-right corner has a position  49 = (6,6).
A scenario is a configuration defined in a given moment.Each reconfiguration is a different scenario.Each scenario  belongs to the set SC of all scenarios.
The classic problem formulations for NoC mapping are not adequate for describing the physical characteristics of the modules, which is necessary for solving the placement problem.Therefore, we propose for this situation an extended application characterization graph (EAPCG).
An EAPCG is given by (, ), where each core   ∈  is a vertex with two weights (  , ℎ  ), which represents the width and height of the core in terms of number of slots.The set  is subdivided in two groups: (1) fixed cores, present in all configuration scenarios, described as Fc  ∈ FC, where FC is the set of all fixed cores; 6,0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 only one scenario , but missing in the next scenario  + 1, being described as Rc  ().
Each directed edge  , ∈  represents the communication from   to   .For each edge, the parameter ( , ) represents the communication bandwidth between   and   (in Mbits/s).
The (, ) graph is illustrated in Figure 3 through the Big Alpha application, which will be used later in Section 6.This application has thirteen cores, being six fixed cores (Fc1-Fc6) and seven reconfigurable cores distributed in three scenarios (RC(1), RC(2), and RC(3)).The communication  traffic between the cores which are represented in the arcs and the weights (width   , height ℎ  ) is described in each vertex.
The mapping function in PRS-NoCs can be defined as where each core   is associated to one reference position   = (  ,   ), with Ω(  ) = (  ,   ).To take into account the core sizes, the following function is also defined: where  is the set of all rectangular regions defined for two positions   and   , on opposite diagonals.(  ) =   ∈  is a rectangle with origin in (  −   + 1,   − ℎ  + 1) and end in (  ,   ).Each scenario , composed of FC ∪ RC, has a specific mapping ΩΩ(), which is the union of all mappings (  ) with   ∈ FC ∪ RC().The placement of the fixed modules remains unchanged for all scenarios .Each mapping has the following restrictions: Equation ( 4) guarantees that the cores will not overlap each other, whilst (5) assures that the placement will not exceed the device entry space.
A device architecture graph is a specific graph for each scenario , Arch(ΩΩ()) = (, Ch), where each router   ∈  is a vertex and each directed edge ch , ∈ Ch represents the communication channel between   and   .All routers are connected to each other in vertical and horizontal fashion in a  ×  grid.For each edge, the parameter bw(ch , ) indicates the bandwidth capacity between   and   .Path(, ) is the set of channels that make the path from the origin  to the destination  and hops(, ) is the number of routers in which the data must pass from  to .Routers can be activated and deactivated, under the following definition: for any router   adjacent to   , bw (ch , ) = 0 bw (ch , ) = 0.
Equation ( 6) indicates that a deactivated router  does not have connections with any other router.If a router has at least one connection with another router, it is activated.Derived from (2), the core mapping in each scenario  is given as The existence of large size cores in the mapping process implies that the network topology may turns indirect.This issue will be modeled as follows.
Alongside the slot concept, the routers themselves are positioned in the space .All the cores   communicate with other cores through an associated router in the same position of the placement: An occupied region Ro(  ) ⊆ (  ) is a region that do not allow active routers.This region exists for large core sizes if both height and width are greater than one and it is delimited by a rectangle with origin in (  −   + 1,   − ℎ  + 1) and end in (  − 1,   − 1). Figure 4 illustrates the concept of a region and an occupied region.
The routers inside an occupied region are deactivated whilst other routers are activated.
For each scenario  the communication between the modules cannot exceed the bandwidth capacity of the channels: where where Path((  ), (  )) indicates the sequence of the channels from   to   with the S-XY routing algorithm applied.The mapping objective is to place the cores in order to minimize the communication cost in every configuration scenario, as described by The objective function is simple; however when the number of hops is minimized, several other metrics are optimized, for example, latency, communication delay, and energy consumption.This kind of objective function is widely used in literature, including [5,9,14,17].The minimum cost problem is illustrated in Figure 5, for the Big Alpha application of Figure 3.The routers associated to the cores are the black ones.It may be noticed that cores with critical communication, like from Fc1 to Fc4, with 50 MB/s, are close to each other.This can also be observed when the core Rc2( 2) is placed on the second scenario: it is only two hops away from Fc4, since they send and receive 18 MB/s to each other.Similar communication treatment for other pairs of cores may be observed in the mapping results.
Here the costs for the first, second, and third scenario are, respectively, 454, 449, and 452, with a total cost of 1355.

Using Genetic Algorithms
In this section, we present the algorithmic solution to be applied in the design-time for a PRS-NoC, in order to obtain the optimized placement and mapping.For solving the problem, we have selected genetic algorithms (GAs).GAs have been successfully used as a solution for a large set of minimization problems, including mapping problems [5,14], and they are able to deal with the hard constrains as the ones previously described.In addition GAs greatly accelerate the execution time when searching for a good solution, being adequate for the mapping and placement problem which is known to be explosive [5].In this section, the use of GAs for solving the problem presented in Section 4 is presented in detail.

GA Coding.
For the chromosome coding, we have developed a simplified representation for the placement and mapping process.Since dispersed cores require more routers to hold communication, we start on with the assumption that packed cores leads, generally, to a better communication cost.Therefore, we will not consider empty spaces between cores, what will be reflected in the chromosome, which makes the computation process simpler and prevents fragmentation.Using a generalist approach and considering the empty slots International Journal of Reconfigurable Computing  would make the problem complexity proportional to the factorial of the number of slots in the architecture while the simpler representation makes the problem complexity proportional to the factorial of the number of modules.Each core represents a gene and the set of cores, ordered according to the order of the scenarios, represents a chromosome.
Figure 6 shows the coding for an application with thirteen cores and three scenarios.The first eight elements of the chromosome represent the set of 5 fixed cores besides the 2 reconfigurable cores that compose reconfigurable scenario 1.The ninth and tenth elements represent the reconfigurable cores of scenario 2 and the last three elements represent the reconfigurable cores of the last scenario, remembering that it is implicit that the fixed cores appear in all three scenarios.The core position into the chromosome represents the physical position of the module, which can be seen in Figure 5. Cores are placed according to two priorities: (1) from top to bottom; (2) from left to right.The placementmapping for the first scenario described in Figure 6 is shown in Figure 5(a).In the second scenario, reconfigurable cores of first scenario are unplaced and Rc2(2) and Rc3(2) are placed instead, using the top-bottom and left-right priority, as shown in Figure 5(b).Figure 5(c) shows the same process for the last three elements of the chromosome.

General Structure.
According to the GA technique, each solution of the problem can be coded as chromosomes.A chromosome is a set of genes, and a set of chromosomes compose a population.A classical GA follows four main steps: (1) generation of a random population; (2) selection of the fittest individuals; (3) crossover of selected individuals; (4) mutation.Steps (2), (3), and (4) are repeated for the refinement of the new population (offspring), and the previous population is substituted (killed) [19].
The implemented algorithm is described as in Algorithm 1 as a pseudocode.The algorithm will be tested later with several types of crossover and with population diversification techniques; however, this general structure will be the reference.
Initially, a set of random solutions is generated (line 2).The "population size" (PZ) parameter defines the number of chromosomes in each population.The fitness function is defined as the inverse of communication cost, represented by (12).
The main algorithm is composed by two loops: the first one (starting at line 4) refers to the refinement of population which is limited by the number of generations (NG).In each generation, a second loop (starting at line 5) performs the selection, crossover, and mutation for all elements of the population.
The selection step is performed by the roulette wheel technique [19] in which the probability  of a chromosome being selected is proportional to its fitness value (13).In line 6 the roulette process is used to select chromosomes cr sel1 and cr sel2.Consider At the end of each iteration, the previous population is killed and the offspring becomes the new population.Line 27 saves the best solution of the current generation.
The crossover is executed with a probability (Crossover), typically between 0.5 and 1 [20].The crossover operator is applied in different parts of the chromosome which represents a scenario.The crossover between chromosomes cr sel1 and cr sel2 generates two new chromosomes, where cr sel1 is the donor of genetic material for the first one and cr sel2 is the donor for the second one.
After the crossover, the algorithm executes the mutation according to its probability (usually lower than 0.05 [20]).The mutation process is basically the swapping of two genes of each part of the chromosome.The mutation is useful to avoid focusing on a particular solution, maintaining a wider searching space.In line 19, the possibility of the mapping is verified, that is, if it is valid and fits into the entry space with respect to the restriction described in (5).If it does, fitness function is computed; otherwise fitness function value is zero.In line 26 offspring becomes the current population, while Figure 6: Placement-mapping chromosome coding for GA.

Crossovers.
Normally, crossover operators are applied all over the chromosome.However, the representation adopted in this work, as in Figure 6, subdivides the chromosome in parts corresponding to each configuration scenario; an adaptation was performed: the crossover is executed in several parts of the same chromosome.Three types of crossovers are presented.

OX.
The first crossover is the classic ordered crossover [19].Figure 7 illustrates an example of the OX process in a mapping with three scenarios, where the first chromosome is the donor and the second one is the receiver.Figure 7(a) shows the first step, where two chromosomes are selected: the first chromosome will donate a subset of each part (represented in gray) to the receiver in the same position.In the second step, the receiver eliminates from its codes the cores donated by the donor, as shown in Figure 7(b).The remaining cores in the receiver are circularly shifted to the left inside its own area (e.g., the first part performs shifts only in the eight first slots) as illustrated from Figure 7(c) to Figure 7(d), occupying empty spaces and opening the required slots for the donor subset.Finally Figure 7(d) shows the OX result.

NWOX.
The OX presents a shifting characteristic that may not preserve the absolute positions of the parents in the chromosome of the son: the circular shift moves the genes from extreme left to the extreme right.For preserving the absolute positions in the crossover, the nonwrapping ordered crossover (NWOX) [21] was proposed.
Figure 8 shows the NWOX crossover process.Figures 8(a) and 8(b) present the same situation seen in the OX case; as in the second figure, the order of Fc6, Fc3, RC2(1), and Fc5 is established.The difference turns out to be clear in the third step, where the space for reception of new cores is allocated between the ordered genes as in Figure 8(c).Note that the relative order of the receptor is preserved.Instead of performing the circular shift and moving the modules Fc6 and Fc3 to the right side of the chromosome as in the OX, those are maintained closer to their original positions.In this way, the final result of the NWOX is shown in Figure 8(d).

PMX.
The partially matched crossover (PMX) [19] uses a principle of position swapping inside the chromosome.
Figure 9(a) presents the donor and receptor (superior and inferior, resp.).In Figure 9(b) the first swapping process occurs, where the position of the receptor gene swaps positions so that it can correspond to the same positions of the donor in the gray area.As the gene Fc4 was swapped to a region outside of the gray zone, its position must be changed again with a new swap, until the gray region turns identical to the gray zone of the donor as illustrated in Figure 9(c).

Population Diversification Techniques.
One of the known problems of GAs is the premature convergence to a local minimum [22].To deal with this issue, several methodologies were proposed to force the diversification of population, maintaining its health.Two techniques were selected among the existing ones, due to their known effectiveness and easy implementation: the random offspring generation (ROG) [23] and the adaptative genetic algorithm (AGA) [20].

ROG.
One of the factors that lead to a premature convergence is the great number of individuals with the same genetic material.In case that happens, there is a great probability of performing crossover with two identical parents, FC ∪ RC(1) RC( 2) RC( 3) leading to a simple cloning process, and consequently to a unchanged situation.
To deal with this problem, the ROG technique was proposed, where before each crossover process the genetic codes are verified.In case the parents are identical, one of the two son's chromosome is completely randomized, and the other one is just cloned.[20] does not have fixed probabilities for crossover and mutation.These probabilities are dynamically changed according to the evolution of the algorithm.

AGA. The AGA technique
For a good evolution, a GA must have two characteristics: convergence to a minimum (local one or global one) and the wide exploration of the search tree.The balance between these two characteristics is controlled by the probability of crossover (crossover) and probability of mutation (mutation).For adapting the probability to the momentary situation, the authors proposed varying probabilities according to the fitness value: where  max is the greatest fitness of population,   is the greater fitness between the crossing individuals,  is the population fitness mean, and  is the fitness value of the chromosome that will be mutated. 1 and  2 are control parameters of the AGA and must vary from 0 to 1.In addition, in order for the 100% probability not to be exceeded, the functions must follow the restrictions: where  3 and  4 are also control parameters.

Experimental Results
6.1.Synthetic Applications.Given the difficulty for finding real applications adequate to PRS-NoCs in the technical literature and open access databases, nine synthetic applications were created.There are several benchmarks that could fit in simple PRS-NoCs that were used in our previous work [11].However, those benchmarks do not fit for the placement problem.
The nine synthetic applications are based on three arrangement of cores named Alpha, Beta, and Gamma.The arrangements are described in Table 1, where the first column shows the applications, the second one shows the number of fixed cores, and the last 5 columns show the number of cores for each configuration scenario.
Starting from the basic arrangements, three core sizes variations are created for each application: (1) small cores, in which the placed cores will not violate the entry space; (2) varied cores, where small and big cores are mixed, with the placement sometimes violating the entry space; (3) big cores, in which placement will almost always violate the entry space, reducing the number of possible solutions.The application Alpha with big cores is the one presented in Figure 3.The other applications are not illustrated here due to space limitations.

Semiexhaustive Solution.
In order to have an algorithmic reference to be compared to the GA solution, an exact and semiexhaustive solution was developed.The objective is to find the minimum communication cost represented by (12).
The semiexhaustive algorithm is a classic branch-andbound algorithm to search in every permutation of each scenario.For instance, for the application Alpha with big cores (presented in Figure 3) the algorithm would test every permutation of the first scenario, and, for each one of those permutations, the two possible variations of the second scenario with the six possible variations of the third scenario.Each possibility represents a branch; when the branch violates the placement restrictions imposed by ( 4) and (5), it is bounded.The bound also occurs when the bandwidth capacity, represented by (9), is exceeded.
This semiexhaustive approach is similar to the one presented in [15] but more accurate, guaranteeing to achieve the global minimum.In [15], a smart-exhaustive algorithm was presented but some of the results were stuck to the local minima.
Table 2 shows the results for the semiexhaustive algorithm.The first column shows the applications and their variations are shown in second column.The global minimum cost obtained by the algorithm is shown in the third column and the algorithm execution time is in the fourth column.The last column shows the percentage of cases that exceeded the entry space, and consequently, were not mapped.The algorithm has worked well for the first case, with acceptable execution times.The algorithm also presented good results for the Beta application; however, the execution time increased significantly, as expected, when compared to the first case.The explosiveness of the problem is evident when the last case is analysed.Even after 24 hours of simulation, the algorithm was not able to conclude the placementmapping analysis.
Observing the data presented in Table 1, it is clear that the size of the NoC and the number of fixed and reconfigurable cores cause an increase in the algorithmic effort for the solution of placement-mapping problem.Therefore, for reallife nonsynthetic applications with large number of cores, there is a need for an algorithm that can solve the problem in polynomial time.

GA Results
. For the simulation, three cases of crossover described in Section 5 were tested.For each type of crossover, three types of GA were considered for simulation: the ordinary GA (OGA), with no techniques for population diversification, the Adaptive GA (AGA), and the GA with random offspring generation (ROG).All algorithmic derivations were tested for each one of the nine applications, which resulted in 81 simulations.Each simulation was repeated 10 times for obtaining the average value.This was done since all the simulations are different due to the random generation of the first population.
The algorithms were simulated on MATLAB, on which the codes of the GA were developed.The simulation of the S-XY algorithm was performed on the same platform for the calculation of the communication cost.
For all simulations, it was empirically defined that the number of generations NG, of 300, offers a good trade-off between quality of solutions and computational effort.Following the same reasoning, the size of population was defined as 100 individuals.For the ordinary GA, the probability of crossover and mutation was set to 0.7 and 0.02, respectively.In the AGA the control parameters were defined as  1 =  3 = 1 and  2 =  4 = 0.5 as suggested by [20].
Table 3 shows the costs obtained from the GA for all the cases.The first column shows the applications, while their variations are listed in the second column.Columns 3, 4, and 5 show the costs for the crossover NWOX with the ordinary GA, and the variations AGA and ROG, respectively; in the same sequence, columns 6, 7, and 8 present the costs for PMX and the same process is repeated in the last three columns for the OX.
For a better visualization of results, they are presented in a color scale where the dark green represents the best solution and the red represents the worst solution.
The ordinary GA presents results, from moderate to bad results when compared to other alternatives.This can be explained by the premature convergence to a local minimum.
The approach with the ROG technique shows some good results for the Alpha application and bad results for the other applications.It was noted during the simulations that the variation of population was too high, showing that the technique is not appropriate for the specific mapping problem, preventing the convergence to a global optima in larger applications.
The best results were obtained from the AGA, showing that its mechanisms for the diversification of population have fitted better to the placement-mapping problem.
The PMX was the best of the three types of crossover, which makes the AGA with PMX the best presented solution.
In order to analyze the general quality of results obtained with the GA approach, Table 4 compares the results from GA with the results of the exact and semiexhaustive algorithms.The percentage indicates how much the cost is above the minimum cost; this index will be named cost penalty.The Gamma application (bold font), was compared with the best cost obtained by the GA, since it was not possible to obtain the global minimum cost with the semiexhaustive algorithm, as previously described in Table 2.
The best solution (PMX-AGA) presented penalties between 0.6% and 11.6% with an average penalty of 4%.Other crossovers presented similar results: the NWOX-AGA reached from 0.6% to 20.3% with 6.6% of average and the OX-AGA presented from 0.6% to 17.7% with 5.6% of average penalty.
The mappings for the Alpha application with large size cores presented the worst results.As the occupation of the PRS-NoC was the largest among the synthetic applications, it was difficult for the algorithm to achieve convergence to a global minimum, since a great number of mappings violated the restrictions.
The Alpha application took at most 48 seconds to be mapped using the GA.The Beta and Gamma applications took at most 73 and 113 seconds to be mapped, respectively.All the simulations were executed in a PC with an Intel Core I7-3770 processor and 8 GB of RAM.The execution times have shown that the algorithm is completely acceptable for a design-time activity.

Conclusion
This work presented GAs solutions for placement and mapping for NoC-based reconfigurable systems.The problem formulation was developed in an efficient way, since all important aspects of PRS-NoC were successfully described, in order to include all important aspects of a PRS-NoC under irregular and indirect communication network.The formulation was the base for the development of GA solutions.
The problem was adapted for GAs, which enabled the solution for a great variety of applications.Three crossovers were tested: the NWOX, the PMX, and the OX.Each crossover was tested for two techniques of population diversification: the AGA and the ROG.
The different types of crossover presented similar results, where the PMX was slightly better.The ordinary GA and the ROG technique tended to converge to a global minimum, while the AGA fitted better to the placement-mapping problem.The AGA results were close to the global minimum cost when combined with any crossover type.The AGA combined with PMX presented the best results, with penalties between 0.6% and 11.6% and an average penalty of 4%.
The execution time was 48, 73, and 113 seconds for applications with 13, 16, and 26 cores, respectively, to execute the entire placement-mapping process.The execution times are completely acceptable for a design-time activity.It also shows that the algorithm is able to solve the problem for a relative large number of cores.

Figure 3 :
Figure 3: The EAPCG of the synthetic application Big Alpha.

Table 1 :
Characteristics of synthetic applications.

Table 2 :
Results of the semiexhaustive algorithm.

Table 3 :
Minimum communication costs obtained from the GAs.

Table 4 :
Cost penalty from gas when compared to the minimal solution.