HotSpot Thermal Floorplan Solver Using Conjugate Gradient to Speed Up

We proposed to use the conjugate gradient method to effectively solve the thermal resistance model in HotSpot thermal floorplan tool. /e iterative conjugate gradient solver is suitable for traditional sparse matrix linear systems. We also defined the relative sparse matrix in the iterative thermal floorplan of Simulated Annealing framework algorithm, and the iterative method of relative sparse matrix could be applied to other iterative framework algorithms. /e experimental results show that the running time of our incremental iterative conjugate gradient solver is speeded up approximately 11x compared with the LU decomposition method for case ami49, and the experiment ratio curve shows that our iterative conjugate gradient solver accelerated more with increasing number of modules.


Introduction
With the constant improvement of the chip speed, power and temperature of a chip also increase accordingly.To cope with the increasing temperature of chips, the thermal aware floorplan method is widely used to avoid hotspots on chips in physical design.It makes the thermal problem to be more critical for physical design quality.In [1,2], the authors employ the hierarchical thermal model to decrease the modules' maximum temperature in chip physical design.
ey use B * -tree [3,4] to represent the floorplan/placement with Simulated Annealing optimization algorithm.e authors' hierarchical thermal floorplan/placement includes two critical steps.First, they cluster the modules by power density, and then they use the Gauss-Seidel method to solve the thermal linear system.In the [1,2], the authors only have done a theoretical analysis between the Gauss-Seidel method and the traditional LU decomposition matrix solver by loop iterative times.ey assume that the time complexity of the traditional LU decomposition matrix solver is 2n 3 /3 and compare the Gauss-Seidel method real loop iterative times with cn 2 instead of the real program or solver CPU running time.
In [5], the authors compare the program running time; this comparison is not obvious because the program running time includes other cost computations and SA's iterative time.In order to better compare the linear solver, in this study, we improve the timing statistical method and join the other precondition methods, such as the SSOR precondition.
In [8], exponentially increasing power densities in current day designs due to aggressive technology scaling has resulted in temperature being one of the primary design constraints along with others like timing, area, and power.A lot of design techniques are being adopted during the physical design stage to minimize the power, apart from the architectural techniques like throttling for dynamic thermal management.In [8], the authors propose a practical methodology for better thermal management by floorplan modifications based on thermal hotspots obtained through dynamic simulations, without disturbing the logical connectivity information.
is methodology definitely warrants the benefits which can be readily realized by doing this analysis early in the design cycle.
is can also improve the placement of the thermal sensors and boost additional performance which can be extracted by their delayed triggering, considering the lateral spreading due to better floorplanning.
In [9], with the continuing scaling of CMOS technology, on-chip temperature and thermal-induced variations have become a major design concern.To effectively limit the high temperature in a chip equipped with a cost-effective cooling system, thermal-specific approaches, besides low-power techniques, are necessary at the chip design level.e high temperature in hotspots and large thermal gradients are caused by the high local power density and the nonuniform power dissipation across the chip.With the objective of reducing power density in hotspots, the authors proposed two placement techniques that spread cells in hotspots over a larger area.Increasing the area occupied by the hotspot directly reduces its power density, leading to a reduction in peak temperature and thermal gradient.To minimize the introduced overhead in delay and dynamic power, they maintained the relative positions of the coupling cells in the new layout.ey compared the proposed methods in terms of temperature reduction, timing, and area overhead to the baseline method, which enlarges the circuit area uniformly.
e experimental results showed that our methods achieve a larger reduction in both peak temperature and thermal gradient than the baseline method.
e baseline method, although reducing peak temperature in most cases, has little impact on thermal gradient.
In [10], with the thermal effect, improper analog placements may degrade circuit performance because the thermal gradient can affect electrical characteristics of the thermally sensitive devices.To mitigate the thermal effect in analog layout design, it is required to reduce thermally induced mismatches among matched devices in addition to eliminating thermal hot spots.
e study presented major challenges arising from the chip thermal gradient for analog placement , introduced nonuniform and uniform thermal profiles as well as the corresponding placement configurations, surveyed key existing techniques for analog placement under nonuniform and uniform thermal profiles, and provided the experimental results for analog placement with thermal consideration.
In [11], the work developed a thermal aware placer, ermPL, to abate both on-chip peak temperature and thermal gradient by developing thermal force and padding techniques cooperated with rough legalization in the forcedirected global placement.
ermal padding is firstly adopted to reduce local power density.To make use of thermal force, the authors used the thermal gain basis to fast and accurately capture the temperature distribution of a placement and effectively calculate the thermal contribution of cells based on the thermal locality.en, they utilized the proposed innate thermal force assessed through thermal criticality and capabilities to spread cells away from hotspots.With the thermal gain basis, ermPL can efficiently obtain the thermal profile of placement with the maximum error of 0.65% compared with a commercial tool.Experimental results show that ermPL can provide 7% and 19% reduction on average in peak temperature and thermal gradient, respectively, within only 4.6% wirelength overhead.
In [12], with the thermal effect, improper analog placements may degrade circuit performance because the thermal impact from power devices can affect electrical characteristics of the thermally sensitive devices.ere is not much previous work that considers the desired placement configuration between power and thermally sensitive devices for a better thermal profile to reduce the thermally induced mismatches.
e study first introduced the properties of a desired thermal profile for better thermal matching of the matched devices.It then presented a thermal-driven analog placement methodology to achieve the desired thermal profile and to consider the best device matching under the thermal profile while satisfying the symmetry and the common-centroid constraints.Experimental results based on real analog circuits show that the proposed approach can achieve the best analog circuit performance/accuracy with the least impact due to the thermal gradient, among existing works.
In this study, we embed the iterative conjugate gradient method into HotSpot floorplan to compare the real CPU running time different solvers with the same compiler and running environment.
e conjugate gradient solver was imported into the thermal floorplan tool HotSpot [13], comparing with its previous LU decomposition solver.en, we can compare the running time with different solvers of thermal floorplan between conjugate gradient method and LU decomposition solver.
e thermal floorplan solver is switched by the program command line parameter with the same running environment such as CPU and memory, GCC version, and compiler's option.It is more convective to compare CPU time with two solvers than theoretical analysis about loop iterative times.We also use the SSOR and Jacobi preconditions to accelerate the conjugate gradient solver.
HotSpot thermal aware floorplan employs the thermal model to compute the blocks' temperature; the thermal temperature metric is combining with other area and wire length metrics, and it is a relative sparse matrix in the HotSpot thermal model of iterative SA framework algorithm.
e HotSpot thermal floorplan can decrease the maximum of block temperature by evenly distributing the power density, avoiding hotspots in the floorplan step of physical design.We import an iterative method to solve this kind of relative sparse linear system in the thermal model of floorplan.is paper's contributions include the following: (i) e relative sparse matrix is defined.It can speed up linear system solver convergence by the iterative sparse method.(ii) e conjugate gradient iterative method is imported in the HotSpot floorplan thermal model.It is an efficient algorithm that can reduce the running time by accelerating the linear solver in hotspot.
is paper is organized as follows.Section 2 introduces the HotSpot floorplan flow.Relative sparse matrix definition and the thermal resistance model are given in Section 3 and Section 4, respectively.Section 5 shows the result of experiments, and conclusions and future research are given in Section 6.

HotSpot Thermal Floorplan
e VLSI physical design floorplan is to place the blocks without overlap in the silicon die, and the floorplan algorithm needs to obey the chip constraint, optimizing area, wire length, and thermal temperature metrics' cost.e placement temperature is solved by linear equations in the thermal model.

Introduction of Hot ermal Floorplan.
e floorplan/ placement of physical design is a critical step for the thermal aware design.Hot floorplan is a thermal aware tool to decrease the module temperature and avoid the hotspot convergence.e hot floorplan merges the thermal cost with the traditional area and wire length cost into iterative Simulated Annealing algorithm, and it is time-consuming to solve the module temperature in the iterative Simulated Annealing algorithm.
We also use the HotSpot model to guide thermal floorplan/placement to do static temperature computer and modules' temperature statistics, and then the thermal cost is integrated with the other area and wire length to do thermal aware physical design.
HotSpot builds a thermal model to compute the dynamic block or grid temperature on chip.We import the conjugate gradient solver to accelerate the thermal block model computation in floorplan SA algorithm.

Relative Sparse Matrix of Iterative Framework Algorithm
ere are more sophisticated algorithms to solve sparse linear equations, avoiding to process zero entries of sparse matrix [6,7], such as the iterative method of linear systems.e relative sparse matrix is not a strict sparse matrix and it tends to be a dense matrix; relative sparse matrix means that there is a few "interesting" entry between one matrix and another.

Relative Sparse Matrix.
Relative sparse matrix definition: if the matrix R � A k+1 − A k is sparse, there is a few nonzero values of matrix R items.We define that matrices A k and A k+1 are relatively sparse; in other words, the matrix A k+1 is a sparse matrix relative to matrix A k .Assume there are two linear equations A k * x � b and A k+1 * x � b, the order of solving linear equations is as follows: the first step is solving A k * x � b, the second step is solving A k+1 * x � b in sequentially iterative framework algorithm; for example, Simulated Annealing algorithm solves A k * x � b and A k+1 * x � b sequentially with the same vector b, in a linear system with different matrices from A k to A k+1 .
In this case of sequentially relative sparse matrix, the iterative methods are employed to solve the relative sparse linear equations to reduce the number of iterations for the convergence.e detail operations are described as follows: In the first step, the linear system A 0 * x � b is solved trivially, obtaining the solution x 0 , and then we reuse the solution x 0 as the initial estimate value for the linear equations A 1 * x � b.In the same way, we reuse the solution In this case of the sequence relative sparse matrix, if we set the initial estimate value of A k+1 * x � b equal to x k , x k is the previous solution of A k * x � b, and it can speed up the iterative method convergence.Because the changes of the matrix from A k to A k+1 is little, even though the matrices A k and A k+1 are full matrix not traditional sparse matrix about entry densities, the matrix A k+1 is a sparse matrix relative to the matrix A k .
We can call this relative sparse linear system computation as the incremental updating solution method.
In the iterative algorithm, the previous iterative solution preserved as intermediate solution is the initial estimate value for current iteration in the linear system.

Relative Sparse Matrix in Thermal
Floorplan of SA Framework Algorithm

ermal Model Introduction.
In the floorplan of VLSI physical design, the circuit modules are randomly placed in the die using Simulated Annealing; once the Simulated Annealing generates a floorplan of circuit modules, we calculate the cost metric of die area, wire length between circuit modules, and the maximum temperature of circuit modules by the thermal model.e Simulated Annealing is an iterative optimizing algorithm; the thermal model is incorporated into the SA (Simulated Annealing).e thermal conduct in the die is complex [14], and it can be an abstracted thermal resistant model [15]: where T and P are the vectors representing temperature and power consumption, respectively; thermal resistance R is the square matrix and symmetric matrix.Once the circuit blocks are determined, the blocks' power P vector will not change, and it is a constant vector.ermal resistance matrix R will change entry values according to the placement detail, and it is a dense matrix instead of the sparse matrix, but it matches the relative sparse matrix definition in iterative Simulated Annealing framework algorithm.e Simulated Annealing algorithm changes placement, a few from one stage to another, computing new cost of metrics, for example, moving a block from one location to another unused location, and this perturbation will only change one block's location in placement so that the thermal conduct between blocks and most block's temperature changes little too.Here the updating block's thermal resistance matrix R is a dense matrix but has a few "interesting" changes about thermal resistance matrix items, and it is a relative sparse matrix between R n+1 and R n .e new thermal resistance matrix R n+1 is a relative sparse matrix with R n .

LU Decomposition Solving Linear Equation.
Solving linear equations gave a system of linear equations in the matrix form: Given matrix A and vector b, the solution x is needed to be solved.e matrix A is LUP decomposed such that PA � LU. e linear equations could be transformed into LU form equivalently as

Mobile Information Systems
LUx � Pb.
(3) e LU solver is done in two logical steps: (i) First step: solving the lower triangular matrix linear equations, Ly � Pb for y (ii) Second step: solving the upper triangular matrix linear equations, Ux � y e cost of solving a system of linear equations is approximately 2n 3 /3 floating point operations if the matrix A has size n [16].
e LU decomposition is the direct solver method of the linear equation.

Incrementally Iterative Conjugate Gradient Solver and Convergence.
ere are many iterative linear solver methods including conjugate gradient, Gauss-Seidel, and successive over relaxation.In this study, the conjugate gradient method is used to solve the thermal model in floorplan.

Convergence of Incrementally Iterative Conjugate Gradient Solver.
e conjugate gradient method is convergence if the matrix is symmetric and positive definite.e condition number associated with the linear equation Ax � b gives a bound on how inaccurate the solution x will be after approximation.
e condition number of matrix is the product of the two operator norms: If A is normal, then κ(A) � |λ max (A)/λ min (A)|, where λ max (A) and λ min (A) are maximum and minimum values of eigenvalues of a matrix A, respectively.e convergence of CG depends on the condition number of matrix κ(A) which is equal to |λ max (A)/λ min (A)|.
Denoting initial guess for x 0 , at starting of the SA algorithm, we can assume that x 0 � [0, . . ., 0]; if we get x after the first time linear solver, then the conjugate gradient solver will reuse the previous x k as the next time initial estimate value x k+1 , k ≥ 0. is incremental updating solution method can accelerate the solver convergence.
e conjugate gradient method inspires Ax � b solution, and x * is also unique in minimizing the following quadratic function: is suggests taking the first basis vector p 1 to be the negative of the gradient of f(x) at x � x 0 , the gradient of f(x) equals Ax 0 − b, and we take p 1 � b − Ax 0 .
It is conjugate to gradient between the vectors.Let r k be the residual at the kth step: Note that r k is the negative gradient of f at x � x k , so the gradient descent method would be to move in the direction r k .Here, we insist that the directions p k be conjugate to each other.We also require that the next search direction be built out of the current residue and all previous search directions, which is reasonably enough in practice.

Pseudocode of Conjugate Gradient Algorithm.
e algorithm is detailed below for solving Ax � b, where A is a real, symmetric, positive-definite matrix.e input vector x 0 can be an approximate initial solution or 0.
e pseudocode of conjugate gradient solver is shown in Algorithm 1.

Precondition of Conjugate Gradient Method.
In most cases, preconditioning is necessary to ensure fast convergence of the conjugate gradient method.e preconditioned conjugate gradient method takes the following form.
We consider a preconditioned system of where M is a nonsingular matrix.Jacobi preconditioning: the simplest preconditioner consists of just the diagonal of the matrix: is is known as the Jacobi preconditioner.e SSOR preconditioner, like the Jacobi preconditioner, can be derived from the coefficient matrix without any work.If the original, symmetric matrix is decomposed as in its diagonal, lower, and upper triangular part, the SSOR matrix is defined as e pseudocode of the conjugate gradient solver with preconditioned is shown as Algorithm 1.

Incrementally Inherits Initializing Estimate Value from Previous Solution.
e HotSpot thermal floorplan is using the iterative Simulated Annealing algorithm.e Simulated Annealing algorithm changes the placement from one to another, only one or two blocks' location, and the most blocks' temperatures change a little.If the initialize estimate value inherits from the previous temperature result, it can reduce the number of iteration times for convergence.It is the reason that the SA framework floorplan algorithm employs the iterative conjugate gradient solver to accelerate the convergence for the thermal model.e pseudocode of SA thermal floorplan algorithm with the conjugate gradient thermal solver is shown in Algorithm 2.

Stopping Criteria for Iteration Solver.
e residual r norm is computed as follows: 4

Mobile Information Systems
If and only if r norm < aim precision , the iterations are terminated after the residual is less small than the desired precision.
e overall iterative times of the conjugate gradient solver are O(k 3 ) in floorplan [17].e conjugate gradient solver with preconditioned is O(k 2.5 ); in two-dimensional problem, n � k 2 , and in three-dimensional, n � k 3 .
e linear equations can be solved by LU in time 2n 3 /3.e analysis and experiment also prove the excellence of the conjugate gradient incremental updating solution method in the SA iterative framework algorithm of thermal floorplan.

Results of Experiments
e conjugate gradient algorithm is imported into the open source HotSpot floorplan [15] in C and C++ program language.e HotSpot thermal floorplan uses the SA (Simulated Annealing) [18] optimal algorithm.e experiments are running on Ubuntu with Intel ® Core ™ CPU i5-2300 2.80 GHz and 12 G memory.e benchmarks are MCNC [19] benchmark circuits.e block power trace is generated by a Perl script random function, and the power density ranges from 10 5 W/m 2 to 10 7 W/m 2 for each block [14].
It is a more convictive way to compare CPU time of two linear solvers in same program; the conjugate gradient algorithm is been implanted by the C++ code and merged into the hot floorplan [15]; and the conjugate gradient algorithm compares with the hot floorplan default LU decomposition solver.e two linear solvers of program are switched by the command line parameter, so the program run environment, such as CPU, memory, GCC version, and compile option, is the same.
Table 1 shows that the conjugate gradient solver without precondition (CG normal in the table) run time is approximate; in the LU decomposition solver on MCNC case, the run time unit is second; the conjugate gradient without precondition solver run time is about speed up averagely 1.49 compared with the LU decomposition solver; the LU decomposition solver once average time is 0.00387 second (3.87E − 03 in the table); and the conjugate gradient solver without precondition solver once average time is 0.00191 second (1.91E − 03 in the table).
Table 2 shows that the conjugate gradient solver with Jacobi precondition run time is quicker than the LU decomposition solver, and the conjugate gradient solver with Jacobi precondition run time speeds up averagely 4.32x comparing with the LU decomposition solver.e conjugate gradient solver with the Jacobi precondition solver once had an average time of 0.000567 second (5.67E − 04 in table ).
Table 3 shows that the conjugate gradient solver with SSOR precondition run time is less better than the conjugate gradient solver with the Jacobi Precondition; the conjugate gradient solver with SSOR precondition run time speeds up averagely 5.18x with the LU decomposition solver; the conjugate gradient solver with the SSOR precondition solver once had an average time of 0.000473 second ( Mobile Information Systems precondition gains the best result ratio comparing with the LU decomposition solver.
Figure 1 shows the conjugate gradient solver run time versus the LU decomposition solver on the MCNC benchmark.e figure name rule: the CG is the conjugate gradient solver without precondition; the Jacobi is the conjugate gradient solver with Jacobi precondition; the SSOR is the conjugate gradient solver with SSOR precondition.e conjugate gradient solver without precondition run time is more than the LU decomposition solver on two small cases, and less than on large cases; the conjugate gradient solver with Jacobi and SSOR precondition experimental results are less than the LU decomposition solver results.e precondition is important for the iterative conjugate gradient solver.
Figure 2 shows the conjugate gradient solver run time ratio versus the LU decomposition solver on the MCNC benchmark.Naming rules are consistent with Figure 1. e experimental ratio curve shows that our iterative conjugate gradient solver accelerated more with increasing number of modules.
We also received the reviewer proposal to adapt to the larger GSRC benchmark of examples to test the scalability of Mobile Information Systems our algorithm.However, we need to emphasize that HotSpot floorplan is designed for the CPU small quantity module floorplan, which is not suitable for running the large collection of placement instances, so the running time is relatively long.Table 4 shows that the conjugate gradient solver with SSOR precondition run time is better than the LU decomposition solver, the conjugate gradient solver with SSOR precondition run time speeds up averagely 17x with the LU decomposition solver, the conjugate gradient solver with SSOR precondition is 24x for test case n300.e profit of the conjugate gradient solver with SSOR precondition gains the best result ratio comparing with the LU decomposition solver.

Conclusions and Future Work
e conjugate gradient solver is often been used in large sparse matrix method computation, and the HotSpot thermal floorplan could be speeded up by using the sparse matrix iterative linear solver.
e experiments show that the iterative conjugate gradient solver is faster than the direct LU decomposition solver on the MCNC benchmark.
e relative sparse matrix theory could be applied to other iterative framework algorithms, and the relative sparse matrix could be extensible.e future works may be the following: (i) Extend to other precondition methods of an iterative linear solver, and we could use other preconditions.Mobile Information Systems (ii) Explore sparse linear system theory to speed up our program because there are many theoretical innovations to solve linear system in the last two decades.

Figure 1 :Figure 2 :
Figure 1: Run time comparing with different block numbers.
4.73E − 04 in table).It is 11x for test case ami49.e profit of conjugate gradient solver with SSOR

Table 1 :
e conjugate gradient solver without precondition.

Table 2 :
e conjugate gradient solver with the Jacobi precondition.

Table 4 :
e conjugate gradient solver with SSOR precondition run time on the GSRC benchmark.