The scan chain insertion problem is one of the mandatory logic insertion design tasks. The scanning of designs is a very efficient way of improving their testability. But it does impact size and performance, depending on the stitching ordering of the scan chain. In this paper, we propose a graph-based approach to a stitching algorithm for automatic and optimal scan chain insertion at the RTL. Our method is divided into two main steps. The first one builds graph models for inferring logical proximity information from the design, and then the second one uses classic approximation algorithms for the traveling salesman problem to determine the best scan-stitching ordering. We show how this algorithm allows the decrease of the cost of both scan analysis and implementation, by measuring total wirelength on placed and routed benchmark designs, both academic and industrial.
The design flow of an integrated circuit (IC), meaning the software applications that allows the designer to move from its specification to its concrete realization, involves many stages of optimization problems, usually from system level to layout [
With both recent advances of the semiconductor industry and new market constraints, Time To Market (TTM) and product quality are becoming major issues. The circuit must meet flawlessly customer expectations in terms of functionality, speed, quality, reliability, and cost. In such a challenging economic environment, and given the significant level of complexity which is reached by the IC, manufacturing testing is more than ever an important factor in the design problem. Today, chip testing should be short, efficient, and cost-effective. A significant amount of research work is ongoing with a focus on complex design-for-test problems at both universities and industry. Design For Test (DFT) techniques are becoming a key since they are considered during the chip design process and flow. Cost of manufacturing chips averages the cost of testing them. So the semiconductors community needs low cost and high quality test solutions. New and efficient DFT solutions are greeted with higher expectations than ever. Most current DFT solutions result in unpredictable design development time and development costs and directly impact (TTM).
In this context, DFT tools are receiving growing attention with the advent of core based System On Chip (SOC) design. In particular, when cores from different vendors are integrated together on the system on chip (as it is often done nowadays), the difficulty level of testing grows rapidly. Among the most apparent issues are core access, system diagnosis, test reuse, test compaction, tester qualification and Intellectual Property (IP) protection. The ITRS (International Technical Roadmap for Semiconductor) identifies key technological challenges and needs facing the semiconductor industry through the end of the next decade. Difficult near-term and long-term testing and test equipment challenges were reported in [
Several of the DFT solutions for ASIC and ASIP designs are based on the internal Scan DFT technique. Scan is potentially an efficient technique, if used properly. Currently, full Scan is the most widely used structured DFT approach where all the design sequential elements belong to the scan architecture or scan chains [
However, traditional Scan solutions make such an extension very difficult since engineers need to handle gate-level netlist without taking benefit from what happens during synthesis optimization. Also, Scan implementation decisions are considered post synthesis. This is too late in comparison to RTL design decisions. Adopting Scan at the Register Transfer Level (RTL) will cover new design and manufacturing needs, strengthen verification, and consolidate reusable design methodologies by closing the gap between RTL and design for test. There are many advantages to inserting scan at the RTL level like: benefiting from the synthesis process (i.e., better optimization in terms of area and timing), the ability to debug testability issues early in the design flow, and leveraging the optimization done by the synthesis tool. The possibility to insert scan at the RTL dates back to the late nineties [
To analyze the optimal design location where full scan chains need to be inserted, the following considerations are required. First, to implement a scan chain, one has to have knowledge of the Flip-Flops (FFs) of the design. Second, the best place to insert any new task in a flow is at the earliest. The motivations for an early scan insertion are threefold: the complexity of the objects grow as one goes forward in a flow, so data handling gets more costly; whatever is added to the design is better if integration happens earlier in the flow; and finally, should the treatment lead to iterations—either redesigning or iterations in the flow—the sooner the iterations, the lower the cost. Third, at the point where insertion is finalized, all the information required for the insertion has to be available. If this information depends at least partly on the insertion itself, this leads to iterations—see below. Fourth, insertion can start at some level in the flow, and end at a later point in same. Again, this can be undesirable, but also unavoidable once some design decisions on the insertion process have been taken. Since FFs are usually known after synthesis, once a gatel-level netlist is available, this sounds like a reasonable point in the flow to insert scan. But the replacement of normal FFs by scan FFs has to be done before synthesis ends, since timing closure is affected by it. We will see in the next section that FFs detection can happen before synthesis.
On the other hand, we will argue in Section
The traditional way to insert full scan in a design is to replace the FFs by scanned FFs during synthesis; possibly connect them into a chain; and then during place and route iterations, (re)connect them to try to minimize the total Manhattan length of the added connections. The exact place where one should reoptimize the ordering of the chain is a matter of debate, and heavily depends on the flow used. For an in-depth discussion of this topic in the case of the Synopsys flow, see [
Also, scan implementation decisions are considered post synthesis. This is too late in comparison to RTL design decisions. Adopting scan at the RTL level will cover new design and manufacturing needs, strengthen verification and consolidate reusable design methodologies by closing the gap between RTL and design for test. There are many advantages to inserting scan at the RTL level like benefiting from the synthesis process (i.e., better optimization in terms of area and timing).
Most of the work done on scan chains insertion and stitching ordering optimization assumes that placement information is available [
The knowledge of FFs is necessary to implement a scan chain. It has been suggested several times independently that one does not need a netlist to have knowledge of FFs of the design; this is the basis of higher-level scan insertion [
Most studies on higher-level scan either do not mention optimization [
In this paper, we try to close the gap between RTL Scan insertion and actual scan stitching optimization, while retaining the model of one-time insertion without any later reoptimization see Figure
Classical scan insertion versus our method.
More specifically, we tackle the following optimization problem: given an RTL description of the design (Verilog or VHDL) we have to find a stitching ordering of the memory elements in the scan chain, which minimizes the impact of test circuitry on chip features (area, power) while keeping testing time at its minimum.
The remaining of this paper is organized as follows. Section
To the best of our knowledge, this work is the first to offer a formal treatment of the scan stitching ordering problem as a discrete optimization problem in the case of RT-level scan insertion. We formalize the problem; give reductions from the several-chains problem to the one-chain problem; and solve the one-chain problem in two steps. The first step, which can be seen as a kind of preprocessing, allows us to build a graph representative of the actual optimization. Then in a second step, algorithms for the TSP are adapted to actually build the chain.
A part of this work has been presented before in [
The main contributions of the paper are summarized below: we analyze what is a suitable objective for measuring the quality of scan stitching orderings, we give a mathematical formulation of the problem of scan insertion at the register transfer level (RTL), we give two reduction procedures to solve the several-chains variants based on a routine solving the one-chain case, we solve the one-chain case in a two-steps approach, we evaluate our algorithm on both academic and industrial designs.
Our algorithm provides a basis for considering the implementation of scan chains as soon as the RTL of a block is available; the authors think that the lack of optimization has been a big obstacle to its adoption in design flows. The methodology has been validated by our industrial partner DeFacTo Technologies in the tool HiDFT-SIGNOFF, it is a first step towards considering the integration of scan at the RTL.
We illustrate Scan insertion at RTL with the VHDL language. Scan insertion is a three steps process: first identify which signals and variables will give rise to flip-flops in the netlist; second, decide which ordering will be used to chain the FFs together; third, change the RTL code to offer a (new) testing mode for the design.
Since the object of this work is the second step of this whole process, we now restrict our attention to its first and third steps. This section is included mainly to make this article self-contained; for a more thorough presentation, the reader is referred to [
We first present how flip-flops are identified; then we illustrate RTL edition for introducing scan at RTL; finally, we examine what could be a good measure of the impact of scan insertion on a design.
FFs identification is realized process by process. Each process is searched for variables and signals. All signals in a process will be translated into a FF in the netlist. To determine which variables will give rise to FFs, one has to first identify clocking signals. Then two cases can happen. To determine which applies in a particular case, one has to recall that processes execute cyclically—see Figure
Two use cases for variables in VHDL.
Either the variable is assigned between the clocking event in the process and the point where the variable is used (case 1). In that case, no FF is generated. Or the clocking event lies between the point where the variable is assigned and that where the variable is used (case 2). In that case, memorization has to occur, and a FF is generated.
To illustrate the introduction of a scan chain at RTL through RTL code edition, we consider the simple process in Algorithm
process(clock) begin if clock’ event and clock=‘1’ then end if; end process;
In order to introduce a testing mode behavior to the design, we simply describe in VHDL what the process is going to do in test mode. The additional VHDL code is shown in Algorithm
process(clock) begin if clock’ event and clock=‘1’ then end if; end process;
In most of the literature about scan ordering optimization, wirelength is the objective used to guide the optimization process. In these works, placement of the FFs is known; so wirelength is a quantity that is directly available during optimization. We will give our own argument for considering wirelength the most useful parameter to optimize stitching orderings.
When adding logic to a design for non-functional reasons, one tries to minimize the increase in size of the design. Additional area due to scan comes in two parts. First, FFs have to be instrumented either into scanned FFs, or with the help of a multiplexer at their input. In the former case, the area cost depends only on the number of FFs in the design; in the latter, the area cost can be less than the maximum if some optimization takes place during synthesis. In both cases, that cost is bounded independently from the stitching ordering. Second, wires have to be added between the output net of a FF and the next FF in the scan chain—either the scanned FF, or one input of the multiplexer in front of it. This does not represent an area cost in itself, since routing happens in the upper metallic layers. But it does add to the difficulty of placement and routing, with the possibility to have a very degraded situation if the stitching order is not chosen with care. In that case, the area can grow if routing is not possible anymore with the available space. We consider this a much more important factor than the additional cell area; therefore our measure of ordering quality will be based on it.
It is not possible to express the impact of added wires on placement and routing as a simple function of these wires; the impact will depend also on the sparsity of the design. But since we take a worst-case approach to the impact of wires, we will use the added wirelength due to scan insertion as our optimization function.
This choice will be validated by the variability that is observed on this parameter—see Section
We now review quickly all the other parameters scan insertion could have an impact on.
In order to help ensuring timing closure, the maximum of the length of the added wires would be a reasonable measure. But since in our case synthesis happens after scan insertion, it will be up to the synthesis-and later on, placement and routing-tools, to ensure timing closure, using all the flexibility of gate sizing to achieve it. Note also that ensuring a low total added wirelength means that not too many added wires will be long, hence lowering the impact on timing, and the additional load for the synthesis tool.
Power while in functional mode grows with wirelength; so minimizing wirelength will also lower the impact on functional power. Power while in test mode is dominated by switching: the values shifted through the scan chain force the values in FFs to change more often than they would in functional mode, leading to power supply issues. A common remedy to this condition is to take benefit of don’t-care values in test vectors to fill them in a way to minimize switching. This also works with RTL-scan. There have been many attempts in other works at minimizing power during test mode by changing the stitching ordering. We have not followed this trail; it would be worthwhile to try combining it with our approach based on minimizing wirelength.
Finally, observability and controllability are not impacted by the stitching ordering. We note here that although combinatorial parts of the circuit have the same observability and controllability in a full scan design implemented both at gate and at RT-level, in the case of RT-level scan, there will be more stuck-at faults, since the multiplexers are now no more combined with FFs. Hence fault coverage values tend to differ, although not much, between both methods for scan insertion.
Having established that wirelength should be minimized, we now turn to the precise description of the optimization problem we will consider in order to try to find low wirelength stitching orderings.
To determine the scan chains that best meets the needs of different integrated circuit designers, while maintaining maximum constraints and restrictions related to the electronic problem, we propose a new scan stitiching algorithm for the automatic insertion of optimal scan chains at the RTL. Our algorithm is structured into two phases, as shown in Figure
Stitching algorithm for optimal scan chain at RTL.
The challenge here is to be able to take into consideration still at high level what is going to happen later on during the placement and routing (P&R) steps. To build this model, we extract from the RTL description information on the expected proximity of the memory elements in the layout.
This phase is devised into three steps, as follows.
First, we perform a preprocessing called Elaboration. Although scan logic can be described at RTL, the very elements of which fault models are talking—nets, gates and FFs—do not exist yet at this level. Hence we need first to translate the RTL code into these elements. This is what synthesis does. But it is not feasible to have a full synthesis at this step in the flow. Once RTL-scan is inserted, the synthesis step still has to be done; we do not want to duplicate that effort.
Our solution relies on a lightweight synthesis as the first step of the RTL scan insertion. This synthesis is done in terms of a virtual library of generic (non-physical) gates; it does not try to optimize logic, timing or gate sizes; it does not need the user to do any fine-tuning; in short, it is done transparently. The user of the scan insertion tool only provides his RTL code, and will get back a scanned RTL code: he will never see the netlist that comes out of the lightweight synthesis. Indeed, in our method, this netlist serves only one purpose, namely graph extraction.
The second step after the design elaboration is to build the undirected
Example design and associated graphs
Note that the vertices are partitioned into two sets: nets on one hand, and gates and memory elements on the other; every edge in the graph has one end on each side. Therefore,
The graph
The third step is to extract information from the design that is necessary for scan chain stitching optimization (second phase). This information will be given in the form of an edge-valuated Proximity Graph (in short
The
Path lengths are restricted to a threshold value
To compute
Figure
Using this new formalism, the chaining of memory elements in the circuit corresponds to a partition of the vertices of
In the case of a single scan chain, our problem reduces to the Traveling Salesman Problem [
Also, single scan chains are not an option for big designs, where testing time would be prohibitive if scanning were not done using more than one chain.
Before explainning the second phase, we present two algorithmic devices to reduce the general chaining problem to the single-chain case. Then we show how our problem can be reduced to the Traveling Salesman Problem and some algorithms to solve it. Finally, we describe the two steps of the second phase.
The simplest way of reducing the several-chains scanning problem to its single-chain subcase is by appropriate post-processing. Once one has computed a single scan chain for the whole design, this chain can be split into the desired number of segments (the actual scan chains). This device is really fast, and it is not expected that the quality of the output of the whole process will be much degraged as compared to that of the single-chaining algorithm. Also, it is very easy to implement, and is our recommandation for the cases where fast enough single-chaining algorithms are available.
A more elaborate device is the preprocessing of the graph
Even if the partitioning is handled by an appropriate algorithmic package, implementing this solution is not as easy as chain splitting. But using it brings with it another benefit: the TSP problems to solve in this case are restricted to the length of the scan chains. For test application reasons, these are in actuality limited; although technically feasible, one seldom meets scan chains of 10000 FFs.
Hence this solution helps ease up the problem of computing the costs matrix for the input of the TSP, which is actually the longest step of our methodology.
Another benefit is that when using partitioning, the whole method scales with only a linear increase in running times if the maximum size of scan chains is kept constant.
We now discuss how TSP algorithms can be applied for the single-chain stitching problem.
There are two distinct frameworks for applying TSP algorithms to our problem. The first one allows any algorithm to be used, but it puts constraints onto the size of input graphs that can be fed to it.
The second one is the particular case of an algorithm that can be used directly on
The input of a TSP routine is a symmetric matrix representing the costs for edges of a complete graph. In order to apply a TSP algorithm to solve the chaining problem, one needs first to compute the whole costs matrix (except diagonal elements). In our model, we attribute (as cost) to a pair
It is the preprocessing step that imposes a serious limitation on the possibility to use this scheme. Indeed, the
If only designs smaller than this are to be treated, then this method is definitely worth trying, all the more that one has then the ability to test and compare different TSP algorithms.
A famous textbook algorithm for the TSP, and also a typical example of an approximation algorithm with guaranteed quality, the 2-approximation for the TSP also has nice properties when used on the scan chaining problem.
An algorithm for a minimization problem is called an
It may seem unintuitive that one can prove a quality bound without even knowing the value of the optimal solution. Actually, such a proof can be derived through the use of a lower bound on the value of the optimal solutions, through showing that the solution given is less than
In the case of the TSP, two approximation algorithms are found in every textbook on the subject: one has an approximation factor of 2, (we call it “the 2-approximation for the TSP”), and the other is Christofides’ algorithm, with an approximation factor of 1.5. Both use in their proof, as a lower bound on the optimal tour length, the value of a minimum spanning tree of the graph.
Please note that these approximation ratios are only proven theoretical bounds, and are not enough to compare the empirical behavior of these algorithms. Still, Christofides is bound to give solutions that are not more than 50% away from the optimal one; this would entice one to use this algorithm.
Alas, Christofides uses Weighted Perfect Matching in bipartite graphs as a subroutine, which has to impacts on its use. First, weights have to be precomputed, and we stumble against the same blocks as mentioned in the previous section. Second, Weighted Perfect Matching needs
Without any regret, we turn now to the 2-approximation algorithm, which is seldom considered a useful practical choice because of its loose guaranty, and not-so-good performance on common benches.
The algorithm is in two steps. First, a minimum spanning tree is computed; then a root is chosen, and the tree is explored and vertices ouput with postfix ordering.
Only the first step actually looks at the input graph; and its only requirement is that the graph be connected. If
Thus, using the 2-approximation, we bypass the pre-computation of the cost matrix. This means that bigger input graphs can be considered if this algorithm is used, as compared to other TSP algorithms.
Although this algorithm was meant for the TSP, that is for an input being a complete graph with values on all edges, one can prove that the 2-approximation guarantee still holds when the algorithm is applied to a connected graph. In this case, the tour cost has to be understood as the sum of the cost of the edges, where edge costs for inexistent edges are length of shortest paths in the input graphs—hence our choice to model the cost of inexistent edges in this way.
The final step is the edition of the RTL code of the design. In this step, we insert in the original RTL code/description the additional RTL constructions required to implement the scan testing logic.
The method we propose to optimize the stitching ordering of RTL scan chains has been implemented in an experimental version of HiDFT-SIGNOFF. This tool relies on a commercial software library for the parsing and elaboration of the design. This elaboration step is what we called “lightweight synthesis” in Section
HiDFT-SIGNOFF has the ability to stitch FFs into several scan chains. This possibility is important from the practical point of view: making several scan chains is the easiest way to reduce test application time. HiDFT-SIGNOFF implements the two strategies presented in Section
For graph partitioning, the METIS library [
In order to validate our method, we did experiments on a number of designs, both in VHDL and Verilog. We used the benchmarks
Table
Designs description.
Design | No. of FF | Language |
---|---|---|
b09 | 28 | vhdl |
b10 | 17 | vhdl |
b11 | 31 | vhdl |
b12 | 121 | vhdl |
b13 | 53 | vhdl |
b14 | 245 | vhdl |
b15 | 449 | vhdl |
b17 | 1415 | vhdl |
b18 | 3320 | vhdl |
b19 | 6642 | vhdl |
Simple-Spi (SS) | 132 | verilog |
Biquad | 204 | verilog |
Ac-97 | 2289 | verilog |
Since our optimization criterion is wirelength, and since we consider congestion during routing an important issue, experimentations were conducted till the place and route step. Two flows were considered, according to Figure
Table
Characteristics of the graphs
Design | No. of FF |
|
|
|
|
---|---|---|---|---|---|
b09 | 28 | 401 | 719 | 28 | 378 |
b10 | 17 | 525 | 947 | 17 | 136 |
b11 | 31 | 1094 | 1845 | 31 | 465 |
b12 | 121 | 3879 | 6993 | 121 | 7260 |
b13 | 53 | 633 | 1042 | 53 | 1378 |
b14 | 245 | 27269 | 46778 | 245 | 29890 |
b15 | 449 | 33179 | 56798 | 449 | 100576 |
b17 | 1415 | 100358 | 171759 | 1415 | 697990 |
b18 | 3320 | 271320 | 462326 | 3320 | 2330318 |
b19 | 6642 | 531161 | 906201 | 6642 | 7061713 |
Simple-Spi | 132 | 1427 | 8646 | 132 | 8778 |
Biquad | 204 | 357 | 20706 | 204 | 20910 |
Ac-97 | 2289 | 18727 | 2381495 | 2289 | 707911 |
The computation of the graph
The figures for both wirelength after place and route, and computation times, have been gathered in Table
Computation times are given only for the two steps of the stitching optimization method, discarding the time for parsing and elaboration. The column
We base our analysis of the comparison of wirelength between gate-level and RTL scan on the slack column of Table
Wirelengths and insertion times.
Design | No. of SC | Wirelength | Slack (%) | Insertion time (ms) | ||
---|---|---|---|---|---|---|
GL-S | RTL-S |
|
TSP | |||
ITC 99 Benchmarks (VHDL) | ||||||
| ||||||
b09 | 2 | 2.06 | 1.63 |
|
|
|
b10 | 2 | 1.94 | 2.06 | 5.97 | 10 |
|
b11 | 3 | 4.77 | 5.03 | 5.38 | 20 | 10 |
b12 | 10 | 12.6 | 12.92 | 2.59 | 200 | 20 |
b13 | 5 | 3.39 | 3.32 | −2.25 | 20 |
|
b14 | 24 | 63.64 | 64.2 | 0.96 | 3 s | 110 |
b15 | 24 | 126.3 | 122 |
|
6 s | 320 |
b17 | 70 | 395.6 | 370.4 |
|
30 s | 3 s |
b18 | 300 | 1187.2 | 922.6 |
|
3 m | 10 s |
b19 | 600 | 2329.6 | 1858.2 |
|
5 m | 20 s |
| ||||||
Opencore designs (Verilog) | ||||||
| ||||||
Simple-Spi | 10 | 11.5 | 10.4 |
|
100 | 20 |
Biquad | 20 | 34.1 | 31.3 |
|
350 | 80 |
Ac-97 | 200 | 247.7 | 238.5 |
|
30 s | 8 s |
Table
Fault coverage for RTL scan.
Design name | Fault coverage |
---|---|
ITC 99 Benchmarks (VHDL) | |
| |
b09 | 99.86% |
b10 | 99.85% |
b11 | 99.93% |
b12 | 99.97% |
b13 | 99.92% |
b14 | 99.99% |
b15 | 99.97% |
b17 | 99.47% |
b18 | 99.81% |
b19 | 99.81% |
| |
Opencore designs (Verilog) | |
| |
Simple Spi | 98.35% |
Biquad | 99.96% |
Ac-97 | 99.80% |
Computation times are mainly dominated by the setup of the graph
After giving some motivations for inserting scan at the RT-level, we have exposed what we believe is an important challenge for RTL-scan, which is finding a good stitching in one single pass, working only at the RT-level. Therefore, the purpose of the present work is to solve the following problem: given an RT-level description of the design (Verilog or Vhdl) we have to find a stitching ordering of the memory elements in the scan chain, which minimizes the impact of test circuitry on chip performance (area, power) and testing time. Then, we have motivated our choice of selecting wire length as the prime parameter to optimize.
To solve this problem, we proposed a new scan stitching algorithm for optimal scan chain insertion at RTL. The techniques used are derived from the combinatorial optimization and operations research domains. Indeed, our algorithm is divided into two main steps. The first step proposes a mathematical model describing the electronic problem. The second one offers a resolution methodology.
The model we propose is based on graph theory. To build it, we extract from the RTL description information on the proximity of the memory elements (existing paths between flip-flops, clock domains, and various other relations extracted from hierarchical analysis) and translate them in two graphs
Finally, we integrated our tool in an industrial design flow and performed experiments over several academic and industrial design benchmarks. Numerical evidence showed that we are able to limit such a cost due to scan insertion in a reasonable computing time, without impacting DFT quality, especially fault coverage. The method seems better than the traditional one for middle-sized designs. The industrial interest for our algorithms and tools is confirmed by our industrial partners. Our RT-level scan optimization algorithm has been incorporated into the tool HiDFT-SIGNOFF by DeFacTo Technologies. In case new flip-flops are added or removed to existing scan chains, it is important to goldenize the RTL code and reflect such changes directly into the RTL code. The DeFacTo tool HiDFT-SIGNOFF allows that by introducing the new scan architecture and by including the flip-flops new ordering. In this way, our work represents a progress in the state of the art, as previous works are not automated and/or were evaluated only for small designs.
Last but not least, an important contribution of our work is to make a neat separation between the DFT problem and the mathematical model. This separation allows the same software to work for several successive technology nodes.
Our initial results give rise to a number of new directions for further research. These are summarized below. First, one could investigate whether cumulating local and global optimization in the manner of [