The scan chain insertion problem is one of the mandatory logic insertion design tasks. The scanning of designs is a very efficient way of improving their testability. But it does impact size and performance, depending on the stitching ordering of the scan chain. In this paper, we propose a graphbased approach to a stitching algorithm for automatic and optimal scan chain insertion at the RTL. Our method is divided into two main steps. The first one builds graph models for inferring logical proximity information from the design, and then the second one uses classic approximation algorithms for the traveling salesman problem to determine the best scanstitching ordering. We show how this algorithm allows the decrease of the cost of both scan analysis and implementation, by measuring total wirelength on placed and routed benchmark designs, both academic and industrial.
The design flow of an integrated circuit (IC), meaning the software applications that allows the designer to move from its specification to its concrete realization, involves many stages of optimization problems, usually from system level to layout [
With both recent advances of the semiconductor industry and new market constraints, Time To Market (TTM) and product quality are becoming major issues. The circuit must meet flawlessly customer expectations in terms of functionality, speed, quality, reliability, and cost. In such a challenging economic environment, and given the significant level of complexity which is reached by the IC, manufacturing testing is more than ever an important factor in the design problem. Today, chip testing should be short, efficient, and costeffective. A significant amount of research work is ongoing with a focus on complex designfortest problems at both universities and industry. Design For Test (DFT) techniques are becoming a key since they are considered during the chip design process and flow. Cost of manufacturing chips averages the cost of testing them. So the semiconductors community needs low cost and high quality test solutions. New and efficient DFT solutions are greeted with higher expectations than ever. Most current DFT solutions result in unpredictable design development time and development costs and directly impact (TTM).
In this context, DFT tools are receiving growing attention with the advent of core based System On Chip (SOC) design. In particular, when cores from different vendors are integrated together on the system on chip (as it is often done nowadays), the difficulty level of testing grows rapidly. Among the most apparent issues are core access, system diagnosis, test reuse, test compaction, tester qualification and Intellectual Property (IP) protection. The ITRS (International Technical Roadmap for Semiconductor) identifies key technological challenges and needs facing the semiconductor industry through the end of the next decade. Difficult nearterm and longterm testing and test equipment challenges were reported in [
Several of the DFT solutions for ASIC and ASIP designs are based on the internal Scan DFT technique. Scan is potentially an efficient technique, if used properly. Currently, full Scan is the most widely used structured DFT approach where all the design sequential elements belong to the scan architecture or scan chains [
However, traditional Scan solutions make such an extension very difficult since engineers need to handle gatelevel netlist without taking benefit from what happens during synthesis optimization. Also, Scan implementation decisions are considered post synthesis. This is too late in comparison to RTL design decisions. Adopting Scan at the Register Transfer Level (RTL) will cover new design and manufacturing needs, strengthen verification, and consolidate reusable design methodologies by closing the gap between RTL and design for test. There are many advantages to inserting scan at the RTL level like: benefiting from the synthesis process (i.e., better optimization in terms of area and timing), the ability to debug testability issues early in the design flow, and leveraging the optimization done by the synthesis tool. The possibility to insert scan at the RTL dates back to the late nineties [
To analyze the optimal design location where full scan chains need to be inserted, the following considerations are required. First, to implement a scan chain, one has to have knowledge of the FlipFlops (FFs) of the design. Second, the best place to insert any new task in a flow is at the earliest. The motivations for an early scan insertion are threefold: the complexity of the objects grow as one goes forward in a flow, so data handling gets more costly; whatever is added to the design is better if integration happens earlier in the flow; and finally, should the treatment lead to iterations—either redesigning or iterations in the flow—the sooner the iterations, the lower the cost. Third, at the point where insertion is finalized, all the information required for the insertion has to be available. If this information depends at least partly on the insertion itself, this leads to iterations—see below. Fourth, insertion can start at some level in the flow, and end at a later point in same. Again, this can be undesirable, but also unavoidable once some design decisions on the insertion process have been taken. Since FFs are usually known after synthesis, once a gatellevel netlist is available, this sounds like a reasonable point in the flow to insert scan. But the replacement of normal FFs by scan FFs has to be done before synthesis ends, since timing closure is affected by it. We will see in the next section that FFs detection can happen before synthesis.
On the other hand, we will argue in Section
The traditional way to insert full scan in a design is to replace the FFs by scanned FFs during synthesis; possibly connect them into a chain; and then during place and route iterations, (re)connect them to try to minimize the total Manhattan length of the added connections. The exact place where one should reoptimize the ordering of the chain is a matter of debate, and heavily depends on the flow used. For an indepth discussion of this topic in the case of the Synopsys flow, see [
Also, scan implementation decisions are considered post synthesis. This is too late in comparison to RTL design decisions. Adopting scan at the RTL level will cover new design and manufacturing needs, strengthen verification and consolidate reusable design methodologies by closing the gap between RTL and design for test. There are many advantages to inserting scan at the RTL level like benefiting from the synthesis process (i.e., better optimization in terms of area and timing).
Most of the work done on scan chains insertion and stitching ordering optimization assumes that placement information is available [
The knowledge of FFs is necessary to implement a scan chain. It has been suggested several times independently that one does not need a netlist to have knowledge of FFs of the design; this is the basis of higherlevel scan insertion [
Most studies on higherlevel scan either do not mention optimization [
In this paper, we try to close the gap between RTL Scan insertion and actual scan stitching optimization, while retaining the model of onetime insertion without any later reoptimization see Figure
Classical scan insertion versus our method.
More specifically, we tackle the following optimization problem: given an RTL description of the design (Verilog or VHDL) we have to find a stitching ordering of the memory elements in the scan chain, which minimizes the impact of test circuitry on chip features (area, power) while keeping testing time at its minimum.
The remaining of this paper is organized as follows. Section
To the best of our knowledge, this work is the first to offer a formal treatment of the scan stitching ordering problem as a discrete optimization problem in the case of RTlevel scan insertion. We formalize the problem; give reductions from the severalchains problem to the onechain problem; and solve the onechain problem in two steps. The first step, which can be seen as a kind of preprocessing, allows us to build a graph representative of the actual optimization. Then in a second step, algorithms for the TSP are adapted to actually build the chain.
A part of this work has been presented before in [
The main contributions of the paper are summarized below:
we analyze what is a suitable objective for measuring the quality of scan stitching orderings,
we give a mathematical formulation of the problem of scan insertion at the register transfer level (RTL),
we give two reduction procedures to solve the severalchains variants based on a routine solving the onechain case,
we solve the onechain case in a twosteps approach,
we evaluate our algorithm on both academic and industrial designs.
Our algorithm provides a basis for considering the implementation of scan chains as soon as the RTL of a block is available; the authors think that the lack of optimization has been a big obstacle to its adoption in design flows. The methodology has been validated by our industrial partner DeFacTo Technologies in the tool HiDFTSIGNOFF, it is a first step towards considering the integration of scan at the RTL.
We illustrate Scan insertion at RTL with the VHDL language. Scan insertion is a three steps process: first identify which signals and variables will give rise to flipflops in the netlist; second, decide which ordering will be used to chain the FFs together; third, change the RTL code to offer a (new) testing mode for the design.
Since the object of this work is the second step of this whole process, we now restrict our attention to its first and third steps. This section is included mainly to make this article selfcontained; for a more thorough presentation, the reader is referred to [
We first present how flipflops are identified; then we illustrate RTL edition for introducing scan at RTL; finally, we examine what could be a good measure of the impact of scan insertion on a design.
FFs identification is realized process by process. Each process is searched for variables and signals. All signals in a process will be translated into a FF in the netlist. To determine which variables will give rise to FFs, one has to first identify clocking signals. Then two cases can happen. To determine which applies in a particular case, one has to recall that processes execute cyclically—see Figure
Two use cases for variables in VHDL.
Either the variable is assigned between the clocking event in the process and the point where the variable is used (case 1). In that case, no FF is generated. Or the clocking event lies between the point where the variable is assigned and that where the variable is used (case 2). In that case, memorization has to occur, and a FF is generated.
To illustrate the introduction of a scan chain at RTL through RTL code edition, we consider the simple process in Algorithm
process(clock)
begin
if clock’ event and clock=‘1’ then
end if;
end process;
In order to introduce a testing mode behavior to the design, we simply describe in VHDL what the process is going to do in test mode. The additional VHDL code is shown in Algorithm
process(clock)
begin
if clock’ event and clock=‘1’ then
end if;
end process;
In most of the literature about scan ordering optimization, wirelength is the objective used to guide the optimization process. In these works, placement of the FFs is known; so wirelength is a quantity that is directly available during optimization. We will give our own argument for considering wirelength the most useful parameter to optimize stitching orderings.
When adding logic to a design for nonfunctional reasons, one tries to minimize the increase in size of the design. Additional area due to scan comes in two parts. First, FFs have to be instrumented either into scanned FFs, or with the help of a multiplexer at their input. In the former case, the area cost depends only on the number of FFs in the design; in the latter, the area cost can be less than the maximum if some optimization takes place during synthesis. In both cases, that cost is bounded independently from the stitching ordering. Second, wires have to be added between the output net of a FF and the next FF in the scan chain—either the scanned FF, or one input of the multiplexer in front of it. This does not represent an area cost in itself, since routing happens in the upper metallic layers. But it does add to the difficulty of placement and routing, with the possibility to have a very degraded situation if the stitching order is not chosen with care. In that case, the area can grow if routing is not possible anymore with the available space. We consider this a much more important factor than the additional cell area; therefore our measure of ordering quality will be based on it.
It is not possible to express the impact of added wires on placement and routing as a simple function of these wires; the impact will depend also on the sparsity of the design. But since we take a worstcase approach to the impact of wires, we will use the added wirelength due to scan insertion as our optimization function.
This choice will be validated by the variability that is observed on this parameter—see Section
We now review quickly all the other parameters scan insertion could have an impact on.
In order to help ensuring timing closure, the maximum of the length of the added wires would be a reasonable measure. But since in our case synthesis happens after scan insertion, it will be up to the synthesisand later on, placement and routingtools, to ensure timing closure, using all the flexibility of gate sizing to achieve it. Note also that ensuring a low total added wirelength means that not too many added wires will be long, hence lowering the impact on timing, and the additional load for the synthesis tool.
Power while in functional mode grows with wirelength; so minimizing wirelength will also lower the impact on functional power. Power while in test mode is dominated by switching: the values shifted through the scan chain force the values in FFs to change more often than they would in functional mode, leading to power supply issues. A common remedy to this condition is to take benefit of don’tcare values in test vectors to fill them in a way to minimize switching. This also works with RTLscan. There have been many attempts in other works at minimizing power during test mode by changing the stitching ordering. We have not followed this trail; it would be worthwhile to try combining it with our approach based on minimizing wirelength.
Finally, observability and controllability are not impacted by the stitching ordering. We note here that although combinatorial parts of the circuit have the same observability and controllability in a full scan design implemented both at gate and at RTlevel, in the case of RTlevel scan, there will be more stuckat faults, since the multiplexers are now no more combined with FFs. Hence fault coverage values tend to differ, although not much, between both methods for scan insertion.
Having established that wirelength should be minimized, we now turn to the precise description of the optimization problem we will consider in order to try to find low wirelength stitching orderings.
To determine the scan chains that best meets the needs of different integrated circuit designers, while maintaining maximum constraints and restrictions related to the electronic problem, we propose a new scan stitiching algorithm for the automatic insertion of optimal scan chains at the RTL. Our algorithm is structured into two phases, as shown in Figure
Stitching algorithm for optimal scan chain at RTL.
The challenge here is to be able to take into consideration still at high level what is going to happen later on during the placement and routing (P&R) steps. To build this model, we extract from the RTL description information on the expected proximity of the memory elements in the layout.
This phase is devised into three steps, as follows.
First, we perform a preprocessing called Elaboration. Although scan logic can be described at RTL, the very elements of which fault models are talking—nets, gates and FFs—do not exist yet at this level. Hence we need first to translate the RTL code into these elements. This is what synthesis does. But it is not feasible to have a full synthesis at this step in the flow. Once RTLscan is inserted, the synthesis step still has to be done; we do not want to duplicate that effort.
Our solution relies on a lightweight synthesis as the first step of the RTL scan insertion. This synthesis is done in terms of a virtual library of generic (nonphysical) gates; it does not try to optimize logic, timing or gate sizes; it does not need the user to do any finetuning; in short, it is done transparently. The user of the scan insertion tool only provides his RTL code, and will get back a scanned RTL code: he will never see the netlist that comes out of the lightweight synthesis. Indeed, in our method, this netlist serves only one purpose, namely graph extraction.
The second step after the design elaboration is to build the undirected
Example design and associated graphs
Note that the vertices are partitioned into two sets: nets on one hand, and gates and memory elements on the other; every edge in the graph has one end on each side. Therefore,
The graph
The third step is to extract information from the design that is necessary for scan chain stitching optimization (second phase). This information will be given in the form of an edgevaluated Proximity Graph (in short
The
Path lengths are restricted to a threshold value
To compute
Figure
Using this new formalism, the chaining of memory elements in the circuit corresponds to a partition of the vertices of
In the case of a single scan chain, our problem reduces to the Traveling Salesman Problem [
Also, single scan chains are not an option for big designs, where testing time would be prohibitive if scanning were not done using more than one chain.
Before explainning the second phase, we present two algorithmic devices to reduce the general chaining problem to the singlechain case. Then we show how our problem can be reduced to the Traveling Salesman Problem and some algorithms to solve it. Finally, we describe the two steps of the second phase.
The simplest way of reducing the severalchains scanning problem to its singlechain subcase is by appropriate postprocessing. Once one has computed a single scan chain for the whole design, this chain can be split into the desired number of segments (the actual scan chains). This device is really fast, and it is not expected that the quality of the output of the whole process will be much degraged as compared to that of the singlechaining algorithm. Also, it is very easy to implement, and is our recommandation for the cases where fast enough singlechaining algorithms are available.
A more elaborate device is the preprocessing of the graph
Even if the partitioning is handled by an appropriate algorithmic package, implementing this solution is not as easy as chain splitting. But using it brings with it another benefit: the TSP problems to solve in this case are restricted to the length of the scan chains. For test application reasons, these are in actuality limited; although technically feasible, one seldom meets scan chains of 10000 FFs.
Hence this solution helps ease up the problem of computing the costs matrix for the input of the TSP, which is actually the longest step of our methodology.
Another benefit is that when using partitioning, the whole method scales with only a linear increase in running times if the maximum size of scan chains is kept constant.
We now discuss how TSP algorithms can be applied for the singlechain stitching problem.
There are two distinct frameworks for applying TSP algorithms to our problem. The first one allows any algorithm to be used, but it puts constraints onto the size of input graphs that can be fed to it.
The second one is the particular case of an algorithm that can be used directly on
The input of a TSP routine is a symmetric matrix representing the costs for edges of a complete graph. In order to apply a TSP algorithm to solve the chaining problem, one needs first to compute the whole costs matrix (except diagonal elements). In our model, we attribute (as cost) to a pair
It is the preprocessing step that imposes a serious limitation on the possibility to use this scheme. Indeed, the
If only designs smaller than this are to be treated, then this method is definitely worth trying, all the more that one has then the ability to test and compare different TSP algorithms.
A famous textbook algorithm for the TSP, and also a typical example of an approximation algorithm with guaranteed quality, the 2approximation for the TSP also has nice properties when used on the scan chaining problem.
An algorithm for a minimization problem is called an
It may seem unintuitive that one can prove a quality bound without even knowing the value of the optimal solution. Actually, such a proof can be derived through the use of a lower bound on the value of the optimal solutions, through showing that the solution given is less than
In the case of the TSP, two approximation algorithms are found in every textbook on the subject: one has an approximation factor of 2, (we call it “the 2approximation for the TSP”), and the other is Christofides’ algorithm, with an approximation factor of 1.5. Both use in their proof, as a lower bound on the optimal tour length, the value of a minimum spanning tree of the graph.
Please note that these approximation ratios are only proven theoretical bounds, and are not enough to compare the empirical behavior of these algorithms. Still, Christofides is bound to give solutions that are not more than 50% away from the optimal one; this would entice one to use this algorithm.
Alas, Christofides uses Weighted Perfect Matching in bipartite graphs as a subroutine, which has to impacts on its use. First, weights have to be precomputed, and we stumble against the same blocks as mentioned in the previous section. Second, Weighted Perfect Matching needs
Without any regret, we turn now to the 2approximation algorithm, which is seldom considered a useful practical choice because of its loose guaranty, and notsogood performance on common benches.
The algorithm is in two steps. First, a minimum spanning tree is computed; then a root is chosen, and the tree is explored and vertices ouput with postfix ordering.
Only the first step actually looks at the input graph; and its only requirement is that the graph be connected. If
Thus, using the 2approximation, we bypass the precomputation of the cost matrix. This means that bigger input graphs can be considered if this algorithm is used, as compared to other TSP algorithms.
Although this algorithm was meant for the TSP, that is for an input being a complete graph with values on all edges, one can prove that the 2approximation guarantee still holds when the algorithm is applied to a connected graph. In this case, the tour cost has to be understood as the sum of the cost of the edges, where edge costs for inexistent edges are length of shortest paths in the input graphs—hence our choice to model the cost of inexistent edges in this way.
The final step is the edition of the RTL code of the design. In this step, we insert in the original RTL code/description the additional RTL constructions required to implement the scan testing logic.
The method we propose to optimize the stitching ordering of RTL scan chains has been implemented in an experimental version of HiDFTSIGNOFF. This tool relies on a commercial software library for the parsing and elaboration of the design. This elaboration step is what we called “lightweight synthesis” in Section
HiDFTSIGNOFF has the ability to stitch FFs into several scan chains. This possibility is important from the practical point of view: making several scan chains is the easiest way to reduce test application time. HiDFTSIGNOFF implements the two strategies presented in Section
For graph partitioning, the METIS library [
In order to validate our method, we did experiments on a number of designs, both in VHDL and Verilog. We used the benchmarks
Table
Designs description.
Design  No. of FF  Language 

b09  28  vhdl 
b10  17  vhdl 
b11  31  vhdl 
b12  121  vhdl 
b13  53  vhdl 
b14  245  vhdl 
b15  449  vhdl 
b17  1415  vhdl 
b18  3320  vhdl 
b19  6642  vhdl 
SimpleSpi (SS)  132  verilog 
Biquad  204  verilog 
Ac97  2289  verilog 
Since our optimization criterion is wirelength, and since we consider congestion during routing an important issue, experimentations were conducted till the place and route step. Two flows were considered, according to Figure
Table
Characteristics of the graphs
Design  No. of FF 





b09  28  401  719  28  378 
b10  17  525  947  17  136 
b11  31  1094  1845  31  465 
b12  121  3879  6993  121  7260 
b13  53  633  1042  53  1378 
b14  245  27269  46778  245  29890 
b15  449  33179  56798  449  100576 
b17  1415  100358  171759  1415  697990 
b18  3320  271320  462326  3320  2330318 
b19  6642  531161  906201  6642  7061713 
SimpleSpi  132  1427  8646  132  8778 
Biquad  204  357  20706  204  20910 
Ac97  2289  18727  2381495  2289  707911 
The computation of the graph
The figures for both wirelength after place and route, and computation times, have been gathered in Table
Computation times are given only for the two steps of the stitching optimization method, discarding the time for parsing and elaboration. The column
We base our analysis of the comparison of wirelength between gatelevel and RTL scan on the slack column of Table
Wirelengths and insertion times.
Design  No. of SC  Wirelength  Slack (%)  Insertion time (ms)  

GLS  RTLS 

TSP  
ITC 99 Benchmarks (VHDL)  
 
b09  2  2.06  1.63 



b10  2  1.94  2.06  5.97  10 

b11  3  4.77  5.03  5.38  20  10 
b12  10  12.6  12.92  2.59  200  20 
b13  5  3.39  3.32  −2.25  20 

b14  24  63.64  64.2  0.96  3 s  110 
b15  24  126.3  122 

6 s  320 
b17  70  395.6  370.4 

30 s  3 s 
b18  300  1187.2  922.6 

3 m  10 s 
b19  600  2329.6  1858.2 

5 m  20 s 
 
Opencore designs (Verilog)  
 
SimpleSpi  10  11.5  10.4 

100  20 
Biquad  20  34.1  31.3 

350  80 
Ac97  200  247.7  238.5 

30 s  8 s 
Table
Fault coverage for RTL scan.
Design name  Fault coverage 

ITC 99 Benchmarks (VHDL)  
 
b09  99.86% 
b10  99.85% 
b11  99.93% 
b12  99.97% 
b13  99.92% 
b14  99.99% 
b15  99.97% 
b17  99.47% 
b18  99.81% 
b19  99.81% 
 
Opencore designs (Verilog)  
 
Simple Spi  98.35% 
Biquad  99.96% 
Ac97  99.80% 
Computation times are mainly dominated by the setup of the graph
After giving some motivations for inserting scan at the RTlevel, we have exposed what we believe is an important challenge for RTLscan, which is finding a good stitching in one single pass, working only at the RTlevel. Therefore, the purpose of the present work is to solve the following problem: given an RTlevel description of the design (Verilog or Vhdl) we have to find a stitching ordering of the memory elements in the scan chain, which minimizes the impact of test circuitry on chip performance (area, power) and testing time. Then, we have motivated our choice of selecting wire length as the prime parameter to optimize.
To solve this problem, we proposed a new scan stitching algorithm for optimal scan chain insertion at RTL. The techniques used are derived from the combinatorial optimization and operations research domains. Indeed, our algorithm is divided into two main steps. The first step proposes a mathematical model describing the electronic problem. The second one offers a resolution methodology.
The model we propose is based on graph theory. To build it, we extract from the RTL description information on the proximity of the memory elements (existing paths between flipflops, clock domains, and various other relations extracted from hierarchical analysis) and translate them in two graphs
Finally, we integrated our tool in an industrial design flow and performed experiments over several academic and industrial design benchmarks. Numerical evidence showed that we are able to limit such a cost due to scan insertion in a reasonable computing time, without impacting DFT quality, especially fault coverage. The method seems better than the traditional one for middlesized designs. The industrial interest for our algorithms and tools is confirmed by our industrial partners. Our RTlevel scan optimization algorithm has been incorporated into the tool HiDFTSIGNOFF by DeFacTo Technologies. In case new flipflops are added or removed to existing scan chains, it is important to goldenize the RTL code and reflect such changes directly into the RTL code. The DeFacTo tool HiDFTSIGNOFF allows that by introducing the new scan architecture and by including the flipflops new ordering. In this way, our work represents a progress in the state of the art, as previous works are not automated and/or were evaluated only for small designs.
Last but not least, an important contribution of our work is to make a neat separation between the DFT problem and the mathematical model. This separation allows the same software to work for several successive technology nodes.
Our initial results give rise to a number of new directions for further research. These are summarized below. First, one could investigate whether cumulating local and global optimization in the manner of [