Integrated On-Line and Off-Line Error Detection Mechanisms in the Coding Theory Framework

In this paper we present an approach for combining on-line concurrent checking (CC) with off-line built-in self-test (BIST). We will show that a reduction of an aliasing probability can be obtained for manufacturing testing by monitoring the output of a concurrent checker and a reduction of a probability of not detecting an error in the computing mode can be obtained by a short periodic BIST. We will present a technique for optimal selection of error-detecting codes for combined on-line CC and off-line space-time compression of test responses for BIST and estimate probabilities of not detecting an error for the approach based on integrating CC and BIST. We also present a technique for on-line error-detection in space-time compressors of test responses for BIST.


I. INTRODUCTION
Off-line testing techniques such as Built-In Self-Test (BIST) and boundary scan have been the focus of VLSI test engineers concerned with product quality.BIST techniques have been widely used for manufacturing testing and repair.The block diagram for a BIST design is given by Fig. 1.
For the block diagram of Fig. space compres- sion (SC) may be useful for overhead reduction only for the case of a large number of outputs of the device-under-test.It was shown in [1,2] that optimal space compressors (SCs) are linear net- works computing syndromes of linear error detect- ing codes.If a (n, s, d) code Vsc is used for space compression, then the corresponding SC can be 313 constructed from XOR gates only, the SC has m n s outputs (m << n, see Fig. 1), and any error which distorts signals at most d-output lines for the device-under-test (D) at any given moment cannot be masked in the SC.Moreover, any error which is detected by Vsc cannot be masked in the corresponding SC.
The most popular technique for time compres- sion (TC) of test responses is based on usage of multiple input linear feedback shift registers (MISRs).The structure of MISR feedback taps is defined by the corresponding binary generating polynomial P(x). [3]For a m-bit MISR the degree of P(x) is equal to m, and to decrease a probability of masking an error in the TC (time aliasing) in many cases P(x) should be primitive. [3]est Pattern Generator (TPG) Any m-bit MISR can be described as a network computing syndromes for the corresponding linear nonbinary (T,T-1,2) Reed-Solomon code, [11] V-c, over GF(2m) where the length, T, of the code is equal to the number of test patterns applied. [4]If 1 > 1 MISRs are used for self-testing or self- diagnosis (multisignature analysis [5'6]), then the corresponding TC is the network computing syndromes for the (T, T-l, + 1) Reed-Solomon code, Vrc, over GF(2m).
Analysis of space and time aliasing probabilities in terms of weight distributions of the corresponding codes Vsc and V.c for different error models is presented in [4].Generalization of these results for a board or system level testing was given in [6].
Since system availability is becoming a key feature of computer systems, on-line error detecting tech- niques based on concurrent checking (CC) are extremely important for design of fault-tolerant systems.The block diagram for on-line error detection is presented at Fig. 2. For the block diagram of Fig. 2, the parity prediction network, R(D), is built in such a way that for any input if there are no errors in the original device, D, and R(D), then an extended output (Yl,..., Yk, Yk+ 1,..., Yn) is a codeword of the binary (n, k, d) systematic code Vcc.All errors resulting in distortions of at most d-1 bits in (Yl, Yk, Y + ,..., Yn) will be detected on-line by the CC.Used as Vcc are, for example, parity prediction (k + 1, k) codes, duplication (2k, k) codes, (k + '[log2(k -1)], k) nonlinear Berger codes and (n,n-[logz(n + 1)) Hamming codes for com- puter memories. [7]ncurrent checkers (CCs) are networks com- puting syndromes for the corresponding codes Vcc and verifying that these syndromes are not equal to zero.Techniques for design of self-checking CCs can be found, e.g., in [7-10].
Example 1 To illustrate the relations between off- line space-time compression of test responses, on- line CC and the corresponding codes Vsc, Vrc and Vcc let us consider the case when the original device, D, is the control ROM for MC68020.
In this case the number of outputs for the original device k 116 and for concurrent checking one can select a (123,116,3) Hamming code as VCC . [11]This will result in r 7 redundant outputs for R(D) and on-line detection of all single and double errors in n 123 outputs of the expanded ROM.In addition to this, all errors distorting three or more output bits will not be detected with probability 2 -7 (assuming all these errors being equiprobable).
For space compression of responses from the expanded ROM a (123, 95, 9) BCH code [11] can be used.In this case the SC will have m 28 outputs, all errors resulting in a distortion of at most 8 out of 123 output bits for the expanded ROM will not be masked in the SC and the probability of masking (space aliasing) for errors distorting more than 8 bits is 2-28.
For time compression of output sequences from the SC in this case one can use a Reed-Solomon code with distance 2 over GF(228). [1 ]Then the cor- responding TC can be implemented by the MISR with primitive generating polynomial P(x)= x 28 @ x 3 @ and the probability of time aliasing is 2 -28 The total overhead in terms of equivalent two-input gates for on-line and off-line error detection based on the above codes is about 15%.r-1 Thus, a strong relationship exits between con- current checking and BIST since both are based on computing syndromes of corresponding error detecting codes Fcc, and l/'sc, l/'rc.This provides a natural framework to integrate both approaches.
We note that current design approaches have major drawbacks sirce mechanisms for on-line and off-line error detection are chosen separately, with- out consideration of any potential interactions.These approaches are not tailored to the most efficient combined utilization of the available silicon area.Also, in treating on-line and off-line techniques separately, performance gains are lost.For example, in the design of an off-line fault detection/diagnosis hardware, the fact that the circuit may also include support logic for on-line error-detection is not considered.
Although there is a significant overlap between on-line and off-line error detection techniques, only a few papers have been published exploring the potential for combining these approaches.The idea of merging on-line and off-line BIST was suggested in [22,23], where partitioning of the logic, placement and design of test circuitry was dis- cussed.In [12,13], a concurrent BIST (CBIST) approach has been proposed.resources are modified for this approach such that during system operation, they can observe normal inputs and outputs.When a normal input matches a test pattern, the circuit output is compressed into a developing signature.Another approach (UBIST) was proposed in [14].For this approach, test vector generators are used for off-line testing of concur- rent checkers, ensuring that all test patterns are applied.It may, though, be difficult to implement UBIST for devices with many input lines.A com- bination of UBIST with boundary scan was sug- gested in [24].Applications of linear cellular automata for concurrent checking, signature anal- ysis and boundary scan have been studied in [25].
Modifications of conventional BIST designs to improve fault coverages for stuck-at faults by monitoring output of a parity predictor during manufacturing testing were proposed in [24].
Applications of this approach for ISCAS bench- marks and ROMs are also presented in [26].
Implementations of built-in testable error detection and error correction circuits have been presented in [15].The problem of parity calculation for parity checking with BIST and pseudorandom testing was considered in [16].
In this paper we will investigate the general framework presented in Fig. 3 where off-line built- in self-test mechanisms of Fig. are coupled with on-line error detection mechanisms of Fig. 2. In the proposed scheme, the original device, D augmented with parity (code) predictor, R(D), is not required to be fault-secure.
In Section iI we will discuss how much improve- ment one can achieve by monitoring an output of a concurrent checker during off-line manufacturing testing.We will also show that on-line concurrent checking only may not be sufficient for detection of stuck-at faults and a short periodic off-line BIST can drastically increase a probability of detection for these faults.Section III will be devoted to optimal selection of codes Vcc, Vsc, VTC, for concurrent checking, space compression and time compression.In Section IV we present analysis of probabilities of error detection for the combined BIST and concurrent checking scheme.Design issues related to on-line concurrent checking of space and time compressors of test responses will be discussed in Section V.

II. MANUFACTURING AND FIELD
TESTING BY COMBINING BIST AND CONCURRENT CHECKING (CC) First, let us show that monitoring the output of the concurrent checker (CC) during off-line manufac- turing BIST results in a drastic decrease of the aliasing probability.To estimate aliasing we have to choose a model describing a distribution of errors at outputs Y l, Yk, Yk + 1, Yn (n k / r) of the device.
There are two important components of any error modelmtemporal and spatial.The first, temporal, models the correlation between the errors caused by different input vectors.[19][20][21] In this ease if input vectors are random, then there is no time correlation between errors due to any two input vectors.
Let e(t) y(t) (t), where y(t) and (t) are fault- free and distorted outputs at the moment t.Then the space distribution of errors at the output of the device is given by probabilities po,p,...,p- (pg= Pr{e(t) for any t}, p0 +p +"" +P2-l --1).Methods of statistical determination of parameters for the above models and their applicability can be found in [17]..These models have been widely used in estimations of off-line aliasing (see e.g.[4,17,18,19,21]).
In the case when space compression (SC) is used before time compression for reduction of an over- head required for BIST (Fig. 3) to estimate a probability of aliasing in a time compressor, distri- butions p0,p,...,p2-of errors in the device should be replaced by distributions q0,q,..., q:m_ of distortions Az(t) at the output of the SC, where qi Pr{Az(t)= i}.If the space compression is based on the (n,n-m) code, Vsc, with check matrix Hsc, then ones in e(t) is even (t 0, 1, 2,..., 6) and z(t) Hscy(t); z(t) @ Az(t) Hsc(y(t) q e(t)); and Az(t) Hsce(t). ( (2) j: i= Hscj For example, for 2n-ary symmetrical errors at the output of the device, [17] we have symmetrical 2m-ary errors Az(t) at the output of the SC with l-p+ (2 n-m-1)p(2 n-1) -1 qi 2n-mp(2n-1)_1 for 0; for 0; where p Pr{e(t) -0 for any t}.
Example 2 Suppose that the original device have k=4 output lines and the (5,4,2) single error detecting code, Vcc, is used for concurrent check- ing (CC).Then there are 15 nonzero 5-bit error patterns with even weights (numbers of ones) which are not detectable by CC.
If the 3-bit MISR based on primitive polynomial P(x) x 3 q3 x @ is used for time compression and T=7 test patterns are applied, then e= (e(0),..., e(6)) is not detected by both concurrent checking and signature analysis iff a number of o 6 (Hsce(O)) q a (Hsce(1)) 3 @ a(Hsce(5)) q Hsce(6) O, where a is primitive in GF (23) and P(ce)--ce3 a q3 1 0 (ai _ aj, C-J, i, j 0, 1,..., 6).For p Pr{e(t) 0 for any t} =0.1 we have for a probabil- ity, PON, of not detecting e C0 by concurrent checking PON 0.211.Similarly for a probability, POFF, of not detecting e (e(0),..., e(6)) 0 by the off-line signature analysis using results from [4] for the aliasing probability for symmetrical errors we have POFF=0.076and, finally, for a probability, PON,OFF, of not detecting e by both CC and the MISR we have PON,OFF--O.037. (Exact formulas for PON, POFF and PON,OFF will be presented in Section IV.) Off-line and on-line error detection techniques can be designed to complement each other.A periodic BIST can be aimed at precisely those stuck-at faults that escape detection by CC.
To justify this approach, we note first that CCs, in most cases, provide for a poor fault-coverage with respect to stuck-at faults at primary input lines; these faults, though, form an important class of permanent faults, since in many cases inter- connections between components are less reliable than components themselves.For example, parity prediction CCs for adders or multipliers, [7] based on comparing parities of input operands, internal carriers and outputs, cannot detect many input stuck-at faults.
In addition to this, since an overhead for CCs grows rapidly with an increase in the error detecting capability of the corresponding (n,k) code, Vcc, only codes with a very limited error

TABLE
Representation of GF (23) with P(x) x i x 1, P(a)=O codes or Hamming codes for memories) have been used.This may result in a low fault coverage for internal stuck-at faults.The following three exam- ples illustra:te this situation.
Finally, there is a problem of detection of stuck- at faults in the CC itself.This problem can be solved by using self-checking checkers.However, self-checking checkers require considerable addi- tional overheads.  Exam Consider a class of networks nm (m= 1,2,...) with 2 m inputs and rn+ outputs defined recursively by Fig. 4. (nm (m--1,2,...) have been used for decoding of extended Hamming codes.) The parity predictor, R(H4), and the CC for Ha based on the (6, 5, 2) parity code are presented in Fig. 5. From Fig. 5 one can see that out of 86 single stuck-at faults (SSFs) in Ha, 30 cannot be detected by the CC.
When m is growing, the fault coverage for SSFs detectable by the parity prediction in Hm is converging to 50%.(One can improve this fault coverage by using CCs based on codes with distances more than 2, but already for (n, rn + 1, 3) Hamming codes with r n-(m / 1) _> [log2(n / 1)], the corresponding parity predictors require about 2 m-1 [lOgE(n / 1)] two-input gates.)We note, how- ever, that a fault coverage close to 2 -(m + 1) can be obtained for SSFs in this case by off-line testing using a (m / 1)-bit MISR.
The previous example demonstrates that for linear devices and CCs based on linear codes, fault coverages for SSFs may be very low.We will show now that this may be true also for nonlinear devices and CCs based on more powerful and not necessarily linear codes (e.g.Berger codes).Example 4 Let us consider a class Qm of devices with m inputs and 2 m-1 outputs, defined recur- sively by Fig. 6 (m 2, 3, 4,...).For Q4 only 7 out of 25 SSFs cannot be detected by the CC, based on the parity prediction (9, 8) code, or by a nonlinear (12,8) Berger code.As m grows, the fraction of these undetectable SSFs in Qm is converges to 1/3.U] Example 5 Let us consider a (N k) ROM with concurrent parity checking.In this case, stuck-at faults in the address decoder that result in a selection of a wrong cell, as well as multiple faults within a cell, will not be detected.These stuck-at faults can be detected with a probability 1-2 -( + 1) by off-line testing, using a (k + 1)-bit MISR at the output of the ROM.V1 The previous examples indicate that CC only may result in a low fault coverage for SSFs and a short periodic BIST detects in most cases a very large percentages of SSFs.Techniques for optimal selection of error-detecting codes for CC and BIST which complement each other are presented in the next section.

III. OPTIMAL SELECTION OF ERROR-DETECTING CODES FOR CONCURRENT CHECKING (CC) AND SPACE-TIME COMPRESSION (SIC) OF TEST RESPONSES
The space compressor (SC) is a combinational net- work with n inputs y(t)=(yl(t),...,yn(t)) and m outputs z(t) (Zl(t),..., Zm(t)) such that m << n and for a given class E of errors, the SC has the fol- lowing error propagating property.Let .(t)=y(t) qe(t), e(t) E and outputs of the SC for y(t) and .(t)are z(t) and (t).Then the SC is error propagating iff for any e(t) E, z(t) (t).
A SC is self-testing iff the fault-free response of the device is a test detecting all single stuck-at faults in the SC.'2] (The major difference between STEP SCs and totally self-checking (TSC) checkers is that the inputs and outputs of STEP SCs are not necessar- ily codewords of a concurrent code Vcc.This implies that a STEP SC could have a single output; a TSC checker, however, must have at least two outputs [7]).
It was shown in [1,2] that STEP SCs can be designed as networks computing syndromes for linear codes.If a (n, s, d) code, Vso was chosen for space compression, then m n s and E is a set of all errors with a multiplicity, at most, d-1.
In the case when CC and space compression are integrated (Fig. 3), there is a problem of optimal selection of the corresponding (n, k) codes, Vco for concurrent checking and (n, s) codes Vsc for space compression.For example, if Vcc Vso then any error missed by the CC will also be missed by the SC.This is a bad choice for (Vco Vsc).
Below, we will describe our approach for solution of this problem minimizing a fraction, of errors which are undetectable by the CC and masked in the space compression process; i.e., l Vcc fq Vsc[2-k, 2 -m _< r/_< 1, m=n-s. (4) If Vcc is the parity (n, n-1,2) code, then Vsc should have an odd distance to minimize r/.In this case, Vcc Vsc[ 2 n-m-and r/= 2-m.
For the general case, to minimize r/, one can use as Vcc and Vsc shortened BCH or shortened cyclic Hamming codes [] such that sets of roots of their generating polynomials do not intersect.
This can be implemented in the following way.
Then k n av, m aw and z/= 2 -aw 2-m.
If Vcc is a out of n nonlinear nonsystematic code [7] (all codewords of Vcc have weight t, < n), then any code Vsc with a distance d such that I] >t is optimal; if Il < t, then d should be odd.Example 6 Let n= 7 (a= 3).We will construct two (7, 4, 3) Hamming codes l/'sc and Vcc such that Vccfq Vscl= 2 and r/= 2 -m--2-3. (Repre- sentation of GF(23) based on the primitive poly- nomial x 3 @ x @ is given in Table I).
Selecting al a and a2 a 3 we have from ( 0100111J, 1110100 and Vcc N Vsc--{0000000, 1111111 }.VI Example 7 Let us apply the above approach to the design of a CC and a SC for the control ROM for MC68020 with k 116 outputs from Example 1. We assume that a shortened (123,116,3) Hamming code Vcc detecting single and double errors is used for on-line error detection, and a shortened 8 error-detecting (123,95,9) BCH code Vsc is used for space compression (see Example 1).
Then the best pair (Vcc, Vsc) minimizing is defined by the following check matrices (al =a, O7, O 5 O9, a is a primitive in GF(27), a127=a= 1.) find from (9) in a unique way e(0)E Vcc.Thus, the fraction of errors which are not detected by off-line BIST and CC in this case is equal to If a primitive MISR is used as a TC, then Hrc=(ar-l, ar-E,...,a, 1) (a is a primitive in GF(2n); a aJ; i, j 0, 1,..., 2n--2), and from (9) e is not detected by CC and BIST for T_< 2n-1 iff e(t) Vcc for any and Hcc [lozo20 e(0)a T-1 e(1)a r-2 @... @e(T-2)a e(T-1) 0. ( 10) Now let us fix e(0) E Vcc,..., e(T-3) E Vcc.Then we have the following equations for e(T-2) and e(T-1) In this case, the SC has r=4 7 28 outputs and the fraction of errors not detectable by the CC and masked in the SC is r/= 2-28.
Let us consider now the problem of optimal selection of time compressors (TCs) for a redun- dant device with a given CC code Vcc.We assume that Vcc is a binary linear (n, k) code (n < 2k), the TC is a m-bit MISR and there is no space compression (m=n).Then the TC is a network computing syndrome for the (T, T-1, 2) Reed-Solomon code, VTC, over GF(2n), where T is a number of test patterns applied.Let us denote the check matrix for this code as where hi GF(2n).
Example 9 Let us consider the control ROM from Example 1 with n 123, k 116, r= 7 and m=28.For p= 10 -5 and T=215 we have from (18), ( 26) and (34), PoN=l.85X 10-3, POFF= 1.04 10 -9 and Pov,oFF 6.88 10-12.Probabil- ities of not detecting an error as functions of test length T for concurrent checking, signature analy- sis and combined concurrent checking and signa- ture analysis for this ROM are presented in Fig. 8 forp= 10-4.I--] V. CONCURRENT CHECKING OF SPACE AND TIME COMPRESSORS IN THE OFF-LINE TESTING MODE Let us consider now the problem of concurrent checking of SCs and TCs to detect permanent and intermittent errors in the process of off-line testing.
A self-testing error propagating SC is a set of m XOR trees with input y(t)=(yl(t),...,yn(t)) and output z(t) (Zl(t),..., Zm(t)) (m <_ n), where z(t) y(t)n tr (35) SC p = 0.0001 (m 28) and time compression by the MISR with primitive generating polynomial X 28 ) X 3 1.Then the CC for the SC and the TC can be implemented by the (29, 28, 2) parity code.Additional overhead required for this CC (in terms of equivalent two- input gates) is less than 5%.

VI. CONCLUSIONS
In this paper we proposed an approach for combining an on-line concurrent checking and off-line BIST based on space-time compression of test responses to maximize probabilities of error detection for both manufacturing and field testing.
An approach for optimal selection of error-detect- ing codes for concurrent checking and space-time compression of test data have been developed, and probabilities of error detection for combined on- line and off-line techniques have been estimated.
An approach for concurrent checking of space and time compressors for test responses was proposed.
The presented techniques can be used for design of fault-tolerant devices with BIST.
FIGUREBlock Diagram for Built-In Self-Test.

FIGURE 2
FIGURE 2 Block  Diagram for On-lin,e Error Detection.

FIGURE 8
FIGURE 8 Probabilities of not Detecting an Error as F6nctions of Test Lengths T for Concurrent Checking Pov, SignatureAnalysis Pore and Combined Concurrent Checking ad Signature analysis Pou,oee for the Control ROM from Example (n= 123,k= 116,m=28) forp=10-4.
Probabilities of not Detecting an Error as Functions of Test Lengths T for Concurrent Checking Po,v, Signature