Design and Analysis of Digital Ratioed Compressors for Inner Product Processing *

Inner product calculations are often required in digital neural computing. The critical path of the inner product of two binary vectors is the carry propagation delay generated from individual product terms. In this work, two architectures to arrange digital ratioed compressors are presented to reduce the carry propagation delay in the critical path. Besides, the carry propagation delay estimation of these compressor building blocks is derived and compared. The theoretical analysis and Verilog simulation both indicate that one of the compressor building blocks we present here might offer a sub-optimal solution for the basic building blocks used in digital hardware realization of the inner product computation.


INTRODUCTION
Many efforts have been thrown on the realization of neural networks mainly owing to their attractive pattern recognition features, [1, 2].In the compu- tation of neural networks, the inner product of two vectors might be one of the most frequently used mathematical operations.Unavoidably the carry propagation will occur if the neural networks are dedicated for either discrete or digital signals.
For instance, the recall of pattern pairs stored in discrete bidirectional associative memory (BAM) needs to compute a summation in the form as Y-th( in__l Yi (Xi" X)) where X is the input pattern, Y is the output pattern, Xi's and Yi's are stored pattern pairs, and th() is a threshold function.Notably, the components of every vector are either bipolar or binary.If n is large in the above calculation, then the carry propagation of the inner product of the vectors will likely be- come the critical delay of the entire neural computing.*This research was partially supported by National Science Council under grant NSC 88-2219-E-110-001.Based on "A comparison of two alternative architectures of digital ratioed compressor design for inner product processing", by C.-C. Wang, C.-J.
Many high-speed logic design styles have been announced.However, these logics suffer from different difficulties.For example, domino logic [3] can not be non-inverting; NORA [4] has the charge sharing problem; all-N-logic [5] and robust single phase clocking [6] cannot operate correctly under clocks with short rise time or fall time, which can not be easily integrated with other part of logic design; single-phase logic [7] and Zipper CMOS [8] contain slow P-logic blocks.Though Zhang et al. [9] proposed a design of compressor to fix such a problem by employing a so-call C2pL (complex CPL), several physical design factors are not fully considered or implemented.First, the sizes of the NMOS transistors for pass logics are impossible to be minimal.Second, the driving inverters' sizes have to be properly tuned.Third, the original design of [9] not only gives a poor fan- in and fan-out capability, but also produces very asymmetrical rise delay and fall delay which will very much likely cause glitch hazards and un- wanted power consumption.Fourth, no further analysis on reduction of carry propagation delay is performed.In this paper, two alternative architec- tures of the digital ratioed compressors building blocks based on the 3-2 compressors are presented, where the problems mentioned above are all resolved.An analytical form of carry propagation delay estimation for these two architectures is also derived.At last, the HSPICE and Verilog simula- tion results are also presented to verify the correctness of our observation. follows" S (a c)b' + (a (R) c)'b Fb' + F'b C (a (R) c)b + (a c)'c Fb + F'c, (1)   where F denotes (a @ c).The feature of such a compressor is that the output represents the number of l's given in inputs.

FRAMEWORK OF RATIOED COMPRESSOR BUILDING BLOCKS 2.1. Basic Compressor Building
Block Design A 3-2 compressor is basically a full adder.
The equations of a full adder are shown as 2.2.Ratioed 3-2 Compressor Design Though a 3-2 compressor can be realized by a full adder, and Zhang et al. [9] proposed a C2pL design for 3-2 and 7-3 compressors, several design issues as addressed in Section are still ignored in their work.Figure shows the schematic diagrams for the two types of 3-2 compressors based on complex complementary pass-transistor logic (CZPL) proposed in [9].We use TSMC 0.6tm 1P3M technology to re-design the 3-2 compres- sors, and the schematic diagrams for the ratioed 3- 2 compressors are shown in Figure 2. In Section 3   of this paper, we will demonstrate that the re- designed 3-2 compressor circuits will overcome all of the problems mentioned in Section 1.

A Primitive Architecture of Digital
Ratioed Compressors A 7-3 compressor building block can be con- structed by cascading four 3-2 compressors as shown in Figure 3.A 15-4 compressor building block can also be formed similarly with two 7-3 compressors and two 3-2 compressors, as shown in Figure 4. Based on this design methodology, a (2 1)-to-n compressor can be composed of two ( 21)-to-(n-1) compressors and (n-1) 3-2   compressors.
Since the total delay of such design is approxi- mately proportional to the count of 3-2 compres- sors that the critical path resides, we assume D denotes the count of 3-2 compressors when 2bits are fed into the (2-l)-to-n compressor block.By observing the structure of the com- pressor blocks, we can deduce D2, D3, and D as (2) By solving the above recurrence relation, we obtain Apart from the delay for the single building block, we have to count in the processing stages needed for summing individual inner product terms.The numbers of processing stages is roughly estimated as ln(n/M)/ln(n/(2'-1)), where n de- notes the total bits of the basic building block out- put, and M represents the bit count of data inputs.

Compressor Building Blocks
The second architecture presented in this work to improve the carry propagation delay of the cri- tical paths is shown in Figure 5.This architecture, inspired by the design methodology of systolic arrays, consists of parallelized 3-2 compressor building blocks only at every processing stage.
Although it is difficult to derive the analytical form of total delay of (2 n-1)-to-n compressors for systolic-like architecture, the upper bound for the delay of (2 n-1)-to-n compressors can be still computed in light of the result given in Eq. (4); i.e., ,/' 3-2 where c is a small integer which is used to offset the bias between the estimated and the correct value of the total delay introduced in Eq. ( 4).Note that c is much smaller than the first dominant term embraced with ceiling function in Eq. ( 5), thus it Comparing with the first primitive architecture presented in Section 2.3, the systolic-like architec- ture improves the delay of inner product calcula- tion from O(rt2) to O(n).Apparently this outperformance is associated with the parallelized computing ability at each processing stage as shown in Figure 5.

Re-designed Building Blocks
In order to verify the correctness of our theoretical analysis, we tend to use HSPICE and Verilog to conduct a series of simulations.The improvement of asymmetrical rise delay and fall delay in the original design can be illustrated through HSPICE simulations.The simulation results are tabulated as shown in Table I.

Delay Simulations
The Verilog simulations are performed 20000 iterations for the first architecture and the systolic-like architecture of 127-7 compressor building blocks, respectively.Table II illustrates the comparison of carry propagation delay for the two architectures of 127-7 compressor building blocks when they are fed with 127 data inputs summation.
The results demonstrate that the systolic-like architecture of digital ratioed compressors indeed lead the least carry propagation delay. 4. CONCLUSION In this paper a re-designed ratioed 3-2 compressor is presented to correct several problems appearing in Zhang's work in [9].The equations for counting the number of 3-2 compressors in the critical path of (2 n-1)-to-n compressors are derived and used to compare the performance of two digital ratioed compressor architectures.Our simulation results show that the systolic-like architecture gives a sub- optimal performance through the parallelized arrangement of 3-2 compressors at each stage of processing.

TABLE
The comparison of rise delay and fall delay in the original design and the re-designed3-2 compressor

TABLE II
The comparison of carry propagation delay for the two architectures of 127-7 compressors