Design of an ASIC Chip for Skeletonization of Graylevel Digital Images

This paper describes the design of an ASIC chip for thinning of graylevel images. The chip implements a Min-Max skeletonization algorithm and is based on a pipeline architecture where each stage of the pipeline performs masking operations on the graylevel images. The chip operates in real time at a frequency of 8 MHz and utilizes about 321 mils × 410 mils of silicon area.


INTRODUCTION
here has been a recent trend to develop special architectures for image processing which exploit the inherent parallelism found both in image data and in image processing algorithms.Unless this parallelism is suitably exploited, the sheer number of operations to be performed per second for even low level image process- ing tasks precludes economic real time processing with currently available processing speeds.It is, therefore, desirable to produce compact, high performance image processors that can execute basic image processing functions needed in different applications in real time.By operating in real time, it is possible to utilize all available data and hence achieve the highest possible throughput and, in addition, the need for external buffer memories, which add to system complexity, cost and size, is avoided.Although several such VLSI ASIC chips have been developed [1-4] for implementing various image processing operations, none has so far been reported for thinning or skeletonization of a digital image.
Thinning procedures play a central role in a broad range of problems in image processing.Before an object is recognized, it is necessary to process the input image and identify the required features.In low level image processing operations commonly employed, thinning of the input image helps in clearly identifying the image boundary.
The skeleton of a region may be defined via the medial axis transformation [5].The so called MAT of a region R with border B may be described as follows.For each point p in R, we find its closest neighbour in B. If p has more than one such neighbours, then it is said to belong to the medial axis of R.Although the MAT of a region yields an intuitively pleasing skeleton, a direct implementation of the above definition is generally prohibitive from the computational point of view because it poten- tially involves calculating the distance from every inte- rior point to every point on the boundary of a region.A number of algorithms have been proposed for improving the computational efficiency while at the same time producing a medial axis representation of a given region.Typically, most thinning algorithms iteratively delete edge points of a region subject to the constraints that the deletion of these points does not (i) delete end points, (ii) break connectedness and (iii) cause excessive erosion of the region.These algorithms can be broadly classified into two groups.The first group of algorithms operate on binary images only.Hence a graylevel image must be first converted into an equivalent binary image by suitable selection of thresholds before these algorithms are applied.The second group of algorithms have been 84 B. MAJUMDAR, V. V. RAMAKRISHNA, P. S. DEY AND A. K. MAJUMDAR designed for finding the skeleton directly from the gray level images.The latter have the advantage that thinning of the image can be performed without the need for thresholding.
In the present paper an ASIC design for skeletoniza- tion of digital images on a single VLSI chip is described.The reported design is based on a Min-Max graylevel skeletonization algorithm developed by Goetcherian [6], which belong to the first category mentioned above.The selection of the algorithm has been prompted by its usefulness in keeping the hardware complexity of the chip fairly low.
2 MIN-MAX SKELETONIZATION Goetcherian [6] has shown that many algorithms devel- oped for binary images can be adopted for use with graylevel images by substituting MIN and MAX func- tions for AND and OR functions and by treating the logical inverse of a binary image as being equivalent in graylevel images to one (the maximum gray value) minus the image.In the present work, Goetcherian's technique has been applied to a skeletonization algorithm developed by Arcelli et al [7] for thinning of digital binary images with the help of a set of 3 3 masks.The result of any operation is placed in the central position of the mask.The masks are shown in Fig. 1 and the order in which they are applied is: A1, B 1, A2, B2, A3, B3, A4, B4.
Each of the masks consists of two subgroups, white and black, as shown by the unshaded positions and positions shaded black respectively in Fig. 1.The posi- tions marked by 'x' in the masks are 'don't care' positions and need not be considered for a particular mask.The operation of each mask can be divided into the following three stages: pixels.The memory unit has, therefore, been designed so that it provides a pixel in one column with three rows simultaneously.To obtain all the eight neighbours of a particular pixel simultaneously, a set of six registers (R R6) clocked by a 3-phase clock, generated inter- nally, has been incorporated in the design. 3

CHIP ARCHITECTURE
The overall datapath for the complete operation is shown in Fig. 2. The chip comprises eight blocks, with each block performing the operation corresponding to a mask represented by Goetcherian's algorithm.Each block has three RAM banks each of 512 bytes capacity.The algorithm specifies that the operation of the i-th mask should be performed on the output of the (i 1)th mask.The blocks implementing operations of the masks are, therefore, connected in a pipeline style.Each mask requires data corresponding to three consecutive rows for performing any operation and each of them extends the image on all four sides.A mask can start computation X X x I x x 1 x A1 A2 x / x x x 1 1 A3 A4 (i) Find whether all the pixels in the white subgroup are lesser in graylevel value than the pixel on which the mask is currently centred.
(ii) Find whether the pixels in the black subgroup are greater in value than the central pixel.
(iii) If both the conditions given above in (i) and (ii) are satisfied, then replace the pixel in the central position of the mask by the maximum of the graylevel values of the pixels in the white subgroup.Following Goetcherian's technique, for every pixel in an image, its eight neighbours have to be considered for performing any computation.Therefore, the first major task is to provide every mask with pixel values corre- sponding to a central pixel and its eight neighbouring when it receives (i) data corresponding to the second row from the previous mask, (ii) the data corresponding to the first row which has already been stored in its memory bank, and, (iii) the second row of data which it is currently receiving.

The Datapath
The datapath can be followed with reference to the schematic diagram of a mask shown in Fig. 3(a).There are two sets of registers (R, R2, R3) and (R 4, R 5, R6).
These two register sets are clocked by the phases PHI2 and PHI3 of a three phase non-overlapping clock.The inset in Fig. 3(a) shows the phases of the clock.The three outputs of the memory unit go to the first set of registers as inputs while the outputs of the latter are connected to the second set of registers.The outputs of the second set of registers (R4, R 5, R6), which are clocked by PHI3, provide the (j 1)th column pixels of the rows (i -1), and (i + 1).The outputs of the first set (R 1, R 2, R3) clocked by PHI2, give the jth column pixels of the same three rows.The outputs of the memory unit provide the (j + 1)th column pixels in turn.The pixel Pij obviously occurs at the output of one of the first set of registers (R 1, R 2, R3).Since the memory unit outputs change at the end of the clock phase PHI1, the outputs at the memory unit will have the (j + 1)th column pixels at the end of PHI1 at which time the pixels neighbouring Pij will be obtained at the appropriate positions.The computation to be performed in each mask has to be completed before the values at the outputs of the registers and the memory unit change.

Implementation of Mask Operations
The mask operations have been implemented with the help of the module Find-Max shown in Fig. 3(a & b).
Since the white subgroup consists of three pixels in each mask and the black subgroup of two pixels (Fig. 1) a total of five 8-bit comparators are required for implementing operations (i) and (ii) above.For finding the maximum of the pixels in the white subgroup, two 8-bit comparators and two 8-bit multiplexers have been used.Another multiplexer is used to select either the central pixel value or the maximum of the pixel values in the white subgroup on the conditions given in (i) and (ii) for onward transmission to the next memory unit (the central pixel value should not be changed in the current unit because the pixel would be used for computations centred on other pixels).The image must be extended at its four sides for operations centering on pixels constituting the boundary of the image.Since 3 3 masks are used, we need to extend each row by one pixel on either side.Similarly, a whole row of extension pixels needs to be added before the first row and after the last row.In the reported chip, extended pixels have been assigned the graylevel value of zero.Image extension has been carried out without introducing new pixels into the image, since that would have resulted in operations centred on those pixels.Four control signals, C 3, C4, C5 and C6 (Fig. 3a) are used for extension of the image at the four sides.The pixel values have been multiplexed to zero at the appropriate time instants in a clock cycle.

The Memory Unit
The memory unit (Fig. 4) is organized as three memory banks each of 512 words with 8 bits per word.The nth row is stored sequentially as it arrives in the first memory bank, (n + 1)th row in the second and (n + 2)th row is being written into the third bank.The masks get the three pixels in the required sequence as follows.Let the jth column of the (n + 2)th row be the current data at the input to the chip.These data are latched on by clock PHI1.By reading out simultaneously the jth location of banks 1 and 2, we get the jth column pixels of the nth and (n + 1)th rows.The jth column pixels of the (n + 2)th row is currently available at the output of the latch and can be sent through a suitable bus driver to the different masks..4 The Address Generator A 9-bit linear feed back shift register (LFSR) has been used for the address generation.The LFSR implements the polynomial function (1 + x 4 + x9) for maximum length sequence generation.Only one XOR gate is required to implement the feedback blocks of the LFSR.The maximum length sequence LFSR will not generate the all zero state and, therefore, it has to be reset for one cycle in every 512 cycles.However, once in the all zero state, the LFSR will not be able to come out of it by itself at the next clock pulse even if the reset signal is removed.Therefore we must seed the LFSR with some other state during the following clock pulse.This has been done using the set input line [Fig.4] whence the LFSR will produce a new address on each clock pulse.This reset- ting and seeding of the LFSR is done by the RAM controller.

The Control Paths
There are three controllers in the chip, the RAM control- ler and the two chip controllers CONTROL1 and CON-TROL2.All the three controllers have the following features: (a) the state machine cycles through a fixed sequence of states and, (b) the state transitions do not depend on the outcome of the computations performed in the masks.The RAM controller supplies the signals CEB (Chip Enable Bar) to initiate read or write, the three Write Enable Bar signals WEB1, WEB2 and WEB3 for the three RAM banks, the status signals SO and S to control the multiplexers and the set and reset pulses for the address generator.The inputs to the RAM controller are the signals EOLN (end of line), EOF (end of frame) and the chip select CS.The clock is obtained by Or-ing PHI1 and PHI2.

Chip Controllers
The chip controllers CONTROL and CONTROL 2 take as their input signals EOF, EOLN and RESET which initializes the controllers and the chip.The controllers provide the control signals needed for extension of the image at the four sides as well as control signals required for interface between two consecutive masks.They also provide the signals DV (data valid), NEWFR (new frame) and NEWLN (new line).
The state diagram for CONTROL is shown in Fig. 5.It has 25 states and is clocked by PHIl.The output signals are C 3, C 4, C 5, C 6, DV, NEWFR and NEWLN.The controller goes to state I whenever the chip RESET line is made active, and remains there until an EOF signal is detected whereupon it moves to state A. It waits in state A for one row while the first row is being stored and, when an EOLN signal is detected, it moves to AX. Fig. 5 illustrates the state transistions through one com- plete frame and the control signals generated by CON-TROL are shown in Fig. 6.Chip controller CONTROL 2 is used to interface between two consecutive masks.The inputs to the controller are the 'end of frame' and 'end of line' signals of the previous mask.For the first mask A, these are identical to the EOF and EOLN signals input to the chip.
For the other masks these signals are the EF (end of frame) and EL(end of line) signal output of CONTROL 2 of the previous mask.The state transition diagram for CONTROL 2 is illustrated in Fig. 7.

Operating Frequency
Since the chip requires a three-phase non-overlapping clock, a clock generator has been designed using JK flip-flops.The input to the chip can be a single clock which would be converted to the three phase non- overlpping clock internally by the clock generator.The operating frequency of the clock is decided by the rate at which data come into the chip (fresh data are supposed to come into the chip on every positive pulse of PHI2).For a 512 X 512 image and a frame rate of 30 per second, a clock period of 127 ns is obtained.The minimum operating frequency of the chip would thus be around 8 MHZ.In the present work, a higher operating frequency could not be envisaged because the pseudo-static RAMs used in the design have a cycle time of 100 ns which allows only one read or write per cycle.The operating frequency can be considerably enhanced if faster RAMs are used. 4IMPLEMENTATION OF THE DESIGN The chip was implemented using VTI (VLSI Technology Inc) CAD tools based on a two metal CMOS process on a SUN 3 workstation.CMOS standard cells, having a minimum line geometry of 1.5 lam, were used for implementing the datapaths.The controllers and the memory were implemented as compiled cells using the  Finite State Machine (FSM) compiler and the CMOS Random Access Memory (RAM) compiler available with the VTI tools.The on-chip memory unit comprises three pseudostatic RAM banks with a cycle time of 100 ns.
A timing simulation of the design was carried out using VTI Sim simulator and Timing Verifier.VTI Sim is a mixed mode, event driven simulator.It simulates transistor logic with timing for MOS technologies and predicts logic levels, approximate voltages and approxi- mate timing of changes in circuit nodes.The Timing Verifier performs a static timing analysis on complete integrated circuits or subcircuits and is used after the logic functionality is verified using VTI Sim.The chip has been found to operate satisfactorily for real time processing of 512 512 video data at a clock rate of 8 MHZ.In order to evaluate the performance of the present design, the simulator was fed with inputs corresponding to synthetic patterns.In most cases the skeletonized version generated was found to be satisfactory.A typical test pattern corresponding to the alphabet 'L' and its skeletonized version generated with the help of the simulator is shown in Fig. 8.
The Chip Compiler utility of the VTI tools was used for automatic placement and routing.provides facilities for interactively modifying the floor- plan of the chip, for routing between arbitrary blocks, standard cell routing, power routing and pad ring place- ment.It also provides an estimate of the final chip area and the number of transistors used.After several such place and route operations and manual modifications wherever necessary, the overall chip area, including I/O pads, was found to be 321 410 mils2.Approximately 79,000 transistors have been used in the core area.The final layout of the chip with the I/O pads are shown in Fig. 9.It has a pin count of 35 and as such would require a 40pin package.

CONCLUSION
The architecture of an ASIC chip for skeletonization of graylevel images in real time has been presented.No such on-chip thinning of digital images has been avail- able so far.The chip operates in real time and can provide image data continuously at a clock rate of 8 MHZ.The hardware complexity of the design is ex- tremely modest.The pseudo-static RAM which has been used has a cycle time of 100 ns.This allows only one read or write per cycle.As already mentioned in Section 3.6, the operating frequency can be enhanced by using faster RAMs.Also, if we can execute a read during one phase of the clock cycle and write into the same bank during the next phase, we would require only two RAM banks per mask thus reducing the chip area.The area of the chip can also be appreciably reduced by using the memory off-chip.

FIGURE 3
FIGURE 3(a) Schematic of mask A (Inset: Three phases of the clock).

FIGURE 3
FIGURE 3(b) Schematic of module Find-Max.
FIGURE 4 Schematic of memory unit.

FIGURE 5 FIGURE 6
FIGURE 5 State Diagram for chip controller CONTROL 1.

FIGURE 7
FIGURE 7  State diagram for ship controller CONTROL 2.

FIGURE 8 A
FIGURE 8 A test pattern before and after skeletonization.FIGURE 9 Final chip layout.

FIGURE 9
FIGURE 8 A test pattern before and after skeletonization.FIGURE 9 Final chip layout.
The Chip Compiler