Parameterized Hardware Design on Reconﬁgurable Computers: An Image Processing Case Study

,


Introduction
Reconfigurable Computers (RCs) are traditional computers extended with co-processors based on reconfigurable hardware like FPGAs.Representative RC systems include SGI RC100 [1], SRC-6 [2], and Cray XD1 [3].These enhanced systems are capable of providing significant performance improvement for scientific and engineering applications [4].The performance of a hardware design on an FPGA device depends on both the intrinsic parallelism of the design as well as the characteristics of the FPGA co-processor architecture, which consists of the FPGA device itself and the surrounding data interface.Due to the limited size of the internal Block RAM memory, it is not applicable to store large amounts of data inside the FPGA device.Therefore external, however, local SRAM modules are generally connected to the hardware co-processor for data storage, such as the example shown in Figure 1.Furthermore, an FPGA co-processor can directly access the host memory through the interconnect, which generally provides a sustained bandwidth up to several GB/s.The available number and data width of local memory banks and the interconnect channels play important roles in the hardware implementation on an FPGA device and decide the parallelism a design can achieve in many cases.In this work, image processing algorithms are adopted as a case study to demonstrate how hardware designs can be parameterized by the data I/O of the FPGA co-processor in order to achieve the best performance.
Image processing applications are capable of gaining a significant performance improvement with hardware design [5,6].Two categories of image processing applications are selected to represent hardware designs that present different data I/O requirements.The first category of applications is image registration, which requires the use of local memory for data access.The second category of applications is edge detection, which directly reads from or writes to the interconnect in a streaming fashion.Image registration is a very important image processing task.It is used to align or match pictures taken in different conditions (at a different time, angle, or from different sensors).A vast majority of automatic image processing systems require the use of image registration processes.Common image registration applications include target recognition (identification of a small target inside a much bigger image), satellite image alignment in order to detect changes such as land usage or forest fires [7], matching stereo images to recover depth information, and superposing medical images taken at different moments for diagnosis [8,9].As an example of the applications in the second category, edge detectors encompass image processing algorithms that identify the positions of edges in an image.Edges are discontinuities in terms of intensity or orientation or both and generally represent meaningful characteristics of the image (boundaries of objects, e.g.).Commonly, edge detectors are used to filter relevant information in an image.Thus, they greatly reduce the amount of processing needed for interpreting the information contents of an image.One of the most important edge detection algorithms is the Canny edge detection [10].The Canny edge detection operator was developed by Canny in 1986 and uses a multistage algorithm to detect a wide range of edges in images.It remains until now, as a state-of-the-art edge detector used in many applications.
The implementation of image registration and edge detection on reconfigurable computers has been previously reported in [11,12], respectively.In this paper, we not only present the detail of hardware design itself, but also exploit the role of data I/O of the FPGA co-processor in the design process.More precisely, we demonstrate how the design of image processing applications is parameterized by local memory architecture and DMA interface on reconfigurable computers.Furthermore, the hardware processing time of both applications are formalized in terms of clock cycles.The remaining text is organized as follows.In Section 2, we discuss the related work based on literature survey.In Section 3, two related image registration algorithms, the exhaustive search algorithm and the DWT-based search algorithm, and their hardware implementations are discussed.Section 4 focuses on the design of Canny edge detection in hardware.The implementation of all applications on the Cray XD1 reconfigurable computer is presented in Section 5. Finally, Section 6 concludes this work.

Related Work
Since image registration is a computation-demanding process in general, hardware (e.g., FPGA device) is leveraged to improve the processing performance.In [8], Dandekar and Shekhar introduced an FPGA-based architecture to accelerate mutual information-(MI)-based deformable registration during computed tomography-(CT)-guided interventions.Their reported implementation was able to reduce the execution time of MI-based deformable registration from hours to a few minutes.Puranik and Gharpure presented a multilayer feedforward neural network (MFNN) implementation in Xilinx XL4085 for template search in standard sequential scan and detect (SSDA) image registration [13].In [14], Liu et al. proposed a PC-FPGA geological image processing system in which the FPGA was used to implement Fast Fourier Transform-(FFT)-based automatic image registration.In [15], El-Araby et al. prototyped an automatic image registration methodology for remote sensing using a reconfigurable computer.However, these previous work only emphasized the image registration algorithm itself.The FPGA data I/O as a factor in the design was not discussed in detail.
Low-level image processing operators, such as digital filters, edge detectors and digital transforms are good candidates for hardware implementation.In [16], a generic architectural model for implementing image processing algorithms of real-time applications was proposed and evaluated.In [17], a Canny edge detection application written in Handel-C and implemented in the FPGA device was discussed.The proposed architecture is capable of producing one edge-pixel every clock cycle.The work in [18] illustrated how to use design patterns in the mapping process to overcome image processing constraints on FPGAs.However, most of the previous works, for example, [16][17][18], only focused on the algorithms alone and did not consider the platform characteristics as a factor in the design.

Image Registration
In this section, we discuss how to implement image registration algorithms in hardware to exploit the processing parallelism, which is bounded by the local memory architecture.
3.1.Background.Image registration can be defined as a mapping between two images, the reference image R and the test image T, both spatially and with respect to the intensity  [19].If these images are defined as two 2D arrays of a given size denoted by I 1 and I 2 where I 1 (x, y) and I 2 (x, y) each map to their respective intensity values, then the mapping between images can be expressed as where f is a 2D spatial-coordinate transformation and g is a 1D intensity or radiometric transformation.More precisely, f is a transformation that maps two spatial coordinates, x and y, to new spatial coordinates x and y : x , y = f x, y . (2) g is used to compensate gray value differences caused by different illuminations or sensor conditions.
According to [19], image registration can be viewed as the combination of four components: (1) a feature space, that is, the set of characteristics used to perform the matching and which are extracted from the reference and test images; (2) a search space, that is, the class of potential transformations that establish the correspondence between the reference and test images; (3) a search strategy, which is used to decide how to choose the next transformation from the search space; (4) a similarity metric, which evaluates the match between the reference image and the transformed test image for a given transformation chosen in the search space.
The fundamental characteristic of any image registration technique is the type of spatial transformation or mapping used to properly overlay two images.The most common transformations are rigid-body, affine, projective, perspective, and global polynomial.Rigid-body transformation is composed of a combination of a rotation (θ), a translation (t x , t y ), and a scale change (s).An example is shown in Figure 2. It typically has four parameters, t x , t y , s, θ, which map a point (x 1 , y 1 ) of the first image to a point (x 2 , y 2 ) of the second image as follows: where p 1 and p 2 are the coordinate vectors of the two images; t is the translation vector; s is a scalar scale factor; R is the rotation matrix.Since the rotation matrix R is orthogonal, the angles and lengths in the original image are preserved after the registration.Because of the scalar scale factor s, rigid-body transformation allows changes in length relative to the original image, but they are the same in both x and y axes.(Please note both (x 1 , y 1 ) and (x 2 , y 2 ) are coordinates in the same Cartesian coordinate system with the origin O.) Computing the correlation coefficient is the basic statistical approach to registration and is often used for template matching or pattern recognition.A correlation coefficient is a similarity measure or match metric, that is, it gives a measure of the degree of similarity between a template (the reference image) and an image (the transformed test image).The correlation coefficient between the reference image R and the image T , which is the test image after rigid-body transformation, is given as where μ R and μ T are mean of the image R and T .If the image R matches T , the correlation coefficient will have its peak with the corresponding transformation.Therefore, by computing correlation coefficients over all possible transformations, it is possible to find the transformation that yields the peak value of the correlation coefficient.
In this work, rigid-body transformation is selected for the registration between two images and the correlation coefficient is used to measure the similarity.Further, we assume that both the reference image and the test image are 8-bit grayscale and share the same size.
Given a search space, (ΔΘ,ΔX,ΔY ) theoretically all tuples of (θ,t x ,t y ) are to be tested to find the tuple that generates the maximum correlation coefficient between the reference image and the transformed test image (In this work, the scale factor s is fixed at 1.).Figure 3 shows the two steps to test each tuple.The first step is to apply a rigid-body transformation on the test image T to get T .The second step is to calculate the correlation coefficient between T and the reference image R. As shown in Figure 3(b), only the pixels of both images within the shaded region are used during the calculation.Since the test image T is rotated and translated to obtain T , some parts of T are beyond the shaded region.In other words, some portions of the shaded region (shown as crossed regions in the left Cartesian coordinate system of Figure 3(b)) do not belong to T .For the pixels belonging to these crossed regions, their values are treated as zeros in the calculation.In the remaining part of this section, two different approaches based on rigid-body transformation are discussed in details.The first approach literally tests the whole search space to find the best tuple of (θ,t x ,t y ).The second approach applies DWT on both the reference image and the test image to reduce the search resolutions in order to improve the search efficiency.

Exhaustive Search Algorithm.
As its name implies, the exhaustive search algorithm tests all possible tuples of (θ,t x ,t y ) with a fixed search resolution, δ θ , δ x , and δ y , on each dimension, respectively, in order to find the tuple that produces the highest correlation coefficient between the transformed test image and the reference image.If this algorithm is implemented on a scalar microprocessor, these tuples have to be tested in sequence.However, if the same algorithm is implemented in hardware, multiple tuples can be tested in parallel to improve the performance.
Since the size of an image is normally bigger than the amount of available Block RAM inside an FPGA device, the local external memory is used to store images.Assuming there are P + 2 individual local memory banks connected directly to the FPGA device, and one bank keeps the reference image R, one bank keeps the test image T. The other remaining P banks are used to store P transformed test images T s using different tuples of (θ,t x ,t y ), as shown in Figure 4.If we further assume that each memory bank has its own independent read and write ports, P transformations of the test image can be carried out concurrently.The calculation of correlation coefficients between the image R and P different T s can be performed in parallel as well.
Given the coordinate of one pixel in the original image, (3) is used to calculate the coordinate of the corresponding pixel in the transformed image.Then, the intensity of the pixel (x 1 , y 1 ) in T can be written into T at the coordinate of (x 2 , y 2 ).If we assume that there are S pixels in the original image and the hardware implementation is fully pipelined, the transformation step would take approximately S clock cycles.Furthermore, some extra clock cycles are needed to initialize the intensities of all pixels within the shaded region to zero due to two reasons.First, there are several regions within which the pixels do not belong to T , as shown in Figure 3(b).Second, there may exist artifacts whose coordinates are within both the shaded region and T , but are not calculated due to discretization, as shown in Figure 5.If the intensities of these pixels are left randomly, it may affect the accuracy of the correlation coefficient.Therefore, it is necessary to initialize the intensities of all pixels in the shaded region to zero in the first place.Since the data width of the memory bank's access ports is multiple-byte, multiple pixels can be initialized in one clock cycle.If we assume that the data width is D-byte, then the initialization process would take roughly S/D clock cycles.Overall, the transformation step of P tuples would take ((D + 1)/D)S clock cycles.The mean intensity μ T of each T can be calculated during the transformation step, hence it takes no extra time.The mean intensity μ R can be precalculated by the microprocessor and forwarded to the FPGA device later since it remains unchanged during the whole image registration process.
Although the calculation of the correlation coefficient as (4) between R and T s is more complicated than the transformation step, it takes the same time as the initialization process since D pixels can be read and processed in the same clock cycle.Altogether, these three steps, including initialization, transformation and correlation coefficient calculation, would take ((D + 2)/D)S clock cycles for testing P tuples of (θ,t x ,t y ).If the entire search space consists of ΔΘ * ΔX * ΔY tuples, the whole registration process would take clock cycles.Apparently, the image registration time in hardware can be significantly reduced by increasing the number of local memory banks.Widening the data width of

Image size
Search space access port of local memory can improve the performance as well; however, it can also hit the upper bound very quickly due to the fact that lim

DWT-Based Search Algorithm.
Although the exhaustive search algorithm is quite straightforward, it is computationdemanding as well.In [20], a DWT-based image registration approach was proposed.As shown in Figure 6, both the test image and the reference image go through several levels of Discrete Wavelet Transform before applying image registration.After each level of DWT, the image size is shrunk to 1/4 of the previous level.In the meantime the image resolution is reduced to half.For example, if k levels of DWT are applied on both the test image T 0 and the reference image R 0 , two series of images, T 0 , T 1 , . . ., T k , and R 0 , R 1 , . . ., R k , are obtained.The registration process starts from the exhaustive search between T k and R k among the search space of (ΔΘ, ΔX, ΔY ) with the search resolution, 2 k * δ θ , 2 k * δ x , and 2 k * δ y , on each dimension.The registration result between T k and R k , (θ k ,t xk ,t yk ) becomes the center of the search space of the registration between T k−1 and R k−1 .In other words, the registration between T k−1 and R k−1 is among the search space of (θ k ± 2 k * δ θ , t xk ±2 k * δ x , t yk ±2 k * δ y ).However, the search resolution is increased to 2 k−1 * δ θ , 2 k−1 * δ x , and 2 k−1 * δ y , on each dimension.In general, when the registration process traces back one level, the search scope is reduced to half on  each dimension, and the search resolution is increased two times on each dimension, respectively.This search strategy is illustrated in Figure 7. Table 1 details the search space and search resolution at each step in which rotation is taken as an example.
Different DWT decomposition processes have to be carried out in a sequence.Similarly the search processes at different levels need to be performed one after the other.Due to these two reasons, the original image and the decomposed images can be stored in the same memory bank, as shown in Figure 8 in which R k or T k denote the decomposed image at the level k.
If we use the same assumptions as in Section 3.2, that is, both original images consist of S pixels, the data width of the local memory is D-byte, and there are P+2 independent local memory banks, then the DWT decomposition step alone will take clock cycles.The search between the decomposed reference image and the test image at each level can use the same method described in Section 3.2.By observing Table 1, it is found that the search space at each level, except the level k, is 125 tuples of (θ,t x ,t y ).Therefore, the search from level 0 to level k − 1 will take clock cycles.The search at the level k itself takes clock cycles.

Edge Detection
Edge detection aims at identifying pixels in a digital image at which the image brightness changes sharply, that is, having discontinuities.Most edge detection algorithms involve the convolution process between the image and a kernel.Convolution provides a way of "multiplying together" two arrays of numbers, generally of different sizes, but of the same dimensionality, to produce a third array of numbers of the same dimensionality.This can be used in image processing to implement operators whose output pixel values are simple linear combinations of certain input pixel values.In an image processing context, one of the input arrays is normally just a grayscale image.The second array is usually much smaller, and is also two dimensional (although it may be just a single coefficient).The second array is always known as the kernel, as shown in Figure 9.If the image has M rows and N columns, and the kernel has m rows and n columns, then the size of the output image will consist of M − m + 1 rows and N − n + 1 columns.Mathematically we can write the convolution between the image I and the kernel K as where x 1 runs from 0 to M − m and y 1 runs from 0 to N − n.
From Figure 9 and (10), we can find that the calculation of different pixels in the output image is independent to each other.Therefore, the intensities of multiple output pixels can be computed in parallel in a hardware design.Since the input International Journal of Reconfigurable Computing 7  image reading and the output image writing are both in a sequential mode, the data storage in local memory can be avoided.Instead, the user logic can access the interconnect directly for fetching the source image and storing the result image.In order to optimize the hardware performance, the interconnect and the user logic are chained into a pipeline, and the data run through the pipeline as a stream.We call this architecture Streaming Data Transfer Mode, shown in Figure 10.Two DMA engines work in parallel to retrieve raw data blocks from and return result data blocks to the main memory.Under ideal circumstances, the reading DMA engine receives one raw data block from the input channel and the writing DMA engine puts one result data block to the output channel every clock cycle.
Being one stage of the overall pipeline, the design of the algorithm logic is parameterized by the characteristics of other components in the pipeline, that is, the data width of the interconnect between the FPGA device and the μP.Because the data width of the interconnect fabric is multiplebyte wide in general, several pixels are fed into the algorithm logic in the same clock cycle.To maximize the throughput of the overall architecture, the algorithm logic has to be capable of performing the operations of multiple pixels concurrently and taking new data input every clock cycle.In the following discussion, we assume that (1) the image is in 8-bit grayscale, (2) the image size is M × N and the kernel size is m × n, and (3) the data width of the interconnect is D-byte.Furthermore, pixels in the original image are delivered into the algorithm logic in a stream, starting with the pixel at the top-left corner, ending with the pixel at the bottom-right corner.
The diagram of the algorithm logic is shown in Figure 11.The architecture consists of four components, one Line Buffer, one Data Window, an array of PEs, and the Data Concatenating Block.
The quantity of PEs is D, that is, the data width of the interconnect in byte.Every PE is fully pipelined and is capable of taking a new input, that is, one block of m × n pixels, every clock cycle.The output of one PE is the intensity of one pixel in the result image.
The   with a horizontal stride of D pixels and a vertical stride of 1 pixel.Figure 12 demonstrates the scanning from left to right and shows the contents in two consecutive steps of the data window.Once the data window receives a valid input, that is, one block of m × D pixels, it does a D-byte left shift for all rows with the input tailing the right most n − 1 columns.
The Line Buffer is a two-dimension register array of (m − 1) × N. The original image is transferred into the FPGA as a stream.However, PEs request image blocks that spread different rows.The purpose of the line buffer is to keep all pixels in registers until one m × D block forms.One m × D block, shown in gray in Figure 11, comprises the new arrived D pixels and another (m − 1) × D pixels that reside at the head of every row of the line buffer.Every time the line buffer receives D pixels of the original image, it performs two actions simultaneously.The first action is to deliver the new formed image block to the data window.The second action is to do a D-byte left shift in a zigzag form, in which two neighbor rows are linked together by connecting their tail and head.
The design of the Data Concatenating Block is straightforward because what it does is to concatenate the outputs of the upstream PEs together.Once it receives valid output from the PEs, that is, D consecutive pixels in the output image, it sends them into the output channel.
In general, the four components in Figure 11 form a pipeline chain and each of them is fully pipelined as well.This architecture is able to accept D pixels of the original image every clock cycle and output D pixels of the result image every clock cycle at the same time.In case of a multiple-stage algorithm, such as the four-stage Canny edge detector, various stages can be chained together and each stage consists of these components with different parameters and functionalities.Under ideal scenario, it would take (M × N)/D clock cycles to perform an edge detection operation on an input image.However, the real performance is upperbounded by the sustained bandwidth of the interconnect in general.

Implementation and Results
These two image registration algorithms and a Canny edge detection algorithm have been implemented on the Cray XD1 reconfigurable computer with Xilinx XC2VP50 FPGA devices.On the Cray XD1 platform, each FPGA device is connected to four local SRAM modules, 4 MB each, as shown in Figure 13.Every local memory module has separate reading and writing ports connected to the FPGA device, and is able to accept reading or writing transactions every International Journal of Reconfigurable Computing concurrency is allowed, given that more independent local memory modules are available.
For the category of applications that use the interconnect for data access, a Canny edge detection algorithm is implemented following the architecture in Figure 11.The Canny edge detector comprises four stages in which each stage takes the output from the preceding stage and feeds its output to the following stage as follows: Because the interconnect is 8-byte wide, 8 edge detection operators are implemented and execute in parallel.Figure 14 shows the original image and the corresponding output after applying the Canny edge detection algorithm.In Canny edge detection algorithm, the processing of each pixel in the output image involves neighbor pixels in the input image.Furthermore, this processing consists of 4 stages and takes hundreds of clock cycles as latency.Due to the fully pipelined design in hardware implementation, the FPGA device is capable of computing 8 pixels in the output image every clock cycle no matter how complicated the computation of each pixel is.On the other hand, the pixels in the output image are computed one by one in the software implementation.Further, it would take thousands of cpu cycles to compute one pixel.Therefore hardware implementation of the Canny edge detection algorithm achieves 544× speedup compared with the corresponding software implementation.Higher speedup can be achieved if multiple images can be processed simultaneously given several interconnect channels are connected to the same FPGA co-processor and there are enough hardware resources to implement multiple instances of edge detection operators.

Conclusions
In this paper, we demonstrate how the parallelism of a hardware design on reconfigurable computers is parameterized by the co-processor architecture, particularly the number and the data width of local memory banks and the interconnect.Image registration algorithms based on rigidbody transformation are adopted as a case study to represent applications that use local memory.Two related; however, different algorithms, the exhaustive search algorithm and the DWT-based search algorithm, are described in detail.For the exhaustive search algorithm, the performance is linearly proportional to the available number of local memory banks.On the other hand, the DWT-based search algorithm improves the efficiency by applying DWT decomposition on both the reference and test images before the search in order to reduce the search scope.Compared with software implementations, hardware implementations of exhaustive search and DWT-based search achieve 10× and 2× speedup, respectively.For the category of applications that directly access the host interconnect, edge detection is selected as a case study.A streaming data transfer mode in which the user logic and the interconnect are chained into a pipeline is proposed.A user logic hardware architecture whose parallelism is decided by the data width of the interconnect is discussed in detail.A Canny edge detection application following the proposed architecture is capable of achieving 544× speedup compared with the corresponding software design.
As a future work, we will extend our work to the Tile 64 platform [21].Parameters such as the external memory bandwidth and the intertile bandwidth will be taken into account to implement image processing algorithms.We expect to take advantages from Tile 64's capability of creating multicore pipelines internally.

Figure 1 :
Figure 1: General architecture of a reconfigurable computer.

Figure 3 :
Figure 3: Two steps in image registration: (a) rigid-body transformation on the test image T, (b) calculate correlation coefficient between the transformed test image T and the reference image R.

Figure 4 :
Figure 4: Local memory data storage layout in the exhaustive search algorithm for image registration.

Figure 5 :
Figure 5: Artifacts due to discretization in rigid-body transformation ((a) the original image; (b) the transformed image).

Figure 6 :
Figure 6: DWT decomposition of an image.

Figure 7 :
Figure 7: Decrease the search space and increase the search resolution in the search process.

Figure 8 :
Figure 8: Store the original and decomposed images in the same memory bank in the DWT-based image registration.

Figure 9 :Figure 10 :
Figure 9: An example small image (a) and kernel (b) to illustrate convolution.
Data Window is a two-dimension register array of m × (n + D − 1) in charge of providing image blocks to the downstream PEs.Analogously, the Data Window scans the original image from left to right and from top to bottom International Journal of Reconfigurable Computing

Figure 12 :Figure 13 :
Figure 12: The content of Data Window in two consecutive steps (assume the kernel size is 3 × 3 and D is 8).

Table 1 :
Search strategy summary for rotation.
.Figure 11: Diagram of the algorithm logic in edge detection. 0