^{1}

^{2}

^{3}

^{4}

^{1}

^{2}

^{3}

^{4}

This paper presents the design and implementation on FPGA devices of an algorithm for computing similarities between neighboring frames in a video sequence using luminance information. By taking advantage of the well-known flexibility of Reconfigurable Logic Devices, we have designed a hardware implementation of the algorithm used in video segmentation and indexing. The experimental results show the tradeoff between concurrent sequential resources and the functional blocks needed to achieve maximum operational speed while achieving minimum silicon area usage. To evaluate system efficiency, we compare the performance of the hardware solution to that of calculations done via software using general-purpose processors with and without an SIMD instruction set.

The capacity of Reconfigurable Logic for massive concurrent computation makes it wellsuited to implementing complex algorithms. This feature has increased the potential of digital design, as shown by several papers which have proposed implementing mathematical algorithms implemented on FPGA. In almost all cases the topology and the performance obtained depend on the specific application, such that each case must be analyzed individually. References [

This paper presents the advantages of concurrency and parallelism in implementing video temporal segmentation with Reconfigurable Logic Devices, emphasizing the advantages and limitations of FPGA technology and its development tools. We chose to implement a function that has been thoroughly studied in [

In Section

Section

Section

Finally, the experimental results section shows the quantitative results obtained by comparing computing time using the proposed hardware solution to that of pure software running on a PC.

The combination of CLB logic with embedded Blocks and data supply are critical aspects of the design to be taken into account.

The main drawback of using a reconfigurable hardware device is the loss of performance imposed by the reconfiguration circuit itself [

Another important issue is the time delay introduced by routing. This factor justifies the inclusion on the FPGA of different-quality routes for different purposes (clocks, near, long, etc.).

The present study will show the many possible alternatives for combining all these elements in a specific design. Those selected in each case will depend on the specific implementation being conducted.

Video processing is a task characterized by a very high demand for data, and when using an FPGA there is only a limited storage capacity available on chip. Thus, an efficient interface is required with external memory capable of feeding and receiving data at the necessary rate. In this case, we used a Xilinx Spartan-3 development board provided with a static external memory bank having 32-bit width and 10-ns access time. This bus width allows us to read and process four bytes at a time. The reading and modification of data storage into the Spartan BlockRAM requires two clock cycles per operation. This memory is synchronous and can operate up to a 200 MHz clock frequency, perfectly matching the 10-ns access time of the external memory.

Our aim is the development of a specific architecture for video segmentation. This video analysis technique enables the detection of low-level semantic units in the video, which are called shots. Many applications are based on shot information to carry out higher level analysis, such as video classification, summarization, visual index generation, or sequence comparison [

One of the main features of an image is its luminance histogram, defined as the frequency of occurrence of the pixels’ luminance values in each frame. The similarity between two frames is inversely related to the distance between vectors representing its characteristics. In the case of a histogram of luminance, this can be defined through the following normalized equation [

In order to implement and optimize expression (

calculation of the histogram,

1D windowing using a window size of three to calculate

sum of products,

square root,

division.

Multipliers needed for stage 3 were not implemented in VHDL because the Spartan 3 device has dedicated 18-bit multiplier blocks providing much better performance than that attainable using CLB logic.

The real bottleneck occurs at the stage responsible for obtaining the histogram, and its interface with external memory is the limiting factor for the overall circuit performance. Therefore, it is advisable to try to reduce the amount of data to be processed. As shown in [

Considering the results obtained in [

Circuit for calculating the histogram.

Port B input has been forced to “0” to clear the accumulators in the same clock cycle, leaving the hardware ready for processing the following frame. Several registers were inserted between output adders and at the address input of the BlockRAMs to pipeline the stage, so as to optimize the interface with external memory and double the internal clock frequency.

To reduce the influence of slight variations in image luminance, each term of the histogram has been replaced with the sum of its adjacent neighbors. Figure

Concurrent windowing definition.

Figure

Grouping the windowed sum terms.

Figure

Circuit for calculating the windowed terms.

To prevent the delay introduced by the next calculation from piling up, BlockRAMs 4 and 5 retain data belonging to a frame while the following one is being processed in a different memory page. The BlockRAM dual port feature allows for decoupling and behaves like another pipeline stage. The only difference in this case is that the retention unit is the whole frame and not a single clock cycle.

Bearing in mind that 64 is the number of established histogram levels, each one represented by a 16-bit integer, the proposed structure can process frames containing up to 65,536 pixels. Taking as a reference a frame of only 1600 pixels, Table

Number of clock cycles per operation.

Histogram calculation | 800 |

Intermediate storage | 32 |

Windowing | 11 |

At the same time that storing and windowing are being performed on the previous histogram, the input stage builds the following histogram. After the first frame is complete, the remaining calculations are done in parallel within the 800 clock cycles taken by the first stage.

To implement this stage, we have modified the circuit depicted in Figure

Two of the partial sums needed to calculate the similarity coefficient as defined by (

Circuit for calculating the windowed correlation (includes windowed calculation pipelined).

Starting from the histograms stored in BlockRAM belonging to two frames, the stage only takes 28 clock cycles to complete two windowed sums. The key to such high performance lies in the concurrency of operations, the organization of the data, and the pipeline structure. It is also easy to interface this block with its neighbors because the input and output buses are only 32- and 64-bits wide, respectively. In contrast, the internal bus-width parallelism reaches 192 bits in the multiplier layer.

The purpose of this stage is to obtain the square root of the denominator in (

Circuit for obtaining the square root and terms of division.

It takes 16 clock cycles to complete processing a 32-bit-long input data. The sum term

Following the sequential stage responsible for calculating the square root, a multiplying block provides the product of the denominator in (

The division between two unsigned integers employing a shift and subtract algorithm is done by a circuit similar to the one shown in Figure

The stages referred to above require a set of control and synchronization signals to properly manage data flow. Therefore, a control circuit is required to generate the appropriate timing signals. Figure

State diagram of the control machine.

The design of this finite state machine is crucial, because the generated signals have to drive all the calculating modules, such that the fan-out and route length of these lines determine the maximum system clock frequency.

Because of the intensive use of pipeline architecture, the system exhibits a finite latency time from which a new value is output each 800 clock cycles (5.7

Clearly, the bottleneck is the width of the bus that connects the external memory with the above-described system. Widening this bus from 32 to 128 bits would make it possible to process 16 pixels at a time, reducing the time needed to complete the calculation to 200 clock cycles. Even with this addition, the usage factor of the remaining stages is never higher than 55% of the total time. However, the inactivity of these stages has no influence on the total processing time as they work in parallel with the histogram stage.

The entire design was simulated and implemented using the software package ISE 8.2i from Xilinx and tested on a development board “SPARTAN-3 Starter Board” from Digilent provided with an XC3S1000 FT256 speed grade 4 chip, 1 Mbyte of SRAM (256 Kb

The SOFT-HARD comparison uses four compressed videos from an MPEG-7 content set belonging to different genres (drama, sport, and news), all of which have an average length of 25,000 frames. The algorithm was implemented using fixed point arithmetic. Overflow has been avoided by selecting suitable resource sizes for each stage. The main considerations are discussed in the following.

The bus width at the multiplication output is twice as wide as the incoming factors and only half as wide for the square root case. The size of the accumulators for the sums of products enables storing the quantity

The Xilinx ISE synthesis tool provided the report shown in Table

Summary of resources usage.

Selected device: 3s1000ft256-4 | |||

Resource | Used | Available | Ratio |

Slices | 952 | 7680 | 12% |

Slice Flip Flops | 1170 | 15360 | 7% |

4 input LUTs | 1531 | 15360 | 9% |

Bonded IOBs | 80 | 173 | 46% |

BRAMs | 7 | 24 | 29% |

MULT18X18s | 7 | 24 | 29% |

GCLKs | 3 | 8 | 37% |

DCMs | 1 | 4 | 25% |

Source: ISE 8.2i provided by XILINX.

The interface with the external memory was debugged using an Agilent 64622D oscilloscope.

Close examination of the summary of the number of clock cycles taken by each individual stage shown in Table

Summary of number of clock cycles taken per operation.

Histogram calculation | 800 |

Intermediate storage | 32 |

Windowed correlation (2 sums) | 28 |

Square root | 16 |

Product | 1 |

Division | 32 |

To evaluate circuit performance, we compared a software implementation presented in Table

Hard-Soft solution comparison.

Implementation | ms | cycles |
---|---|---|

PC (sequential programming) | 2500 | 5000 |

PC (sequential optimized) | 750 | 1500 |

PC (SSE with Intrincsics) | 350 | 700 |

FPGA (SPARTAN 3-VHDL) | 200 | 28 |

In this paper we have presented an FPGA implementation to calculate similarities between two frames. Our design works with the DC coefficients of compressed video frames to compute the frame histogram. Similarity is calculated by applying a windowed cross-correlation to frame histograms. Our implementation has efficiently solved the problems arising from the management of constant data flow from video frames. Thus, we have designed a windowed sum of products stage which completes the calculation of 64 sum terms in only 28 clock cycles.

Although our system is a hardware implementation of a video segmentation technique, extending these results to other data processing applications is possible, simply because any calculations involving data correlation always takes the form of a sum of products. Its success in such applications will depend strongly on both the ability to maintain constant data flow through the system and how the numerical resolution of each stage is chosen.

This work was supported in part by the Ministry of Education and Science (CICYT) of Spain under Contract TIN2006-01078 and Junta de Andalucia under Contract TIC-02800.