Neuromorphic Configurable Architecture for Robust Motion Estimation

The robustness of the human visual system recovering motion estimation in almost any visual situation is enviable, performing enormous calculation tasks continuously, robustly, efficiently, and effortlessly. There is obviously a great deal we can learn from our own visual system. Currently, there are several optical flow algorithms, although none of them deals efficiently with noise, illumination changes, second-order motion, occlusions, and so on. The main contribution of this work is the efficient implementation of a biologically inspired motion algorithm that borrows nature templates as inspiration in the design of architectures and makes use of a specific model of human visual motion perception: Multichannel Gradient Model (McGM). This novel customizable architecture of a neuromorphic robust optical flow can be constructed with FPGA or ASIC device using properties of the cortical motion pathway, constituting a useful framework for building future complex bioinspired systems running in real time with high computational complexity. This work includes the resource usage and performance data, and the comparison with actual systems. This hardware has many application fields like object recognition, navigation, or tracking in difficult environments due to its bioinspired and robustness properties.


Introduction
Bioinspired systems emulate the behavior of biological ones.Neuromorphic approximations [1] are based on the way how the nervous systems create physical architectures and computations, attending to the morphology, information coding, robustness against damage, and so on.Neuromorphic systems usually deliver good primitives for the building of more complex systems, being the output of each system simpler than its input.This data reduction helps in the task of integrating every response associated with all information channels [2].
Attending to the estimation of a pixel motion inside the image sequence, there are many models and algorithms that could be classified as belonging to the matching domain approximations [3], energy models [4], and gradient models [5].Related to this last family, different studies [6][7][8] show that this represents an admissible choice for keeping a tolerable tradeoff between accuracy and computing resources.For designing systems operating efficiently, it is required to deal with many challenges, such as robustness, static patterns, illumination changes, different kinds of noise, contrast invariance, and so on.If bioinspirational behavior is required, that is, ability to detect correct motion related to optical illusions or avoiding operations like matrix inverse or iterative methods that are not biologically justified, we have to select carefully a model that carries out this kind of requirements.This is the Multichannel Gradient Model (McGM) [9][10][11][12].
Motivated by these previous results and analysis, we present the architecture and implementation of a customizable optical flow embedded processing core running in real time.This system works in the framework of a codesign scheme that is able to manage complex situations in real environments [13] better than other algorithms [14] and mimic some behavior of the mammalians [15].This paper is organized as follows.First, the stages of McGM model are explained very briefly; after that, we tackle the precision study of every conceptual stage, obtaining a set of bit width values which models the filters and the bit width stage required to obtain results that match with the statistical error metric requirements.From this previous study, we design the customizable architecture implementation attending to the original model plus several hardware modifications in order to improve the feasibility of the system.An example of this is the design of IIR filters replacing the original FIR filters due to the memory limitation of the prototyping platform, or the use of several information channels with a few bit width, replicating the nature of the brain (large number of neurons with very little precision for a few channels with huge information capacity) [14].After that, we explain the coarse pipeline processing architecture and the platform and language used in our systems.Finally, quality results, hardware associated cost, and comparisons to other implementations are shown.

Multichannel Gradient Model (McGM)
The original algorithm was proposed by Johnston and Clifford, and we have applied Johnston's description of the McGM model [9], while adding several variations to improve the viability of the hardware implementation.Figure 1 shows a simplified scheme of the algorithm.

IIR Filtering
A temporal IIR filter is modeled from its original FIR description due to the limitation of available memory in our prototyping platform [15,16].The result is a recursive filter with only two frames of latency, being o and i the output and input, respectively, of the filter and a i , b i the coefficients from our previous work [14,15]: where a 1 = e −1/α /α 2 ; b 1 = 2e −2/α , b 2 = e −2/α , and α drives the peak in the temporal impulse response function.It is calibrated with a frame peak value equals to 10 following a critical flicker fusion limit of 60 Hz, according to the human visual system evidences [11]: Attending to the original algorithm, we need to perform the order zero, one and two derivatives, which represent our first triplet of information to be processed, as shown in Figure 1.

FIR Spatial Filtering
A set of spatial FIR filters is modeled by the next impulse response corresponding to bidimensional Gaussians and their separable derivatives: where σ represents the spread of the Gaussian and H n is the Hermite polynomial of order n .The convolution is done in a separable way, taking derivatives in x and y directions up to sixth and second order, respectively, due to bioinspired and robustness reasons [11][12][13].The aim of this stage is to cover enough spread area of information channels that allow us to contribute to the calculus when any of them are null due to many reasons, such as noise.Therefore, we have three spatial structures, each one containing a pyramidal set of several filters corresponding to Gaussians and their different derivatives.

Steering Stage
The steering stage represents the approach to projecting the space-temporal filters calculated in previous stages, under the different orientations.Being n and m the order in x and y directions, respectively, θ the angle projected, D the derivative operator, and G O the Gaussian expression, we obtain the general expression of the filter rotated in the space as a linear combination of filters belonging to the same order basis [14].Thus, we have to apply this transformation to each value: (5)

Taylor Expansion Stage
In this stage a truncated Taylor expansion is done, substituting it for the point on the space-time image in order to further enhance the algorithm.To perform this, it is necessary to use each oriented filter previously calculated.This expansion is highly versatile and represents a robust information structure of the sequence in space and time: With this, it is necessary to differentiate each Taylor expansion respect to x, y, t, calling these derivatives X, Y, T, and forming the following sextet of quotient as shown in the quotient stage:

Quotient Stage (General Primitives) and Following Stages
This is the last stage belonging to the common path, where a quotient of every sextet's component is computed from every measurement of the product of steered Taylor expansion differentiates: The architecture of the core is branched in two separated ways, modulus and phase, with different bit operations working independently, containing products, several quotients, and even trigonometric operations as arctangent, which are performed in software.The details of the software stages can be found in previous works [14,15] being the final aim to recover a dense representation of motion.Therefore, we have two values for each input pixel corresponding to modulus and phase of the velocity, that is, velocity projection in x and y directions, following the next expressions

Precission Study (Bit Width Analysis)
We have designed a specific strategy to define the bit width required in each conceptual stage following this previous algorithm.The basic idea is to transform every calculus in the model, applying a chained process of quantization.For the sake of clarity, if the parameters of the convolution are the bit width of the input I, the length of the filter L, the mask size M, we can compute the output bit width simply shifting the range the output bit width O: Applying this method in each stage, we obtain a set of values that throw back the transformation between floating point International Journal of Reconfigurable Computing domain and integer domain, getting a tradeoff between bit width and affordable error.
As the metric error value, we have proposed the most common ones used in the specialized literature, such as Barron's vector [8] and Galvin's couple of metrics [6], where Vc and Ve are the values of the correct and experimental velocities, respectively, and g ⊥ is the normal component to the Galvin vector difference: We have also taken into account the simple error measures (absolutes and relatives) relative to modulus and phase: Regarding the stimuli, we have used synthetic compositions of sine waves of different spatial frequencies and the famous stimulus of diverging tree and translating tree [17], commonly used to evaluate optical flow.As a result, we obtain the set of precision parameters that are applied in the model attending to the range of affordable error.Figure 2 shows the bit width of the stages performed in hardware, and Table 1 contains the final values chosen, for an FIR Blur filter length of 5 pixels, FIR spatial filter of 23 pixels, and IIR temporal filter equivalent to an FIR length of 21 frames, with a more detailed analysis available in [14].

Codesign Process
The system has been designed as a codesign process working with an asynchronous pipeline (micropipeline).The PC feeds the FPGA with a stream of frames through a bank of memory connected to PCI bus.The board takes a continuous stream of pixels at its input (1 byte/pixel); however, we employ 32 bits at the output, coming back to the PC, where they are reordered and written to the hard disk.We have selected Handel C to implement this core, using DK tool [18].Relating to the prototyping board, an AlphaData RC1000 board has been used, which includes a Virtex 2000E-BG560 chip and 4 SRAM banks of 2 MB each [19].The memory banks can be accessed both from the FPGA and the PCI bus, Figure 3 showing the communication scheme of the codesign system between the external memory banks, FPGA, and the host platform.
We have implemented a bit width precision defined version of the model, that we called "semihardware" version or SmHW; furthermore the next step is to implement different hardware cores for examining the tradeoff between accuracy and efficiency.We have developed in the FPGA two kinds of platforms that are called "basic" (HWbas) and "extended" (HWext) architectures.The SW version is  implemented using the temporal FIR filtering, 24 orientations (each 15 • ), the SmHW version keeps the same number of orientations, although the implementation of the IIR filters and the Taylor Expansion is not completed (only are used the 65% of the weights).The basic architecture has one less order of spatial differentiation than the versions commented above, and it has only 18 orientations (each 20 • ), remaining the rest of the parameters constant.The extended architecture has one additional order less than the basic and also decreases in the number of orientations, taking 8 orientations (each 45 • ).Table 2 summarizes the main differences between these versions attending to the nature of temporal filter, the final spatial derivative order, the number of orientations, and the density of the weights used in the expansion.

Results
We have analyzed the resources required by the platform and also the number of cycles (NCs) of each stage in Table 3.Every stage belonging to both architectures has been designed as customizable, scalable, and modular.The basic architecture computes initial blur filter in order to remove aliasing components, IIR temporal filtering that performs the temporal derivatives, FIR spatial filtering, that is, spatial derivatives, and steering filtering that project the results onto the whole space (the SW prefix denotes that these stages are performed in software).This architecture contains the processing scheme belonging to most of gradient-based   optical flow models, thus it could be considered as a motion preprocessor [15,16].The extended architecture is able to cover more stages and is focused in the specific McGM algorithm, implementing all the stages commented previously, plus a Taylor expansion, Taylor product (their derivative products), and the quotient stage as shown in Figure 4.

Hardware Cost
The basic architecture consumes 41% of the board slices, with every stage being performed with parameter values very close to the original model (derivatives in x up to order 5, 18 orientations in the steering stage), implementing 4 stages.Nevertheless, the extended architecture requires 97% of the development board.

Performance
Related to the number of cycles, we have noted the Xilinx timing analyzer tool [20] to be very conservative; thus we can increase the throughput around 25%-35% if we clock the system manually from the values obtained.The slower stage in the basic architecture is the FIR filtering, while the last stages designed need the maximum number of Block RAMs and slices due to the computation being performed replicating the spatial convolution (FIR filter) concurrently  for n orientations until order m in x.Nevertheless, in the extended architecture we must keep resources for the next stages, removing some contributions and parallelizing the processing scheme in discrete groups, which replace the whole group entirely concurrently.For instance, the steering stage is performed with fewer terms and with reduced parallelization level, requiring almost the double of cycles.Applying this strategy of keeping enough resources in the prototyping board, we can extend the model to additional stages.We can see in Figure 4 the global codesign scheme and the two architectures involved, representing the transactions between external RAM (grey blocks) and the stages.The stage corresponding to IIR filter has to keep 3 frames using the bank number one, the steering stage reads the orientation weights from bank number three, and the send/receive modulus connects the input/output data between the FPGA and the host system via the PCI bus using DMA transfer.Figure 5 shows the performance for the whole systems using chained stages, attending to the pixel/seconds processed, concluding that it is possible to compute 177 frames/second with a resolution of 128 × 96 pixels in the basic architecture, and 37.9 frames/second for the extended architecture.

Quality of the Results
An accuracy analysis has been carried out, being possible to examine the quality of the results under different transformations and metrics, as we can see in Figure 6.The phase and modulus metrics (difference between values) show a good behavior regarding the implementation changes, while Barron's metric seems to go well keeping the proportion accuracy under changes, but Galvin and Galvin perpendicular metrics suffer with the implementation change from SW to HW.It is due to the nature of the metric, which gives an idea about how the algorithm copes with the Aperture Problem [8], this topic being discussed in previous work [14].Despite restricting every version in terms of precision parameters one step further until finally taking the extended architecture, in general the error values are delimited reasonably.

Some Visual Results
Figure 7 shows some visual results corresponding to different versions of our system, concretely SW versus HWbas.It can be noted that while the SW version keeps a calculus density close to 100% (middle row in Figure 7), HWbas loses some points due to precision bit width (bottom row in Figure 7), that is, the bit number of the parameters in each stage.The input sequence, called diverging tree (upper row in Figure 7) has a divergent structure where the modulus is supposed to vary poorly and the phase is changing regularly over 360 • .Since we are working with synthetic sequences, we can estimate the error without any ambiguity.Also we have used   the translating tree sequence, where modulus is changing from left to right and the phase is practically almost the same.

Comparison with other Approaches
There are other gradient optical flow models implemented in hardware [21,22], belonging to the Lucas and Kanade algorithms [23] and to Horn and Schunk approximations [24,25], while in Table 4 we can see the average error for different metrics, although only we compare the Barron's metric since the cited authors do not provide other measurements.Attending to the errors, our implementation provides better results than the other approaches, even with calculation density 100%.Nevertheless, the final results are improved if the points where the scene structure changes, that is, points smaller than a determinate temporal derivative, are filtered.This is caused by a least squares process being performed at the end of the algorithm for calculating the modulus and the phase final values.The points filtered would force the slope of the linear regression to be very small, with the value of velocity is almost null.
Regarding throughput, we are able to calculate more than 2000 Kpixel/s in the basic Architecture and about 1000 Kpixel/s in the extended.It would locate our implementation between those in [23,26], enough for real-time purposes, although it could be improved using a board with more resources that is used here and increasing the parallelism level.
The error using the diverging and translating tree sequences [17] is shown in Figure 7, and it is obtained with different metrics regarding the expressions ( 12)-(13).

Conclusion
We have developed an FPGA-based implementation of a bioinspired robust motion estimation system with an associated complexity higher than those found in other gradientbased models commonly used in the literature.The study of precision calibrates the model and adjusts the bit width needed for keeping a tradeoff solution between accuracy and efficiency, acting as a bridge between software and hardware and estimating the cost to convert every stage from floating to fixed point.Taking the results from this precision study, different hardware moduli have been designed, organizing International Journal of Reconfigurable Computing this in two high parallelized architectures.The first one, referred to as basic architecture and common to optical flow gradient models, is a superconvolutive processor orientated along multiple angles.It could be used as a starting point for many computer vision algorithms, not necessarily restricted to the motion estimation field, like change detection, stereo, or even biometry techniques such as real-time signature recognition.The second architecture, called extended, is focused in the Mutichannel Gradient Model, and includes the truncated Taylor expansion representation of space temporal information of the scene, its three differentiates respect space and time, and the quotients of the products of these last functions.The rest of the stages, called velocity primitives, corresponding to the expressions ( 8)-( 9) are performed in software in the framework of a codesign process, where the final modulus value is a quotient of determinants and the final phase is an arctangent.This extension can be implemented using a board with more resources than the VIRTEX 2000 E and, depending on the accuracy required, using a structure based on LUTs or implementing a CORDIC core.Both architectures are scalable and modular, and also extensible to one device with more resources that our prototyping platform.
Additionally, the resources consumed have been evaluated as well as the throughput and the accuracy of the designed coprocessors.All models come forward with asynchronous segmented architectures (micropipelines).Regarding quality, the average error has been compared using Barron's metric, since other authors do not provide results with other metrics; also the throughput of the design has been compared with other implementations.This work generates dense optical flow maps up to 80 frames/second and 185 frames/second for a resolution of 128 × 96 in the extended and basic architectures, respectively.The present contribution opens the door to embed complex bioinspired systems that require a huge quantity of computation.We are currently improving the system to extend the model to a fully stand alone platform also to deal with stereo vision.Several application fields are though to use it, such as motion illusion detection or video compression.

Figure 1 :
Figure 1: General scheme of the McGM algorithm.

Figure 2 :
Figure 2: Evaluation of the bit width needed in the modulus (a) and phase (b) converting the data to fixed point.

Figure 3 :
Figure 3: Scheme of the communication process.

Figure 4 :
Figure 4: Scheme of the two architectures working with an asynchronous pipeline.

Figure 5 :
Figure 5: Throughput of the pipeline (Kpps) and frequency corresponding to basic and extended architectures.

Figure 7 :
Figure 7: Some visual results corresponding to the software version versus the basic architecture (diverging tree sequence).Left hand indicates velocity modulus and right hand velocity phase.

Table 1 :
Parameters of each stage (100% density).F 1 temporal IIR, F 2 spatial FIR, W 3 steering weight, W 4 Taylor expansion, O i r bit width output of stage i,. Being

Table 2 :
Summary of the different implementations.

Table 3 :
Slices and memory requirements and number of cycles for basic and extended architectures.

Table 4 :
Summary of the different implementations for the Yosemite sequence.NP means not provided.