Multi-Softcore Architecture on FPGA

. To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using availablestandardsintellectualpropertiesyetmaintaininghighperformanceisalsoachallengingissue.Softcoreprocessorsandfield programmablegatearrays(FPGAs)areacheapandfastoptiontodevelopandtestsuchsystems.ThispaperdescribesaFPGA-based designmethodologytoimplementarapidprototypeofparametricmulticoresystems.AstudyoftheviabilityofmakingtheSoC usingtheNIOSIIsoft-processorcorefromAlteraisalsopresented.TheNIOSIIfeaturesageneral-purposeRISCCPUarchitecture designedtoaddressawiderangeofapplications.Theperformanceoftheimplementedarchitectureisdiscussed,andalsosome parallelapplicationsareusedfortestingspeedupandefficiencyofthesystem.Experimentalresultsdemonstratetheperformance oftheproposedmulticoresystem,whichachievesbetterspeedupthantheGPU(29.5%fasterfortheFIRfilterand23.6%fasterfor thematrix-matrixmultiplication).


Introduction
The requirements for higher computational capacity and lower cost of future high-end real-time data-intensive applications like multimedia and image processing (filtering, edge detection, correlation, etc.) are rapidly growing.As a promising solution, parallel programming and parallel systems-ona-chip (SoC) such as clusters, multiprocessor systems, and grid systems are proposed.Such sophisticated embedded systems are required to satisfy high demands regarding energy, power, area, and cost efficiency.Moreover, many embedded systems are required to be flexible enough to be reused for different software versions.Multicore systems working in SIMD/SPMD (Single Instruction Multiple Data/Single Program Multiple Data) fashion have been shown to be powerful executers for data-intensive computations [1,2] and prove very fruitful in pixel processing domain [3,4].Their parallel architecture strongly reduces the amount of memory access and the clock speed, thereby enabling higher performance [5,6].Such multicore systems are made up of an array of processing elements (PEs) that synchronously execute the same program on different local data and can communicate through an interconnection network.These parallel architectures are characterized by their regular design, which enables designcost effective scaling of the platform for different performance applications by simply increasing or reducing the number of processors.Although these architectures have accomplished a great deal of success in solving data-intensive computations, their high price and the long design cycle have resulted in a very low acceptance rate [7].Designing specific solutions is a costly and risky investment due to time-to-market constraints and high design and fabrication costs.Embedded hardware architectures have to be flexible to a great extent to satisfy the various parameters of multimedia applications.To overcome these problems, embedded system designers are increasingly relying on field programmable gate arrays (FPGAs) as target design platforms.Recent FPGAs provide a high level of logic element density [8].They are replacing ASICs (Application Specific Integrated Circuits) in many products due to their increased capacity, smaller nonrecurring engineering costs, and programmability [9,10].It is easy to put many cores and memories on an FPGA device, which today has over 100 million transistors available.Such platforms allow creating increasingly complex multiprocessor system-on-a-chip (MPSoC) architectures.FPGAs also support the reuse of standard or custom IP (intellectual property) blocks, which  further decreases system development time and costs.A good example is the use of softcores, such as the NIOS II [11] and the MicroBlaze [12] for Altera and Xilinx FPGAs, respectively, to implement FPGA-based embedded systems.Thus, the importance of multicore FPGAs platforms steadily grows since they facilitate designing parallel systems.Based on configurable logic (softcores), such systems provide a good basis for parallel, scalable, and adaptive systems [13].
In a context of high performance, low technology cost, and application code reusability objectives, we target FPGA devices and softcore processors to implement and to test the proposed multicore system.The focus of this work is to study the viability of softcore replication design methodology to implement a multicore SPMD architecture dedicated to compute data-intensive signal processing applications.Three benchmarks that represent different application domain were tested: FIR filter, spatial Laplacian filter, and matrix-matrix multiplication, characteristic for 1D digital signal processing, 2D image filtering, and linear algebra computations, respectively.All tested applications involve the computation of a large number of repetitive operations.
The remainder of this paper is organized as follows.The next section briefly introduces the multicore system with focus on the proposed FPGA-based design methodology.In Section 3, the main characteristics of the NIOS II softcore processor are presented.The methodology followed to adapt the NIOS II IP to the parallel SoC requirements is also described.Section 4 discusses some experimental results showing the performance of the softcore-based parallel system.Section 5 gives an overview of related work.A performance analysis and comparison of the implemented system with other architectures and a discussion on the architectural aspects are also provided.Finally, Section 6 presents the concluding remarks and further work.

Softcore-Based Replication Design Methodology
In this section, we first give an overview over our multicore architecture including the different used components.Then, we describe the replication design methodology based on the use of the NIOS II softcore as a case study.

Multicore SoC Overview.
The proposed parallel multicore system works in SPMD fashion.As presented in Figure 1, it consists of a controller that accesses its sequential data memory (DATAM).It synchronously controls the whole system composed of a 2D grid of processing elements (PEs) interconnected via two types of interconnection networks thanks to network interfaces (NI).Indices  and  present the number of columns and the total number of PEs (  *  ), respectively.Index  is calculated by All processors execute the same instructions coming from a parallel instruction memory (InstM).The controller serves as a master and can only execute sequential program whereas each PE executes the same parallel program while operating on its local data.The execution of a given instruction depends on the activity of each PE, identified by a unique number in the array.In fact, each processor in the multicore system is characterized by its identity to specify its own job.The multiple processors can communicate via two types of networks: a neighbourhood network and a crossbar network (global router) that assures point-to-point communications between processors and also performs parallel I/O data transfer.The neighbourhood network interconnects the PEs in a NEWS fashion so that every PE can communicate with its nearest neighbour in a defined direction (north, south, east, or west).It mainly consists of a number (equal to the number of PEs) of routing elements (RE) which are arranged in a grid fashion and are responsible of routing the data between PEs.The routers may be customized at compile time to implement three different topologies (Mesh, Torus, and Xnet).It is also possible to change dynamically from one topology to another according to the application needs.The global router assures the irregular communications between the different components and can be configured in four possible bidirectional communication modes: PE-PE, PE-Controller, PE-I/O Peripheral, and Controller-I/O Peripheral.These different described communication patterns are handled by the communication instructions (presented in Section 3.2) as defined by the designer.To achieve the required architecture performance and configurability, a fully parameterizable and structural description of the networks of the proposed architecture was done at the RTL (Register Transfer Level) using the VHDL (Very High Speed Integrated Circuits Hardware Description Language).
The proposed multicore SoC is a parameterizable architecture template and thus offers a high degree of flexibility.Some of its parameters can be reconfigured at synthesis time (such as the number of PEs as well as the memory size), whereas other parameters can be reconfigured at runtime like the neighbourhood network topology.Table 1 details these different parameters.This multicore prototype serves as a base to test different configurations with different processors.The designer just needs to use a given processor IP and adapt the network interface to that processor.Such characteristics make the system scalable and flexible enough to satisfy a wide range of data-parallel applications needs.
The following subsection highlights the FPGA-based replication design methodology followed to build the multicore system.A case study is then highlighted based on the NIOS II softcore.

Design Approach.
To design a parallel multicore SoC, we propose in this paper a replication design methodology, which consists in using a softcore to implement the different processors in the multiprocessor array.Following this manner, the design process is easier and faster than traditional design methods.As a consequence, it offers a large gain in the development time.The additional advantage of this methodology is the facility of programming since we use the same instruction set for all processors.The aim is to propose a flexible, programmable, and customizable architecture to accelerate parallel applications.
Figure 1 illustrates the multicore architecture based on the replication design methodology.The designer has to carefully respect the system requirements in order to integrate a given processor IP.In the proposed system, the controller is in charge of controlling the global router functioning and transferring or reading data to or from PEs respectively.The PE executes parallel instructions and communicates with neighbouring PEs through the neighbourhood network or with the controller and peripherals through the global router.To accomplish these different functionalities, the controller has to be interfaced with the global router whereas the PE has to be interfaced with the neighbouring network as well as the global router (as shown in Figure 1).In both cases, the networks only need data and address signals issued from the different processors.Table 3 details the global router network interface with the processors.It is scalable so that it is adapted to a parametric number of PEs.This is performed by mapping each PE data and request to input ports configured as a vector of length equal to the number of PEs.The neighbourhood network interface only contains data and addresses coming or going to PEs.It consists of a number of routers (equal to the number of PEs), each one connected to one PE.Connections between the routers allow transferring data through several paths across the architecture.Both implemented on-chip networks are reliable, in the sense that data sent from one core is guaranteed to be delivered to its destination core.The results obtained for the network interface hardware cost, on the Stratix V FPGA [14], are summarized in Table 2.It is remarkable from the synthesis results given in Table 2 that the implemented network adapters present a cheap solution requiring a significant small logic area on the FPGA.This demonstrates the simplicity of the interface and confirms that changing the processor IP is an easy task.Thus, small design time is required to reconfigure the architecture.
The next section discusses more precisely the use of the NIOS II IP to implement a multicore configuration following the replication methodology.

NIOS II-Based Multicore Architecture Overview
3.1.Architecture Description.The NIOS II embedded softcore is a reduced instruction set computer, optimized for Altera FPGA implementations.A block diagram of the NIOS II core is depicted in Figure 2 [11].
It can be configured at design time (data width, amount of memory, included peripherals, etc.) in order to be optimized for a given application.The most relevant configurations are the clock frequency, the debug level, the performance level, and the user-defined instructions.The performance level configuration enables the user to choose one of the three provided processors: the NIOS II/fast, which results in a larger processor that uses more logic elements but is faster and has more features; the NIOS II/standard, which creates a processor with balanced relationship between speed and area and some special features; the NIOS II/economic, which generates a very economic processor in terms of area, but very simple in terms of data processing capability.
In this work, we use the NIOS II economic version in order to put a large number of processors in one FPGA device since we target a multicore SoC.NIOS II processors use Avalon bus for connecting memories and peripherals.So, the implemented network interface is based on Avalon interface.The Avalon interface is basically an interface that creates a common interface from different interfaces of all memory and peripheral components of the system.In this work, we use the Avalon-Memory Mapped interface (Avalon-MM) [15].It is an address-based read/write interface synchronous to the system clock.A wrapper, considered as the network interface, has been implemented in order to be able to integrate the NIOS with other components of the architecture, mainly the neighbourhood network and the global router.This wrapper is a custom component which has two interfaces: one Avalon slave interface to be connected to the NIOS and one conduit interface (for exporting signals to the top level) to be connected to the networks.To assure communication, we only need addresses and data coming from processors as well as the read/write enable signal to indicate if it is a read or write operation.The NIOS-based wrapper is responsible of transferring data and addresses of the attached processor to the appropriate network (dependently on the address coding).Moreover, it communicates the data and addresses received from networks to the processor depending on its identity number.
To design systems which integrate a NIOS II processor, Quartus II has as complement a SW called Qsys [16].Qsys is used to specify the NIOS II processor cores, memory, and other components the system requires.Qsys automatically generates the interconnect logic to integrate the components in the hardware system.The NIOS II-based SoC synthesis flow is summarized in Figure 3.
In the implemented architecture, each processor is connected to a timer which provides a periodic system clock tick, a system ID peripheral which allows uniquely identifying the PE, and to a local data memory.In addition, the controller is connected to a JTAG UART which allows communicating PC with the NIOS core through the USB-Blaster download cable and providing debug information.Each processor is also connected to a custom peripheral which integrated the network interface, allowing transferring the processor requests to the other components in the multicore architecture.An example of 4-core architecture is shown schematically in Figure 4.In the architecture, different subsystems are connected by network interfaces (NI) to networks (neighbourhood network and global router).A network interface is used to send and receive data transmitted over the network.Each subsystem contains the processor and some peripherals or accelerators.The total number of PEs, defined by the number of rows and columns, can be specified before synthesis.
The design steps followed to implement the multicore architecture show the simplicity of the replication methodology to be applied on any softcore processor and the configurability of the proposed SoC prototype.To build a multicore system, two steps have to be performed: to adapt the network interface to be connected to the chosen processor and to define the corresponding communication instructions based on the processor instructions since the HW networks are managed by the SW program.The latter step is more detailed in the next paragraph.

SW Programming.
The main important instructions that depend on the used softcore are communication instructions.Such instructions are defined to assure communicating data through both networks.In this work, they are based on standard C and specific NIOS instructions.Accessing and communicating with NIOS II peripherals can be accomplished using the HW abstraction layer (HAL) interface of the NIOS II processor.In fact, the HAL provides the C macros IORD (IO Read) and IOWR (IO Write) that expand to the appropriate assembly instructions [17].These instructions allow accessing and communicating with different peripherals connected to the NIOS II processor.Each communication instruction consists in writing or reading data from the defined address.In our case, the data has a 32-bit length and the address has a 12-bit length.The address field is different depending on the communication network.In the case of the global router the most significant bit (the bit at the 12th position) is set to "1, " whereas in the case of the neighbourhood network it is set to "0. " To assure global router communications, we distinguish different instructions as illustrated below.(iii) RECEIVE instruction allows receiving data through the global router: data = IORD (NI BASE, address).It analogously takes the same address field as the SEND instruction.
In the case of the neighbouring communication, SEND and RECEIVE instructions are also encoded from IORD and IOWR NIOS macros.Compared to the global router instructions, they only differ in the address field, which contains the direction to which the data will be transferred (0: north, 1: east, 2: west, 3: south, 4: northeast, 5: northwest, 6: southeast, and 7: southwest).
Altera provides the NIOS Integrated Development Environment tool to implement and compile the parallel SW program.The following section presents the implementation and experimental results of the proposed multicore design.

Synthesis Results.
The used development board is the Stratix V GX FPGA that provides a hardware platform for developing and prototyping high-performance application designs [14].
It is equipped with a Stratix V (5SGXEA7K2F45C2) FPGA, which has 622000 logic elements (LEs) and 50-Mbit embedded memory.It presents many peripherals, such as memories, expansion pins, and communication ports.The used SW tools are the Quartus II 11.1, the Qsys, and the NIOS II IDE that allow designers to synthesize, program, and debug their designs and build embedded systems on Altera FPGAs.
Table 4 shows synthesis results for different multicore configurations varying the number of cores as well as the memory size.These systems contain both neighbouring and global router networks.The maximum frequency resulted from each SoC design is also measured.
It is clearly obvious that the consumed FPGA area increases as the number of processors increases.As illustrated in Table 4, the parallel architectures occupy proportional areas while increasing their size.A 32-core configuration just consumes 8% of the available logic elements as well as 6% of memory blocks.This shows the efficiency of the implemented architecture.It is so possible to implement up to 300 cores in the Stratix V 5SGXEA7K2F45C2 FPGA.Table 4 also presents the clock frequency resulting values which slightly drop due to the used networks.

Tested
Benchmarks.This section gives better insight into the use and performance of the parametric multicore system by executing different benchmarks.As benchmark, we considered three algorithms to represent different application domain.They are FIR filter, spatial Laplacian filter, and matrix-matrix multiplication, characteristic for 1D digital signal processing, 2D image filtering, and linear algebra computations, respectively.The applications are implemented in C language, using the NIOS II IDE.The results for execution time and speed are measured for different multicore configurations.Based on the results obtained and the system requirement, suitable architecture can be chosen for the targeted application.In fact, the nature of the application determines the scaling that needs to be applied to the SoC in order to meet the performance.
(1) FIR Filter.Finite Impulse Response (FIR) filtering is one of the most popular DSP algorithms.Such computations require increased number of multiplications and additions as well as multiple memory read operations.A FIR filter is implemented with the following equation: where  is the input signal and   are the filter coefficients. is the filter order or number of taps.An -order FIR filter requires  multiplications and  additions for every output signal sample and 2 memory read operations (for input signal and filter coefficients).We implement the FIR filter on different multicore configurations to study the scalability of the system and the influence of the type of the interconnection network.The results, shown in Figure 5  of the implemented FIR filter on different multicore configurations.Two communication mechanisms are tested in this case study: the neighbouring communication and the point-to-point communication, through the integration of an Xnet network and a crossbar network, respectively.It is clear from Figure 5 that the designed system is scalable and efficient while integrating more cores on the chip.When the number of cores increases, a larger part of the output signal can be calculated at the same time.Then, communication instructions will be decreased, which enhances the speedup of the system.In fact, the 64-core configuration with an Xnet network has a speedup 7 times higher than the 2-core configuration.As expected, the multicore architecture with a neighborhood interprocessor network is more efficient to compute FIR filtering operations.Thus, based on flexible communication network, the designer can choose the best interconnect suited to their application and its requirements.Furthermore, efficient communication is critical to achieve high performance since the interconnect scheme can significantly affect the type of algorithms running on the architecture.
(2) Spatial Laplacian Filter.We consider a 2D filter such as Laplacian filter, which smooths a given image (of size 800 × 480 pixels) by convolving it with a Laplacian kernel (3 × 3 kernel [9]).The parallel algorithm implemented in this work has been proposed in [18].Such application can be performed in parallel, so that multiple pixels can be processed at a time.Each core reads an amount of pixels (depending of the image size) from an external SDRAM memory and stores them in vectors.Then, it multiplies the vector elements by the filter kernel coefficients.After that, all resultant elements are added and divided for scaling.As a result, each core simultaneously produces a convolution of  pixels, where  is the number of cores integrated in the system.In this algorithm, we need to integrate the global router to assure data transfers from SDRAM to data memories of the processors.The controller is so responsible of controlling and configuring the network.The execution time results and speedup values are shown in Figure 6 varying the number of cores.Figure 6 clearly shows that the speedup exponentially increases while increasing the number of cores.It shows a good scalability since doubling the number of cores provides an average speedup of 1.6.We deduce that the multicore system is efficient to compute image processing applications.For example, we reach a speedup 5 times over a sequential execution when integrating 8 PEs, which is quite appreciable.It is also noted that the execution time slightly decreases between 16 and 32 cores due to the time needed to accomplish communications.
(3) Matrix-Matrix Multiplication.One of the basic computational kernels in many data parallel codes is the multiplication of two matrices ( =  × ).Matrix multiplication is considered an important kernel in computational linear algebra.It is a very compute-intensive task, but also rich in computational parallelism, and hence well suited for parallel computation [19].In this work, the tested multicore configuration integrates 64 PEs.We study the influence of the interconnection network topology on the application performance.The PEs are arranged in 8 × 8 grid.To perform multiplication, all-to-all row and column broadcasts are performed by the PEs.The following code for PE(, ) is executed by all PEs simultaneously, in a 64-core configuration.In this case, matrices  and , of size 128 × 128, are partitioned into submatrices (, ) and (, ) of size 16 × 16.
/ * East and West data transfer * / For k = 1 to 7 do Send A(i,j) to PE(i,(j+k) mod 8) / * North and South data transfer * / For k = 1 to 7 do Send B(i,j) to PE((i+k) mod 8,j) The obtained time results are depicted in Figure 7 with differently sized matrices.Figure 7 demonstrates that the used network influences the total execution time.In our case, it was easy to test different topologies thanks to the implemented mode instruction.From the SW level, the programmer can choose the interconnection network topology they want to use.The results show that the Torus network is the most appropriate neighbourhood network for the matrixmatrix multiplication.
To further evaluate the performance of the implemented multicore configurations, we measure the communicationto-computation ratio (CCR) for the three tested benchmarks based on a 32-core configuration.Table 5 depicts the obtained results.It is clearly shown that the measured communicationto-computation ratio is small, which indicates that the number of processing cycles exceeds the number of communication cycles.This allows having a good speedup.
Table 6 shows the parallel efficiency for the tested benchmarks depending on the number of PEs.The parallel efficiency decreases as the number of cores increases.We notice that the efficiency drops down to 12%, when executing the FIR filter with 64 cores.In this case, the additional cores do not deliver additional performance due to the needed extra communications.Through these results, we can choose the best number of cores to parallelize a given application.

Multicore Architectures and Performance Analysis
5.1.Related Work.Several on-chip multicore platforms have been proposed [20].Different multiprocessor architectures with the NIOS II softcore were also presented in [21].These platforms are designed to raise the efficiency in terms of GOPS per Watt.In this work, a particular attention is given to architectures working in SIMD/SPMD fashion and prototyped on FPGA devices.One of the famous high performance architectures is the GPGPU.The Fermi processor is an example [22].It is composed of 512 cores.However, these architectures present some limits in terms of the access memory bottleneck since all cores share the same memory and the high power consumption making them not suitable in the embedded system field.
In [23], the authors present a parameterizable coarsegrained reconfigurable architecture.Different interconnect topologies can be reconfigured, and static as well as dynamic parameters are described.This work mainly concentrates on providing a dynamic reconfigurable network by developing a generic interconnect wrapper module.The PEs used in the design have limited instruction memory and can only execute specific instructions needed in digital signal processing.
In [24], the authors present massively parallel programmable accelerators.The authors propose a resource-aware parallel computing paradigm, called invasive computing, where an application can dynamically use resources.They show that the invasive computing allows reducing energy consumption by dynamically powering off the idle regions at a given time.
A many-core computing accelerator for embedded SoCs, called Platform 2012, is presented in [25].It is based on multiple processor clusters and works in MIMD fashion.One cluster can hold a number of PEs (STxP70-V4 processor) varying from 1 to 16. P2012 is a Globally Asynchronous Locally Synchronous (GALS) fabric of clusters connected through an asynchronous global NoC.
The Cell Broadband Engine Architecture (CBEA) [18] is a multicore system, which integrates a PowerPC PE and 8 Synergistic PEs dedicated to compute intensive applications.It has been shown to be significantly faster than existing CPUs for many applications.
Many modern commercial microprocessors also have SIMD capability, which has become an integral part of the Instruction Set Architecture (ISA).Design methods include combining a SIMD functional unit with a processor or exploiting the SIMD instructions extension [26] present in softcores to run data-parallel applications.SIMD extensions are used, for example, in ARMs Media Extensions [27] and PowerPCs AltiVec [28].The SIMD-style media ISA extensions are presented in most high-performance generalpurpose processors (e.g., Intels SSE1/SSE2, Suns VIS, HPs MAX, Compaqs MVI, and MIPs MDMX).Their drawback is due to their data access overhead for instructions, which becomes an obstacle to the performance enhancement.
In [29], the authors present heterogeneous multicore architectures with CPU and SIMD 1-dimensional PE array (SIMD-1D) and MIMD 2-dimensional PE array (MIMD-2D) accelerators.A custom hardware address generation unit (AGU) is used to generate preprogrammed addressing patterns in smaller number of clock cycles compared to the ALUbased address generation.The SIMD-1D accelerator has 8 PEs and a 16 bit × 256 16-bank shared memory connected through a crossbar interconnection network.Through a matrixmatrix multiplication benchmark, the authors demonstrate that the proposed SIMD-1D architecture is 15 times powerefficient than GPU for the fixed-point calculations.
Another FPGA-based massively parallel SIMD processor is presented in [30].The architecture integrates 95 simple processors and memory on a single FPGA chip.The authors show that their proposed architecture can be used to efficiently implement an RC4 key search engine.An applicationspecific multicore architecture designed to perform a simple cipher algorithm is presented in [31].The architecture consists of four main modules: the sequential processor that is responsible for the interpretation and execution of a userdefined program, the queuing module that is responsible for scheduling tasks received from the sequential processor, four parallel processors that are idle until they get tasks from the queuing module, and the parallel memory that holds the data and key for processing and allows for concurrent access from all parallel processors.There is also a monitoring module which interprets user-control signals and sends back their results via a specified output.The architecture can integrate more parallel cores; however this can be done manually and so requires long design time.
These different platforms are summarized in Table 7.

Performance Analysis.
The multicore architecture, discussed in this paper, is suitable for data-intensive computations.Comparing with embedded general-purpose graphics processing units, such architectures have both huge number of processing units; however, there exist also some differences.GPUs used in multiprocessor systems-on-a-chip have direct access to the main memory, shared with the processor cores.In GPUs, there is no neighbouring communication between the cores.Thus, exchanging data is made via the shared memory.In contrast, the PEs in the proposed multicore SoC can communicate with each other and also other components via two networks.GPUs are mainly dedicated to massively regular data-parallel algorithms.However, the proposed multicore architecture can handle both regular and irregular data parallel algorithms thanks to its communication features.To validate this reasoning, our multicore SoC and a commercial embedded GPU are evaluated for selected algorithms (FIR filter and matrix-matrix multiplication).A Samsung Exynos 5420 Arndale Octa Board is used.It features a dual-core ARM Cortex-A15 that runs at 1.7 GHz and an ARM Mali T-604 GPU operating at 533 MHz [32].The GPU has a theoretic peak performance of 68.22 GOPS.For the comparison, a 16-core configuration (the PEs are connected via a Xnet network) was considered.The multicore architecture operated at 337 MHz.The considered architecture has a peak performance of 10.78 GOPS.The performance comparison is illustrated in Table 8.Execution time measures and the achieved fraction of the peak performance are highlighted.
Table 8 shows that our proposed multicore system achieves better speedup than the GPU (29% faster for the FIR filter and 23% faster for the matrix-matrix multiplication).Our system is also more efficient since it achieves a better fraction of the peak performance than the GPU.This proves the well resource utilization in the multicore on-chip implementation.The aforementioned results demonstrate the performance of the proposed architecture.We think that our system can be coupled with GPU to form a high-performance multiprocessor system.
Tanabe et al. [33] propose an FPGA-based SIMD processor.The proposed architecture is based on SuperH processor that was modified to include more numbers of computation units and then performs SIMD processing.A JPEG case study was presented; however no idea about programming was given.The achieved clock frequency of the SIMD processor is too low (55 MHz) while many softcores run faster these days.Table 9 compares between our multi-softcore architecture and the architecture described in [33] composed of one control unit and a number of computation units (CU) in terms of power consumption and frequency.As illustrated, our system achieves good results since it presents a higher frequency and a lower power consumption compared to the SIMD processor.The replication design methodology makes the design of a multi-softcore architecture easier and can integrate a parametric number of PEs compared to the used dedicated implementation [33] that makes adding more computational units to the architecture a heavy task.Compared to the multicore architecture described in [31] which is application specific, our proposed architecture is parametric and scalable and can satisfy a wide range of dataintensive processing applications.In addition, our system integrates two types of flexible communication networks that can efficiently handle communication schemes present in many applications.In terms of performance, our multicore architecture can reach a speedup of 1.9 with two cores and 3.2 with four cores.In comparison, the achieved results presented in [31] show a speedup reaching 1.85 with two cores and 2.92 with four cores active.
The preceding performance comparison confirms the promising advantages and performance of the proposed FPGA-based multicore architecture.This paper proves that an FPGA-based multicore design built with softcores is powerful to compute data-intensive processing.The presented architecture is easy to design, configurable, flexible, and scalable.Thus, it can be tailored to different data-parallel applications.

Conclusion
In this paper, we have presented a softcore-based multicore implementation on FPGA device.The proposed architecture is parametric and scalable, which can be tailored according to application requirements such as performance and area cost.Such a programmable architecture is well suited for dataintensive applications from the areas of signal, image, and video processing applications.Through experimental results, we have demonstrated better performance gains of our system in comparison to state-of-the-art embedded architectures.The presented architecture provides a great deal of performance as well as flexibility.
Future work will focus on studying the energy efficiency of the presented system, as power consumption is considered the hardest challenge for FPGAs [34].We want to explore the usage of custom accelerators in the proposed architecture.We also envision studying a possible dynamic exploitation of the available level of parallelism in the multicore system based on the application requirements.

Figure 5 :
Figure 5: Execution time of FIR filter on different multicore configurations and two types of networks.

Table 2 :
Area occupancy of the network interfaces.

Table 4 :
NIOS II/e-based multicore SoC implementation results on FPGA.

Table 5 :
Communication-to-computation ratio for a 32-core configuration.

Table 8 :
Performance results of Mali-T604 GPU and 16-core architecture.Numbers in brackets denote the difference between GPU's and our architecture's values.