Embedded system design is increasingly based on single chip multiprocessors because of the high performance and flexibility requirements. Embedded multiprocessors on FPGA provide the additional flexibility by allowing customization through addition of hardware accelerators on FPGA when parallel software implementation does not provide the expected performance. And the overall multiprocessor architecture is still kept for additional applications. This provides a transition to software only parallel implementation while avoiding pure hardware implementation. An automatic design flow is proposed well suited for data flow signal processing exhibiting both pipelining and data parallel mode of execution. Fork-Join model-based software parallelization is explored to find out the best parallelization configuration. C-based synthesis coprocessor is added to improve performance with more hardware resource usage. The Triple Data Encryption Standard (TDES) cryptographic algorithm on a 48-PE single-chip distributed memory multiprocessor is selected as an application example of the flow.
The International Technology Roadmap for Semiconductors (ITRSs) [
In this paper an automatic design flow is proposed for data parallel and pipelined signal processing application on embedded multiprocessor with NoC, and we implement block cipher TDES application on 48 cores signal chip distributed memory multiprocessor. The proposed flow explores first parallel software implementation through multi-FPGA emulation of the embedded multiprocessor designed for pipelined and data parallel application with emphasis on data locality. In a second phase the flow adds hardware accelerator connected as coprocessor to embedded multiprocessor through high level synthesis in order to finally propose a better performance solution to the designers.
The paper is organized as follows. Section
This proposed work is related to the embedded multiprocessor, system level synthesis design flow, and design space exploration of customizable embedded multiprocessors. A few examples of commercial multicore implementations have been proposed. They can be globally divided in 2 categories: (1) general purpose, (2) application specific. ARM ARM11MPcore [
Several embedded multiprocessor design flows have been reported in the literature. These flows are different on evaluation methods, system specification and application specification. The first is the evaluation methods on which the exploration is based. SoCDAL [
The heterogeneous design flow used for this study is described in Figure
Automatic heterogenerous design flow.
Having achieved maximum design space exploration through parallel programming, the second step is to explore coprocessor-based TDES parallel execution by incrementally adding TDES C-based synthesis generated coprocessor. Final step will compare both paths to select the most appropriate implementation. If the performance of software parallelization meets the system design objective, the parallelized application is executed on the multiprocessor. If not, coprocessors are added to multiprocessor to improve system performance with using more hardware resources.
The automatic design flow approach requires a tight and efficient combination and integration of EDA tools. The Zebu-UF multi-FPGA platform is used and different EDA tools from EVE [
Zebu-UF platform is based on 4 Xilinx Virtex-4 LX200, built as an extended PCI card via a motherboard-daughter card approach.
The 4 FPGA-based system can emulate the equivalent of up to 6 million ASIC gates in a single card. Zebu-UF also includes on-board memory capacity based on 64 MBytes of SSRAM and 512 MBytes of DRAM memory chips via an additional memory board, which plugs into the PCI motherboard. Table
EVE Zebu-UF4 platform details.
Modules | Descriptions |
---|---|
FPGA | 4 Virtex-4 LX200 |
DRAM | 512 MBytes |
SSRAM | 64 MBytes |
ICE | Smart and direct |
Eve Zebu-UF4 Platform.
Design automation tools of 4 commercial companies are combined together to generate the multi-FPGA MPSoC and the parallelized execution files for emulation as described in Figure
Workflow of multi-FPGA MPSoC.
The Xilinx EDK tool is used to generate the Small-Scale Multiprocessor (SSM), which is described in Section
Hardware accelerators are blocks of logic that are either automatically generated or manually designed to offload specific tasks from the system processor. Many math operations are performed more quickly and efficiently when implemented in hardware versus software. Adding powerful capability to FPGAs, hardware accelerators can be implemented as complex, multicycle coprocessors with pipelined access to any memory or peripheral in the system. They can utilize FPGA resources (such as on-chip memory and hard-macro multipliers) to implement local memory buffers and multiply-accumulate circuits. Using as many master ports as necessary, they can initiate their own read and write operations and access any I/O pin in the system. Hardware accelerators are a great way to boost performance of software code and take full advantage of the high-performance architecture of FPGAs. The design and software simulation of a hardware accelerator used a CAD tool called ImpulseC. This software allows the creation of applications intended for FPGA-based programmable hardware platforms similar to [
The ImpulseC compiler translates and optimizes ImpulseC programs into appropriate lower-level representations, including Register Transfer Level (RTL) VHDL descriptions as in Figure
(a) Block diagram of accelerator connection forms, (b) C-based HW accelerated system design workflow.
The main concurrency feature is pipelining. As pipelining is only available in inner loops, loop unrolling becomes the solution to obtain large pipelines. The parallelism is automatically extracted. Explicit multiprocess is also possible to manually describe the parallelism.
ANSI C types operators are available like int and float as well as hardware types like int2, int4, int8. The float to fixed point translation is also available.
The only way to control the pipeline timings is through a constraint on the size of each stage of the pipeline. The number of stages of the pipeline and thus the throughput/latency is tightly controlled.
All arrays are stored either in RAM or in a set of registers according to a compilation option.
An automatic design flow for C-based hardware accelerator based on the preciously described tool has been proposed and applied [
Data parallel and pipelined applications require architectures with close communication between processing elements and locality memory resource for computation and data access. Mesh-based architectures are suitable for data parallel and pipelined applications. However, most mesh-based architectures have signal processing element per-switch while a cluster-based architecture allows more concurrency on local data.
The architecture of the small-scale multiprocessor, presented in Figure
Small-scale multiprocessor IP.
Processor tile.
The distributed memory SSM IP is composed of several IPs described in Table
IPs of SSM multiprocessor.
IP component | Description | Source | Version | Qty |
---|---|---|---|---|
Processor | Soft core IP | Microblaze Soft core IP Xilinx | 5.00 b | 12 |
Memory | Soft core IP | Xilinx Coregen 96 KB | v.2.4. | 8 |
Network on chip switch | Soft core IP | VHDL Arteris | 1.10 | 4 |
Danube library | ||||
Interchip | Soft core IP | VHDL Arteris | 1.10 | 1 |
Danube library |
The design is OCP-IP [
The design of the network on chip is based on Arteris Danube Library. The Arteris Danube library [
Arteris switch uses worm-hole routing for packet transfer. Packets are combined with header, necker (request only), and data cells. Each field in cells carries special information Most of the fields in header cells are parameterizable, and some are optional. Typical request and response packet structure are shown in Figure
Arteris typical request and response packet.
48 processors multiprocessor architecture.
Performace monitoring network on each FPGA.
The previous SSM IP can be easily extended and composed up to a 48-processor multiprocessor organized as a
Due to its large size and prohibitive simulation time, emulation on FPGA platform is targeted for the performance evaluation of this multiprocessor architecture as well as for accurate area data. The emulation platform is the Eve Zebu-UF4 Platform.
Network on chip traffic is monitored through hardware performance monitoring unit. The overall concept is to probe the Network Interface Unit (NIU) from initiator and target.
Statistics collector is added for each group of 3 processors and 4 on-chip SRAM and collected through a dedicated performance monitoring network on chip. The monitoring unit and performance monitoring network on chip use dedicated hardware resources and thus do not interfere of affect the measurement. There is a tradeoff between emulation frequencies and hardware monitoring area which is significant.
TDES algorithm was introduced as a security enhancement to the aging DES, a complete description of the TDES, and DES algorithms can be found in [
The TDES is a symmetric block cipher that has a block size of 64 bit and can accept several key sizes (112, 168 bits) which are eventually extended into a 168 bit size key; the algorithm is achieved by pipelining three DES algorithms while providing 3 different 56 bits keys (also called key bundle) one key per DES stage. The DES is a block cipher with 64 bit block size and 54 bit key.
The TDES starts by dividing the data block into two 32 bits blocks which are passed into a Forward Permutation (FP) then criss-crossed in what is known as Feistel scheme (or Feistel Network) while being passed into a cipher Feistel function
Feistel function
TDES encryption and decryption schemes (Feistel network).
Feistel decryption network
Feistel encryption network
The TDES algorithm clearly processes data on a pipelined mode in both the encryption and the decryption mode.
Block cipher algorithms have different operation modes; the simplest is called Electronic Code Book (ECB) in this mode and the block cipher is used directly as illustrated in Figure
ECB operation mode for the TDES block cipher.
The problem with ECB is that encrypting the same block with the same key gives an identical output. As a counter measure to reveling any information, blocks are chained together using different modes like Cipher Feed Back (CFB), Output Feed Back (OFB), and Cipher Block Chaining (CBC) illustrated in Figure
CBC operation mode for the TDES block cipher.
The TDES represents a data parallel and pipelined signal processing application. So far TDES has been implemented as a hardware IP core [
Based the C implementation from National Institute of Standards and Technology (NIST) [
Two different implementation methods are compared on the 48-PE multiprocessor: multiple-data multipipeline software parallel implementation and coprocessor data parallel implementation.
To fully use the 48-PE multiprocessor, two parallelisms, data and task parallelism, are studied and combined together to achieve a best performance. Data parallelism means that different data are distributed across different processing nodes. These nodes process their received data in parallel. In task parallelism, the 48 calls to an
The two parallelisms are combined together to work as a Fork-Join model showed in Figure
Fork-Join Model of data and task parallelism.
As described in the architecture section, the target architecture is a 48-PE multiprocessor organized as a
One mapping example result is given in Figure
Fork-Join Model mapped to 48-PE multiprocessor.
24 MicroBlaze PEs are chosen for implementation.
All data are divided into 4 blocks and the 24 MicroBlaze PEs are divided into 4 groups correspondingly.
To encrypt each data block, each pipelined group has
In each pipelined group, each MicroBlaze PE will calculate
At the first step, only one pipelined group is used to map the TDES application. Different size of data is sent by the source: from 10 packets up to 400 000 packets as in Figure
Results of one pipelined group with different size of data.
Another important observation is that small size of packets treated by large number pipelined group will use more time than the one treated by small number pipelined group, for an extreme example 10 packets treated by 24-PE pipeline uses almost the same time as 10 packets treated by 1-PE pipeline. So task parallelism is suitable for large-scale data application.
At the second step, task parallelism and data parallelism are combined to find out a good tradeoff between the 2 parallelisms. In this example, at most 24 PEs are used, different combination of data and task parallelism is listed in Table
Combination of data and task parallelism (24 cores case).
Number of pipelined group | Number of PE in 1 pipelined group | Number of | Packets for 1 pipelined group |
---|---|---|---|
24 | 1 | 48 | 4000 |
12 | 2 | 24 | 8000 |
8 | 3 | 16 | 12 000 |
6 | 4 | 12 | 16 000 |
4 | 6 | 8 | 24 000 |
3 | 8 | 6 | 32 000 |
2 | 12 | 4 | 48 000 |
1 | 24 | 2 | 96 000 |
Figure
Single pipelined group versus multiple pipelined groups.
Tradeoff between task and data parallelism (24 core limited case).
In Figure
Hardware performance monitoring network records the latency of each packet sent by the 24 MicroBlaze processors. The latency of packets is shown in Figure
NoC monitoring.
Data parallelism with coprocessor is another method to realize TDES application parallelization onto multiprocessor. Coprocessor is used to execute complex math operation, which can greatly improve system performance. In this case, each MicroBlaze processor of the 48-PE multiprocessor has a coprocessor to do the whole TDES functions; the MicroBlaze processor is only in charge of communication: they get the original data from the source, send them to coprocessor, wait until coprocessor sends back the results, and finally send the results back the destination.
ImpulseC tools are used to generate the TDES coprocessor directly from the C code to VHDL source, which greatly improves the design productivity.
The TDES coprocessor is designed for the Xilinx MicroBlaze processor using an FSL interface. The Triple DES IP was synthesized for a Virtex-4 platform by Xilinx ISE. The maximum frequency is reported by the synthesis. The generated 5-stage pipeline implementation uses 2877 slices and 12 RAM blocks with a maximum frequency of 169.75 MHz. The occupation for the same IP using LUTs instead of RAM blocks is 4183 slices. The maximum frequency for this version is 162.20 MHz. 12 instances of the generated IP were successfully implemented on an Alpha Data XRC-4 platform (Virtex-4 FX-140) using 12 MicroBlaze processors within a Network-on-Chip.
The chosen architecture for the Triple-DES hardware accelerator is the 5-stage pipeline shown in Figure
This IP was synthesized for a Xilinx Virtex-4 LX-200 FPGA. RAM blocks can be saved by using LUTs, if necessary. Table
HLS-based TDES IP versus optimized IPs.
Helicon | Xilinx | HLS (RAM) | HLS (LUT) | |
---|---|---|---|---|
Slices | 467 | 16181 | 2877 | 4183 |
Max frequency (MHz) | 196 | 207 | 170 | 162 |
Throughput at 100 MHz | 255.6 Mbps | 6.43 Gbps | 305 Mbps | 305 Mbps |
5 Stage pipeline TDES.
In this case, all the 48 MicroBlaze processors do the TDES application in parallel. When they finish all the packets, they will send a finish flag to the synchronization memory. MicroBlaze 0 will verify that all the PEs finish their jobs and records the execution time. Results of TDES application using coprocessor on the 48-PE multiprocessor are showen in Figure
Parallel software versus coprocessors on a 48-PE multiprocessor.
High performance embedded applications based on image processing and single processing will increasingly be moved to embedded multiprocessors. An automatic design flow is proposed for data parallel and pipelined signal processing applications on embedded multiprocessor with NoC for cryptographic application TDES. The proposed flow explores through execution on multi-FPGA emulation for parallel software implementation with task placement exploration and task granularity analysis. A hardware-based network on chip monitoring drives the task placement process to reduce communication bottlenecks. In the second phase, high level synthesis generated hardware accelerators are added to explore the tradeoff in area performance while still privileging multiprocessor basis for the implementation. Future work will add reconfigurability management of dedicated area for hardware accelerators as well as automatic parallelization.