VLIW DSPs can largely enhance the Instruction-Level Parallelism, providing the capacity to meet the performance and energy efficiency requirement of sensor-based systems. However, the exploiting of VLIW DSPs in sensor-based domain has imposed a heavy challenge on software toolkit design. In this paper, we present our methods and experiences to develop system toolkit flows for a VLIW DSP, which is designed dedicated to sensor-based systems. Our system toolkit includes compiler, assembler, linker, debugger, and simulator. We have presented our experimental results in the compiler framework by incorporating several state-of-the-art optimization techniques for this VLIW DSP. The results indicate that our framework can largely enhance the performance and energy consumption against the code generated without it.
Very Long Instruction Word (VLIW) architecture [
VLIW architecture is now widely used in commercial process designs, such as NXP’s TriMedia media processors, Analog Devices’ SHARC DSP, Texas Instruments’ C6000 DSP family, STMicroelectronics’ T200 family based on the Lx architecture, Tensilica’s Xtensa LX2 processor, and Intel’s Itanium IA-64 EPIC, in both the embedded domain and the nonembedded domain.
While it can be exploited to largely improve the ILP, it also brings a large challenge for the development of software toolkit for VLIW architecture. As compiler is the main responsibility for the code generation of VLIW architecture, VLIW’s advantages come largely from having an intelligent compiler that can schedule as many instructions as possible into parallelization to maximize the total ILP [
In this paper, we presented our methods and experiences to develop system toolkit flows for a VLIW DSP architecture. This VLIW DSP architecture is a scalable VLIW DSP architecture. Our system toolkit consists of compiler, assembler, linker, debugger, and simulator. The compiler is retargeted from Open64 compiler, and various issues are addressed to support optimization including software pipeline and SIMD code autogeneration. Assembler, linker, and debugger are developed based on Binutils. Finally, a cycle-approximate simulator has been developed based on Gem5. Benchmarks are evaluated on this framework, and results are presented. The results indicate that our framework can largely enhance the performance.
The remainder of this paper is organized as follows: Section
Magnolia is designed for sensor-based embedded systems [
A FU library has already been developed, which includes four different types of FUs, which are Unit A, Unit M, Unit D, and Unit F, respectively. Unit A, Unit M, and Unit D are fixed-point units, while Unit F is a floating-point unit. Unit A is dedicated to executing arithmetic operations, logical operations, and shift operations. Unit M can execute multiplication operations, as well as some arithmetic and logical operations. Unit D is in charge of memory access and process controlling, as well as some arithmetic and logical operations. Unit F carries out all the floating-point operations, including the floating-point vector operations [
Magnolia architecture supports both fixed-point instructions and floating-point instructions. The instruction width is 32 bits. In order to meet the ever increasing requirement of computation power of embedded applications, SIMD instructions are supported in Magnolia architecture to largely enhance the DLP (Data Level Parallelism). Special purpose instructions are also developed for acceleration of certain sensor-based embedded applications.
Magnolia architecture supports both fixed-point register file and floating-point register file. The registers in the fixed-point register file are 128 bits, while the floating-point registers are 256 bits. The number of registers in each type of register file is scalable. Both fixed-point register file and floating-point register file are programmer visible.
Traditionally, during the register allocation phase of compilation, if registers are not enough, additional save and load operations must be created and inserted into the original instruction queue by the compiler, to spill the data of a symbolic register to memory and restore it back to the register later. However, accessing memory is much slower than accessing registers and would slow down the execution speed. As in VLIW architectures, compiler needs to enhance the ILP, thus potentially increasing the pressure of register, which means spilling often happens.
Thus a mechanism called spill register file [
Figure
An example of Magnolia architecture.
The software flow is illustrated in Figure
The software flow.
The compiler for Magnolia VLIW DSP architecture is designed based on Open64 [
In order to retarget Open64 to support compiling for Magnolia architecture, three major works must be done: (1) implementing of machine description files related to Magnolia architecture; (2) constructing of a code generator for Magnolia architecture; (3) accomplishing of optimization techniques for Magnolia architecture.
The retargetablity of Open64 compiler comes from the introduction of machine description files. Three main categories of information about the target architecture are described in the machine description files, including the following: Information about the Instruction Set Architecture (ISA) describes the details of instructions in the instruction set, such as functions, number of operands, data type, assembly code format, and addressing mode. Information about the Application Binary Interface (ABI) describes the details of the interface between an application program and the libraries or other parts of the application program, such as data type, data size, data alignment, and the calling convention. Information about the processor model describes the details of resources in the target architecture, such as functions of each kind of functional units and number of each kind of functional units.
So in order to support Magnolia architecture, machine description files related to Magnolia architecture must be generated.
The Open64 compiler is mainly composed of three parts: the Front-End, the Middle-End, and the Back-End.
The Front-End translates programs written in high-level language into the intermediate representation of Open64, which is WHIRL (Winning Hierarchical Intermediate Representation Language). The Front-End of Open64 supports C/C++/Fortran. The Middle-End is composed of several phases, each of which performs a target-machine-independent optimization on the WHIRL. And the Back-End is in charge of code generation, which builds assembly codes according to WHIRL.
As the Front-End is totally target-machine-free, it needs no modification to support for Magnolia architecture. And the retargeting work of the Middle-End is discussed in next section.
The Back-End of Open64 is retargeted to support code generation for Magnolia architecture. Our compiler can be roughly divided into three phases: Code Expansion, Resource Binding, and Code Emission. The details of the implementation of those three phases can be found in [
The Code Expansion phase analyzes the WHIRL structure and translates operations on WHIRL structure into instructions from the Magnolia architecture. So, during the implementation of this phase, two major works must be done: (1) constructing the corresponding relation between operations from WHIRL structure and instructions from machine description files of Magnolia architecture; (2) building correct Magnolia instruction format according to the machine description files.
The Resource Binding phase binds instructions to certain resources in the architecture, such as execution cycle, execution FU, and registers for operands, too. During the implementation of this phase, proper order of intersteps must be carefully arranged to avoid deadlock, as there may be conflicts caused by binding instruction to different resources. Also, in this phase, the cooperation mechanism with machine description file must be dealt with.
The Code Emission phase translates the bind instructions into assembly format. So, in this phase, correct assembly format of instructions must be extracted from the machine description files for Magnolia architecture.
The Middle-End of Open64 is retargeted to fit for Magnolia architecture, and our compiler’s Middle-End is mainly composed of two parts: loop optimizer and global optimizer. The loop optimizer performs transformation on loops to optimize the compiling code. The global optimizer uses Static Single Assignment (SSA) as the program representation and performs def-use analysis, alias classification and pointer analysis, induction variable recognition and elimination, copy propagation, dead code elimination, partial redundancy elimination, and other typical optimizations. The details of the implementation of those two optimization phases can be found in [
A lot of existing approaches in researches perform SIMD code autogeneration at a late stage of the compilation process, because more information is available at late stage. However, the disadvantage is that the data parallelism in loops cannot be effectively exploited by these techniques, so the code quality can be less optimal.
So, in this work, a high-level technique is used to generate SIMD code with examining of loop code. The SIMD code generation is in the early stage of the compilation process, just after the input source file has been transformed into the WHIRL structure. This approach only needs simple knowledge of the target-machine’s ISA, so it is easily retargetable. The data packing work is done at the same time as the SIMD code is generated; thus, it can ease the work of Resource Binding, especially register allocation, in the Back-End stage.
Our SIMD code autogeneration technique is focused on loops. A preprocessing engine is introduced in the SIMD code autogeneration process. It is responsible for filtering out loops that do not suit our SIMD code autogeneration technique. Several directive rules have been introduced to choose right candidates for the following process.
After a candidate loop is selected, the compiler will go through the loop to annotate operations that could be candidates to be grouped into SIMD code. Then, all the candidates will be evaluated. Several rules are defined to conduct the evaluation process. After the evaluation, the compiler will reconstruct the WHIRL structure and replace those candidate operations in the loop into SIMD operations according to the evaluation result. Also, the data will be aligned and regrouped into packed data for the SIMD operations.
The SIMD operations in the WHIRL structure will finally be translated into SIMD instructions from Magnolia ISA in Code Expansion phase of our compiler. And the data for SIMD instructions will be prepared in Resource Binding phase.
The assembler, linker, and debugger of Magnolia architecture are developed based on the open source toolkit GNU Binary Utilities. There are two major issues which need to be solved: Maintaining correct instruction parallelism: according to the definition of the Magnolia assembly format, the instructions which are executed paralleled in one clock cycle must be arranged in a certain pattern where the functional units of these instructions are in an ascending order. Otherwise, the assembler cannot identify the instruction parallelism correctly. Thus, when generating assembly code, Magnolia compiler must check and rearrange the issue order of instructions, so that the information about the instruction parallelism can be delivered to the assembler in a right way. The assembler has been designed so that it can identify and recognize these pieces of information to issue correct instruction queue. Avoiding real-time errors: as Magnolia architecture is designed dedicated to embedded systems, where real-time errors must be avoided, the linker is designed to perform static linking.
Simulator provides a platform to validate the design of software toolkit, evaluate the processor architecture, and accelerate the hardware development progress. Efficient modeling of the processor architecture and fast simulation are critical for the development of both hardware and software of VLIW architecture. The simulator for Magnolia architecture is designed based on Gem5 simulator.
Gem5 [
Gem5 provides plenty of simulation object models with implementation details, including memory, CPU models, bus, cache, and physical, and so forth. However, Gem5 simulator does not support the EPIC architecture processor models and VLIW ISA simulation. In order to construct our simulator, we have created a processor model with VLIW features and added the description of Magnolia ISA to the original Gem5 system. The original Gem5 loader and simulation core were also modified. The Magnolia ISA is implemented using the Gem5 ISA description system, by generating a decoder function, which analyzes the Magnolia ISA description and generates a C++ instruction object. The C++ instruction object is then treated as a basic data type of the simulator and utilized by the simulation core and other simulation objects.
The most important issue of the implementation of our simulator is to enable simulation of parallel execution of instructions. In the original Gem5 design, instructions are processed in sequence. Thus, when processing VLIW architectures, it might lead to conflicts among the instructions operating on the same register. To avoid the conflicts, in our simulator, the register file is duplicated. The processor reads register from the original register file. And the duplicated one is used for writing. When the instructions in a dispatching packet are all processed, a register file updating function is invoked, to update all the register values, so as to maintain the coherence of the register data and thus to enable the simulation of parallel execution of instructions.
As mentioned before, Magnolia architecture uses the order of function units to indicate the instruction parallelism. If the function units of two adjacent instructions are in ascending order, then the two instructions will be issued concurrently. Or else, the issue of the latter instruction will be delayed to next cycle. The benefit is that we can save 1 bit to double the encoding space of the instruction sets. However, it is not compatible with RISC execution style. So, during the designing of our simulator, we have added instruction parallelism judgement mechanism in original Gem5 to support this feature.
Chapman et al. [
Wu et al. [
Chang et al. [
Wittenburg et al. [
Several works have discussed the utilization and improving of TI company’s DSP starter kit, including [
Steiger and Grentzinger [
Several simulators for VLIW architecture simulation have already existed, such as VLIWDLX [
Gem5 [
Experiments were done by running programs from DSPstone [
Performance results measured on Magnolia simulator.
These programs are first compiled by the Magnolia compiler and then assembled and linked and finally loaded to the simulator to get the measurement of performance. Blue bar showed the performance generated from the compiler without any optimization. Yellow bar showed the performance generated by the compiler with some basic instruction scheduling optimization. Purple bar showed the performance generated by the compiler with optimizations such as EBO and WOPT. Green bar showed the performance generated by the compiler with loop optimizations. Finally, red bar showed the performance generated using the compiler with SIMD code autogeneration technique.
The results indicated that our compiler can gain speedup up to around 5 times compared with the code without any optimizations. In some cases, SIMD autogeneration technique does not bring any further performance improvement. That is because, in these cases, either there is no loop in these benchmarks or the number of loop iterations does not satisfy our preprocessing rules. In that situation, our SIMD autogeneration technique does not work. In our future work, we will try to find a way to improve the applicability of our SIMD autogeneration technique, both for more levels of nested loop and for loops that have more sophisticated structures.
Figure
Energy consumption results measured on Magnolia simulator.
Clearly, our optimization can significantly reduce the energy consumption related to the execution of those programs, thus making it more suitable for sensor-based application usage.
In this paper, we have presented our methods and experiences on developing system toolkit for a VLIW DSP architecture. Our entire system toolkit includes compiler, assembler, linker, debugger, and simulator. We presented our methods to develop those tools and also presented the experiences of dealing with issues encountered in the process. Results evaluated using DSPstone benchmarks indicated significant optimization on performance and energy consumption. The experiences presented in this paper might benefit the architecture designers and toolkit developers who are interested in similar VLIW DSP architectures.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This paper is supported by the National Natural Science Foundation of China (no. 61201182).