Application Specific Instruction-set Processors (ASIPs) expose to the designer a large number of degrees of freedom. Accurate and rapid simulation tools are needed to explore the design space. To this aim, FPGA-based emulators have recently been proposed as an alternative to pure software cycle-accurate simulator. However, the advantages of on-hardware emulation are reduced by the overhead of the RTL synthesis process that needs to be run for each configuration to be emulated. The work presented in this paper aims at mitigating this overhead, exploiting a form of software-driven platform runtime reconfiguration. We present a complete emulation toolchain that, given a set of candidate ASIP configurations, identifies and builds an overdimensioned architecture capable of being reconfigured via software at runtime, emulating all the design space points under evaluation. The approach has been validated against two different case studies, a filtering kernel and an M-JPEG encoding kernel. Moreover, the presented emulation toolchain couples FPGA emulation with activity-based physical modeling to extract area and power/energy consumption figures. We show how the adoption of the presented toolchain reduces significantly the design space exploration time, while introducing an overhead lower than 10% for the FPGA resources and lower than 0.5% in terms of operating frequency.
A common feature of modern embedded systems is the need for highly optimized application-specific processing elements. Application-specific instruction-set processors (ASIPs) are often the only solution to the required functional and physical constraints able to provide, at the same time, high flexibility and programmability.
These processors are typically performance- and power-optimized for a specific application domain. The optimizations in terms of extension of the processor instruction-set often include vector processing and SIMD support, complex domain-specific arithmetic operations (e.g., MAC for digital signal processing). In terms of architecture organization, it is not infrequent to find register files with particular configurations (depth, data width, or number of ports), separate local memories for different kinds of application data, customized data channels that implement real-time data flow into and out of the processing units, or synchronization ports shared with other SoC blocks.
As a consequence of such extreme configuration possibilities, to efficiently explore the hardware-software customization of such systems appropriate emulation techniques are required to provide fast but accurate performance estimates. Along with the classical characterization of hardware modules and applications with classic functional metrics (i.e., execution time, cache performance, and resource congestion), there is increasing interest in obtaining early estimations of physical metrics, such as area occupation and power/energy consumption. For all these requirements, hardware-based emulation techniques have been proposed as an alternative, more scalable, solution to cycle-accurate software-based simulation approach.
FPGA devices, with their high flexibility, have shown to serve well for hardware-based emulation but, at the same time, the achievable advantages are mitigated by the overhead introduced by the physical synthesis/implementation flow. This overhead impacts on the emulation time and thus on the number of explorable design space points.
In this work, we address the FPGA-based on-hardware emulation for design space exploration of ASIP processors. We have developed a toolchain that uses software-driven runtime reconfiguration of the emulating platform to enable the evaluation of different architectural configurations after a single synthesis/implementation process, thus maximizing the speed-up of the overall evaluation process. The mechanism employed to reconfigure at runtime the emulating platform does not rely on standard partial reconfiguration capabilities that are offered by current FPGA devices and toolchains. Instead, we developed an algorithm to identify the logic to be placed on the FPGA and the hardware modules that support the actual logic reconfiguration.
The main contribution of this paper can be summarized as follows. We conceived an algorithm to identify an overdimensioned platform that can be used to emulate different candidate architectures employing a software-based reconfiguration mechanism; We developed the hardware modules that allow the runtime reconfiguration mechanism to take place; We present the application of the algorithm and toolchain on an industrial ASIP design and programming flow [ We discuss the advantages and evaluate the resource overhead of the proposed approach with two filtering kernels use cases.
The ultimate result of our work is a toolset capable of automatically building a reconfigurable prototyper that can serve as emulation platform to evaluate a set of candidate design points, selected by a design space exploration (DSE) algorithm. The definition of such algorithm is beyond the scope of this paper.
The remainder of the paper is organized as follows. First, an analysis of the state of the art in architectural emulation is presented in Section
Today, the vast majority of architectural simulation is performed, at maximum accuracy (i.e., cycle level) in software. Among the most famous solutions, many are still sequential, like SimpleScalar [
Despite these innovations in the field, there is general consensus on the fact that classic software approaches are not anymore able to provide cycle accuracy for complex multi-core hardware-software designs in reasonable simulation times. A promising alternative approach aims at preserving cycle accuracy, resorting to the extremely high speeds that can be achieved by running the target architecture on some kind of configurable hardware. FPGA devices naturally provide this configurable hardware substrate, potentially enabling hundreds of MHz of pure emulation speed already in the earliest stages of the design flow. In addition, their integration capacity, scalability, the relatively low cost, and the decreasing power consumption figures suggest FPGAs are going to be the reference platform for hardware-based emulation for the next future [
The most important contribution to the field of large-hardware FPGA platforms for simulation of complex systems is brought by the RAMP project [
Other examples of hardware-based full-system emulators are [
In [
The main use scenario of our toolchain employs it together with a DSE engine, that creates the candidate architectures according to the design requirements, asks for emulation of a set of those and iteratively proceeds in the exploration. Some examples that can benefit from using our approach during the evaluation phase are [
As already introduced in Section
The algorithm that we conceived to build the WCC configuration, together with the hardware modules that implement the mechanisms for its runtime configuration, were designed to preserve full cycle accuracy. In detail, the number of clock cycles that an arbitrary set of instructions will take to execute on the WCC configuration, when configured to emulate on the FPGA a candidate configuration, has to be exactly the same that it would take on the same candidate configuration, when it is synthesized and emulated alone on the FPGA. Similarly, all the functional metrics (congestion, latency, and CPI) will be exactly the same, when measured in terms of clock cycles. The only difference between the WCC configuration and each single candidate configuration, when synthesized and placed on the FPGA, will be the number of utilized resources and the operating frequency of the FPGA emulator. Overall, this mechanism preserves the correct functional behavior of the ASIP processor and the binary compatibility of the WCC configuration to the executable code that would run on every candidate topology. The algorithm that we use to synthesize the WCC configuration is described in Section
The design flow that implements the proposed prototyping technique refers, as baseline, to the industrial ASIP customization flow. This flow has been extended to provide the needed support for runtime configuration. On the hardware side, we added some further HDL generation capabilities, that have been integrated with the baseline flow and will be explained in Section
This section will present the ASIP architecture template which we take as reference for our exploration purposes. We also define which variables identify the design space that we intend to explore. The considered processor template belongs to the class of VLIW ASIPs and is composed of instances of industrial IPs, based on a flexible
Figure register files: they hold intermediary data between processing operations and are configurable in terms of width; depth; number of read and write ports. Typically, ASIPs generated from this architecture template include many register files; issue slots: they are the basic units of execution within the processor. Every issue slot includes a set of function units (FUs), that implement the operations actually executable and compose the execution stage. From an operating viewpoint, every issue slot receives an operation code from the instruction decoding phase and, accordingly, accesses the register files and activates the proper function units; logical memories: these are containers for hardware implementing memory functionality; interconnect modules: configurable connections automatically instantiated and configured. They implement the required connectivity within the different processor building blocks. In detail, connections can appear on the forward propagation data path along the processor slice, but also on the backward path to implement the writeback into the different register files. The interconnect module from the register files to issue slots is called argument select network (ASN), while the interconnect from the issue slots to the register files is called result selection network (RSN).
Reference VLIW ASIP template.
Baseline prototyping flow (evaluation time for
Prototyping flow extended with runtime reconfiguration capabilities (evaluation time for
In this work, we start from Silicon Hive’s Pearl processor and then enable a design space exploration that covers most of the degrees of freedom exposed by the PAT, but not all of them. We consider every candidate processor configuration as composed of one
Moreover, we limit the degrees of freedom, limiting the processing slices to have only one issue slot and one register file. This is to say that, in order to have more than one issue slot inside a processor configuration, more than one processing slice is required. Finally, FUs inside every issue slot are taken from a pool of predefined FUs. Although the industrial methodology supports full extensibility of the instruction set through the definition of custom instructions, for the scope of this work, this possibility is not taken into account.
Having defined these limitations on the range of possible template configurations, the design point including only the control slice is the simplest stand-alone processor configuration that can be evaluated. Other design points are processor configurations that instantiate an arbitrary number of additional processing slices and feature different parameterizations of the building blocks included in them. As a result of what was introduced so far, the design space under consideration is thus determined by the following degrees of freedom.
Although the way of specifying the parallel datapaths to be instantiated in the processors is limited, the configurability of the template keeps being very wide. With further engineering effort, that is beyond the scope of our research activity, the techniques and the tools that we present can be extended to overcome the mentioned limitations, keeping the correctness of the theoretical approach.
Figure
On the software side, in order to perform the evaluation of the architecture with the desired application, the target source code is compiled by means of an automatically retargetable compiler. The compiler is able to optimize the instruction flow with respect to the instruction set and to the architecture of the processor under prototyping. In order to do so, it extracts information on the underlying ASIP architecture from the same configuration specification provided in input by the DSE engine. The compiler then retargets itself according to the considered ASIP configuration specification. After compilation, the program can be executed on the ASIP actually implemented on FPGA.
The hardware structures, within the scope of this work, have been instrumented with dedicated activity probes and counters, capable of collecting performance (in terms of numbers of cycles) and the activity figures (i.e., number of accesses) of the blocks in the considered ASIP. In addition to estimation of functional metrics, these activity traces are used to estimate physical metrics, such as dynamic power consumption. This estimation is performed through a layer of analytical modeling that will be described in Section
The left side of Figure
After the execution, the collected emulation data is translated to estimated physical metrics by means of dedicated area; frequency and energy models and fed back to the DSE engine. The translation is performed by means of a set of analytical expressions that allow the evaluation of the energy, area, and critical path contributions of the functional blocks inside the library. Such analytical expressions depend both on their static parameterization and switching activity. The power modeling layer is able to separately account for leakage and dynamic power consumption. Such models refer to a target technology library.
The modeling phase has to acquire information on the architecture configuration, therefore it takes as input a description of the current configuration with a dedicated file generated during the RTL generation phase. Examples of the relevant architectural information are the number, kind, and depth of the different issue slots, number of ports, width and depth of the register files, size of the memories, and so on. The on-hardware emulation then provides the activity traces necessary to perform the analytic estimation of interest.
For space reasons, we do not list here all the model formulae for estimation of area occupation, leakage, and dynamic power consumption of every functional block. However, Tables
Area models dependency recap. Subscripts for operations separate operation count for the single issue slot (IS) from the overall processor count
Area | |
---|---|
FU | |
IS mux logic | |
Register File | |
Decoder | |
Result select network | |
Sequencer | constant |
FIFOs | |
Program memory | |
Data memory |
Power models dependency recap. OPC stands for operation per cycle. Program memory is assumed to have 100% access rate.
Leakage power | Dynamic power | |
---|---|---|
FU | ||
IS mux logic | ||
Register file | ||
Decoder | ||
Result select network | ||
Sequencer | constant | constant |
FIFOs | ||
Program memory | ||
Data memory |
On reading the tables, we point out that the
The models, that are the formulae used to calculate power and area and whose main dependencies are reported in Tables
To allow fast prototyping of multiple candidate interconnect configurations inside the system, the baseline flow has been extended with a utility that analyzes the whole set of configurations under prototyping, synthesizes the WCC, and creates the configurable hardware and the software functions needed to map each candidate configuration on top of the overdimensioned hardware.
Figure
The synthesis algorithm is described in Section
Algorithm
The WCC is defined iteratively while analyzing all the candidate configurations. At every iteration, it is updated according to the design point currently under analysis. At iteration The number of issue slots inside For every issue slot For every issue slot
To clarify the WCC construction algorithm, let us consider a possible design space with the following three different candidate configurations: a first candidate configuration with 3 issue slots, respectively instantiating the functional units a second candidate configuration with 2 issue slots, respectively instantiating the functional units a third candidate configuration with 4 issue slots, respectively instantiating the functional units
Decomposing the outer loop of the algorithm, the WCC is iteratively constructed as follows. The WCC instantiates 3 issue slots, with functional units The WCC is then modified to instantiate 3 issue slots, with functional units Finally, the WCC is modified to instantiate 4 issue slots, with functional units
The software runtime reconfiguration capability is supported by two hardware modules, automatically generated and instantiated in the overdimensioned WCC architecture basing on the set of different input configurations that are passed to the exploration engine.
The first module is the
The configurable instruction adapter is in charge of translating the instructions produced by the compiler, which retargets itself for each candidate ASIP configuration, into an instruction executable on the WCC. All the sequences in the instruction related to a given slice of the configuration under evaluation are adapted in size and dispatched to the corresponding slice of the WCC. The value of each control directive is modified to ensure the instruction will provide the correct functionality on the overdimensioned prototype, despite the presence of additional hardware. All slices that do not exist in the configuration under test are disabled using dedicated opcodes.
Figure
Example of instruction adapting.
In the extended flow, where the number of issue slots is potentially different (more issue slots are usually instantiated in the WCC) from the one of each candidate configuration, the instruction adaptation is necessary to execute the same instruction binary on the WCC. The adapter is adequately programmed via software through a memory-mapped register write, in order to obtain information on the configuration identifier. According to this value, it then decodes the different instruction fields, generates a new (longer) instruction word, and dispatches the new opcodes according to the mapping strategy. In the example of Figure
The second hardware module, the
Software support for reconfiguration is realized simply writing a memory-mapped register, which stores a unique configuration identifier and acts as an architecture selector, directly accessible by a function call at C application level. The automatic flow provides, in the form of a simple API, the function that accesses this register. The value stored in the register, as already described in Section
In case the user does not modify the application code to instrument it with the function calls that select the architecture configuration under test, the extended framework provides an alternative way, which employs Xilinx System Generator (SysGen) [
SysGen provides the capability of automatically creating the hardware and software support for data exchange between the FPGA board and a host processor. It also enables stimuli generated by a software-based simulation environment (such as a Matlab function itself or an HDL simulated stimuli generator like ModelSim) to be fed as input to the hardware. In this way, after every execution, the user can choose the next configuration to evaluate directly from the Simulink graphic interface and automatically restart the application.
As we specified in Section
These possible drawbacks have been evaluated in detail while implementing the flow extensions. In order to minimize the area occupation of the WCC netlist, the minimum needed subset of hardware resources is identified by the WCC synthesis algorithm. The mapping of the instruction fields inside the instruction adapter is optimized to reduce as much as possible the number of different adaptations that must be allowed by the circuitry, to minimize its area and, being the adapter a combinational module, its propagation delay to prevent impacting the overall working frequency of the prototype. The resulting area/frequency overhead is quantified in Section
In this section, we present two use case scenarios for the previously described runtime reconfiguration techniques. The first use case involves exploration of a possible configuration space for the architecture of a single ASIP, running an image filtering kernel application. The second use case involves exploration of a multi-ASIP system, composed of three processors, a packet-based on-chip interconnection switch, two different shared modules (a UART controller and a hardware Test&Set semaphore bank for lock-based synchronization) and a MicroBlaze platform control processor. The running application is an MJPEG codec, partitioned and mapped on the three ASIP processing elements. For both sets of experiments, the adopted hardware FPGA-based platform features a Xilinx Virtex5 XC5VLX330 device, counting over 2M equivalent gates.
We will present the results obtained while performing the single-ASIP architecture selection process over a set of 30 different ASIP configurations. The explored design points were identified considering different permutations of the following processor architectural parameter values:
A filtering kernel was compiled for every candidate configuration, and the resulting binaries were executed on the WCC prototype, adequately reconfigured. Although the above-mentioned design space could seem small, this design case is very realistic. In fact several DSE engines, like [
In Figure
Use case results. Every configuration is labeled with a 4-tuple, whose elements represent the total number of issue slots, the register file capacity (in 32 bit words), the number of fully featured issue slots, the number of data memories. Execution cycles are reported for the different configurations under emulation. Moreover, the modeled power consumption (expressed in
Area occupation
Total execution cycles
Power consumption of FUs
Energy consumption of FUs
In this paper, to avoid the disclosure of sensitive industrial information, we do not report absolute power, frequency, and area numbers. We backannotate the emulation results referring to “comparative’’ numbers for energy and area contributions for the functional blocks in the ASIPs. The area, power, and energy figures are thus reported respectively in
From the performed analysis a designer could estimate
Power consumption for each FU in a particular configuration, composed of four issue slots, with register files of 16 entries, reported in
The cycle-accurate correctness of the emulation of a candidate configuration with the WCC one is guaranteed by construction of the WCC architecture. In fact, every instruction that traverses the ASIP datapath, both in the candidate configuration and in the WCC, undergoes the same exact logic path. The WCC architecture does not insert any new pipeline stage in the instruction path with respect to the ASIP. What can change is only the operating frequency of the WCC and thus the resulting execution real time, due to the more complicated combinatorial logic (e.g., instruction adapter), but the emulated CPI will be exactly the same (as a count of clock cycles). To confirm this behavior, we compared the experiment results obtained by prototyping all the candidate configurations on the WCC architecture with the results of their stand-alone evaluation. As expected, we found that exactly the same “functional-related’’ (execution time, latency, and switching activity) performances are estimated. Cycle/signal level accuracy can thus be assessed for the presented approach.
In order to evaluate the speed-up that was allowed by the proposed approach, we should firstly consider the time needed for the same design exploration performed using the classic approach. This requires the designer to go through the implementation flow for all the candidate design points. The related time depends on the complexity of the considered design point. For the set of candidates in this example, implementation time ranged from roughly 20 minutes to 45 minutes, for a total of 15 hours. When using our approach, on the other hand, only one synthesis is required. Such synthesis of the WCC required roughly 45 minutes, allowing for a 20x speed-up. In this analysis we are not accounting for the time needed for the execution of the kernels on the prototypes, since it is negligible (few seconds) with respect to the implementation time.
We present now a second use case, which validates the proposed emulation approach inside the exploration of a multi-ASIP system. Several are the reasons for which such kind of use case is important. Firstly, since cycle-accurate simulation becomes slower as the size of the simulated system increases, obtaining execution traces of multicore systems by means of software-based methods quickly becomes unpractical. As opposed to software-based simulators, the emulation speed achievable with FPGA devices is not affected, in general, by the system size. The only requirement is that the system under prototyping fits in the target configurable device. Therefore, a multi-ASIP system exploration should, in terms of overall speed-up, favor FPGA-based emulation approaches as opposed to pure software simulation.
Moreover, the use case shows that it is possible to cross-optimize the microarchitectures of the ASIPs, exploiting results obtained from a complete system prototyping. In fact, we show how the designer is able to observe the mutual influence among the processors and between the processors and the surrounding environment (interconnect infrastructure, peripherals, and shared memories) without relying on further software-based simulation steps.
We present the prototyping results obtained by the execution of an M-JPEG encoder on a parallel MPSoC composed of a host processor and three ASIPs, interconnected by means of a Network-on-Chip subsystem. The system is represented in Figure
Multi-ASIP platform under exploration.
The application is partitioned into four parallel computing tasks, communicating through FIFO channels, according to a programming model based on Kahn Process Networks ([ the host processor is in charge of initializing the program and data memories inside the ASIPs and executing a parallel task (named the second task then involves the DCT-encoding calculations and is mapped on the first ASIP (ASIP1); the third task takes care of the quantization process ( the fourth and last task performs the variable length encoding part (VLE) and is executed by the third ASIP (ASIP3).
In order to explore the microarchitectures of the single ASIPs, ASIP1 and ASIP2 were enriched with the support for fast prototyping. ASIP3 on the contrary has been implemented as a single static configuration. By doing so, we are able to investigate on the impact that the customization of ASIP1 and ASIP2 has on the metrics related with the execution of the VLE task, which runs on ASIP3.
In detail, in the presented results the number of issue slots and the kind of included function units inside ASIP1 and ASIP2 were the variables that defined the exploration space. We let ASIP1 and ASIP2 processors issue slot counts assume all the possible combinations of values from 2 to 4.
Figure
Second use case results. Every horizontal axis captures the number of issue slots inside processors ASIP1 and ASIP2. Execution cycles are reported for the different configurations under emulation. The IPC for every ASIP is also reported.
Execution times
IPC per ASIP
The power results were instead acquired as a sum of the power dissipation occurring in the ASIP1 and ASIP2 processors, for their different configurations. ASIP3 was not considered in the power estimation process, since its architectural configuration has been kept constant.
By looking only at the execution time results, it seems that the architectural modifications performed inside the ASIP2 processor (which is assigned the quantization task) were unable to produce an impact on the ASIP3 execution time. This means that, as for what regards the execution time, the DCT task is the most hungry, and optimizing its execution is key to obtaining an overall performance improvement.
Interestingly, changing the number of issue slots from 2 to 3 does not affect the execution time. We could argue that this is related to how the VLIW compiler exploits the available issue slots to exploit the DCT task available parallelism.
We are also able to provide an estimation of the system power and energy consumption. Figure
Second use case results. Every horizontal axis captures the number of issue slots inside processors ASIP1 and ASIP2. Execution cycles are reported for the different configurations under emulation. The modeled power consumption (expressed in
Power consumption
Energy consumption
By looking at the power numbers, on the other hand, we see that the architectural changes to both the ASIP1 and ASIP2 processor configurations have an impact on the overall results. The numbers suggest that the most convenient architectural configuration, in terms of energy, features 2 issue slots in the ASIP1 processor and 4 in the ASIP2 processor.
In the use case exploration, we kept the NoC switch configuration constant and changed the configurations of ASIP1 and ASIP2. This means that the candidate configurations with the highest number of issue slots could have experienced more network congestion than the ones with less issue slots, simply because of the increased memory references issued per cycle. Since this difference exists among the various candidate configurations, the WCC architecture exactly reproduces it when configured to emulate the different candidate configurations. This is another consequence of the cycle accuracy that the WCC architecture has with respect to the direct emulation of each single candidate configuration.
Both the functional results and the cycle counts obtained with our FPGA approach and the baseline software simulation were completely equivalent. But, while cycle-accurate software simulation required few minutes (roughly five on average per configuration), onboard execution on the FPGA prototype required only few seconds (roughly two) to emulate each candidate architecture. A synthesis/implementation flow, performed on an Intel Quad-Core machine with commercial tools, required less than half an hour to be completed. Such time obviously depends on the size of the system, but can be estimated in the order of one hour for moderately complex systems. According to the mentioned numbers, the presented approach allows time saving that increases with the number of candidate topologies under prototyping, outperforming soon (for approximately ten candidate design points involved in the design process) software-based simulation. Moreover, the software-based simulation is not always effective. For example, if design cases imply the evaluation of runtime middleware policies or network routing protocols, whose effectiveness must be measured on execution times much longer than a single-processing kernel, simulation times become unaffordable, making FPGA emulation the only available evaluation method and our approach fundamental for its application to DSE.
In this section, we present a quantitative analysis on the overhead introduced by the addition of reconfiguration support in terms of occupied hardware resources and critical path degradation. For this scope, referring to the previously presented use case, we compare the results obtained by a synthesis of the WCC and the most hardware hungry configuration under test (when implemented for a stand-alone evaluation, without support for runtime reconfiguration). Again, the WCC is generated to support 30 different configurations. This degree of exploration is reasonably interesting for this kind of applications, though larger pools of configurations can always be used.
As can be noticed from Table
FPGA hardware overhead figures.
Occupied slices | Slice registers | Slice LUTs | |
---|---|---|---|
Largest configuration | 19859 | 6923 | 16387 |
WCC | 21278 | 6931 | 17951 |
(+7.1%) | (+0.001%) | (+9.5%) |
Also, a comparison related to the logic synthesis maximum operating frequency is presented, being this a limiting factor for the overall emulation speed. Results presented in Table
FPGA critical path overhead figures.
Critical path | |
---|---|
Largest configuration | 9.809 ns |
WCC | 9.817 ns |
The results show how the overhead reduction mechanisms explained in Section
In this work, an approach to ASIP configuration selection, based on FPGA-based emulation platforms, is presented and evaluated. The main point of strength of the proposed approach is that, by using the runtime software-based reconfiguration capabilities of the hardware platform, several emulation steps could be performed after a single-FPGA synthesis and implementation run. In such a way, we showed how different VLIW ASIP architectures could be emulated on-hardware by
In addition to functional correctness of the emulation approach, the experimental data proved that the overhead introduced by the overprovision of hardware resources to the worst case configuration that is actually implemented on hardware does not preclude the feasibility of the approach, neither in terms of FPGA slices (less than 10% overhead) nor in terms of critical path and operating frequency (negligible overhead). In addition to the classical functional metrics (e.g., execution time, access rate, IPC, and resource congestion), the presented framework has been proven to be able to produce physical metrics (e.g., area obstruction, leakage static power, dynamic power, and energy consumption) for a prospective implementation of the ASIP system. Moreover, the reconfiguration mechanism for configuration selection is compatible with a preexisting wider framework for FPGA emulation of ASIP based multicore architectures. The presented use case validates the usefulness of the framework as an effective support to quantitative design space exploration or simply as an environment for rapid prototyping of complex multicore platforms.
Future developments of the proposed software-based reconfiguration approach include an accurate comparison, in terms of reconfiguration speed and resource overhead, to proprietary (Xilinx) tools that exploit newest devices hardware partial FPGA reconfiguration capability. On the modeling side, we intend to tune the models with further experiments. Also, the models for the function units need to be adapted depending on the type of the module. For example, it is likely quadratic models work better for units which involve multipliers. However, this may not be the case for adder units.
The research leading to these results has received funding from the European Community Seventh Framework Programme (FP7/2007–2013) under Grant agreement no. 248424, MADNESS Project, from ARTEMIS JU—ASAM Project, and by the Region of Sardinia, Young Researchers Grant, PO Sardegna FSE 2007–2013, L.R.7/2007 “Promotion of the scientific research and technological innovation in Sardinia.’’