Design Example of Useful Memory Latency for Developing a Hazard Preventive Pipeline High-Performance Embedded-Microprocessor

The existence of structural, control, and data hazards presents a major challenge in designing an advanced pipeline/superscalar microprocessor. An efficient memory hierarchy cache-RAM-Disk design greatly enhances the microprocessor’s performance. However, there are complex relationships among the memory hierarchy and the functional units in the microprocessor. Most past architectural design simulations focus on the instruction hazard detection/prevention scheme from the viewpoint of function units. This paper emphasizes that additional inboard memory can be well utilized to handle the hazardous conditions. When the instruction meets hazardous issues, the memory latency can be utilized to prevent performance degradation due to the hazard prevention mechanism. By using the proposed technique, a better architectural design can be rapidly validated by an FPGA at the start of the design stage. In this paper, the simulation results prove that our proposed methodology has a better performance and less power consumption compared to the conventional hazard prevention technique.


Introduction
In current computer architecture, the multiple-instruction (pipeline, superscalar) microprocessors are proposed to improve the efficiency of a single-instruction microprocessor.There are usually four stages (instruction fetch, decode, execute, and writeback) adopted in a multiple-cycle processor.The CPI (cycle per instruction) value of the pipeline (multiple-instruction) microprocessor is several times larger than that of a single-instruction microprocessor.Generally, the pipeline architecture is combined with RISC (reduced instruction set computer) methodology to design high performance processors.
Pipeline microprocessor hazards occur when multiple instructions are executed.The pipeline architectural hazards that are introduced in [1,2] make the program instructions unable to be parallely executed.In general, there are three types of hazards: structure, control, and data hazards.A structural hazard means that the hardware components (resources) are insufficient to support the execution of the pipeline instructions in the same clock cycle.The frequently occurring case of the hardware components conflicting when sharing the single port memory means that they are unable to support the read/write operation at the same time.The second type of hazard is termed a control hazard, which arises from the present executed instruction's inability to make decisions because this instruction decision making should rely on the results from the next following executed instructions.An example is the branch instruction, which is unable to make a correct decision whether to jump or not during this instruction in the execution cycle.This is due to the most recent jump condition can not be obtained when a decision is made.
The third type is data hazard occurs when the current instruction's operands should refer to its earlier instruction's executing results, but the previous instruction final result is still not stable (the instruction is not working in the writeback stage,) as shown in Figure 1(a) where the reference hazard occurred for instructions I1 and I2.The focus of most conventional designs is to analyze the hazard conditions and promote additional mechanism (insert NOP instruction) in order to resolve the hazard situations and obtain better pipeline performance, as shown in Figure 1(b).Memory latency is not welcome because the access delay degrades the microprocessor's performance.An idea of how to manage the memory latency as a hazard prevention mechanism is shown in Figure 2.There is a half clock cycle memory latency for the fetched instruction which is loaded from the slow-speed main memory (SRAM, DRAM).Hence, the pipeline operation does not need to insert the NOP, and without the performance degradation penalty.
The different types of memory in the system board can help designers rapidly create a better pipeline hazard prevention architecture, such as a hazard prevention mechanism that utilizes nonuseful memory latency to reduce the hazard penalty.To design a better pipeline architecture, two factors should be taken into consideration simultaneously.First is the instruction's hazard conditions, and second is the memory access latency issue.
Most of the past architectural designs focus on the instruction hazard detection and prevention scheme from the aspect of pipeline function units.The pipeline-stall and forwarding techniques proposed in [1] are neglecting the use of useful memory latency.The possible reasons may stem from the following: one is that memory requires a particular design for better performance requirement in real microprocessors.This cannot be fully emulated by FPGA.Thus, most of the FPGA is used to verify the functional correctness of the fetch/decode/execution/writeback stages.
Freedom of design in different system's architecture is not available in a chip.Currently, FPGA is utilized to rapidly validate a feasible architectural design.The FPGA emulation process can help designers quickly adopt different memory architectures to reduce the hazard penalties during the system design phase.Better microprocessor performance adopts suitable cache-RAM-Disk memory volume in this hierarchy design.However, there are complex relationships (e.g., levels of cache, RAM access time, and volume) between the memory hierarchy and function units.
In this paper, we propose a superior pipeline architectural design obtained from the FPGA validation phase that does not merely use the FPGA to perform functional verification.As the memory latencies are dissimilar for different types of memory, the idea of an architectural design that applies useful memory latency can be rapidly validated by the FPGA.By choosing to adopt different types of FPGA board memory, we might find a better hazard detection/prevention mechanism.
When a designer attempts to utilize different types of memory in the FPGA board, the requirement is to compare the performance when utilizing the different memory latencies (internal register, flash, RAM, or ROM) within the core architecture.
We emphasize that the additional onboard memory can be well utilized to handle the hazards.The FPGA board not only helps the designer to validate the function units but also brings creative guidance to help the designer find better architecture and reduce hardware overhead to detect/prevent hazards.
One 16-bit X86-light pipeline RISC microprocessor (14 instructions) was developed to validate our idea for specific application (e.g.matrix multiplication).The design is Harvard architecture, where the data and the instruction are put in separate memories.In this paper, we focus on the hazard prevention realistic design using memory latency and verify our results using FPGA implementation for this microprocessor.For demonstrating the different paradigms of solving the hazard problems with/without using the FPGA onboard memory, two design approaches are adopted.Method-1 is the hazard detection and prevention mechanism design using additional hardware.Method-2 uses inboard memory latency to replace the hardware that was proposed in Method-1 and validates the results with a Xilinx FPGA demo board (XESS XSV-800).In the Method-2, the data memory (DM) and instruction memory (IM) SRAM is separate on board for pipeline processor.
In this paper, firstly, the data and control hazard detection and prevention techniques (forwarding, NOP insertion, and stall) for our architecture are introduced.Secondly, two validation approaches are used to verify the design architecture.For Method-1, the Instruction-Memory and Data-Memory are synthesized and embedded within the design architecture.Method-2 only synthesizes the core architecture into the FPGA and places the Instruction-Memory and Data-Memory in SRAM on the board.The test programs need to be loaded into the Instruction-Memory and synthesized with the design at the same time.For Method-2, the test program is loaded from IM and writes output to the Data-Memory during the validation process.Method-2 also supports flexible validation environments for quickly reevaluating the design architecture.
There are two contributions of this work.First, for the design phase, memory latency could be effectively utilized to avoid the hazard issues, for designing simpler and faster pipeline architecture; for example, the data hazard resolving mechanisms do not need to be embedded into the design.The second is not the same as conventional designing of the pipeline processor using internal registers as the processor's IM, DM memory.Using Method-2, the instruction memory  access latency from on system board SRAM prevents the data hazard problems arising from pipeline operation.The functional testbench does not need to be synthesized with the design, so effortless verification methods can be used to rapidly validate the prototype advanced pipeline core architecture.A more flexible verification environment can be adopted for large amounts of varied test programs.
The aforementioned two approaches are successful in evaluating the designs.The synthesizable RISC architecture practically executes 35 MHz onboard, and the clock frequency of Method-2 is two times faster than that of Method-1.
As to the organization of this paper, Section 1 comprises the introduction.Section 2 is the design and hazards analysis of our RISC processor.The two FPGA verification methodologies are proposed in Section 3. Section 4 is the experimental results.This paper is concluded in Section 5.

The Pipeline Hazard and Memory Latency Surveys
In [3], the study investigates the relative memory latency, memory bandwidth, and branch predictability in determining the processor performance.The proposed basic machine model assumes a dynamically scheduled processor with a large number instruction window.This study claims that, if a system with unlimited memory bandwidth and perfect branch predictability, that memory latency is not a significant limit to performance.The simulation model with SPEC92 benchmarks is used to study the performance.
Reference [3] proves that the best existing branch prediction mechanism with very large table sizes also resulted in several times lower performance compared to perfect branch prediction method for many benchmarks.This means that perfect branch predict ability is the most import factor.This paper assumes that memory bandwidth is not usually a significant limit for the advanced technology.However, this assumption might not be achievable, as memory bandwidth is always a bottleneck in current harvard system architecture.There are also less currently advanced (multicore with multithread) designs with perfect branch predict ability while to tolerate high memory latency.
Reference [4] claims that the repeatable timing is more achievable than predictable timing.This research describes micro pipelining architecture and the memory hierarchy delivers repeatable timing can provide better performance compared to past techniques.The program threads are interleaved in a pipeline to eliminate pipeline hazards, and a hierarchical memory architecture is outlined which hides memory latencies.
In [4], multithread architecture applies the pipeline operation from interleaving the memory access operation.The repeatable timing can speed up pipelining architecture, as the pipeline interleaving is within the pipeline processor and DRAM access.This specific architecture might not be suitable for generic system architecture.The designer applies the proposed technique that needs to consider the instruction dependence conditions.
The research [5] reviews the RISC microprocessor architecture, which presents a microthreading approach to RISC microarchitecture.This paper focus on the speculation that has high cost in silicon area and execution time, as a compiler can almost find some instructions in each loop which can be executed prior to the dependency is encountered.The proposed approach attempts to overcome the performance penalty from instruction control (branch, loop) statement and data missing problem.The proposed technique can tolerate high latency memory from avoiding the speculation in instruction execution.However, without well utilizing memory latency, only the compiler cannot obtain the best performance improvement for complicated programs.

The Demonstration RISC Microprocessor Architectures
The single instruction and the pipeline version demo architectures are written by Verilog HDL and validated using XCV-800 Xilinx FPGA [6].

The Pipeline Architecture (Method-1).
Our pipeline architecture is shown in Figure 5.The instruction/data memory is synthesized using the FPGA internal logic element, and the computed results are obtained in every clock cycle.There are four stages (fetch, decode, execute, and writeback) and three addressing modes (direct, indirect, and immediate) proposed for this design.The results can be obtained after each clock cycle in this four-layer pipeline architecture.The data hazards of pipeline processors are generated from data dependence using the same registers, for example, RAW (read after write), WAW (write after write).The control hazard is made from the branch instruction, which decides to fetch a false next instruction during the pipeline operation.These hazards have been discussed in previous studies [1,2].

The Hazard Analysis of Pipeline
Architecture.Each instruction operation requires 4 clock cycles for Method-1.The incorrect pipeline operation occurs when the next instruction is executed in the following clock cycle.These conditions termed the structure, control, and data hazards are occurred.
The data hazard was raised from one instruction decoding operation using the same registers that corrupted the previous instructions during the execution (or writeback) stage, for example, the read after write (RAW) data hazard occurring in the pipeline architecture; for example, [SUB r1, r0] follow [ADD r0, #10] and the RAW hazard occurring on the register of r0.The write after write (WAW) hazard causes the register overwrite situation; for example, [STA m[r0], r0] follow [MOV r0, #10]; the register r0 becomes WAW hazard.Thus, the hazard detection and correct circuits were added to resolve these hazard problems in Figure 5.
For example, the data hazard occurs when two instructions are executed serially; for example, [SUB r1, r0] follow [ADD r0, #10]; the read after write hazard (RAW) occurs on register of r0.The renew value of r0 = r0+10 was not obtained until the ADD instruction writeback stage.The renew value is not ready to update the source register r0 of SUB instruction during the decode stage.We do not list the other types of hazards in detail such as WAW.The following Figure 6 shows the data hazard occurring on the two instructions of add r0, #10 and add r10, #10.
The instruction format is "opcode, target operand, source operand".There are several types of occurrences that arise from this data hazard issue.We categorize these details in Table 1.The means hazard occurs when the target instruction register (target operand) at the execution stage combines with any third row instruction's source register (source target) in the decode stage.We just list the simple one here; the other detailed rules are shown in Appendix B. Several types of data hazards are categorized in Appendix C. To resolve this hazard issue, when detecting the aforementioned code sequence, the instruction execution cycle does need to wait for completion (the writeback cycle).The ALU quickly passes the computed results to the next instruction (as the direct input for next instruction).This method is called forward.

Memory Latency Utilized by Pipeline
Architecture Method-2    8.
The memory load functional units require four clock cycles, as shown in Figure 8.We use the FPGA experimental board (Xilinx XCV 800) to demonstrate our idea.We focus on using the RAM of this board.There are 4 clock cycles required for SRAM read/write operation of XESS XSV-800 board.The RAM read/write operation requires four clock cycles, and the detailed timing diagram is demonstrated in Figure 9.
The memory read/write requires four clock cycles.This gives us the chance to arrange 8 cycles to execute one instruction.There is one do-nothing cycle required to be inserted into the cycle.In total, eight-clock cycle is needed to execute an instruction.The insert do-nothing cycle is used to align the operation of the pipeline instructions.This methodology also has more freedom for hazard prevention.The two-level pipeline has better performance and less hazard process hardware.The side with simple architecture (no hazard detection circuit required) allows more complex advanced functional cores to be inside.
Figure 10 shows the pipeline clock cycle plan of Method-2.There are 4 clock cycles required to read instructions from the instruction memory or write results to data memory.We should mention that the signal lines should be ready before the write operation.This is not shown in the figure.The latency can be utilized to cover the timing intervals of one instruction fetch stage with decode/execute/writeback/donothing stages of previous instructions, for example, instructions I2 and I3 in Figure 11.There are no hazards that occur when we move the instruction and data memory outside FPGA to the random access memory on board.The simpler redesign two-layer pipeline architecture is shown in Figure 7.There is no hazard detection/correct circuitry and each output can be obtained at every four clock cycles.If the access delay is not utilized, due to each stage extending to four clock cycles to fit the slowest fetch stage, each instruction operation should expand to 16 clock cycles (4 clock cycles for each stage).However, the hazard problems can be avoided for the four-layer pipeline architecture.
Both Method-1 and Method-2 architectures can be adopted and, as we stated, greater flexibility will be available to extend using the onboard RAM.When we need to match the memory read/write cycle, there is one alternation, as shown in Figure 12.All function units extend the operated cycle to four, thus the total clock cycle for executing an instruction increases to 16 clock cycles.In this architecture, the resulting output occurs every four clock cycles.A greater amount of architecture is able to be adopted when there is a sufficient use of different types of onboard memory, such as Flash and ROM.This helps the designer to have more freedom in choosing different design styles.
The hazard problem consideration of this architecture is simpler than the no-RAM version, due to each stage having four cycles.Each functional unit in the local element within FPGA only requires one clock cycle to execute the operation.The remaining three cycles (3/4) provide more flexibility to solve the hazard issues than the tight conditions case (each stage assigns one cycle.)Figure 13 shows the data dependence being resolved by the forwarding operation.

The Experimental Result Analysis
In a conventional design concept, an FPGA chip is only used to validate the functional correctness of instruction operations in the fetch/decode/execution/writeback stages.In this research, the FPGA board not only helps the designer to validate the function units but also brings a creative contribution to help find new hazard detection/prevention strategies that have less hardware overhead in comparison to those in the past research.The performance evaluation of a prototype pipeline design needs to utilize different types of memory in the FPGA board as much as possible and to derive a better hazard detection and prevention mechanism.There are four pipes for Method-1 and two pipes for Method-2.Thus, Method-2 has less hardware overhead.better than those from Method-1.Method-2 works frequency is 62.8% higher than Method-1.
Table 3 shows the comparisons of three architectures, and single instruction, 16-cycle pipeline (Figure 12), and 8cycle pipeline (Method-2, Figure 10).All of the specifications are obtained from the synthesized reports by FPGA tool.For equal comparisons, it needs to be mentioned that the three types of instruction/data memory all use the onboard RAM.
The SINGLE and 8-cycle version are near equivalent; this represents the pipeline architecture performance is major limited by IM and DM access time.There are several benchmark test bench programs are used for the proposed design.In Table 4, the measure results are obtained from Bubble-Sort for large volume of data.Table 4 measures the power consumption.There is a less power consumption in the 8-cycle pipeline architecture under the three-clock frequency.

Discussions
In this paper, the proposed pipeline architecture utilizes the memory access latency to improve the performance when hazards occurred.As memory access speed is dissimilar for the different types of memory, well utilize the memory access latency cycles, the pipeline operation can be speed up.The proposed methods use the assumption of instruction and data can be obtained (hit situation) from memory.This means that the instruction and data can be found from the chip's internal register or onboard memory.These memories are enough to store the required instruction and data during program execution.
wait cycles for instruction and data memory missing situations.A flexible system architecture requires to include the instruction and data hit-miss conditions for various memory architectures.However, such a pipeline architecture is hard to design by including different memory waiting cycles.

Conclusion
We find that a good cycle timing plan is the most important issue for designing a pipeline CPU.The processor performance depends on how well the clock cycle, the control, and the data flow are managed.Also, the design style has a good chance to be improved if one does consider that the memory latency can be utilized for hazard prevention.When the microprocessor can utilize different types of memory (internal register, flash, RAM, or ROM) in the system board, this gives flexibility and helps to achieve a system architecture with better performance.The hard to use onboard memory will be regularly ignored during the prototype verification phase.The designer should not forget the onboard FPGA memory, although it differs from the real CPU memory, and it is also inconvenient to use it (to coordinate the read/write operation with kernel function units).This might be the best chance to reevaluate the preliminary design during the verification phase because one might find another better hazard free structure from memory latency.In our experience, the designer can obtain a greater number of different architectures by spending time to try to use onboard memory.The functional test program should take the memory access delay into consideration when the execution programs are moved to the outside memory in the second method.The design that applies memory latency for hazard prevention has a better performance with less power consumption than that of the conventional design.

Figure 2 :
Figure 2: Using memory latency to design a hazard prevention mechanism.

Figure 3
shows the microarchitecture of the single instruction version.There are four stages used in this processor, for example, instruction fetch, decode, execute, and write back stages.The next instruction is fetched and should wait until the current instruction's results are written back to memory (register/data memory).The timing diagrams are shown in Figure4.There are 4 clock cycles required for each instruction by this style, with each stage of the operation at the clock positive edge.There are partial instructions that do not need to execute the four steps shown in Table5.The Verilog HDL codes are used to design the hardware function block, and Xilinx FPGA simulation/synthesis environment is used to fulfill the experiments.The single cycle architecture shown in Figure3executes the opcode on each clock cycle positive edge.A simple description of the function unit is as follows: instruction

Figure 3 :
Figure 3: The single clock cycle architecture.

Figure 4 :
Figure 4: The function units are activated by each positive clock edge simultaneously.

Figure 6 :
Figure 6: The data hazard example.

Figure 11 :
Figure 11: The three instruction execution cycles of Method-2.

Table 2 :
There are more benefits from pipeline Method-2.

Table 3 :
The design specifications' comparisons.

Table 4 :
The performance/area comparisons of the synthesis results.

Table 2
shows that the synthesis results from Method-2 are The read operation timing diagram of XCV800 on board RAM.