Functional verification has become one of the main bottlenecks in the cost-effective design of embedded systems, particularly for symmetric multiprocessors. It is estimated that verification in its entirety accounts for up to 60% of design resources, including duration, computer resources, and total personnel. Simulation-based verification is a long-standing approach used to locate design errors in the symmetric multiprocessor verification. The greatest challenge of simulation-based verification is the creation of the reference model of the symmetric multiprocessor. In this paper, we propose an efficient symmetric multiprocessor reference model, Hybrid Model, written with SystemC. SystemC can provide a high-level simulation environment and is faster than the traditional hardware description languages. Hybrid Model has been implemented in an efficient 32-bit symmetric multiprocessor verification. Experimental results show our proposed model is a fast, accurate, and efficient symmetric multiprocessor reference model and it is able to help designers to locate design errors easily and accurately.
Recently, the symmetric multiprocessor (SMP) has become a leading trend in the development of advanced embedded systems. Meanwhile, with the rapid improvement of the hardware manufacturing technologies and the help of computer-aided design (CAD) tools, SMP systems become more and more powerful and complex. As a result, the design verification of SMP systems takes up a large part of the total design period. The verification method directly determines the efficiency of SMP system verification and even the whole design cycle.
A variety of techniques have been deployed to efficiently and effectively detect design errors in SMP systems. These techniques can be divided into three categories: formal verification, simulation-based verification, and hardware emulation [
The major drawback of the mainstream simulation-based approach is the difficulty of creating an efficient reference model of the DUT in a short time. The success of simulation-based verification depends on the accuracy and the quality of the reference model in use. An efficient and accurate reference model is able to help designers locate errors easily and quickly. Many researchers have already proposed various reference models of the processor at presilicon. During the simulation-based verification, most processors regard the simulator as the reference model. These simulators are normally obtained from earlier stage in processor development, in which simulators are used for performance evaluation under benchmark [
In a simulation process, function coverage analysis is needed to check and show the quality of testing. It helps the verification team to check whether the function points that they want to simulate are covered during the testing phase. Sometime, some direct tests written by hands are needed with the help of function coverage analysis to cover the missing cases. The function coverage analysis is usually achieved from the RTL (Register Transfer Level) code and indicated by one signal or a set of signals. As the verification team is unfamiliar with the RTL code, it is difficult for them to observe the function points in RTL code, especially if the signals needed by the function points do not exist in the RTL code and the verification team has to turn to the designers for help. It is necessary for the designers to add these signals that are useless to the system function. In this way, the function coverage analysis needs the interaction of the verification team and the designers, so it is error-prone. However, the verification team is familiar with the reference model that is created by them. So if they achieve the function coverage analysis from the reference model rather than from the RTL code, the function coverage result can be more accurate. And the direct tests are able to be written by the verification team more effectively.
The main contribution of our work is that an efficient SMP reference model is proposed. It is written with SystemC. Acting as the SMP reference model, HM is simpler and faster than TM and more accurate than FM. The second contribution is that we define a timing sequence called Dependent Timing Sequence (DTS). The function of DTS is the timing interface between two models. The final contribution is that the function coverage analysis is able to be obtained from HM. In this way, the verification team can achieve more accurate coverage result quickly. Then the direct tests can be written by them more effectively.
As shown in Figure
Efficient symmetric multiprocessor verification.
In the validation process, when a test case is stressed on the SMP system and HM simultaneously, the SMP system executes and HM simulates the instructions in this test case one by one. For each single instruction, the CPU pipeline executes it and the execution results of the CPU pipeline are obtained. If this instruction is a load/store instruction, the CPU pipeline needs to send this instruction to LSM. Then LSM executes this instruction and the execution results of the LSM are obtained. In this way, the execution results of the whole SMP system are obtained. On the HM side, first CPM simulates this instruction and the simulation results of CPU pipeline are achieved. If this instruction is a load/store instruction, CPM has to pipe its timing stream to CCM via DTS accordingly. The timing stream makes CCM begin to simulate and the simulation results of LSM are achieved by CCM. In this way, the simulation results of the whole SMP system are achieved. At this time, the tool will compare the execution results with the simulation results to check the correctness. Once any discrepancy occurs, the tool stops the simulation immediately. Then the tool will collect the information of this instruction such as its execution results and simulation results for the verification team. It is convenient for the verification team to locate errors with the help of these messages.
An important part of HM is CPU Pipeline Model (CPM) that is function-accurate. It can be used to act as the reference model of CPU pipeline. CPM only cares about the function of CPU pipeline rather than its timing information. As shown in Figure
Block diagram of CPM.
The simulation results of CPU pipeline can be obtained rapidly, including much key information of the SMP system, for example, PC, the value of registers, and the state of the target processor. The tool compares these simulation results achieved by CPM with the execution results obtained by DUT. And any discrepancy indicates an error of the DUT. If no discrepancy occurs and the simulating instruction is not a load/store instruction, the simulation of this instruction is finished successfully. If this instruction is a load/store instruction, CPM has to send the complete timing information of this instruction to CCM via DTS. If an error occurs, the simulation will be stopped at once and the simulation results and the execution results are obtained directly to help the verification team to locate and fix this error.
The other important part of HM is Cache Coherence Model (CCM) that is timing-accurate. CCM is the reference model of LSM. As CCM is timing-accurate, it needs to care about the details of LSM. However, only the details that have an effect on the function points that the verification team wants to simulate are considerable. The function points are defined manually by the verification team, and they are the combination of the characteristics of the DUT and a series of events that must be verified. In the application, these events are analyzed by observing the signals and states of the DUT. When the verification team has finished listing these events, they would serialize the events that have close relationship and outline their features. Finally the events that have close data relationship are put in one process according to the serialized events and the relationship of data structure between these events. As a result, these processes can be implemented with SystemC and run in parallel. And the processes communicate with each other by FIFO.
Figure
Block diagram of hardware. RB is used to preserve load/store transactions and maintain their order. WB keeps store miss transactions, LB is responsible for preserving load miss transactions, and STQ keeps store hit transactions. COHU maintains cache coherence between cores and NCOHU deals with the transactions unrelated to cache coherence.
As the main memory has a lower load/store speed, buffers are utilized in the NCOHU to save load/store transactions unrelated to cache coherence. However, it is fast to access software memory. As a result, there is no need to create buffers for memory access in CCM. And sometimes more than one transaction attempts to access cache, whereas cache is a one-port element. So buffers are needed to save the outstanding requests to cache. However, CCM can accept and execute all the requests simultaneously, so no buffer is needed to save these transactions to cache in CCM. The abstraction of these buffers not only has an effect on the function, but also can reduce the implementation time of CCM. However, some hardware architectures cannot be abstracted; even any discrepancy between the hardware and CCM may cause fatal functional mistakes.
The interconnection usually works faster than CPU; some of the transactions related to cache coherence need to be saved in COHU. The order of these transactions is maintained by COHU in order to achieve accurate execution results. CCM has to deal with these load/store transactions in the same way with hardware to obtain the right simulation results. Figure
The execution result depends on the arbitration. (a) CPU0 executes before CPU1. (b) CPU1 executes before CPU0.
Figure
Block diagram of CCM.
Dependent Timing Sequence (DTS) is the timing interface between CPM and CCM. For every single instruction, CPM simulation and CPU pipeline execution proceed simultaneously. The tool compares the simulation results with execution results all the time. If no error is found in CPU pipeline and the simulating instruction is a load/store transaction, CPM is responsible for delivering the timing information of this transaction to DTS. CPM is aware of all the timing information of this transaction except for the cycle number whose function is to notify CCM when to begin its simulation. However, CPM can find this information from the execution results of hardware. In this way, the complete timing sequence of this transaction can be obtained and piped to DTS by CPM. DTS includes all the timing information CCM needs. Then CCM reads the timing information from DTS and begins its simulation. Figure
Timing information in DTS.
As different kinds of CCMs may need different timing information, the information in DTS should be adjusted to meet the timing requirements of CCM.
As HM is written by verification team and only includes the considerable function points, it is fast to obtain the function coverage report. Moreover, the isolation between system design and verification due to the proposed function coverage analysis approach can avoid many unnecessary errors in function coverage report and make the analysis more accurate.
We selected the CK810MP of Hangzhou C-SKY Microsystems Co., Ltd., to evaluate the feasibility of HM. As shown in Figure
Architecture of CK810MP.
Figure
Verification platform.
The test generator generated 4000 tests each with 100 instructions, including the boot sequence used to initialize the CK810 core. In the first experiment, we compared the simulation speeds of these four models of CK810MP. To obtain the differential results, these 4000 tests were divided into 10 test groups randomly and each test group has various numbers of tests. The numbers of tests included by these 10 groups gradually increased from the first one to the tenth one. Then these test groups were fed to the reference models of CK810 quad-processor system, respectively, to compare their simulation speeds. Figure
Simulation speed.
Further, we focused on the functional design of CPU pipeline in HM (denoted as CP-FM) and the timing-accurate model of CPU pipeline in TM (denoted as CP-TM) to explain why HM has obvious speed advantages comparing with TM. The test groups were fed to CP-FM and CP-TM to compare their simulation speed. Figure
Comparison of simulation speeds of the function-accurate model and timing-accurate model of CPU pipeline.
In the second experiment, we compared the accuracy of these four models indicated by the number of errors found by them. The 4000 tests in the simulation environment were divided into 10 test groups each with 400 tests randomly. Then these test groups were stressed on CK810MP system and its four reference models, respectively. Figure
(a) Error number found by test groups; (b) accumulated errors.
As soon as these four reference models’ writing is finished, they are put into operation in the CK810 quad-processor verification. However, here these models are not exactly the correct golden models defined by the specification, especially the TM. The CPU pipeline of the CK810 quad-processor is a complex dual-emission superscalar 10-stage pipeline; hence some inconsistency between TM and the correct timing-accurate model is unavoidable at the beginning of simulation. The elimination of the inconsistency needs to take a lot of time. Before the TM becomes a correct timing-accurate model, it may obtain wrong simulation results because of some timing inconsistency, whereas the processor achieves the wrong execution results caused by a design error. If the wrong simulation results and the wrong results are the same, unfortunately, TM would take the attitude that the hardware is infallible. Figure
An example of exclusive transactions. (a) The correct implementation. (b) The wrong execution result caused by a design error. (c) The wrong simulation result caused by the timing inconsistency.
As the simulation goes on, these models are all modified by the verification team to become the correct golden models gradually. At this time, if some timing errors of CPU pipeline do not influence the function of the CK810 quad-processor, the TM can discover these timing errors but the HM cannot. As an example, the interval between the load exclusive transaction and the first store exclusive transaction in Figure
To compare the accuracy of four reference models further, we analyzed the coverage of function points that we want to simulate. Figure
Coverage of function points.
Further, we focused on CP-FM and CP-TM to compare their accuracy and explain why HM has obvious speed advantages comparing with TM, while maintaining similar accuracy, by using the test groups used in Figure
(a) Error number found by test groups; (b) accumulated errors found by the function-accurate model and timing-accurate model of CPU pipeline.
An accurate and efficient symmetric multiprocessor reference model is proposed in this paper. The function coverage analysis is able to be achieved from it to help the verification team to write direct tests more accurately. This reference model has been implemented for a 32-bit symmetric multiprocessor verification. The experimental results show that the number of errors found by our proposed model is about 4 times that found by a function-accurate model. Our proposed model has a better performance in finding errors than the function-accurate model. The simulation speed of our proposed model is about 30 times as high as that of a timing-accurate model in the same condition. In comparison to the timing-accurate model, our proposed model is easier to create and faster, whereas their abilities to find errors are similar. The advantages of the proposed model come from the functional design of the CPU pipeline model. With the help of our proposed model, the verification team can locate design errors more quickly and verify the interconnection more efficiently. The time for symmetric multiprocessor verification can be shortened obviously with our proposed model.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to thank the members of the multiprocessor project team at Hangzhou C-SKY Microsystems Co., Ltd., especially Ke Wang, Xiaomeng Zhang, Teng Hu, and Xiaofei Jin, for their cooperation and help in this work.