An Evaluation of Parallel Synchronous and Conservative Asynchronous Logic-Level Simulations

A recent paper by Bailey contains a theorem stating that the idealized execution times of unit-delay, synchronous and conservative asynchronous simulations are equal under the conditions that unlimited number of processors are available and the evaluation time of each logic element is equal. Further it is shown that the above conditions result in a lower bound on the execution times of both synchronous and conservative asynchronous simulations. Bailey’ s above important conclusions are derived under a strict assumption that the inputs to a circuit remain fixed during the entire simulation. We remove this limitation and, by extending the analyses to multi-input, multi-output circuits with an arbitrary number of input events, show that the conservative asynchronous simulation extracts more parallelism and executes faster than synchronous simulation in general. Our conclusions are supported by a comparison of the idealized execution times of synchronous and conservative asynchronous algorithms on ISCAS combinational and sequential benchmark circuits.


INTRODUCTION
Reliable design of digital VLSI systems requires ex- tensive logic simulations consuming enormous amounts of CPU time.Parallel processing offers a viable way to improve upon this time.Two main classes of algorithms exist for parallel logic simula- tion known as the synchronous and asynchronous al- gorithms.In synchronous simulation (sometimes re- ferred to as centralized-time simulation), a central- ized clock for the simulation time is maintained.All logic elements experiencing input events at the cur- rent simulation time are processed and then the clock is advanced by one time unit to the next simulation time.In contrast, the asynchronous simulation (also 91 called distributed simulation) does not require any centralized clock to coordinate its execution.Instead, all events carry the simulation time information (timestamp) themselves.In conservative asynchro- nous simulation, a logic element is ready for evalua- tion as soon as all of its inputs have received a token (a logical value and its timestamp).When a logic element evaluates, it produces an output based on the logical value of the input tokens and consumes the input token(s) with the lowest timestamp.The output has a timestamp equal to the timestamp of the con- sumed input token(s) plus the delay of the logic ele- ment itself.In the "conservative" form of asynchro- nous simulation, the time order of tokens is always guaranteed and only "safe" evaluations are allowed i.e., an evaluation guaranteeing a correct result.
In implementing the event driven principle (i.e., sending an output token to the fanout elements only if there is a change in its logical value), the conserva- tive asynchronous simulation can deadlock.A dead- lock is a situation where no element can evaluate because at least one of its inputs is missing a token.This occurs frequently in the simulation of circuits with feedback because if the output that is feeding back did not change, no token will be sent to that input, causing a deadlock.
There are two ways to handle deadlocks (proposed by Chandy and Misra [2][3]); one is deadlock avoid- ance by the use of NULL or redundant messages, the other is deadlock detection and recovery.Bailey [1] develops the execution time of asynchronous simula- tion without considering the overhead due to han- dling of deadlocks.We do consider this overhead in the execution times of ISCAS-85 [4] and ISCAS-89 [5] benchmark circuits.
In the development of execution times of synchro- nous and conservative asynchronous simulation, Bailey first describes the circuit to be simulated in terms of a simulation dependency graph, , which is a directed graph of events with each vertex represent- ing an event in the circuit.The vertices in the graph are labeled with events and the edges are labeled with delays in the circuit.Both the events and the delays have positive integer values.If a parent event causes a child event, then there is an edge in from the parent event vertex to the child event vertex with a delay of the logic element corresponding to the child event.The execution times of synchronous and con- servative asynchronous simulation are developed in terms of this graph.In Bailey's analyses, a fixed ex- ecution sequence is assumed, the evaluation time of each vertex in the graph is equal, an unlimited num- ber of processors are available and the inputs to a circuit remain fixed during a simulation.Under the above assumptions, it is then proved that the unit- delay simulation is a lower bound on the execution times of both synchronous and conservative asynchronous simulations and that these execution times are equal.
We continue a similar development here but relax the assumption that the inputs to a circuit are to re- main fixed during a simulation.Since most practical simulations require testing the circuit with a large set of different inputs, it is more meaningful to analyze the parallelism and execution time of synchronous and asynchronous simulations under varying input conditions.As will be shown later in this paper, the presence of multiple input events allows the conser- vative asynchronous simulation to extract more parallelism due to its capability to process events be- longing to different simulation times and thus quite different conclusions are obtained as compared to ].In analyzing parallelism in the execution of a simu- lation, we examine both pipelining and concurrency in the processing of events.Pipelining corresponds to processing a stream of events on a line in a circuit.It is a measure of how quickly the next consecutive event on an input line of a logic element can be processed after the previous event has been consumed.Concurrency refers to parallel evaluation of different logic elements at a given time, in response to events on their inputs.
In the remaining organization of this paper, we an- alyze the synchronous and asynchronous simulations individually and develop bounds on the execution times for general multi-input, multi-output circuits experiencing an arbitrary number of input events.A relative comparison of the synchronous and conser- vative asynchronous simulation execution times is then presented to show that the conservative asyn- chronous simulation may execute faster.It is also shown that the conservative asynchronous simulation can further improve upon its execution time by em- ploying safe lookaheads i.e., an evaluation based on the controlling input being present on a logic ele- ment's input.Finally, a comparison of the idealized execution times of the synchronous and conservative asynchronous simulation algorithms on ISCAS com- binational and sequential benchmark circuits is presented to support our conclusions.It is shown that the conservative asynchronous simulation implementing the deadlock avoidance scheme exploits better pipelining and concurrency in element evaluations and even with the overhead of NULL messages, executes faster than both the synchronous simulation and the conservative asynchronous simulation implementing the deadlock detection and recovery scheme.Except for allowing inputs to change during a simulation, the remaining assumptions throughout this paper are sim- ilar to Bailey [1 i.e., all logic elements have a unit- delay, an unlimited number of processors are avail- able, and each processor evaluates a logic element in E time units.
An initial version of this paper was presented at the 6th IEEE Symposium on Parallel and Distributed Processing, October 1994.

EXECUTION TIME OF SYNCHRONOUS SIMULATION
Bailey shows the execution time of unit-delay synchronous simulation, -rs. . to be "r.,.., E* (depth() + 1) The limitation of (1) is that it is only valid for a single input circuit with a single input event or for multiple input circuits such that an event occurs at exactly the same simulation time on different inputs.To allow for multiple input events, equation (1) needs to be modified to take into account the pipelining effect taking place due to a sequence of events on an input, and the possible concurrency due to events on different inputs.As an example of pipelining, if two events separated by one time unit are received on the input of a single input circuit, then the execution time is, %., E* ((depth() + 1) + 1) i.e., it increases by only one evaluation time.This is because while a logic element at depth in is executing the first event, the element at depth i-1 is executing the sec- ond event.On the other hand, if the two consecutive input events on a line are separated in simulation time by t.,w, time units and t.ep >-depth(), then %.,, E* (depth() + 1) *2 i.e., the execution time doubles to that of a single event due to lack of pipelining.In general for a single input circuit with a sequence of e input events, the execution time of unit-delay synchronous simulation is bounded by E* ((depth() + 1) + e 1) ---"rs, -< E* ((depth() + 1)*e.
For a completely general circuit, we must allow for an arbitrary number of external inputs, with each in- put experiencing different number of events at differ- ent simulation times.The calculation of execution time then requires that the simulation dependency graph be identified for each external input.We denote )i as the section of corresponding to an input i.Let n be the number of external inputs and e be the num- ber of input events on an input i.Then the best-case execution time for the unit-delay synchronous simu- lation is given by (2).It occurs when all input events on a line are separated by one time unit to extract maximum pipelining, and different inputs receive events at the same simulation time to achieve maxi- mum concurrency.
Max (E* ((depth(i) + 1) + e 1)) (2) Oton-1 The worst-case execution time is given by (3) and occurs when all input events are separated in simula- tion time by an interval greater than or equal to the depth of the simulation dependency graph, such that there is no pipelining or concurrency (between differ- ent external input events).
'rs., in__--d E *(depth(i) + 1)* e (3) We illustrate the best-and worst-case execution times using an example.An exclusive-OR circuit is shown as an interconnection graph in Figure 1.Fig-  ure 2 shows the simulation dependency graph i for each input.The vertices in the graph are labeled with events and the edges are labeled with the delays in the circuit.
If the inputs A and B in Figure 2 experience two events at times 5 and 6, then from (2) the best-case execution time is, %.,, E * ((3 + 1) +

FIGURE
Circuit Interconnection Graph for an Exclusive-OR Circuit * 5.The pipelining and concurrency in execution are shown in Figure 3 by mapping to the execution time (assuming E 1).To show the worst-case exe- cution time, if input A experiences two events at times 5 and 15, and input B experiences events at times 10 and 20, then from (3), 'rs,.E * (3 + 1) * 2 + E * (3 + 1) * 2 E * 16. Figure 4 shows this by mapping to the execution time for E 1.Note that there is no pipelining or concurrency between differ- ent input events in this case because of the wide separation of events in simulation time in relation to the circuit depth.
For unit-delay conservative asynchronous simulation, assuming an unlimited number of processors, Bailey   shows the execution time, ('rc,.) to be, a'c, It can be easily verified that (4) is not valid for any simulation other than a single input circuit receiving a single event.Taking into account the effect of mul- tiple input events, we develop the expressions for the best-and worst-case execution times of conservative asynchronous simulation for general multi-input, multi-output circuits.It should be noted that while the synchronous sim- ulation processes all events at each simulation time completely before proceeding on to the next time, the asynchronous simulation can concurrently process events belonging to different simulation times as they are produced.This allows asynchronous simulation to exploit pipelining, due to a sequence of events on an input of a logic element, as well as concurrency, due to different logic elements receiving events at the same execution time (not necessarily at the same sim- ulation time).As an example, if two different logic elements are ready for evaluation because all their inputs have received tokens and the tokens for the two logic elements contained different timestamps, the asynchronous model allows them to execute con- currently whereas the synchronous simulation will al- low only one execution at a time.
For a general circuit with n inputs and e events on an input i, the best-case execution time of conserva- tive asynchronous simulation is given by (5).

"rc, Max
Oton (E * ((depth(i) + 1) + e 1)) It occurs when there is maximum pipelining and con- currency available in simulation.Note that unlike synchronous simulation, the separation in terms of simulation time is not a factor for exploiting either pipelining or concurrency in asynchronous simula- tion.
The worst-case execution time for asynchronous simulation is caused by reduced parallelism due to the way it processes events.In asynchronous simula- tion, each logic element has to sequence the input events in terms of their timestamps to guarantee cor- rect behavior.During evaluation, a logic element con- sumes the input token with the lowest timestamp and produces an output with a timestamp equal to the timestamp of the consumed token plus the delay of the element itself.Thus even if the events appearing on different inputs of a logic element were generated in parallel, a number of output events equal to the sum of all input events have to be generated sequentially in the worst case, thereby reducing the concur- rency in simulation.An example of this is shown in Figure 5, where the two inverters process the events concurrently belonging to different simulation times but when passing through the AND gate, the genera- tion of events is serialized on its output because of Format: Event @ Simulation tim%xecution timc 1010 O 01 FIGURE 5 An Example Showing Serialization of Generation of Output Events the differences in the input timestamps.In Figure 5, the execution time for the generation of each event is denoted as a subscript to the event and it is assumed that E 1.The execution time for the output of a logic element equals one more than the maximum execution time on the front of its inputs.This is be- cause, in conservative asynchronous simulation, a logic element is not ready for evaluation until it has received all of its inputs.In Figure 5, @.. indicates additional events on an input, thus allowing the con- sumption of all events in the example.
The example shown in Figure 5 demonstrates that multi-input logic elements may reduce the concur- rency in asynchronous simulation by serializing the generation of events if they receive events that are separated in simulation time on their different inputs.Taking this effect into account, the worst-case execu- tion time for conservative asynchronous simulation is given by (6).
where ek, denotes the number of events at the input of a logic element at level k in a given input-to-output path.Before applying (6), the number of events at each output of a logic element is computed by accu- mulating the number of events on the fanin lines of that element.Equation ( 6) is derived from the circuit interconnection graph.Starting from an input-to- output path, it accumulates the events at the inputs of a logic element at each level k as indicated by the faninkterm , ek, (inputs are labeled starting from 0 to i=0 fanin-1).The pipelining effect is taken into account by subtracting the number of events at the previous k fanin_ level by the term , ek_l, (this term is zero i=0 for k 1).This accumulation of events is carried out for each input-to-output path and the execution time of the simulation is the maximum over all these paths.By applying (6) to the circuit of Figure 5, it can be seen that "rc,u for AND E * 6.To explain this further, both inverters in Figure 5 have two input events which can be executed in parallel, so for k level, the execution time is 2 (for this level)-0 (for previous level) + E*3.For k 2 level, the AND gate has 2 input events on each of its inputs, so the worst case execution time would be 4 (for this level) 2 (for previous level, to account for the pipelining effect for k 1) + E*3.Combin- ing the execution time for all levels, we obtain a total time of E * 6.

COMPARISON OF SYNCHRONOUS AND CONSERVATIVE ASYNCHRONOUS SIMULATION
The best-and worst-case execution times for synchronous and conservative asynchronous simulation are given by Equations (2-3) and (5-6) respectively.
In comparing the best cases, it can be seen that Equation (2) for synchronous simulation is exactly identi- cal to Equation (5) for conservative asynchronous simulation.However, there are differences in the re- quirements for achieving this minimum time.The best case for synchronous simulation occurs when the events on an input are separated in simulation time by only one time unit to exploit maximum pipelining, and events on different inputs occur at the same sim- ulation time to get maximum concurrency.The con- servative asynchronous simulation does not have this requirement and is capable of exploiting both pipelin- ing and concurrency for widely separated events.For instance, the asynchronous simulation of a chain of inverters executes in the minimum time given by Equation (5) regardless of the separation time of in- put events.In contrast, the synchronous simulation requires input events to be separated by only one time unit to achieve the best execution time.Note also that in most practical simulations, the input data to a cir- cuit is held stable for at least the delay through the circuit.Thus the asynchronous simulation may achieve the minimum time but the synchronous sim- ulation cannot as the input events are almost always separated by more than one time unit in practical sim- ulations.
In order to achieve the lowest possible execution time when there are multi-input logic elements in- volved, the conservative asynchronous simulation does require that the events on different inputs of a logic element have the same timestamps.This condi- tion allows for consumption of multiple input events thus minimizing the effect of serialization in the generation of output events.Hence this condition ulti- mately requires a fixed simulation time difference in the external input events (depending upon the delay of the path of each input of a logic element to the external input) to achieve the best execution time.This is rather a stringent requirement as can be seen from an example.If the first input of an AND gate receives events through a chain of two inverters con- nected to an external input and the other input is an external input, then the external input events have to be separated by 2 simulation time units to result in minimum execution time in asynchronous simulation.
The minimum time given by Equation (5) would not be obtainable for most circuits because of the conflicting timing requirements from multiple paths through the circuit.Figure 6 illustrates this point us- ing the data from Figure 3.The minimum execution time given by Equation ( 5) is not achieved by asyn- chronous simulation because events at the inputs to Format: Event @ Simulation timeexecution time ., FIGURE 6 Asynchronous Simulation Execution for the Exclusive-OR Circuit with Input Events Separated by One Time Unit the AND gates are not optimally separated in time leading to some serialization.The execution of this example using the asynchronous simulation takes 6 time units in comparison to 5 time units required for the synchronous approach which accomplishes the task in minimum time.
In short, the requirements on both synchronous and asynchronous simulation to achieve the best execu- tion time as given by Equations ( 2) and (5) are quite strict.The best execution time may not be observed for either type of simulation.The requirements for Equation (2) to be used would never be achieved in practical circuits that often use an input data that is held constant for at least the delay through the entire circuit.Likewise the requirements for the use of Equation ( 5) would not be achievable by most cir- cuits having recombination of paths with different de- lays, although this is mitigated by not having an out- put event for each input event as has been assumed in the development above.
The worst-case execution times for synchronous and asynchronous simulation are given by Equations ( 3) and ( 6) respectively.The synchronous simulation exhibits the worst-case execution time when all input events are separated by time intervals greater than or equal to the depth of the simulation dependency graph.In this case it is unable to exploit any pipelin- ing or concurrency.However, asynchronous simula- tion can extract some pipelining and concurrency in- dependent of the time separation of input events.This pipelining and concurrency are reduced when passing through multiple input gates due to the serialization in generation of output events.For example, the asynchronous simulation shown in Figure 5 takes 6 time units (worst-case time for asynchronous simula- tion) to complete and has concurrent evaluations in the two inverters.The synchronous simulation can not execute the events at the inputs of the two invert- ers concurrently since they belong to different simu- lation times and thus it takes 8 time units (worst-case time for synchronous simulation) to complete.Another comparison is made in Figure 7, where the asynchronous simulation for the exclusive-OR circuit takes 11 time units to complete as opposed to 16 time Format: Event @ Simulation timeexecution time O.. units for synchronous simulation (which is shown in Figure 4).The execution time for the conservative asynchronous simulation can also be verified by applying Equation ( 6) to the exclusive-OR circuit in Figure 7 for the two input events on each input, yield- ing [(2) (0) + 1] + [(4) (2) + 1] + [(8) (4)  + 1] 11 time units to execute.Thus other than some very special circuits e.g., a completely serial circuit or a circuit with only one gate, the worst-case execution time of conservative asynchronous simula- tion will be less than that of synchronous simulation.
As shown by the above analyses and examples, the theorem 4 in Bailey's paper [1] that the execution times of unit-delay, synchronous and asynchronous simulations are equal, is not valid for simulations ex- periencing multiple input events.Generally, the con- servative asynchronous simulation can exploit better pipelining and concurrency as compared to the syn- chronous simulation for widely varying events in terms of their timestamps and thus results in less ex- ecution time.The execution time of conservative asynchronous simulation can be further improved by incorporating safe lookahead as described in the next subsection.

Improving Asynchronous Simulation by Incorporating Lookahead
Asynchronous simulation can exploit lookahead to further improve upon its execution time.Lookahead corresponds to a prediction of the output when not all input tokens of an element have been received.In the conservative asynchronous simulation, lookahead should always produce a correct prediction of the output.This can be achieved by performing an eval- uation based on a controlling input (e.g., 0 is control- ling value for an AND gate, and for an OR gate).The presence of an input token with a controlling value is sufficient to determine the output.Hence, by incorporating lookahead in the conservative algorithm, it is not necessary to wait until a token is present on all its inputs before an element can be evaluated.If any of the tokens in the front of an input queue of a logic element has a controlling value, the element is evaluated.The output token produced has a timestamp equal to the highest timestamp of the controlling input token plus the delay of the element.
In order to implement lookahead, each element maintains a "lookahead counter" and a location to store its controlling value.The lookahead counter stores the highest timestamp of the tokens in the front of the input queues that have a controlling value.Any incoming input token having a timestamp less than this lookahead count is absorbed.Thus many input tokens can be absorbed in one evaluation and an out- put produced with a much higher timestamp than would be possible without using lookahead.This minimizes the number of messages and improves the execution time of the conservative asynchronous sim- ulation.The pseudocode for the lookahead based con- servative asynchronous algorithm is shown in Appendix B.
The lookahead scheme can be implemented on multi-input AND, NAND, OR and NOR gates.In- verters and exclusive-OR gates do not have any con- trolling values as such and thus cannot take advan- tage of lookahead.Lookahead can also be applied to edge-triggered flip flops.Since, after the triggering edge has been detected, the output can be correctly predicted up to the next triggering edge.
Conservative asynchronous simulation on a combi- national circuit comprised of multi-input AND, OR type gates can generally improve 50% upon its exe- cution time by employing lookahead.This can be seen by assuming the probability that the output of a gate is 0 to be 0.5 i.e., the output is 0 half the time and 1, the rest of the time.The number of gate eval- uations using lookahead will thus be reduced by half because half the time at least one of the inputs will have a controlling value.For sequential circuits, the conservative asynchronous simulation based on the deadlock avoidance scheme can have a much higher performance gain by using lookahead.This is be- cause in addition to the reduced gate evaluations, lookahead greatly minimizes the number of NULL mes- sages needed to avoid the deadlocks in feedback loops.Some results on benchmark circuits are pre-sented in the next section that demonstrate the effectiveness of lookahead.

EVALUATION ON BENCHMARK CIRCUITS
We measured the execution times of combinational ISCAS-85 [4] and sequential ISCAS-89 [5] bench- mark circuits on both synchronous and conservative asynchronous simulation algorithms.All circuits were simulated under unit-delay, as unit-delay has been shown to be the lower bound on the execution time of either synchronous or conservative asynchro- nous algorithm [1].In the implementation of syn- chronous algorithm, a timing wheel is used whose time slots contain events that can be executed in parallel.Thus for a given data set (with unlimited num- ber of processors and one time unit for evaluation of an element), the execution time of synchronous sim- ulation is equal to the number of non-empty time slots ].
For conservative asynchronous simulation, we first implemented the algorithm presented in [6,9] which uses an avoidance scheme to handle deadlocks.This algorithm was then further improved upon by incor- porating lookahead.Our lookahead implementation used lookahead on multiple input gates as well as flip flops.The pseudocodes for the conservative asyn- chronous algorithm and the improved form incorporating lookahead are given in appendices A and B respectively.In this algorithm, NULL messagesare generated only if there is a possibility of a deadlock.This is detected when one of the inputs of a logic element becomes empty as a result of an evaluation.
In this case, the output is sent to its fanout elements regardless of a change from its previous value.Note that this is an optimization over Chandy and Misra's always send NULL message strategy in [2][3].
In our implementation, we have an input queue of size 16 for all inputs to a logic element.For an asyn- input queue size is increased and usually saturates for a queue size of about 5.In our execution time mea- surements, an unlimited number of processors is as- sumed with one unit evaluation time for a logic ele- ment and zero communication time for distributing tokens to the fanout of a gate.This is consistent with and chosen so that the parallelism in an algorithm can be determined independent of the communication overhead.However, as communication time in- creases, the synchronous and asynchronous algorithms would perform relatively the same.The total time units to complete the asynchronous simulation were measured using the same data set as used for the synchronous simulation.Table I shows the character- istics of the benchmark circuits and the data set.Data for the ISCAS-85 combinational circuits (c prefix) consisted of 30 random sets.The length of a set for a particular circuit was adjusted so that the circuit would reach a stable state before the next data was entered i.e., the length of a data set corresponds to the maximum depth of the circuit.Data for the ISCAS-89 sequential circuits consisted of 40 random sets.Data was preceded by several clock cycles to reset the flip flops in the circuit.Data was changed only during the middle of the positive clock pulse, and remained con- stant for a single clock period.Clock cycle times were adjusted for different circuits so that the circuit would reach a stable state before the next clock cycle.
The results of the execution times of the two algorithms on combinational and sequential benchmark circuits are shown in Table II.It can be seen that the execution time of asynchronous simulation with loo- kahead is much lower than the synchronous simula- tion for all circuits.On the average, the conservative asynchronous simulation is almost three and a half times faster than synchronous simulation for combi- national circuits, and two times faster for sequential circuits.The redundant or NULL messages used in the asynchronous algorithm cause the overall execution time of conservative asynchronous simulation to in- crease because extra evaluations may take place at the element receiving these messages.The sequential cir- cuit simulations generate a large number of NULL messages to avoid a large number of deadlocks (see Table IV).The execution time data in Table II includes this effect and despite the overhead of NULL mes- sages, the asynchronous simulation still outperforms synchronous simulation for combinational as well as sequential circuits when lookahead is employed.
We carried out a similar comparison between the synchronous algorithm and an asynchronous algorithm based on deadlock detection and recovery scheme.In the deadlock detection and recovery scheme, the circuit is allowed to deadlock which is a condition in which no logic element can evaluate be- cause at least one of its inputs is missing a token.After a deadlock has been detected, the circuit recov- ers by computing a global minimum time "gmt" (which is the smallest time of an unconsumed event in the circuit) and updating token timestamps which Circuit  are less than gmt to gmt [7].Table III shows a com- parison on benchmark circuits between the synchronous algorithm and the asynchronous algorithm based on deadlock detection and recovery scheme (DDR).In Table III, it is assumed that the circuit recovers from a deadlock in 0 time.Even with this unrealistic assumption, the conservative asynchronous simulation based on the deadlock detection and recovery scheme performs worse than the synchronous simulation.This is because the deadlock detec-tion and recovery scheme looses much of the pipelin- ing when the circuit deadlocks causing its performance to be worse than the synchronous simulation.
It can be seen from Table III that when the number of deadlocks is relatively small (e.g., c2670 circuit), the asynchronous simulation approaches the synchro- nous simulation execution times, and its performance is relatively worse when the number of deadlocks is high.The results in Table III agree with other re- searcher's conclusions about the relatively poor per- formance of the conservative asynchronous deadlock detection and recovery scheme as compared to the synchronous scheme.Soule' [7] has done a similar comparison on a variety of circuits under the same assumptions as ours (i.e., unlimited number of processors and zero time to recover from a deadlock) and found the asynchronous simulation using the deadlock detection and recovery scheme to perform worse than the synchronous simulation.Soule' also examined the conservative asynchronous avoidance scheme and found it to be extremely poor relative to the synchronous simulation, but his implementation did not carry the NULL message optimization as we have described, instead in his implementation a NULL message was sent out after every evaluation (Chandy and Misra's always send NULL message scheme).
Table IV compares the NULL message overhead in different conservative asynchronous schemes based on deadlock avoidance and it can be seen that the conservative asynchronous scheme with lookahead has the least overhead in terms of NULL messages as compared to actual events in the circuit.Even though for sequential circuits, the number of NULL mes- sages is two to three times more than the number of events in the lookahead based avoidance scheme, the execution time is still better than the synchronous simulation because of the increased pipelining and concurrency in event processing.
All ISCAS benchmark circuits were tested in this work.However, for keeping the paper to a reasonable length, we report the results on only a few of these circuits.More results on other circuits can be found in [8].The results on remaining circuits are relatively similar to the ones we have presented in this paper.Further, in an implementation on a data flow archi- tecture based hardware accelerator with limited num- ber of processors [9], the performance of the synchro- nous and the optimized conservative asynchronous algorithms shows relatively similar results as we re- port in this paper.
Overall, the ability of the conservative asynchronous algorithm to concurrently evaluate logic ele- ments with each element's inputs having differing timestamps from other element's inputs and its ability to exploit better pipelining along with lookahead al- low it to execute faster than the synchronous simula- tion.The conservative asynchronous algorithm im- plementing the deadlock avoidance scheme maintains better pipelining of events on the input(s) of a logic element and thus executes faster than the deadlock detection and recovery scheme in which the pipelin- ing effect is lost when the circuit deadlocks.

CONCLUSIONS
In this paper, we have extended Bailey's analysis of synchronous and conservative asynchronous logic simulation by considering multiple input events.By taking into account both event pipelining and concur- rency due to multiple input events, the expressions for the best-and worst-case execution times of synchronous and conservative asynchronous simulations for general multi-input, multi-output circuits were de- veloped.It is then shown that the conservative asyn- chronous simulation has the ability to exploit better pipelining and concurrency due to widely varying times in input events and thus can execute faster than the synchronous simulation in general.
Our conclusions are supported by the simulation execution times of combinational ISCAS-85 and se- quential ISCAS-89 benchmark circuits on the syn- chronous and conservative asynchronous algorithms.
Even with the overhead of NULL messages, the con- servative asynchronous simulation using the opti- mized deadlock avoidance scheme exploits better pipelining and concurrency, and thus executes faster than both the synchronous simulation and the conser- vative asynchronous simulation based on the dead- lock detection and recovery scheme which looses all its pipelining when deadlocks occur.Thus our work presents important conclusions dif- ferent than previously proved in [1], and shows the effectiveness of conservative asynchronous simula- tion in terms of parallelism and execution time over synchronous simulation when a lookahead scheme is employed.Although the overhead associated with asynchronous simulation (maintaining input queues in each logic element etc.) is higher than synchronous simulation which makes it unattractive for software implementations, our work shows that it has high potential for hardware acceleration of logic simulation.

FIGURE 3 ExecutionFIGURE 4
FIGURE 3 Execution Time for the Exclusive-OR Circuit Show- ing Pipelining and Concurrency in Synchronous Simulation

FIGURE 7
FIGURE 7 Asynchronous Simulation Execution for the Exclusive-OR Circuit with Widely Separated Input Events 2

TABLE
Characteristics of the Benchmark Circuits and the Data Used for Evaluating Simulation Algorithms

TABLE II Execution
Times of Benchmark Circuits on the Synchronous Algorithm and an Asynchronous Algorithm using the Avoidance

TABLE III Execution
Times of Benchmark Circuits on the Synchronous Algorithm and an Asynchronous Algorithm Using the Deadlock Detection and Recovery Scheme

TABLE IV Comparison
of NULL Message Overhead in different Asynchronous Conservative Schemes based on Deadlock Avoidance