Scalable Clustered Time Warp and Logic Simulation

We introduce, in this paper, Clustered Time Warp (CTW), an algorithm for the parallel simulation of discrete event models on a general purpose distributed memory architecture. CTW has its roots in the problem of distributed logic simulation. It is a hybrid algorithm which makes use of Time Warp between clusters of LPs and a sequential algorithm within the clusters whereas Time Warp is traditionally implemented between individual LPs.


INTRODUCTION
A great deal of effort has gone into parallel logic simulation because reducing the time of uniprocessor simulators can have a significant impact on the design of VLSI systems.The simulation of these systems has, in fact, become a bottleneck in the overall design process [3] is an excellent survey of the work done in parallel logic simulation.
Recently, research in the area has turned from synchronous algorithms (such as the oblivious strategy in which all of the gates in a circuit are evaluated at each time step of the simulation), to the use of both conservative and optimistic asynchronous algorithms.
Conservative algorithms [4] are known to have low memory usage.On the other hand, avoiding or detecting and breaking deadlocks can reduce greatly the performance of these algorithms.This is especially true when large models with small computational granularity, such as those found in the domain of logic simulation, are considered.In general, conservative algorithms depend a great deal on lookahead to achieve good performance [8].Given the large number of cycles in a circuit [2], this might present a serious drawback.
Optimistic algorithms [11] are very attractive for logic simulation since they can extract a great deal of parallelism and they are deadlock-free.Nevertheless, Time Warp studies have often pointed out the problems encountered due to the large amount of memory a simulation might require.Further- more, it is unclear whether Time Warp remains efficient as the size of the simulation model grows.
The ideal algorithm would be one that would have the memory needs of conservative algorithms and the potential of optimistic algorithms to extract a great deal of parallelism.
Digital circuits are constructed by interconnect- ing functional units, which are themselves com- posed of different blocks.At the lowest level a block can be modeled as some combinatorial logic connected to a series of clocked registers or latches.
Figure illustrates the hardware model of logic circuits [19].We distinguish three phases: 1.An initialization vector is applied to the input latches and once the signal is stable, clock G0 is activated.

Combinatorial Logic
General circuit structure.
2.The propagation vector travels throughout the combinatorial logic and reach the output latches. 3.The output vector is then sampled when the clock 1 is activated.This suggests that the signal activity within the blocks is rather chaotic whereas the activity between the blocks tends to be more regular.The key idea would then be to use a conservative approach to synchronize all the gates of one block, and to use an optimistic approach to synchronize these blocks.
In the following pages, we are going to present a new hybrid algorithm for the asynchronous parallel simulation of digital circuits (the algo- rithm can of course be applied to other types of simulations).The algorithm makes use of Time Warp between clusters of LPs running on different processors and use a sequential algorithm within the clusters.We also demonstrate experimentally that the algorithm scales well to the simulation of large models with low computational granularity, while Time Warp does not We christen the algorithm Clustered Time Warp [1].
The remainder of the paper is organized along the following lines.Section 2 contains a descrip- tion of other hybrid algorithms.Section 3 de- scribes the Clustered Time Warp algorithm along with an illustrative example.Section 4 contains  experimental results in which Clustered Time Warp is compared to Time Warp.Section 5  describes our work on the scalability of CTW.We conclude in Section 6 with the conclusion.

RELATED WORK
A number of attempts have been made to combine the optimistic and conservative approaches.[20] allows a process to proceed optimistically but avoids sending potentially erroneous messages to other LPs.[13] employs a window protocol to prevent LPs from getting too far apart in simu- lated time.
In [17] the authors present an algorithm in which LPs are grouped into clusters, Time Warp is used within the clusters and a conservative algorithm is used between clusters.By wary of comparison, our algorithms makes use of Time Warp between clusters and a sequential algorithm within each cluster.We feel that Clustered Time Warp is more appropriate for massive fine grained simulations such as digital logic simulation than Local Time Warp for two reasons: first, CTW takes advantage of the fact that LPs in a cluster share the same address space, thereby making their synchronization and scheduling straightforward; second, by using a conservative algorithm between clusters, Local Time Warp might reduce the parallelism of the simulated model.In any event, as no performance results were presented for Local Time Warp, it is difficult to compare it to CTW.
3. CLUSTERED TIME WARP 3.1.Clusters   In the Clustered Time Warp approach, the model is partitioned into clusters of LPs prior to the simulation.The motivation behind this idea is that logical processes modeling the gates which belong to the same functional unit can be grouped together.There is no restriction put on the size and on the number of clusters except that one cluster must reside on a single processor and cannot be split among processors.Each cluster is associated with a Cluster Environment (CE) which is in charge of scheduling the LPs.The Cluster Environment also takes care of all the commu- nication with the other clusters and as a conse- quence, the CE manages an input queue and an output queue called the Cluster Input Queue (CIQ) and the Cluster Output Queue (COQ) respectively.

Events
When an LP sends an event to another LP located in a different cluster, it gives that event to the Cluster Environment, which keeps a copy of it in its Cluster Output Queue as an antimessage just like an LP would do in a pure Time Warp environment.The CE then sends the event to the appropriate cluster which hosts the destination LP of the event.When the receiving cluster gets the event, its CE simply enqueues it in the CIQ.Such events which cross the cluster boundaries are referred to as external events.If an LP sends an event to another LP which is located in the same cluster, then it passes by the Cluster Environment and enqueues the event directly into the input queue of the receiving LP.Events whose sending and receiving processes are located in the same cluster are referred to as internal events.
Events in the CIQ are sorted by increasing order of receive time whereas events in the COQ are sorted by decreasing order of sending time.The reason why different ordering strategies are used is simple.In a pure conservative approach, an event contains only one timestamp that represents the moment at which that event occurred in the physical system.Processes sort the received events by increasing order of timestamp so as to be able to easily retrieve the event with the smallest timestamp value.In an optimistic approach, a process has two types of queues.An input queue which stores received events in a similar way to a conservative system, and an output queue which stores copies of events sent to other processes.When a straggler is received, the process rolls back by restoring an earlier state and sends antimes- sages.During this last operation, the process goes through its output queue to locate copies of events which were caused by messages whose receive time was larger than that of the straggler.In order to make this operation efficient, events stored in the output queues need to be sorted in decreasing order of sending time.
In Clustered Time Warp, a message has nearly the same structure as in Time Warp.It contains the identification of the sending LP and that of the receiving LP, a sign to differentiate messages from antimessages, a send time and a receive time, and the data needed for model evaluation.The difference with Time Warp lies in the way logical processes are identified.In Time Warp, an LP is identified in the whole system by a single name.In Clustered Time Warp, it is composed of two names: one that identifies the cluster and one that identifies the LP in the cluster.This naming methodology makes the implementation of a dynamic load-balancing algorithm much simpler.Instead of keeping a routing table in each processor of all the logical processes in the system, all that is needed is to keep the location of the cluster, which will then be in charge of forwarding the event to the appropriate LP.If the cluster happens to have been moved to another processor, only one entry needs to be changed in the routing table instead of changing the entries of all the LPs contained in that cluster.
There are three different types of messages in our simulation system: normal messages which contain the events generated by the simulation itself, antimessages which are necessary to cancel wrong computations, and control messages which are needed to perform distributed computation such as the calculation of the GVT estimate, termination detection or collection of statistics.
In a system working under proper conditions, normal messages are the dominant source of communication overhead.Antimessages and con- trol messages are comparatively less frequent but their transmission delay is far more critical than that of normal messages.For example, the longer an antimessage takes to reach its destination, the more useless work the system is likely to perform, therefore the longer it will take to cancel that work.Similarly, the longer a GVT token takes to be passed around, the less accurate is the GVT estimate, hence making the fossil collection me- chanism less efficient.It is therefore necessary for antimessages and control messages to be given a higher priority than other normal messages in order to ensure their fast delivery, especially when the traffic is heavy.Our simulation system is assumed to rest upon a network layer which provides reliable communication channels between the processors and in which messages can have different priority.Furthermore, the Clustered Time Warp approach does not assume a commu- nication system with FIFO properties.

Scheduling
The Cluster Environment is responsible for scheduling the LPs in the cluster and each processor schedules all its CEs.A smallest time- stamp first scheduling policy is used since it reduces the number of rollbacks.Lin and Lazowska [12] do a thorough study of the scheduling problem in which they confirm the advantage of the smallest timestamp first policy and they even suggest making it preemptive.As a consequence, all the events stored in the CIQ and in the LP's input queues are also put in a priority heap.The event at the top of the heap is the one which has the smallest timestamp; hence the destination LP of that event will be the next process to be scheduled in the cluster.

Timezones
Since Time Warp is used between clusters strag- glers may arrive at any time.Therefore a mechan- ism must be created in order for the Cluster Environment to determine which LPs to roll back and which antimessages to send to cancel incorrect computations.This task is achieved through the use of timezones.
From the cluster's point of view, the simulation is decomposed into a series of adjacent and non- overlapping time intervals called timezones.When the simulation starts, each cluster has only one timezone with interval [0, + oc[.Each time a cluster receives a message from another cluster whose receive time is t, it finds the timezone interval [ti, ti+ 1[ into which fits (i.e., ti < < ti+ 1) and splits it into two new timezones with intervals [ti, t[ and [t, ti+l[.Timezones are then stored in a table in increasing order of time.

Logical Processes
Logical processes have a single input queue and no output queue.They also maintain their own logical clock whose value is called the Local Simulation Time (or LST).The behavior of the clock is similar to that of a process' clock in a pure conservative system.If a process LPi, with clock LSTi is about to consume message mp with timestamp t(mp), then the following operations are performed: 1.LSTi+--max (LSTi, t(mp)).2.LPi processes mp.3.LSTi--LSTi + service time.
Furthermore, the LP also keeps track of the Timestamp of the Last Event it processed (or TLE).The TLE is different from the LVT (Local Virtual Time) introduced by Jefferson [11].In Time Warp, the LVT corresponds to the time- stamp of the next event the logical process is going to consume, whereas in Clustered Time Warp, the TLE value corresponds to the timestamp of the last event the LP processed.
When an LP is scheduled for processing, it first checks into which timezone the receive time of the event it is going to consume fits.If that timezone is different from that of the last event the LP processed, then the LP performs a checkpoint by saving its state.Otherwise it directly consumes the event.In short, the LP checkpoints each time it changes timezones.
Each LP is therefore composed of a process in charge of the actual event evaluation, a Local Simulation Time (LST), the Time of the Last Event it processed (TLE), a message input queue, and a state queue.
Figure 2 shows the structure of a cluster.
3.6.Rolling Back Suppose the cluster receives a straggler with receive time ts.As we have just seen, the Cluster Environment creates a new timezone for the straggler.It then rolls back all the LPs in the cluster which have a TLE greater than ts to a checkpoint prior to ts.In addition, the CE will send all the necessary antimessages stored in the COQ whose sending time is greater than ts.To"other clusters Receive time.
Since LPs do not perform a checkpoint every time they process an event, they might have to roll back to a state well before the receive time of the straggler or the antimessage received by the cluster.Therefore LPs need to coast forward as in Time Warp, re-processing all events whose receive time is prior to ts, and not resending messages already produced before ts.The major difference with Time Warp is that LPs can remove from their input queue all of the internal messages which have a send time greater than the timestamp of the straggler or the antimessage which caused the rollback.This does not affect the correctness of the simulation as all the LPs in the cluster are rolled back.Hence, all of the necessary internal messages will be regenerated.Note that the external messages stored in the Cluster Input Queue are not removed since their sending processes are located in different clusters, as a consequence, such messages are not regenerated.
Because the events in the cluster are processed in strict timestamp order (i.e., lowest timestamp first), the descendents of the straggler will be placed correctly in the heap, and events at all of the LPs in the cluster will be processed in the correct order.It is important to note that individ- ual LPs never send antimessages.TLE is the receive time of the last event processed by the logical process.e is the event to be processed where t(e) is its receive time and t.(e) is its send time. (2 (2) begin select timezone Zr with interval [ti, ti+[ITLE if t(e) Zr then checkpoint TLEt,.(e)LSTmax(LST, TLE) simulate event e LST +-LST + service time for all events e to send do t(e') -TLE t(e') -LST if destination LP of e is in the same cluster then insert e into the input queue of the destination LP else give e' to the CE fbr it to send endif endfor end.Inputs is the state queue of the logical process.fi is the input queue of the logical process.
Inputs e is the event to be processed where t(e) is its receive time and t(e) is its send time.
COQ is the Cluster Output Queue.Inputs e is the event to be sent.begin (1) send event e to the destination cluster (2) create the antimessage of e (3) coc ,--coo + end.FIGURE 6 An LP passes to the CE event e to be sent. 3.8.Example 3.8.1.Receiving Messages Figure 7a shows the space time graph at a cluster composed of three logical processes.The x- axis represents the virtual time and the y-axis represents the location of the three LPs. Figure 7b shows the arrival of message ml, whose receive time is 7 and whose destination process is LP1.
Since m has been sent by an LP located in a different cluster, the Cluster Environment creates a new timezone starting at 7 which is indicated by the vertical line.Prior to the arrival of rnl, the cluster had only one timezone with interval [0, + oc[.When rnl has been received by the cluster, there exist two new timezones with intervals [0, 7[ and [7, + 3.8.2.Processing Messages Now LP1 is scheduled to process ml.Since ml is located in timezone [7, + oe[ and LP1 is in time- zone [0, 7[, the process performs a checkpoint and saves its state.The checkpoint is represented by the circle in Figure 8.Then, the process advances its local clock to the value of the receive time of ml (indicated by the bold horizontal bar) and processes m l.A black triangle indicates that the message has been consumed, while a white triangle shows an unprocessed message.LP1 is now in timezone [7, / [.The processing of rnl triggers the sending by LP1 of messages m2 and m3 with receive times 9 and 11 and whose destination processes are LP2 and LP3 respectively.Since these two messages were generated within the cluster, no new timezone is created.
LP2 is now scheduled to process m2 since the receive time of this message is smaller than that of m3.Like LP1, LP2 saves its state before entering a new timezone, advances its local clock, and processes m2.Similarly, LP3 is scheduled in its turn, its state is saved and m3 is consumed.This triggers the sending of a new message m4 whose destination process is LP2 and receive time is 13.
All of the LPs are now in timezone [7, / [.Note that message m4 generated by LP3 did not create any new timezone because both the sending and the receiving process are located in the same cluster.Such messages are referred to as internal messages.Similarly, messages sent between clus- ters are referred to as external messages.LP2 is now scheduled to process m4, but since m4 is located in the same timezone [7, + [ as LP, the process does not save its state and directly consumes m4 (Fig. 8b).

Rolling Back
Suppose now that the cluster receives message m5 with receive time 10 and whose destination process is LP1.Since rn5 is an external message, the cluster splits timezone [   indicates, LP2 and LP3 have already processed messages with a timestamp larger than that of m5 (which makes m5 a straggler).In order to preserve the correctness of the system, LP2 and LP3 are both rolled back to a state prior to the receive time of ms.Note that LP1 does not need to be rolled back since it did not process any message with a timestamp larger than that of straggler ms.After rolling back the processes, all the internal messages with a sending time larger than the receive time of the straggler are discarded since they will be regenerated if necessary by the rolled back LPs.
In the example, m4 has already been removed from the input queue since it has been processed by LP2.
Figure 9b shows the state of the cluster once the straggler m5 has been received, LP2 and LP3 are rolled back, and m4 has been discarded.Note that messages m2 and m3 have now been marked as not having been processed.The cluster now contains three timezones with intervals [0, 7[, [7, 10[, and   [10, + oc[.LP2 can now coast forward resaving its state and reprocessing m.As for LP3, it does not need to coast forward since it does not have any event to process with a timestamp smaller than that of the straggler ms. Figure 10a shows the state of the cluster once LP has completed the coast forward operation.The cluster can now resume to its normal behavior by scheduling LP1 to process ms.Since LP1 is going to enter a new timezone, its state is saved (Fig. 10b).
LP3 is then scheduled next, saves its state before entering the new timezone [10, + oe[, processes m3 and sends m6 to LP2.Note that LP3 skipped directly timezone [7, 10[ and did not perform a second checkpoint since it would have been useless as no message are being processed by LP3 in that timezone.Finally, LP2 processes m6 after saving its state before entering timezone [10, + oc [.

Antimessages
Consider that the cluster is in a state as depicted by Figure 10b and receives r5, the antimessage of ms.
All LPs which have processed a message with a timestamp larger or equal to the timestamp of ms are then rolled back to a state prior to ms.Message ms is now removed from the input queue and the two timezones [7, 10[ and [10, + ec[ are merged into one single timezone [7, + oc[ as there is no more external input messages located in that interval.Figure 1 a shows the state of the cluster once all LPs have been rolled back and message ms has been annihilated.The cluster resumes and LP3 is now scheduled to process m3 which causes my to be generated and sent to LP.Finally, LP2 processes my (Fig. b).
3.9.Estimating the GVT Our fossil collection algorithm differs somewhat from that of Time Warp.In the Clustered Time Warp approach, the state prior to the GVT must be saved, while in Time Warp this is not necessary.
The reason for this is that it is possible to roll back to a point prior to the GVT because not every FIGURE 11 (a)m5 is annihilated by its antimessage, the cluster rolls back, and (b) m3 is reprocessed.event is checkpointed.Similarly, the events prior to the GVT in the LP input queue cannot all be removed.As it is possible for the LP to rollback to a state prior to the GVT, events with timestamp smaller than the GVT might have to be repro- cessed while the LP coasts forward.Once an estimate of the GVT has been calculated, all the LPs can discard the states prior to the GVT but one, and preferably, the one whose timestamp is the closest to the GVT.Then, all the events whose receive time is smaller than the timestamp of the oldest state can be also discarded.Figure 12 shows the pseudocode executed by each logical process when a new GVT estimate has been calculated.
In the current implementation of Clustered Time-Warp, a token-ring passing algorithm [14] is used since the architecture used to develop the system (the BBN Butterfly) does not contain a large number of nodes (maximum of 32 nodes).
Furthermore, even though the memory of the machine is physically distributed, the shared- memory paradigm guarantees atomic message delivery.Had the system been implemented on the top of a communication network, an extra mechanism should have been developed to ensure that no message is hidden in a communication channel during the GVT calculation. 3.10.Space-based Checkpointing Techniques All existing dynamic checkpointing techniques are time-based since logical processes choose to change the checkpoint interval based on their rollback history.The new checkpointing technique that naturally results from the Clustered Time Warp approach described previously is the first space- based checkpointing technique.In other words, the checkpoint interval of an LP depends on the origin of the message received, and not on the rollback history of the process.In this section, we introduce two other variants of the original checkpointing technique developed for Clustered Time Warp.

Input
CIQ is the Cluster Input Queue.COQ is the Cluster Output Queue.
GVT is the new estimated GVT value.I is the state queue of the logical process. begin let ote C_ q VS otd = TLE(S) < GVT   In the Clustered Time Warp algorithm, when a straggler or an antimessage arrives at the cluster, all of the LPs which have processed an event with a receive time larger than that of the straggler or of the antimessage will be rolled back.The decision to rollback is therefore taken at the cluster level, thus we define this technique as clustered rollback.
Checkpointing is performed each time an LP changes timezone.Since timezones are dynami- cally created by the Cluster Environment depend- ing upon the arrival of messages coming from other clusters, we denote this mechanism as clustered checkpoint.
Clustered Rollback-Clustered Checkpoint (CRCC) is the rollback and checkpointing techni- que that naturally results from the Clustered Time Warp approach.
This technique has the advantage of reducing memory consumption by discarding all of the messages in invalidated timezones as they will be regenerated.However, the expense of forcing these LPs to roll back each time an antimessage or a straggler arrives at the cluster is not negligible, especially if most of the events generated by the LPs within that cluster are not causally related to the event which caused the rollback.In such a case, only a few LPs actually need to be rolled back. 3.10.2.Local Rollback, Clustered Checkpoint Since there is a risk of wasting computational resources in CRCC due to the fact that all the LPs in a cluster are rolled back even if it is not necessary for them to do so, a compromise was sought in which the decision of rolling back is made by the logical process itself.
In this new scheme, when a straggler or an antimessage is received by the cluster, the Cluster Environment updates the timezone table accord- ingly and places the event into the input queue of the receiving LP.LPs now behave much as they do in a pure Time Warp system: rolling back when they detect the arrival of a straggler in their input queue and sending antimessages when needed.
Hence, logical processes also need an output queue to keep track of the messages they send in order to cancel wrong computations in the case they have to roll back.As a direct consequence, the cluster does not need to have an input queue nor an output queue, therefore, the CIQ and the COQ can be discarded, and the Cluster Environment ends up only taking care of updating the timezone table when external events come into the cluster.
This technique is called Local Rollback- Clustered Checkpoint (LRCC) since the decision to roll back is made at the LP level, and checkpointing is still performed at the cluster level via the timezone table.
Although this scheme might offer less overhead in terms of computation, it is more expensive in terms of memory since all the events in the LP input queue as well as those in the LP output queue have to be kept as they will not be regenerated. 3.10.3.Local  Rollback, Local Checkpoint   In this variant of Clustered Time-Warp, an LP checkpoints only if it receives an external message, in other words a message that has been generated by another LP located in a different cluster.This scheme is simpler in the sense that LPs do no longer need to check whether they are entering a new timezone.Furthermore, the Cluster Environ- ment does not need anymore to maintain a timezone table.Hence, compared to the other techniques described above, this scheme requires the least computational overhead.
Because the decisions of rolling back and checkpointing are both performed at the LP level, Even though it is evident that an LP will have fewer checkpoints compared to the schemes described earlier, it is not obvious at all it will save more memory.On the contrary, and although it appears counter-intuitive, this scheme can be more greedy.Since the distance between check- points is greater, the number of events an LP needs to keep (in order to coast forward if it rolls back to a state prior to the GVT) tends to grow.Therefore, there is a trade-off: the fewer states an LP saves, the more events it needs to keep.In the case of logic simulation, the size of an event is far from being negligible compared to that of a state.Therefore the distance between checkpoints should not grow excessively if we want to keep the usage of the memory to a minimum.

The Multiprocessor Environment
In this section, we evaluate the performance of Clustered Time Warp and its different check- pointing techniques introduced in the preceding section.Our algorithm is compared to pure Time Warp and a variant of Time Warp using periodic state saving.
We used a BBN Butterfly GP1000 shared- memory multiprocessor for our experiments.The Butterfly is an MIMD machine composed of 32 processor nodes.Each node has an MC68020 and MC68881 processors with 4 megabytes of memory and a high-speed multistage crossbar switch which interconnects the processors.From a processor point of view, remote and local memory references are identical, thus creating a global virtually shared memory space.The crossbar switch is a banyan network composed of 4 4 switch ele- ments and is interfaced with each node by an AM2901 microprocessor whose purpose is to ensure the atomicity of memory operations per- formed on remotereferences.
An asynchronous message passing layer was implemented on the top of the shared memory so that the results obtained from running the different algorithms are not dependent on the presence of shared variables, hence making any comparisons unfair.Furthermore, a future port of the simulator to distributed memory architectures will be made easier.If our simulation system had used the shared memory paradigm of the BBN Butterfly to share data structures such as event queues or process states, extrapolation of performance re- sults to distributed memory architectures would have been too hypothetical.
The message passing layer provides two non blocking communication primitives: send() and receive().Messages can either have a low or a high priority.If a high priority message is awaiting, it is delivered to the processor before a low priority message, regardless of its arrival time.Otherwise, if no high priority message is awaiting, low priority messages are delivered to the processor in the order they were received.

Simulation System
Our logic simulation model uses three discrete logic values: 1,0 and undefined.To model the propagation delay, each gate has a constant service time.All of the common logical gates were implemented: AND, NAND, OR, NOR, XOR, XNOR, NOT, and D-type flip-flops.
The circuits used in our study are digital sequential circuits selected from the ISCAS'89 Bench-marks.We present the results obtained from simulations of two of the largest circuits (Tab.I) since they are both representative of the results we obtained with the other circuits and have different characteristics.For example, circuit s38584 has a relative asynchronous parallelism name s35932 s38584 nearly twice as high as that found in circuit s35932.The relative asynchronous parallelism was defined as being the average number of events an asynchronous algorithm can process concurrently divided by the total number of gates in the simulated circuit.
A program was written to read the netlist of the ISCAS benchmark circuits and to partition them into clusters.We used a string partitioning algorithm, because of its simplicity and especially because results have shown that it favors con- currency over cone partitioning; see for example [6].The algorithm is similar to an in order tree walk [7].A gate connected to a primary input is first selected and assigned to a cluster.Its output is then followed and the same procedure is applied for each succeeding gate.When the cluster contains the desired number of gates, a new cluster is created and the algorithm resumes.Figure 13 shows a potential string assignment for circuit s27 for a cluster size of 4.
A simulation run can be decomposed into three phases.First, each processor starts up by loading the.gates assigned to it and by creating their corresponding LPs.Then, each gate which has an initialized state produces an event to the gates connected to it.Some of these gates will be triggered and will propagate their changes throughout the circuit.After a while the system becomes stable, and events stop being generated.During the third phase, input vectors (previously randomly generated) are read and the simulation is run.Once the termination of the detected, statistics are collected.system is 4.3.Experiments   We conducted two categories of experiments: one was to determine the effects of cluster size on the performance of each algorithm, and a second set of experiments to compare the performance (memory and execution time) of the algorithms with that of Time Warp.Because previous studies [5, 18] have shown that lazy cancellation does not actually perform better, we used an aggressive cancellation strategy in all our experiments.For each simulation run, three metrics were used to evaluate the performance of the algorithms: the simulation time, the peak number of states and the peak memory usage.

Simulation Time
We defineto be the simulation times such that: -= tn-to where to and tn are the real time at which respectively the first and the last event were processed by the system.isexpressed in seconds.

Peak Number of States
During a simulation run, process LPi, constantly monitors the size of its state queue LPi.Let Lpi (t) ]IP (t)] be the size of iP; at real time such that to _< < tn.We define the number of states of processor Pk at real time to be hpk(t) p(t)VLPi E P. Let the peak num- ber of states of processor Pk be pk Max(e (t)) where to < _< tn.We define the peak number of states of a simulation as: -Max(bio) VP E II where II is the set of processors involved in the simulation.The peak number of states is therefore the maximum number of states required by any host during the entire simulation.

Peak Memory Usage
In addition to ffffLPi, LPi also monitors the size of both its input event queue in and its output event LPi queue oout Let (t) ]fLi(t) / cut 'utPi" 03LPi ''LPi(t) be the number of events stored in f,n and cout at real wc c (t) + , c (t) l.Let Cs be the size of a state and ae the size of an event.We define the memory usage of a processor PA at real time as: cek t) (Os" /LPi(t) + Oe" WLPi(t)) VLPiEPk --Oe" Z ('OCj (t)   vcjEe Note that when the CRCC checkpointing technique is used oout 0 since LPs do not need "LPi an output queue.Similarly, '-in out for the cj c other techniques since there is no cluster output queue and no cluster input queue.
Let the peak memory usage of processor PA be &/, Max(e(t)) where to < < tn.We define the peak memory usage of a simulation as: & Max(cp) VP II where II is the set of processors involved in the simulation.The peak memory usage is therefore the maximum memory required by any host during the entire simulation and is only dependent on the number of states and the number of events stored in memory.

Varying the Cluster Size
In this category of experiments, we ran a series of circuit simulations for each algorithm on a fixed number of processors (20).The only parameter that was changed during the tests was the size of the clusters.In the first run, the size was such that all of the processors hosted only one cluster.In the second run, there were 2 clusters per processor, 4 in the third test, so on until a maximum of 256 clusters per processor was reached.

Peak Memory Usage
Figure 14 shows the peak memory usage in kilobytes vs. the number of clusters per processor for circuit s35932.The graph indicates a rather stable behavior on the part of LRCC and LRLC with a minimal memory usage occurring at 2 clusters per processor.At this point LRCC needs 38% less of the memory than pure Time Warp to run the simulation, and LRLC 22% less.
As for CRCC, we observe a rather high memory usage when each processor contains only one cluster.This is indirectly due to the synchronization overhead incurred by the algorithm itself.When a straggler is received by a cluster, all the processes whose TLE is greater than the receive time of that straggler have to be rolled back.This operation is expensive since one straggler can roll back several hundreds of processes, even though most of these processes are not causally related to that straggler.This will have the effect of desynchronizing the LPs, thus increasing the risk of rollbacks in other processors.This problem suddenly disappears when 2 clusters per processor are used.In this case, the cluster size is halved and the effect of a straggler becomes less dramatic.The memory usage for the CRCC checkpointing tech- nique decreases until 4 clusters per processor, at which point it becomes constant.The data show up to a 40% difference in maximal memory usage be- tween CRCC and Time Warp. Figure 15 shows the peak memory usage for circuit s38584.On the whole, all the checkpointing techniques of Clustered Time Warp do not per- form as well as in the previous case.For example, LRLC requires between 5 to 10% less memory than Time Warp and LRLC needs about 4 to 15% less memory.As for CRCC, the memory con- sumption is rather high from to 4 clusters per processor.After that point, the memory usage drops down to reach a minimal value at 128 clusters per processors where the memory require- ments are about 43% smaller than Time Warp.
The difference in the peak memory consumption between the two circuits is due to the fact that circuit s38584 has a relative asynchronous paralle- lism nearly half that of circuit s35932 (see Tab. I).
This characterisitic of circuit s38584 has two consequences.First, because fewer events are being processed in parallel, the Clustered Time Warp approach has a smaller chance to take advantage of its sparse checkpointing techniques.
Take for example an LP that receives only one event between two GVT computations.In such a case it does not really matter what the checkpoint interval is, since the LP will have to perform at least one checkpoint anyway.Thus, if we consider a simulation in which LPs process very few events, the overall memory usage of any checkpointing technique will not be very important.
In addition, when a circuit having a small parallelism is simulated, the event population in  the system is likely to be relatively small too, hence reducing the number of process states that have to be saved.Because less objects are being manipu- lated by the system, the estimated GVT tends to be closer to the actual GVT, therefore the fossil collection mechanism is able to remove most of the useless states and events.As a direct consequence, the memory usage reduction that can be achieved by Clustered Time Warp is attenuated.

Simulation Time
Figures 16 and 17 show the simulation time vs. the number of clusters per host.We observe that CRCC has a significant overhead when compared to Time Warp.This is mainly due to the fact that some LPs are unnecessarily rolled back.Also, each time a cluster receives a straggler or an antimes- sage, the cluster has to check all of its LPs to find out whether or not they have to be rolled back.This overhead becomes more pronounced when the cluster size is large.From 64 clusters per processor and onward, the simulation time for CRCC becomes constant and is about 34% higher than that obtained with pure Time Warp.
For both LRCC and LRLC, the simulation time is approximately constant for any cluster size.
LRCC is about 10% slower than pure Time Warp since clusters need to update their timezone table regularly, and because LPs check the table each   time they are about to process an event.As for LRLC, it is about 5 to 15% faster than Time Warp because fewer states are saved.Consequently, the fossil collection mechanism has less work to do and can catch up quickly.
Relative to Time Warp, the fact that LRCC performs slightly better for circuit s38584 and LRLC performs better for circuit s35932 is again a direct consequence of the parallelism available in the circuit.LRCC is slower than Time Warp because of the overhead created by the timezone management.A smaller parallelism implies a smaller overhead, thus better performance.Simi- larly, LRLC is faster than Time Warp because the checkpoint interval is sparse and the overhead due to the garbage collection mechanism is reduced.
However, if the parallelism gets small, the event population becomes small too, and less fossil objects have to be collected.Therefore, the reduction of the garbage collection overhead is less significant.

Summary
Based on these results, we chose the cluster size for each algorithm which gave the best performance in order to use them in our second set of experiments.
For LRCC and LRLC, we chose one cluster per processor.In the case of CRCC, we chose 32 and 128 clusters per processor for circuits s35932 and s38584 respectively.In the second set of experiments we observed the behavior of the algorithms, varying the number of processors from 8 to 24.In addition we also show the performance of a Periodic State Saving mechanism (PSS) which is a modified version of pure Time Warp in which the checkpoint interval is constant and larger than one.In our study, we chose a checkpoint interval of 3 as it proved to be an optimal value for a large range of type of simulations [15].

Peak Number of States
The main reason why checkpointing techniques are used in optimistic algorithms is the reduction of the memory usage.Nevertheless, no study has so far demonstrated that a larger checkpoint interval results necessarily in a smaller memory usage.Figure 18 shows an exmaple of two logical processes LP1 and LP2 whose checkpoint intervals are 3 and 2 respectively.Triangles represent events and circles represent checkpoints.Suppose a new GVT estimate is calculated and both LPs are about to collect their fossil objects.In addition to the state prior to the GVT, LPs need to keep all of the succeeding events in order to be able to restore their state during the coast forward phase of rollback recovery.For this reason, LPI does not actually have any fossil object whereas LP2 can delete 2 fossil events and fossil state.Consequently, even though LP has a larger checkpoint interval, its memory usage is larger than that of LP2.
This problem is actually more important in the case of logic simulation where the event size is of the same order of the state size.If the distance between checkpoints becomes too large, the memory used to keep events (needed for the coast-forward phase)could become larger than the memory saved by skipping checkpoints, in which case the overall space performance of the algorithm might not be improved.
To illustrate this problem, we measured the peak memory usage used by each algorithm, as well as the peak number of states.
In Figures 19 and 20, we show the peak number of states for each algorithm vs. the number of processors for the circuits s35932 and s38584 respectively.For both circuits, and regardless of the number of processors, all algorithms require less state saving than Time Warp.However, the LRLC checkpointing technique is by far cheaper since it stores some 70% fewer states than Time Warp in some cases.CRCC, LRCC and PSS all use approximately 30 to 40% less states than Time Warp.
4.5.2.Peak Memory Usage In Figures 21 and 22, we show the peak memory usage of each algorithm vs. the number of processors for circuits s35932 and s38584 respectively.
In all cases, the proposed algorithms consume less memory than pure Time Warp.    . . . . . ."x"" ..= :-'--7-X 13---'-"-.:El"-.:.-""" 'ccc' --'r,RCC' -F- The phenomenon we described previously can now be observed.For circuit s35932, when compared to Time Warp, the CRCC checkpoint protocol, which saved half as many states as LRLC (see Fig. 19), actually performs much better than LRLC when all the memory usage is considered (see Fig. 21).Similarly, when compared to Time Warp, the periodic state saving technique with a checkpoint interval of 3 (PSS), saves only between 9 and 16% of the memory usage whereas it saved between 30 and 35% of the states.
These results show the importance of taking events into consideration for the design of checkpointing techniques for optimistic algorithms.
The same phenomenon is observed for circuit s38584 (Fig. 22).In this case, even though the activity of the circuit is much smaller than circuit s35932, the CRCC checkpoint protocol uses between 15 and 40% less memory than Time Warp depending on the number of processors being used.Also, despite the fact that the PSS protocol saved between 28 and 37% of the states, the total memory usage was actually reduced only by about 10 to 13%.

Simulation Time
In Figures 23 and 24, we present the simulation time of each algorithm vs. the number of processors.We observe that both LRCC and LRLC perform comparably to Time Warp.CRCC is from 30 to 60% slower than pure Time Warp in  these examples.We note that this difference becomes less significant as the number of processors increases (since the memory is itself more distributed among the processors). 4.6.

Speedup
In order to measure the speedup obtained with the parallel simulation system, we have developed a sequential simulator.In this case, since the simulation is performed on a single processor, there is no need for synchronization, therefore no checkpointing is performed and events are deleted as soon as they are processed.As a consequence, no GVT algorithm is needed and the fossil collection mechanism is simply switched off.The scheduling of the processes is performed with a single heap and a minimum message timestampfirst policy is used.The sequential simulation for circuits s35932 and s38584 took 283 and 291 seconds respectively.
Results are shown in Figures 25 and 26.As we have seen in Table I, the parallelism available in circuit s35932 is much higher than that available in circuit s38584 (the relative parallelism is twice as high), as a consequence, the speedup obtained from the parallel simulation of circuit s35932 is relatively higher than circuit s35932.When the number of processors is relatively small, the overhead of the synchronization algorithm be-  comes more significant, and we observe that the speedup is actually better for a circuit with less concurrency.This clearly shows that the performance of asynchronous algorithms depends highly on the intrinsic parallelism availble in the simu- lated circuits, but also in the ability of these algorithms to keep their overhead relatively small. 4.7.Summary Figures 27 and 28 summarize the results by comparing each algorithm with pure Time Warp for circuits s35932 and s38584 respectively.For each algorithm, we give the minimum, the max- imum and the average percentage difference from pure Time Warp for the maximum number of states, the peak usage of memory, and the simulation time.
We first observed that each algorithm saves a substantial number of states, especially LRLC.However, these results do not necessarily directly translate into those obtained for total memory usage.
However when the clustered checkpointing mechanism of CTW is employed (i.e., LRCC and CRCC), the performance is better in terms of memory consumption.fact that in simulation models such as logic simulation in which the size of the state of the LPs is approximately the same as the size of the events, it is important to consider the increase of memory needed to store the supplementary events due to the checkpoint interval.
As to the simulation time, only CRCC is much slower than pure Time Warp, whereas the other algorithms exhibited a speed comparable to Time Warp.
The results also point out a stable behavior of the algorithms with respect to the number of clusters employed.With this range of choices among checkpointing algorithms, it is possible to choose an algorithm depending upon the memory requirements of the simulation.

SCALABILITY
In addition to large memory consumption, it is also possible for the number of rollbacks in Time Warp may increase without bound.Phenomena such as cascading rollbacks, echoing and the dog- chasing-its-tail are examples of this problem [13].
In this section we briefly summarize some of our results on the scalability of CTW (with CRCC checkpointing) as compared to that of Time Warp.As space limitations preclude a detailed discussion, the interested reader is directed to [1].
We define the scalability of a Time Warp based system to be the rate at which the proportion of rolled back events to committed events increases relative to the size of the simulated model.We say that a Time Warp based system is unstable if the number of rolled back events during a simulation run is not bounded, making it impossible for the simulation to terminate in a finite amount of time.
The small number of large digital circuits publically available makes it difficult to examine the scalability of our algorithms in the context of a logic simulation.Consequently, we employed queuing networks in our experiments.This choice enabled us to relate the size and topology of the network to the performance of the algorithms.We used three different network models in our experiments [9]: a pipeline model, a hierarchical model and a distributed model.Each node in all of the models represents an n xn cluster of logical processes.In order to evaluate the scalability of Time Warp and CTW, we simply varied the cluster sizes.In our experiments the number of processes ranged from 10,000 to 60,000.Links were bidirec- tional and routing was random.Three metrics were used to characterize the behavior of the simulation: throughput, the proportion of rolled back events, and the maximum memory usage.The throughput is the number of committed events per second.It provides a measure of how fast the simulation advances in real time.We employed the deterministic, uniform and shifted exponential service distributions at each of the nodes.
For all of the models and distributions, we noted that the throughput of Time Warp decreases from 2 to 5 times faster than CTW.The memory requirements of CTW are 3 times smaller than those of Time Warp for the pipeline model and twice smaller than for the other models.As an example of this behavior, Figures 29 to 31 depict the behavior of the distributed model.We have introduced, in this paper, Clustered Time Warp, an algorithm for the parallel simulation of large discrete event models which have a low computational granularity.While CTW was im- plemented on a distributed memory architecture, it can also be implemented in a shared memory environment.
CTW is a hybrid algorithm in that it makes use of Time Warp between clusters of LPs and a sequential algorithm within each cluster.We described three variants of CTW, each occupying a different point on the memory versus execution time trade-off continuum.In order of increasing memory utilization (and decreasing execution time), these variants are The performance of CTW was examined by making use of both gate level circuit simulation and queuing network models.The logic simula- tions investigated the memory versus execution time trade-offs of the three variants of CTW along with Time Warp and Time Warp with periodic state saving.Two of the largest benchmarks in the ISCAS89 benchmark suite were simulated.
We also investigated the stability of CTW compared to that of Time Warp in the context of several queuing network models.At issue in these experiments was the know propensity for Time Warp to be subject to rollback explosions.LP populations varying between 10,000 and 60,000 were employed in these studies, along with a fixed number (= 20) of processors.In all case, CTW (using CRCC) proved to be far more resistant to these rollback explosions then Time Warp.The throughput of Time Warp decreased from two to five times faster than that of CTW, while the memory requirements of CTW were a third of Time Warp's for the pipeline model and half of those for the two other models.These experiments would appear to indicate that CTW has the capability of limiting the rollback explosions associated with Time Warp.The degradation in performance due to phenomena such as cascading rollbacks and dog chasing its tail can then be contained.
A number of issues related to CTW remain to be investigated.One such topic is the question of cluster size.While we determined appropriate cluster sizes for our experiments empirically, it is reasonable to assume that different models simu- lated on different platforms would result in different cluster sizes.Some form of automated procedure to determine cluster sizes would cer- tainly be desirable.
Other topics are the issue(s) of model partition- ing and load balancing.Dynamic load balancing algorithms for CTW are described in [1].A related issue is that of flow control between clusters.It might be possible to develop an integrated load balancing and flow control algorithm which could help maintain the stability of CTW in a distributed memory environment.These issues are analogous to the work done on the memory management of Time Warp in a shared memory environment.
Finally, and most important, it is important to evaluate the performance of CTW in realistic simulations, for example register level vlsi simula- tions of circuits with 250-500,000 gates.Each of these questions is the focus of on-going research efforts.
We remain optimistic.

1 .
Figures3 and 4give the pseudocode executed by each logical process in a Clustered Time Warp system.

Figures 5 and 6
Figures5 and 6give the pseudocode executed by each cluster environment in a Clustered Time Figures 5 and 6give the pseudocode executed by each cluster environment in a Clustered Time Warp system.

FIGURE 3
FIGURE 3 The Logical Process is about to process event.

FIGURE 5
FIGURE 5 The cluster has received event e.

FIGURE 8 (
FIGURE 7 (a) The system starts and (b) message m is received for LP.

FIGURE 10
FIGURE 10 (a)LP2 coasts forward and (b) the cluster resumes and proceeds normally.

FIGURE 12
FIGURE 12 Operations performed by the LP once a new GVT estimate is calculated.

FIGURE 13
FIGURE13 Example of a string partitioning for circuit s27 (cluster size 4).
Furthermore, each cluster Cj monitors the size of both its input queue in and its

FIGURE 14
FIGURE 14 Memory vs. Number of clusters per processor (circuit s35932).

FIGURE 15
FIGURE 15 Memory vs. Number of clusters per processor (circuit s38584).

FIGURE 16
FIGURE 16 Simulation time vs. Number of clusters per processor (circuit s35932).

FIGURE 17
FIGURE 17 Simulation time vs. Number of clusters per processor (circuit s38584).

4. 5 .
Varying the Number of Processors

FIGURE 24
FIGURE 24 Simulation time vs. Number of processors (circuit s38584).

FIGURE 29
FIGURE 29 Throughout for the distributed model.

FIGURE 31
FIGURE 30 Rollacks for the distributed model.
t is the timestamp to which the process should roll back.4 The LP is told by the CE to rollback to time trbk. FIGURE begin delete timezone Zi with interval [t, t+l[It,r(e) e Z if e is an antimessage then delete timezone Zi-i with interval [ti-t, ti[ create timezone Z_ with interval [ti-l,ti+l[ else create timezone Z with interval [ti, tr(e)[ create timezone Z with interval [t(e),ti+-[ endif for each LP in the cluster with a TLE >_ tr(e) do tell LP to roll back to a state prior to t(e) 7, + [ into two new timezones with intervals[7, 10[ and [10,+ [.As Figure9a