Execution of VHDL Models Using Parallel Discrete Event Simulation Algorithms

In this paper, we discuss the use of parallel discrete event simulation (PDES) algorithms for execution of hardware models written in VHDL. We survey central event queue, conservative distributed and optimistic distributed PDES algorithms, and discuss aspects of the semantics of VHDL and VHDL-92 that affect the use of these algorithms in a VHDL simulator. Next, we describe an experiment performed as part of the Vsim Project at the University of Adelaide, in which a simulation kernel using the central event queue algorithm was developed. We present measurements taken from this kernel simulating some benchmark models. It appears that this technique, which is relatively simple to implement, is suitable for use on small scale multiprocessors (such as current desktop multiprocessor workstations), simulating behavioral and register transfer level models. However, the degree of useful parallelism achievable on gate level models with this technique appears to be limited.


INTRODUCTION
ver the course of the last decade, VLSI circuits being manufactured have escalated in com- plexity.This is a direct result of advances in manu- facturing technology.Simulation is now a vital tool used by VLSI engineers, both for verifying that a design meets its functional requirements, and for generating test vectors against which behavior of a manufactured circuit can be compared.
In order to simulate a design, the structure and intended behavior must be described.Structure can be described using circuit schematics, but behavior must be described using a hardware description lan- guage (HDL).An HDL is esssentially a specialized programming language, designed for this purpose.In 1987, the IEEE standardized a hardware descrip- tion language for digital circuits and systems called VHDL [8].Since then, VHDL has gained wide- spread use internationally, with simulators being made available by the main vendors of computer aided engineering software.Its adoption has been accelerated by the United States Department of De- fence requirement that circuits designed for it under contract be delivered with documentation in VHDL.Independently, many engineers world-wide have seen the need for a common design language to allow them to interchange design information, and to use design tools with standard interfaces.
The current problem in VLSI engineering is that simulation runs take excessive amounts of time to execute.Run times of several days are not uncom- mon for complex circuits, causing increased design costs and longer time-to-market.The trend is for run times to lengthen, since the combinatorial complexity of simulation is outstripping performance im- provements in computers.The delays induced may make a circuit uncompetitive when it is finally man- ufactured.The temptation is to reduce the amount of simulation done, at the risk of releasing a circuit with functional errors or having to re-iterate through the prototyping cycle.The underlying cause of the problem is that most of the simulation tools available commercially perform sequential execution of a model.However, VHDL is not inherently a sequen- tial language; indeed, it has been designed to allow parallel execution with deterministic results.
The Vsim Project at the University of Adelaide is investigating means of speeding up simulation by ex- ecuting VHDL models in parallel on multiprocessors and multicomputers.This can be achieved by processing a circuit's responses to numerous signal events in parallel.Parallel simulation is not a trivial problem, as the components of a model executing on different processors must appear to be kept in synchronization to avoid causality errors.The suc- cess of this approach depends on the degree of parallelism allowable within a simulation model, which in turn depends on the data dependencies within that model.Success also depends on the degree to which a simulator can find additional opportunities for parallelism without appearing to introduce causality er- rors.
There is a large body of research dealing with general techniques for managing parallel discrete event simulation (PDES).See [6] for a survey of the field.
These techniques model a system as a collection of logical processes, which interact by scheduling and reacting to events that occur at instants in simulation time.This framework is appropriate for simulating VHDL models, since such models consist of pro- cesses that represent the behavior of circuit com- ponents, and that schedule and react to changes of values of circuit signals.
The experiments being performed in the Vsim Project involve developing VHDL simulators using an appropriate selection of the published techniques, and determining which techniques give best speedup for different kinds of VHDL models.It is expected that VHDL simulation will be amenable to speedup using PDES algorithms, but the degree of parallelism implicit in real VHDL models is currently not clear, nor are the patterns of data dependencies relevant to this problem well understood.
As a first step, this paper identifies a number of PDES algorithms used for discrete event simulation in general, and evaluates their applicability to exe- cuting VHDL simulation models.In Section 2, we survey a number of PDES algorithms, then in Section 3 we discuss the characteristics of VHDL that are relevant when considering parallel simulation, and consider the applicability of the various PDES al- gorithms to execution of VHDL models.In Section 4, we describe an initial experiment performed using one of the PDES algorithms and present measure- ments made during execution of some benchmark models.

SURVEY OF PDES ALGORITHMS
Discrete event simulation algorithms view a physical system as a collection of interacting physical processes.A simulation model consists of a number of logical processes, each corresponding to a physical process, which communicate by exchanging time- stamped event messages.Each logical process con- tains local state, which includes a local clock to mea- sure the passage of simulation time.A logical process responds to an event message when its local clock reaches the time-stamp of the message.It then sim- ulates the action of the corresponding physical process by modifying its internal state using the infor- mation in the message, and by sending further messages with later time-stamps to other logical processes.
Correct execution of a simulation model can be guaranteed if the following causality constraint is sat- isfied: that a process receives and acts upon events in non-decreasing time-stamp order.Since a process may modify its local state when it acts upon an event, observing this constraint guarantees that the process' response to a later event cannot affect the response to an earlier event, as we would expect.
In some cases, this constraint is stricter than nec- essary to achieve correct execution.We say two events are independent if processing one has no ef- fect on the outcome of processing the other.Inde- pendent events can be processed in any order, ir- respective of their time-stamps.
A typical sequential implementation of discrete event simulation uses a time ordered central queue of events remaining to be simulated at later simu- lation times.The logical processes are implemented as coroutines, being resumed in turn as the simula- tion progresses.Conceptually, the local clocks of all the logical processes are subsumed by a global sim- ulation time clock.Simulation proceeds by the al- gorithm shown in Figure 1.The body of the loop implements what is called a simulation cycle.In each cycle, an event at a single instant in simulation time is processed.Since the queue is ordered by the time- stamp of the events, the events are processed in non- decreasing time-stamp order, and thus the causality initialize all logical processes this should cause some events to be scheduled while event queue not empty loop set the global simulation time to the time-stamp of the first event in the queue remove the first event in the queue and resume the destination process enqueue any new events scheduled by the process end loop FIGURE Discrete event simulation algorithm used in a sequential simulator.

Discrete Event Simulation
Centralized Queue ] Distributed Conservative Deadlock Avoidance Deadlock Recovery FIGURE 2 A taxonomy of discrete event simulation algorithms.constraint is satisfied.Note that the order of pro- cessing of events scheduled for the same simulation time is not specified, nor need it be.Since such events correspond to simultaneous events in the physical system being modeled, there is no inherent ordering.
There are three major classes of parallel discrete event simulation (PDES) algorithms that have been published in the literature: central queue PDES, con- servative distributed PDES, and optimistic distrib- uted PDES.Each of these has different approaches to dealing with parallelism, synchronization and communication of data within a model.Figure 2 shows a taxonomy of these algorithms.software processes.They execute a loop that receives an event message from the kernel, processes the mes- sage, then suspends.A process schedules an event message for another process by calling back to the kernel process, which updates the event queue.
The amount of parallelism available using this method is unclear, and may vary significantly be- tween models in different applications.If many events occur with exactly the same time-stamp, this method may give us significant speedup.However, in real models which include device propagation de- lays, we expect that events will be spread more evenly through simulation time.Hence the processes which receive these events will not be executed in parallel, even if the events are independent.Another problem is that the central queue is a shared resource between the processes executing during one simu- lation cycle, and is thus a potential bottleneck.
There are a number of possible ways to increase parallelism using a centralized event queue algo- rithm.One simple technique is to avoid serializing access to the central queue, so that it does not be- come a bottleneck when new events are added.An example of how this may be done is by dividing the queue into a number of subqueues, and placing events destined for different processes on different queues, using the destination process identification as a hash key.More elaborate techniques involve searching for independent events beyond the current time step.This search involves some model specific knowledge about the behavior of processes.
2.1 Centralized Queue PDES One simple way to achieve parallelism in executing a simulation model is to modify the sequential al- gorithm shown above, to process all events selected in one time-step in parallel.Since these events all have the same time-stamp, the causality constraint is not violated.The basic simulation algorithm, shown in Figure 3, is implemented by a kernel process, which maintains the central queue of event mes- sages.Logical processes are implemented as parallel 2.2 Conservative Distributed Algorithms In the conservative distributed PDES technique, in- formation about scheduled events is distributed amongst the receiving processes.Different processes may respond in parallel to events occurring at dif- ferent simulation times, but a process only receives an event message when it can determine that no caus- ality error will result.This is the meaning of the name 'conservative' for this technique.initialize all logical processes this should cause some events to be scheduled while event queue not empty loop set the global simulation time to the time-stamp of the first event in the queue repeat remove the first event in the queue and resume destination process until the time-stamp ofthe first event in the queue is greater than the current time wait until all processes have suspended end loop FIGURE 3 Centralized queue PDES algorithm.initialize loop set local-clock to the earliest input link-clock if any input link with link-clock = local-clock is not empty then remove first event from one of these links process he event may involve scheduling events on output links else wait until an event arrives on auy link with link-clock local-clock end if end loop FIGURE 4 Conservative PDES algorithm.
Each logical process in the simulation maintains its own local clock recording the simulation time it has reached.Events are sent between logical pro- cesses via links, which act as FIFO queues.The links are created at simulation startup, connecting com- municating logical processes.The conservative al- gorithm must ensure that a process does not receive an event with time-stamp earlier than the current local clock.This can be ensured in part by having processes send event messages in nondecreasing time-stamp order.In addition, each link has a clock associated with it, recording the time-stamp of the earliest message queued in the link.If the link is empty, the link clock records the time-stamp of the last message sent over the link.Each process re- peatedly selects a link with earliest clock, and pro- cesses the first event message from the link.If all of the links with earliest link clock are empty, the process blocks until a message arrives on one of them.The algorithm executed by each process is shown in Figure 4.
Unfortunately this algorithm may lead to deadlock of part or all of the simulation.If, as in Figure 5, a set of logical processes are connected by a cycle of empty links and those links are the ones which are selected by each process in the above algorithm, then all processes in the cycle will be blocked perma- nently.Techniques for dealing with deadlock include deadlock avoidance, and deadlock detection and re- covery.Deadlock avoidance involves the use of null mes- sages, which carry no data and are sent solely for synchronization purposes.When a process finishes processing an event which arrived at time it sends a null message on all of its output links with time- stamp + 6min, where min is the lower bound on the time-stamp increment for a message through the process.This increment can be thought of as the mini- mum propagation delay through the corresponding physical process.The null .messageindicates to the receiver that the sender will not generate an output event with time-stamp earlier than + min, SO the receiver may safely update its local clock.It also generates null messages on its output links, adding its minimum time-stamp increment.In this way, null messages are propagated around a cycle until some local clock value reaches the link clock of some other link not in the cycle.
There are two possible problems with this dead- lock avoidance technique.Firstly, it will only work if the sum of the time-stamp increments around any cycle of processes is non-zero.Thus it is not suitable for zero-delay abstract behavioral models.Secondly, the extra message traffic generated by sending null messages may be considerable.However, this can be reduced by sending null messages on a demand basis.A is waiting for a message from C, since the link from C has earlier link clock than A's other input link.Processes B and C are also waiting on empty links in the cycle.No process can pro- ceed, so the network is deadlocked.
If a process must block, it first requests null messages from those processes on the sending sides of the links on which it is about to block.
Deadlock detection and recovery involves a mech- anism (such as that described in [3]) to detect dead- lock in the distributed computation.Deadlock can then be broken by processing the messages with the smallest time-stamp.An advantage of this technique is that cycles in which the sum of the minimum time increments is zero are permissible.A disadvantage in typical implementations is that the mechanism to detect and recover from deadlocks is centralized and therefore may become a bottleneck.Furthermore, using that mechanism, only global deadlocks are de- tected, allowing local deadlock in a subnet of the simulation to remain undetected and reduce parallelism.
Shared memory implementations of conservative algorithms make possible several optimizations [4,  6, 11].For example, deadlock avoidance can be im- plemented by having a process which is about to block on an empty link directly inspect the local clock of the process at the other end of the link.This obviates the null message traffic that would otherwise be required.

Optimistic Distributed Algorithms
One major drawback with conservative algorithms is the fact that they may not fully exploit the potential parallelism in a model.If there is any chance that one process may affect another, then the process executions are serialized.On the other hand, opti- mistic algorithms, using the Time Warp paradigm [10], allow execution to proceed without regard to causality violations, and take corrective action when violations are detected.The time warp algorithm forms the basis of the experimental,parallel VHDL simulator developed at the University of Cincinnati [131.
As with conservative algorithms, each process in- cludes a local clock which measures simulation time for that process (called virtual time), and a queue of time-stamped event messages to be processed.How- ever, with optimistic algorithms, each process proceeds in simulation time at its own pace, on the as- sumption that no causality errors result.If at some point in simulation time an event arrives with time- stamp less than the local clock (a straggler event), then the receiving process must roll back.It must undo all state changes back to a simulation time at or before the time-stamp of the straggler, and re- commence.In order to be able to perform roll back, a process must save its state from time to time, and must also keep a list of all messages it has received so that it can process them again after rolling back.Furthermore, it must keep a record of event mes- sages it has sent to other processes, since some of them may result from execution which is subse- quently rolled back.When roll back occurs, the pro- cess must send an 'anti-message' for each message sent with time-stamp greater than the straggler.If the message to be cancelled is still in the destination's input queue, it is simply deleted.On the other hand, if the destination process has already processed the message, it too must be rolled back.This process continues recursively throughout the network of pro- cesses until all incorrect computation has been un- done.
In order to prevent unbounded growth of saved state, time warp simulators use the notion of Global Virtual Time (GVT), which is less than or equal to all processes' local virtual time clocks.Since no pro- cess need ever roll back past GVT, a process need only store enough state history to roll back to GTV.State for earlier times can be discarded.GVT is also used to determine when operations such as I/O that cannot be rolled back may be committed.
There are a number of optimizations to the time warp algorithm that may be used in a VHDL sim- ulator.For example, when a process rolls back, in- stead of immediately cancelling events it has generated, it may first check to see if the new events it generates are the same as the old ones.If so, the original events need not be cancelled.This is known as lazy cancellation.In the best case, lazy cancella- tion may allow execution to complete faster than critical path analysis would suggest.This may occur if the optimistic execution of a process turns out to have produced the correct results when a straggler arrives.The successor processes will already have completed correct execution, and need not be rolled back.In the worst case, delaying cancellation may allow incorrect computation to spread further through the network than normal cancellation, thus requiring more roll back in other processes and prolonging execution time for the simulation.The actual behavior can be expected to be somewhere between these extremes, depending on the model and input data.
Another optimization of time warp which may be effective in circuit simulation is lazy re-evaluation.If processing a straggler event does not change the state of a process, then it is not necessary to re-process the events that were rolled back, since they would produce exactly the same results.Instead, the pro- cess can 'jump forward' again to the virtual time to which it had previously made progress.This opti- mization may prove valuable in models that do not rely on internal state to respond to events, such as combinatorial circuit elements.
Optimistic algorithms rely on the frequency of roll back being sufficiently low that it does not significantly affect the progress of the simulation.However, in extreme cases, the processes could spend more time performing roll backs than useful com- putation, which may cause the parallel simulation to execute slower than a sequential simulation.Optimistic algorithms also presume that the space need to save state information is not excessive.Compared to conservative algorithms, optimistic algorithms avoid the possibility of deadlock, thus providing more opportunity for parallelism, albeit at the ex- pense of performing some computation which is subsequently discarded.The time warp algorithm is significantly more complex than conservative approaches, and there is no guarantee that it will induce a better speedup.Experimental evidence is needed to determine which technique produces best results for different types of models.

CHARACTERISTICS OF VHDL MODELS
Each of the PDES algorithms surveyed in the previous section are general purpose, and can thus serve as the basis for a parallel simulator for VHDL models.However, there are a number of aspects of the semantics of VHDL that affect how these algo- rithms are used.In this section, we comment on some of these aspects and discuss their implications for simulator implementations.We assume that the reader is familiar with VHDL, and refer the less familiar reader to a number of sources that describe the language [1, 5, 12] and to the IEEE Standard Language Reference Manual [8].We also consider implementation of some of the new features added to the language in the VHDL-92 draft standard [2,9].
One important distinction to draw when discussing PDES algorithms in the context of VHDL simulation is the different use of the term 'event'.A PDES 'event' corresponds to a VHDL 'transaction', that is, a time-stamped communication of information from one process to other processes.VHDL distin- guishes the special case of the new value on a signal being different from the old value with the term 'event'.To avoid confusion in the following discus- sion, we will use the terms transaction when refer- ring to a PDES event in the context of VHDL, and the term 'VHDL event' when referring to a trans- action that changes a signal value.The reason for making this distinction is that the PDES algorithm used must communicate transactions, not just VHDL events, between processes connected by signals. 3.1 Processes and Signals in VHDL Models PDES algorithms share a common view of a model to be simulated: they treat it as a collection of logical processes, which schedule events.Each logical pro- cess is a piece of sequential code, which is triggered by events scheduled for it by other logical processes.In a central queue PDES simulator, events are sent to a logical process by the kernel process.In a dis- tributed PDES simulator, events are sent from one logical process to another via event message links.
In a VHDL model, the processes are explicitly defined, and the links are defined by the signal nets.
(Compare this with some other modeling languages, in which the logical processes and event scheduling links are implicit, and must be extracted by code analysis.)A VHDL model consists of a hierarchical structure of component instances and processes, in- terconnected with signals.These are specified using VHDL entities (which represent objects to be used in the design), and corresponding architecture bodies (which implement the functionality of the entities).
An architecture body can contain component in- stances, with ports that are interconnected by signals.
Figure 6(a) shows an example of a hierarchical model.Each component instance is bound to an entity/ architecture pair (by a configuration specification), and can be elaborated by substituting the corre- sponding architecture body.The signals of the sub- stituted architecture body are connected with those of the enclosing body via the component instance's ports.The component instances in the substituted body can then be recursively elaborated.The recur- sive elaboration bottoms out when an architecture body is reached that contains only processes, and no further component instances.A process is a sequen- tial body of code that can affect the values of signals using signal assignment statements.A process con- taining a signal assignment for some signal s is said to contain a driver for s.
This elaboration procedure is carried out statically on a design as a precursor to executing the simula- tion.The result of elaboration is a collection of VHDL processes, each containing drivers for one or In (a), the entity Y has an architecture B, which contains two processes P1 and P2.The entity X has an architecture A, which contains two component instances C1 and C2, both of which are bound to the pair Y/B.In (b), the structure is elaborated, with architecture body B substituted for each of C1 and C2.The result is a network with two instances of P1 (C1.P1 and C2.P1) and two instances of P2 (C1.P2 and C2.P2).
VHDL process corresponds directly to a PDES log- ical process, and each signal net is implemented as a collection of PDES event message links.Whereas in the general PDES algorithms, a logical process sends an event message directly to a destination process, in a VHDL simulation a VHDL process sched- ules a transaction on a signal.This involves the driver for the signal duplicating ('fanning-out') the trans- action to all other VHDL processes which reference the signal, namely, all those VHDL processes which connect to the signal net for input.The semantics of VHDL process resumption also require an extension of the general PDES algorithms.In the general algorithms, a logical process is resumed when an event message is received, and executes code to respond to that message.A VHDL process, however, only resumes when a wait state- ment terminates.This may occur as a result of: a VHDL event on a signal in the sensitivity set on which the process has suspended (if no wait condition is specified), a VHDL event on a signal in the sensitivity set, and the wait condition being true (if a condition is specified), or the time-out period expiring (if one is specified).However, this does not fully describe the condi- tions to which the process must respond.It must also respond to any transaction that arrives on any input link and update the input value accordingly.In the case of conservative PDES algorithms, the process must update its local clock using the link clocks of all input links, even if the VHDL process is not wait- ing on some of the signals.In the case of optimistic algorithms, the process must respond to straggler messages on input links-for signals the VHDL pro- cess is not waiting for.These semantics suggest an implementation of a process in terms of a logical process, which executes the PDES algorithms, and which encapsulates a VHDL process (see Figure  7).
The logical process acts as a local kernel for the VHDL process, filtering input messages, and resum- ing the VHDL process when the wait statement on which it is suspended terminates.The VHDL process uses the services of the local kernel to send output messages as described throughout the rest of this paper.
In addition to the processes described above that define the behavior of a model, VHDL also allows the designer to specify a block of code for resolving the value of a signal driven by a number of sources.When a transaction occurs on any of the sources, a resolution function is invoked, which calculates the new value from the values of the sources.One way to implement resolution functions in a PDES simu- lator is to encapsulate the resolution function in a PDES process.The process has input links to receive transaction messages from each of the sources for the resolved signal, and has output links to processes that use the resolved signal value.The process re- sponds to input messages by computing the new sig- nal value and scheduling a transaction message on its output with no delay.(This must be a delay of 0fs+0delta to preserve VHDL semantics; see dis- cussion below on the VHDL semantics of simulation time.) A related issue is the implementation of VHDL GUARD signals, which require the evaluation of a guard expression.The implementation can be done similarly, with the guard expression being encapsulated in a PDES process.In this case, the input links to the process receive transaction messages for sig- nals denoted in the guard expression, and the output input output links links VHDL process FIGURE 7 Implementation of a simulation process as a VHDL process encapsulated in a logical process.link corresponds to the GUARD signal itself.Again, the process must evaluate the expression and sched- ule an output transaction with no delay.
A further complication arises from the fact that signals in a VHDL model may be of composite types (arrays or records).The semantics of the language require such signals to be treated as collections of scalar subelements.The semantics of assignment and sensitivity for composite signals are defined in terms of aggregating the effects of assignment and sensi- tivity to the scalar subelements.Hence a straight forward implementation could apply the techniques used for scalar signals to each subelement of a com- posite signal.An optimizing simulator might attempt to treat the composite signal as a single signal that communicates composite values, provided the model semantics are preserved.The major complication arises with resolved composite signals, where the res- olution function must resolve composite values from a number of sources.The difficulty is that transac- tions may occur on individual subelements of differ- ent sources at different times, and that each source may be composed of several subsources arising from drivers in different VHDL processes.The driving values must be collected together to form the com- posite values passed to the resolution function.The most straight forward approach is to perform this combination in the PDES process encapsulating the resolution function, as described above.The PDES process must have input links from each of the sub- element sources contributing to each composite value to be passed to the resolution function.

Simulation Time and Delta Delays
Whereas the general PDES algorithms discussed in Section 2 use a simple view of simulation time, VHDL simulators require a structured representa- tion for simulation time.This comes about from the specification of the run-time semantics of VHDL in terms of a sequential iteration of a simulation cycle.The cycle basically involves the following steps" 1. advancing simulation time to the time of the earliest scheduled transaction, 2. updating signals with the values of transactions scheduled for this simulation time, 3. resuming processes sensitive to VHDL events resulting from the signal updates, 4. waiting until the resumed processes have all suspended.If a process that is resumed in step 3 schedules a transaction with zero delay, the corresponding signal is not updated until the next iteration of the simu- lation cycle.No process sees the effect of the trans- action until the next cycle.This form of delay is called a delta delay, since the simulation time is the same for the next cycle, but some apparent time has passed.It is quite possible to write a VHDL model whose meaning depends on the correct ordering of process executions in successive simulations cycles after a delta delay, as shown in Figure 8. Indeed, zero delay behavioral models, and some low level models for switched circuit elements, require these semantics to be observed by a simulator.
The potential problems that arise lead simulator implementors to treat simulation time as a pair, con- sisting of a time value in fs, and a count of delta cycles at that time step.This presents no great dif- ficulty in a sequential simulator, or in a parallel sim- ulator using the central event queue technique, since the kernel can implement the VHDL simulation cycle as described above.However, in a simulator using a distributed algorithm, local clocks and time- stamps in transaction messages must include both the architecture a of e is signal sl, s2 bit := '0'; begin pl :process begin wait for I fs; sl <= '1' after 0 fs; wait" end process pl; p2: process begin wait for I fs; s2 <= '1' after 0 fs; wait; end process p2; p3: process begin wait until s '1 '; assert s2 '1' report "Wrong order!"; wait; end process p3; end a; FIGURE 8 A simple model that depends on correct ordering of process resumptions in the presence of delta delays.If correct VHDL semantics are observed, pl and p2 are resumed in the first cycle at time fs, then p3 is resumed after a delta cycle and no assertion violation occurs.On the other hand, if a simulator were to resume pl first, then p3 before p2, p3 would see the value '0' on s2, and raise an assertion violation.
time value and the delta cycle count, and compari- sons must be made using both elements.
The semantics of the VHDL simulation cycle also require another variation to the general PDES al- gorithms.If two transactions with the same time- stamp are sent to a logical process, one of which causes a VHDL event resulting in resumption of the VHDL process, both transactions must be received and the signal value updated before the VHDL pro- cess is resumed.Compare this with the general con- servative algorithm shown in Section 2.2, in which a process responds to an event as soon as the local clock reaches the link clock.The variation to the general algorithm shown in Figure 9 avoids the possibility of the VHDL process being resumed in re- sponse to one transaction, only to have another transaction with the same time-stamp arrive later.
The outer loop performs the normal conservative algorithm, so long as the process is not triggered by a VHDL event on a signal in its sensitivity set.When it is triggered, the inner loop performs the conserv- ative algorithm on input links, receiving transactions and updating signals, so long as there are link clocks equal to the local clock.When all input links have progressed to a later time, the process can guarantee that there will be no more transactions for the current time, so it can resume the VHDL process body.
A simulator using the time warp algorithm requires relatively little modification to deal with this aspect of the semantics of VHDL process resumption.
When a VHDL process is triggered by a VHDL event on a signal, it may resume immediately, on the optimistic assumption that no further transactions will arrive with the same time-stamp.If an input trans- action subsequently arrives with time-stamp equal to the local clock when the VHDL process was re- sumed, the transaction should be treated as a strag- gler.
A further complication arises when 'postponed processes' proposed for the VHDL-92 language, are considered.Such a process is sensitive to VHDL events in the normal way, but does not necessarily resume execution in the simulation cycle in which a triggering event occurs.Instead, it is resumed after all of the delta cycles for that time-step have been completed.While these semantics are relatively straightforward to implement in a sequential simu- lator or a central event queue parallel simulator, care must be taken in distributed parallel implementa- tions.
If a conservative algorithm is used, a postponed process may not respond to a transaction until it can guarantee that no other transaction at the same time- step will arrive.Figure 10 illustrates a scenario where the normal conservative algorithm would fail for a postponed process.
A possible variation to the conservative algorithm of Figure 9 for use by a postponed process is shown in Figure 11.In this case, the inner loop performs the conservative algorithm on input links, receiving initialize loop set local-clock to the earliest input link-clock if any input link with link-clock local-clock is not empty then receive first event from one of these links and update signal if VHDL process is triggered then while there is an input link with link-clock local-clock loop if any input link with link-clock local-clock is not empty then receive first event from selected link and update signal else wait until an event arrives on any link with link-clock local-clock end if end loop ---after loop, all input links have link-clock at later time than local-clock resume the VHDL process may involve scheduling events on output links end if else wait until an event arrives on any link with link-clock local-clock end if end loop   10 A scenario where the normal conservative algo- rithm would fail for a postponed process.The postponed process P has two input links, one of which is empty, with a link clock at + 0delta.The other contains a queued transaction with time- stamp.t+ 0delta, which causes a VHDL event that triggers the process.Immediate resumption of P would be incorrect, since the subsequent arrival of the transaction at simulation time + ldelta should cause the corresponding signal to be updated first.transactions and updating signals, so long as their link clocks remain in the current time-step.When all input links have progressed to a later time-step, the process can guarantee that there will be no more delta cycles, so it can resume the postponed VHDL process body.The process body may schedule trans- actions on output links, but the semantics of VHDL-92 require that they have non-zero delay, otherwise further delta cycles in the same time-step would re- sult.
A simulator using the time warp algorithm can deal with postponed processes relatively easily.A post- poned process may resume immediately it is trig- gered, on the optimistic assumption that there will be no further delta cycles in the time-step.(The re- quirement that transactions it schedules have non- zero delay still applies, however.)An input trans- action arriving subsequently with the time compo- nent of the time-stamp equal to the time component of the local clock when the process was resumed should be treated as a straggler.A simple expedient for implementing this is to reserve a special value of delta cycle count which is deemed to be greater than any other value, and setting the delta component of the local clock to this value when the postponed pro- cess is resumed.

Transport and Inertial Delay Semantics
The semantics of transport and inertial delay in VHDL necessitate further modifications to the basic PDES algorithms, due to the fact that a transaction scheduled by one signal assignment may subse- quently be deleted by another.If a signal assignment specifies transport delay, any previously scheduled transactions on the signal with time-stamps later than or equal to that of the new transaction are deleted.If inertial delay is used, any previously scheduled transactions with earlier time-stamps are also ex- amined, and some of those may be deleted.In both cases, the transactions to be deleted can not, in general, be determined until run-time, since a model initialize loop set local-clock to the earlies7nput link-clock if any input link with link-clock local-clock is not empty then receive first event from one of these links and update signal if process is triggered then while there is an input link with link-clock.timelocal-clock.timeloop set local-clock to the earliest input link-clock if any input link with link-clock local-clock is not empty then receive first event from selected link and update signal else wait until an event arrives on any link with link-clock local-clock end if end loop --after loop, all iput liks have lik-clock at later time-step than local-clock resume the postponed process may ivolve schedulin events o output links end if else wait until an event arrives on any link with link-clock 1octal-clock end if end loop FIGURE 11 Modified conservative PDES algorithms that correctly handles postponed processes.may include dynamically variable delay values in sig- nal assignments.We compare this to other simpler hardware description languages, such as that used by Soul6 in his parallel simulators [15], in which a fixed statically determined delay is specified for each pro- cess, and simple transport delay semantics are used.This means that each signal assignment schedules a transaction on a signal for a later simulation time than transactions from previously executed assignments, and no transactions need be deleted.
For parallel simulators using conservative algo- rithms, the prime concern in this context is that trans- actions be sent over message links in non-decreasing time-stamp order.If transport delay is used, the sim- ulator needs to determine a lower bound on the delay value for any assignments to a signal in a process (6min), since transactions later than the current sim- ulation time (T) plus this delay may be deleted, and therefore should not be sent.(This delay is the same value used for time-stamping null messages.)The process should hold a transaction scheduled on an output link until T + 3mn passes the time-stamp of the transaction.The worst case is 6min 0 fS, in which case the process cannot send a transaction until its local clock passes the time-stamp of the transaction.
If a process uses inertial delay, deletion of a sched- uled transaction with time-stamp earlier than T + 6min depends on the values in transactions scheduled by subsequently executed signal assignments.In gen- eral, these are determined at run time, and cannot be predicted, thus a transaction may not be sent until the local clock passes the time-stamp of the trans- action.However, in many cases it may be possible to determine these values through static analysis of the model code (e.g., using the techniques suggested by Willis and Siewiorek [16]), and thus guarantee that transactions won't be deleted.In these cases, a transaction may be sent before the local clock reaches its time-stamp, thus increasing the parallelism achievable.This is an example of the use of lookahead in a conservative algorithm.The literature on conservative algorithms suggests that the quality of lookahead information used is important in im- proving the performance of these algorithms (see [6]).
The proposal for VHDL-92 to include a pulse re- jection limit in an inertial delay specification provides opportunity to improve on the above behavior.The pulse rejection limit is a time interval, counting back- wards from the delay value specified in a signal as- signment, in which previously scheduled transactions may be deleted.Any transactions between the cur- rent time and the beginning of this interval are not deleted.Thus if the simulator can determine a lower bound on delays (6m,) and an upper bound on pulse rejection limits (Pm,x), transactions with time-stamps between T and T / (min Omax)can be sent.How- ever, the difficulty mentioned above also applies to determining Pmax, namely that, in general, the inter- val is determined at run time.
While considerable complexity is added to con- servative algorithms in handling these semantics, time warp algorithms can deal with transaction dele- tion much more simply.When a transaction is sched- uled in a time warp simulator, the process can send it immediately.If the transaction must subsequently be deleted, the process can simply send an anti- message to cause cancellation of the transaction and roll back of any optimistically executed computation.Note, however, that this does not prevent application of lookahead techniques to avoid sending a trans- action.If the simulator can determine that a trans- action will be deleted, it can avoid the cost of sending and subsequently cancelling the transaction.

Shared Variables
The proposal for shared variables in VHDL-9219] creates considerable extra complication for distrib- uted PDES algorithms.At the original time of writ- ing of this paper, the proposal included mechanisms for synchronizing processes access to shared vari- ables, allowing a designer to manage or exclude non- determinism in a model.However, since then, the IEEE ballotting body has voted to remove these mechanisms from the formal language definition, al- lowing uncontrolled access to shared variables.This step has been the cause of much contention, and it is unclear at this stage whether the originally proposed synchronization mechanisms will be adopted as an informal adiuct to the language.We have not yet addressed the issue of handling uncontrolled shared variables in a parallel VHDL simulator, as it is unclear what the expected semantics should be.The following discussion addresses inclusion of con- trolled shared variables, as described in the original proposal.
The proposed semantics are that shared variables may be declared in blocks that enclose processes, and the processes may refer to shared variables using access statements.Each access statement names a set of shared variables, and the runtime system is re- quired to guarantee exclusive access to these vari- ables while the sequential statements in the access statement are executed.The mechanism used by the runtime system to provide mutual exclusion between processes is required to be deadlock free.This is made simpler by the act that processes may not sus- pend whilst within an access statement, nor may they dynamically nest access statements.
In a sequential simulator, or a parallel simulator using a central event queue algorithm, shared vari- ables can be managed by the kernel.It can apply any of the well known resource allocation techniques used by operating systems and described in texts on the subject (for example, [14]).However, a simulator using a distributed PDES algorithm must also allow for the fact that different processes requesting access to shared variables may be at different simulation times.To preserve the language semantics, the im- plementation must synchronize processes that use shared variables, so that accesses occur in the correct order in simulation time.An obvious way to do this is to encapsulate each shared variable in a monitor process, which is responsible for granting access to one process at a time, and for serializing requests in the correct simulation time order.A process requests access to a shared variable by sending a time-stamped request message to the monitor process.When ex- clusive access is granted, the monitor process replies with the value in another message with the same time-stamp.When the requesting process has com- pleted its access statement, it relinquishes exclusive access by sending a third message, possibly contain- ing an updated value.
In a simulator using a conservative algorithm, the monitor process treats the request message channels as input links, complete with link clocks, and applies the conservative approach to serialize requests cor- rectly.On the other hand, in a simulator using an optimistic algorithm, a monitor process simply queues requests and grants exclusive access provided requests arrive in non-decreasing time-stamp order.When a straggler request arrives, it rolls the value of the shared variable back, and sends anti-messages for the grant messages it has since sent to other processes.These, in turn, must roll back and re-execute the access statements.

A CENTRAL QUEUE PDES IMPLEMENTATION
The Vsim Project at the University of Adelaide aims to develop a number of concurrent simulation ker- nels employing various of the different algorithms surveyed above.The aim is to use these simulators to fulfil a number of goals: to measure the amount of parallelism implicit in typical VHDL models, to determine types of structural dependencies via kernel instrumentation, and to use the information obtained from prototype kernels to guide the design and development of more advanced kernels.

Initial Implementation
For the production of the first simulation kernel we chose to adopt the centralized queue approach.A number of factors motivated this choice.Firstly, the way in which VHDL semantics are expressed seems to favour simulators with a central queue.While this might be regarded as a deficiency of the VHDL def- inition in terms of abstraction, it makes for a rela- tively simple mapping of the general algorithm to the specifics of VHDL.Our experience is that the pro- cess of development of such a kernel leads to a greater understanding of VHDL semantics, which can then be applied in the development of kernels for which the mapping is non-trivial.Secondly, using the centralized queue approach identifies a useful lower bound on the amount of parallelism available for a given model.Thirdly, a centralized kernel is easiest to instrument, whereas instrumentation of a distributed system would present additional problems.
To enable full concentration on the development of the kernel we chose to utilize an off-the-shelf front-end analyzer and code generator.This saved in development cost, but also precluded research in a number of interesting areas.For example, it made it difficult to perform static analysis of models for model-specific adaptation in the kernel.It also pre- cluded bunching of several (small) VHDL processes into a single simulation process, to control the gran- ularity of the final processes.
The front-end we chose was the CAD Language Systems Inc. (CLSI) Retargetable VHDL Code Gen- erator (RVCG).Kernels using this system are library files which are linked with the generated code object files for a given model to produce a standalone ex- ecutable for that model.In broad outline a simula- tion run proceeds as follows: Elaboration: the kernel initially has control, and calls functions provided by the generated code to elaborate the model.This elaboration process (which is single-threaded) includes calls to kernel entry points to create process, signal, driver and other VHDL objects.Control re- turns to the kernel at the completion of elab- oration. 2. Initialization: the kernel executes all VHDL processes.Each VHDL process executes as a separate thread.Since the sequential code of each VHDL process is part of the generated code, the kernel must callback into this code as necessary. 3. Execution: the kernel performs the simulation cycle as given in the Language Reference Man- ual [8], Chapter 12), re-activating VHDL pro- cess threads as required.The simulator currently runs on a sequential ma- chine (a single processor SparcStation), and is being used to collect measurement in accordance with the first aim stated above.The kernel design is structured in such a way that porting to a parallel architecture should present minimal problems.We have devel- oped the prototype kernel to stage where it supports a subset of VHDL sufficient to represent any elab- orated model.Some structural modeling aspects are not supported, but they can be manually translated into the behavioral subset which is supported.At this stage the instrumentation necessary to report structural dependency information is still to be de- veloped.12 Proportion of execution time spent in the different phases of the simulation cycle.The 4-bit register models describe a simple transparent latch register.The behavioral models consists of two processes, and the structural model consists of D-flipflop and gate elements.The DLX and multiplier models are larger high-level and low-level models respectively.Simulator overhead (light-weight process scheduling and context switching) is not in- cluded, as it is relatively insignificant.

Results
The results shown here are derived from two benchmark models.The first is a register transfer level model of the non-pipelined DLX processor, de-  scribed in [7].This serves as a simple high-level model, as the datapath is modeled using delta-delay components, and the control section and test bench memory are behavioral.There are 21 processes, 45 scalar signals (mostly of type bit) and 32 bit vector signals (mostly of 32 elements each) in the model.
The second benchmark measured was a low level model of a 32-bit shift-and-add multiplier, consisting of flip-flop, full-adder and gate components, with a simple state machine controller.This benchmark in- cludes simple timing for each of the components in the model.There are 132 processes, 10 scalar signals of type bit and 8 bit vector signals of 32 elements each.While we recognize that the results reported here are model dependent, the two benchmarks we have chosen are representative of two frequently used classes of VHDL models, and thus yield sig- nificant insights into simulator behavior.
The first result of interest concerns the proportion of execution time spent in updating signal values and determining the set of processes to resume in a given cycle, versus actually executing those processes.Whereas in general simulations the latter dominates, in VHDL the reverse is the case, as Figure 12 clearly shows.The reason is that in most VHDL models, most processes perform very little computation, whereas the semantics of signal assignment and signal update in VHDL are quite complex, irrespective of whether a PDES algorithm or a simple sequential simulation algorithm is used.This result lead us to a modification of the general centralized queue al- gorithm, to be described in Section 4.3.
The results obtained for the low-level 32-bit mul- tiplier are superficially encouraging.Figure 13 re- veals a large percentage of cycles with more than 90 Num Cycles Potential Parallelism in Cycle FIGURE 13 Histogram of cycle frequencies vs potential degree of parallelism for the multiplier model.execution time axis, each consisting of a queue processing phase and a process execution phase.The vertical axis represents the number of processes potentially executing in parallel.VHDL processes are above the dashed line, and queue update processes below.
processes resumed.Figure 14 shows a section of the parallelism profile for this model.The profile reveals the number of processes that could potentially exe- cute in parallel in each simulation cycle.This section, which is typical of the entire parallelism profile, shows that the vast majority of processes executing within a cycle in which 90 or more processes are resumed only execute for a very short time; usually they perform a single signal assignment.There are two problems with this kind of behavior: the sched- uling overhead is likely to dominate the execution time for such processes, and, within such cycles, a small number of processes execute for a significantly longer time than the majority.The execution of these processes must be completed before the next simu- lation cycle may begin.It should be noted that the first problem is a result of mapping each VHDL pro- cess to a separate logical process.Even in a sequen- tial simulator, logical processes still need to be sched- uled.The best strategy for a solution to both problems appears to be a modified code generator which 'bunches' several small VHDL processes into a single logical process, resulting in a decrease in the number of processes, but an increase in their granularity to a more useful size.We expect that this strategy would significantly improve performance of low-level simulations, for example, at the gate level.However, it is a difficult problem to solve, and the feasibility and effectiveness of such a strategy re- mains to be explored.
The parallelism profile for the DLX model does not exhibit this behavior to the same degree: the granularity of processes in this model is greater on average and also more even.However, as Figure 15  shows, the amount of concurrency obtained is very small.Two points are worth noting about this bench- mark.Firstly, the model contains only 21 processes, since the system is modeled at a high level of ab- straction.That is, the problem lies partially in the lack of parallelism implicit in the model rather than in the algorithm.Secondly, even though the amount of parallelism achieved is small, it is still significant.We consider this benchmark to be suitable for exe- cution on small scale multiprocessors, such as the multiprocessor workstations and servers that have appeared recently.

Modified Implementation
In light of the proportion of simulation time spent in the signal-update phase, as revealed by Figure 12, we made a VHDL specific modification to the sim- ulation algorithm.In addition to running each VHDL process as a separate thread, we multi- threaded the signal update phase.We replaced the single queue by multiple queues, each encapsulated in a thread which updates elements in the queue during the update phase of the simulation cycle.This corresponds to multiple concurrent 'kernel processes', to adopt the nomenclature of the language definition.Within a given cycle, there must be no dependencies between elements in different queues.We achieved this by using the object identification of the driven signal as a key to determine the queue into which to place transactions.Thus all transactions scheduled for a given signal are entered into the same queue, even if the transactions arise from different drivers in the model.
The multi-threaded signal update phase performs quite well, indeed for models such as the DLX, more concurrency is obtained in an 8-threaded update phase than in process execution.Two areas specific to multi-threaded signal update which could be fur- ther investigated are improving the distribution of elements across the queues, and determining an op- timal number of queues and update threads.

CONCLUSIONS
In this paper we have surveyed a number of PDES algorithms, which serve to improve the performance of general simulation problems by making use of multiple processors.We have pointed out the poten- tial advantages of using these algorithms in the con- text of executing VHDL simulation models.In par- ticular, the productivity improvements and the reduced time to market that can be gained from shorter simulation run times makes parallel execu- tion attractive.We have also pointed out some of the ways in which the algorithms must be adapted to form the core of a VHDL simulator, and suggested implementation strategies for both VHDL and VHDL-92.
An implementor of a VHDL simulator choosing a PDES algorithm must be aware of a number of factors: (1) the characteristics of the models expected to be simulated (implicit parallelism and data de- pendencies), (2) the relative complexities of imple- menting the algorithms, and (3) the characteristics of the target machine.While it is generally believed that optimistic algorithms give the best performance and speedup, other algorithms may be appropriate when all engineering considerations are taken into account.
As a first step towards understanding the trade- offs, we have implemented a simulator using the cen- tralized event queue algorithm.Our initial measure- ments lead us to an improvement on the basic al- gorithm, namely parallelizing the signal update phase of the simulation cycle.With this modification in place, measurements of two sample models indicate that the technique is appropriate for use in simulating high-level and medium-level models on small-scale multiprocessors.This is exactly the case in a devel- opment environment early in the design cycle, where engineers using multiprocessor workstations need fast turn-around for proof of concept models.Measurements of a low-level model indicate that this al- gorithm is less appropriate for such models, since the overheads of the run-time system dominate over VHDL process execution.A simulator using an op- timistic algorithm on a massively parallel multicom- puter may be more appropriate at this later phase of the design cycle.Also, the development of compiler optimization techniques to cope with this problem deserves further research.Further measurement is needed using a more extensive benchmark suite, both on our simulator and on simulators using other PDES algorithms, to gain a clearer understanding of the trade-offs.

FIGURE 6
FIGURE 6 Elaboration of a simple example hierarchical model.

FIGURE 9
FIGURE 9 Modified conservative PDES algorithms that correctly handles VHDL process resumption, based on the VHDL simulation cycle semantics.

FIGURE 14
FIGURE 14 Excerpt from the parallelism profile for the multiplier model.Individual simulation cycles are shown along the horizontal

FIGURE 15
FIGURE 15  Histogram of cycle frequencies vs potential degree of parallelism for the DLX model.