Investigation of a Superscalar Operand Stack Using FO 4 and ASIC Wire-Delay Metrics

Complexity in processor microarchitecture and the related issues of power density, hot spots and wire delay, are seen to be a major concern for designmigration into low nanometer technologies of the future.This paper evaluates the hardware cost of an alternative to register-file organization, the superscalar stack issue array (SSIA). We believe this is the first such reported study using discrete stack elements. Several possible implementations are evaluated, using a 90 nm standard cell library as a reference model, yielding delay data and FO4 metrics. The evaluation, including reference to ASIC layout, RC extraction, and timing simulation, suggests a 4-wide issue rate of at least four Giga-ops/sec at 90 nm and opportunities for twofold future improvement by using more advanced design approaches.


Introduction
Current trends in semiconductor technology, and in particular the International Technology Roadmap for Semiconductors [1], suggest that future concerns in microarchitecture at the VLSI level will pose significant challenges.These include increasing power density [2], progressively severe thermal hot spots in increasingly complex designs [3], the impact of growing static power [4], and the problem of wire versus gatedelay and power scaling [5,6].Such problems are often most acutely exposed in key mainstream processor components such as cache, register related logic such as reorder buffers, rename logic, and the register file itself.Any alternative scheme to the traditional register-based computing paradigm can therefore open up the possibility of new approaches to these problems.However, register files are so highly optimized that measuring alternatives now requires complete layout of an optimal design for comparison, followed by timing and power analysis and nothing as simple as functional comparison of abstract logic.This paper focuses upon one possible unexplored option for operand storage which is alternative in its structure to that of a register file.The questions we examine are (a) can a LIFO (last-in-first-out) stack support superscalar operand access and (b) what is its performance relative to established mainstream approaches.
This work is undertaken with a 90 nm UMC CMOS process library; however, we ultimately utilize FO4 as a delay metric [7] in order to provide a general measure of performance that can be scaled to other process nodes.The work is undertaken using standard cell digital libraries and not at the transistor level.Although this is not therefore an optimal solution, it permits rapid assessment of multiple implementation schemes and semicustom design of the most promising candidate.This also means that performance results act as a conservative (lower) limit on potential performance.
Inevitably this work has some relevance to the ageold argument of stack processors versus register machines, particularly as significant improvements have been made in stack-processor code efficiency and code optimization [8][9][10] and in design of architectures capable of multiple instruction issue with out-of-order completion and/or previous attempts to parallelize stack structures [11][12][13][14].
However, the fundamental focus and aim of this paper is not to compare two competing processor paradigms but simply to establish in its own right the feasibility of a stack structure capable of permitting efficient access to operands in a superscalar access mode.It is also possible, for example, to have stacks employed in situations where they are not the basis for complete programmable processors.Therefore, the application space for such a solution is recognizably, but not exclusively, in the CPU design space.
In a traditional operand stack, multiple operands are held in a stack area organized notionally as a pushdown stack or LIFO buffer.Often the physical implementation employs a pointer into a small memory area.However, this leads to serious bottlenecks and there is little difference between this approach and a register file, especially where multiple operands need to be accessed in the same clock cycle (a requirement for superscalar operand access).In CPU oriented applications, it is not unusual to simply map a virtual stack onto an architectural register file, but undoubtedly this offers a poor presentation of a stack-oriented system since it is only an emulation of a true stack.It does however potentially allow superscalar execution since register files can be superscalar, though with significant hardware cost in terms of hazard avoidance and related logic.A true stack structure therefore appears at first sight to be restricted in its ability to act outside of its serial LIFO mode of operation.However, in this paper we demonstrate a discrete stack with multiport capability and superscalar functionality.

Superscalar Stack Issue Array (SSIA)
We propose a novel stack structure, the superscalar stack issue array (SSIA), whereby a small number of independent hardware stack cells are capable of being accessed collectively by multiple ports concurrently.However, such operations are inseparable from their stack effects (push, pop, or no effect).Consequently, collective stack reordering is performed as a single cycle internal operation, such that operands can be dispatched and stack-state kept coherent even with multiple actions in a group.The use of a novel tag scheme permits out-of-order write-back.All of these attributes can be achieved with basic logic structures.An unexpected benefit of destructive readout, usual in a stack structure, is that it eliminates the RAW/WAR dependency problem which hinders register-based processors to the extent as to demand renaming and reorder buffers and virtual registers with significant power, area, and thermal penalties.Therefore, an SSIA module can support superscalar operand issue and out-of-order completion without encountering RAW/WAR hazards or requiring a renaming scheme for its contents.These could conceivably be important benefits as described.
The tagging scheme allows delayed write-back to the stack structure, whereby any uncommitted stack cell content is supplanted by a unique tag which simply reserves and occupies the stack cell in question until the write-back can be completed.The unique aspect of this tagging scheme is that tags move with stack content, and therefore operands and results are actually nomadic in their behavior, and we refer to these incomplete contents as nomads.There is no possibility of multiple writes being able to target the same reserved space and hence no RAW/WAR.
If we consider how this looks at the logic level and how it differs from a standard stack, it may become clearer.The extra inputs on the 5-way multiplexers facilitate placement of an immediate operand   via the topmost multiplexer or selection of an optional "stack fill" value m fn from memory in the bottom-most multiplexer.How such memory-to-stack data movement is handled has many options, and we do not explore these here; however, even trivial buffer schemes of small capacity or a memory queue approach is a feasible solution [13,15,16].Thus, this scheme achieves the ability for in-order issue with out-of-order completion.This is investigated further in the literature [11,13,14,17].
Analyzing this SSIA structure one can make several observations; first of all it appears that each issue slot contributes a logic delay to the overall cycle time of the system.The logic delay for cascaded SSIA structures is typically linear with respect to issue width, a fact that is not true of register files; indeed area growth in register files can be cubic or quadratic as a function of port count [18] and have considerable impact in design scaling [19].A further observation is that SSIA operand access delays are unequal.Again this is not true of a register file where all operands are emitted from parallel read ports with approximately equal latency.For a superscalar stack scheme, each operand pair is accessed with a different latency, ranging from near zero latency to a larger latency at the final issue slot.This may or may not be exploitable to some advantage; we simply observe the characteristic at this stage rather than speculating too far.However, what we would expect to observe is a similar story for power behavior where the peak power will not be a linear multiple of issue width, because each slot will peak at a different time and not at the same time.Total power over the whole cycle should however scale with issue width.

Implementation Scope and Variations
It should be clear, by analyzing these several cases of advanced stack operand stores, that the basic component building block of such a system is a multiplexer of some given characteristics, primarily the number of inputs or inputordinal.It also follows that the best multiplexer design (in terms of speed, area, or power) will yield the best structure for the proposed stack-operand management scheme for the same optimization targets.An obvious choice is to use standard cell multiplexer components; however, this proves to be suboptimal where standard cell multiplexer choices are limited to a few cases and not necessarily implemented in the best way to facilitate ganging into suitable combinations.In order to evaluate this, it is necessary to consider the building block design at the logic level, its fan-out behavior, the logical effort, localized wire delay, and ultimately the full structure layout when a number of issue slots are cascaded and the logic is duplicated to represent the "" bits of stack width (typically 16, 32, or 64 data bits per operand, for example).
We also note at this point that the depth of the discretely implemented stack is usually much greater than four elements.The top four elements only represent the "hot-zone" of the stack, where activity is more complex.Below this is a region we refer to as the "tidal zone" whereby all elements pop or push in unison, like ebb and flow of a tide, but are not reordered.Figure 3 highlights this difference in stack element zone behavior.It is noted that these deeper stack elements may be clock-gated or subject to dynamic power management when they are "empty." The nature of the stack makes this easily identifiable.We do not consider this in the analysis presented here however.
The tidal stack-element multiplexing arrangements are less complex.A 3-to-1 multiplexer selects either the existing cell-state or one of two neighboring cells (above or below).How deep this tidal zone should be is a matter for further research.It should be as small as possible to reduce area/power cost, but a tidal zone of say 4 or 8 elements might not be sufficient to allow tags to achieve delayed write-back within the deeper stack before content migrates into memory or cache.This is an area for future research and we therefore assume, for the present, a discrete stack portion consisting of multiplexer ordinals as a set {3, 3, 3, 3, 5, 4, 4, 5} reading bottom to top.The {5, 4, 4, 5} portion being the "hot-zone" is symbolized by Figure 2 and the {3, 3, 3, 3} portion being a "tidal zone" is as in Figure 3.This configuration is assumed in the rest of the paper.There are numerous possibilities for implementing multiplexers, each has its own advantage and disadvantage.Some options are summarized below: (1) combinational logic using AND/OR/MUX, (2) tristate selection of multiple inputs, (3) pass-gate transistor based solutions.
In this paper, we focus on cases (1) and (2), and we present timings as an FO4 delay metric and as an absolute timing.In both cases, we derive FO4 results from a 90 nm 1 volt process but also quote absolute timing for comparison between design choices.This allows FO4 delays to be comparable across process nodes up to a point.In later sections, we compare the SSIA with alternatives using similar process nodes.
In this paper, we identify several standard cells of interest for our design objectives and comparative implementations: (i) AO22: dual two-input-AND feeding into an OR-gate, (ii) AOI22: equivalent to AO22 but omits final inverter, For our experiments we always select the X1 variation of the available standard cell.This is not the fastest option but achieves low area cost.The trade-off between larger faster cells and smaller slower cells is not straightforward: leakagepower differs, and larger cells will lead to longer wires in the full structure and alter RC loading effects.Using these cells, we evaluate the ganged tristate approach of Figure 4, and also four more typical implementations, based upon the schemes illustrated in Figures 5(a We can thus define five initial models for evaluation, which we will refer to using the notation given as follows: In addition to the basic switching structure, each model has a constant delay associated with the state retention flipflop/latch and tag-match data-insertion multiplexer.Referring back to Figure 2, it can be shown that each multiplexing stage cascades into its successor stage with fan-out of 5 (FO5C), with the exception of the final stage (or only stage in a nonsuperscalar case).The final stage feeds in to the storage bit cell, possibly via the tag match insert logic, which can be placed either immediately before the bit-cell input or immediately after the bit-cell output.Therefore, it has a fan-out of 1 rather than the fan-out of 5 encountered when cascading to further stages.We therefore have case FO1T (feeding a tag-logic stage) and case FO1L feeding a latch stage.The loading effect of these choices then result in three fan-out loading cases that need to be calculated for each multiplexer style, relating to the position in the cascade chain: FO5C: fan-out 5 cascaded, FO1T: fan-out 1 to tag insert mux, FO1L: fan-out 1 to Latch.
We evaluate this worst case here and results of our evaluations are presented in Table 1.From this initial analysis, it appears that the TNN model has the best speed advantage, and that either FO1T or FO1L variants are preferable for speed.

Tristate Multiplexer Models in Detail
In order to assess the tristate multiplexer strategy with as much accuracy as possible, we utilized several evaluations.First of all, evaluations based upon the 90 nm process data sheets were used to generate an initial timing estimate for  4. The total load capacitance (Cload) driven by the active tristate in the group was calculated as a function of fan-out () and ganging () plus the next stage inverter, such that load capacitance equated to cload = 2.668 ×  + 1.743 × ( − 1) . ( The value 2.668 refers to input capacitance of the assumed next stage (an Inverter INVX1 cell), whilst the value of 1.743 refers to the capacitance of the outputs of the other inverters (referred to in the data sheet as input capacitance).Both values are extracted from the cell-library data sheets.These estimates appear to be sufficient for a first-order comparison of implementation methods.However, this may not be wholly accurate: signal behavior, layout, and wire effects have some importance, as will be highlighted later in this paper.
Having evaluated the timing models in each case and repeated the analysis for each of the FO5C, FO1T, and FO1L output loading cases, we have enough data to perform an evaluation of delay variation as a function of issue width.However, a complete analysis still needs to incorporate the constant delay of latch and tag insertion timing parameters themselves and not just the effect they have on output loading of preceding stages.This is outlined in the following Sections 4.1 and 4.2.

Tag Insertion Multiplexer.
The MUX2 standard cell (X1 type) drives a single fan-out load for the data input of the latch QDBAH (X1 type).QDBAHX1 has an input capacitance of 1.976 fF for the  input node.This allows us to calculate a delay of 70 ps using the straight-line fit to MUX2 propagation delay data versus capacitive loading of the driven QDABHX1 latch input.

State Storage Latch.
The latch itself has several important timing requirements.Setup and hold time amounts to just under 35 picoseconds.Data propagation from  to  varies according to output load.This will be fan-out 5 in all cases but the target cell could be any one of MUX2, MLX2, AO22, AOI22, or INVT depending upon the implementation.We therefore have a further table (Table 3) for values derived for this latch propagation delay and also the final total with tag logic included.Examination of Table 3 presents the data latch and total timing (tag stage plus latch timings).This shows that the variation in timing of the bit cell latch for different input variants is marginal.

First-Order Timing Projections
With timings projected for each component under all of the encountered input and output conditions, it is possible to assemble a timing analysis for a stack issue-logic slice for each of the component models introduced.At this stage, the subcomponent delays can simply be summed according to the configurations used.So, for example, the CEN model using combinational noninverting logic has configurations with respect to issue width as given in Table 4.
Replacing "CEN" with any of the other models allows the same cascading to be used to calculate the respective delays of each model.These are given in Tables 5 and 6 and plotted as graphs in Figures 7 and 8. Tables 5 and 6 show the absolute timing for the chosen 90 nm process, and the FO4 delay figure extracted on the basis that FO4 delay for 90 nm technology is 45 ps.By way of confirmation, our experimental timing measures derived an average FO4 delay of around

ASIC Layout and Core Evaluations
After performing timing estimates based primarily upon reported standard cell characteristics, it appeared that the tristate model offered the best delay characteristics without resorting to full custom cell design.This implementation model was then evaluated further during a collaborative visit to the Circuits and Systems Research Centre (CSRC), University of Limerick, by one of the authors.VHDL coded descriptions of the 5m1 multiplexer, using tristate internal selection, were synthesized to create a standard-cell based mux core.After manually tidying up, the 5m1 cell appeared as illustrated in Figure 9.
Examining the cell critical wires as in Figure 10, wire lengths approximate to 13.7 m (ganged wire), 2.7 m (input), and 2.0 m (output).After DRC and LVS checks, the 5m1 cell was cropped and rechecked by removal or addition of one or more tristates to create a range of cells from 2m1 through to 8m1.At this stage, a number of effects were then considered in order to get an accurate delay estimate for the building block.These are outlined as follows.
6.1.Slew Rates.Our tests revealed that input slew rates of the input test signals have an appreciable effect on timings, adding more than 10 ps to measurements for a conservative slew rate of 100 ps.We used a buffer chain to condition the input test signals and give realistic signal properties for our tests.

Standard Cell RC Delay.
With RC behavior included, the delays for our multiplexer combinations are increased noticeably.The data sheets only provide for transistor characteristics and not layout related to wire and metal effects.Adding wire-related delays for the ganging interconnect in the multiplexer further increases the delay.Simulation data for these measurements is given in Table 7.  output gate loading conditions.At this stage it becomes obvious that the initial timing projections given in Section 5 are conservative when compared against real layout and RC extracted timings (as shown in Table 7).Examination of the internal behavior of a 5m1 ganged tristate multiplexer core shows the effect of ganging and the unequal switching times for tPHL and tPLH transients.This is illustrated in Figure 11,  which shows an input signal of matched slew rate, causing internal node switching on the ganged tristate, which in turn drives the final output inverter.Both tPHL and tPLH are shown; it is clear that the transient on the internal node is the critical issue for this design.Using a dynamic precharging method on the ganged bus (when all tristates are disabled) might have a significant impact on this problem, allowing faster operation.This would certainly be an area for further investigation.

Fully Cascaded Structure and Bus Interconnect.
To make comparative analysis easier, we derived a model for tristate behavior under given output loading conditions and multiplexer input counts.We began with the plot shown in Figure 12, which shows schematic timings data (blue), RC extracted data (red), and our final model (green: Figure 13) incorporating a distribution bus metal structure suitable for driving a 5-4-4-5 multiplexer row, which is the required worst case for the cascaded n-wide issue structure described in this paper.We also performed simulations of multiplexer cores at schematic and layout levels, and with cascaded cells, to derive timings approaching those for a full ASIC implementation.The data in Figure 13 is used for the final performance estimates, leading to the revised component timings for the TNN model, as given in Tables 8(a), 8(b), and 8(c), with wire effects of Table 9 and predicted access and cycle times in Table 10.This incorporates schematic derived timing data for the latch and tag stage (see Tables 8(b) and 8(c)), and wire effects (Table 9).The full layout structure is shown in several figures given as follows: a single row of hot-zone elements is shown in Figure 14.When several of these rows are stacked one below another, interlocking the input and output bus lines, the structure appears as in Figure 15, which represents a 4wide "hot-zone" structure for a single bit of stack word width.Hot-zone control wire supply lines are required for each issue slot, 20 per issue slot.These are capable of being routed over the cells via a higher metal layer (seen running vertically top to bottom in Figure 15).These control lines do not impact upon the structure's overall area and standard cell packing density and have no influence on the dimensions of buses used to connect critical data paths between stacked issue slots.Finally, Figure 16 shows the additional cells, added to the left hand side of the layout, representing the {3, 3, 3, 3} "tidal" stack zone attached to each hot-zone issue slot module, with shared control lines coming from the left-hand side in this case (though over-cell routing is possible).The practical consideration to be made here is what impact the interconnect buses (between hot-zone modules) have upon        signal delay due to added capacitance.These bus lines are singularly driven but fan out to 5 destinations.We performed simulation of cascaded rows using the bus structure shown in Figure 16 in order to account for this and found that bus related delays were typically of the order of 20 ps.

Comparative Performance
In previous sections, we developed a structural model for a superscalar operand stack and presented timing estimates for five different (relatively straightforward) implementation strategies.The next question to pose is to ask how fast a superscalar stack is with respect to register file alternatives.However, given the radically different nature of the superscalar operand store, as compared to the register file paradigm, blind like-for-like comparisons are difficult to assert.Consider a likely configuration of our superscalar stack unit, with perhaps four top-of-stack "hot-zone" elements (representing the {5, 4, 4, 5} multiplexer arrangement) and between four and eight deeper "tidal" stack elements implemented with 3-to-1 multiplexers.If we take the latter case then we have a total of 12 stack elements.The question is raised: is this SSIA equivalent to a register file of 12 registers?Our rationale for comparison is as follows.
7.1.Word Count.Such a stack as described above can push operands deeper than the 12 discrete elements implemented directly in logic, so it may not be accurate to describe it as having equivalent capacity to a register file of 12 words.Conversely, the number of "useful" registers in a register file under certain workloads is a small fraction of full capacity.Many registers are either awaiting write-back or contain dead data that will never be read again.Useful lifespans are typically measured in low 10's of clock cycles [39], so much so that some researchers have even proposed discharging unused registers to save power [40].Stack content is almost exclusively live and useful, and it is much rarer to have dead "nomads"; thus, direct comparison is misleading.Empty stack elements are always identifiable easily and this correlates with our early comment about empty stack elements being able to be power or clock gated in a more sophisticated SSIA solution.However, we note that the fact that redundant register content is seen as a significant concern for static power wastage suggests that the stack approach has potential here as it rarely contains redundant content, which could be exploited for static/dynamic power reductions.

Port Count.
Typical superscalar register files are arranged into "" write ports and "2" read ports.A 12-port register file is often organized as 8 read ports and 4 write ports.Such a configuration matches the expectations of a 4-issue machine, with 4 pairs of operands (8 reads) and 4 possible results (4 writes).An SSIA will have an identical read-port count (a 4-issue stack provides 8 operands).However, each stack element is capable of retiring an uncommitted value to the SSIA, so it might be thought that there are "" write ports for an -deep discrete logic stack portion.However, there are limitations on practical numbers of retirement buses, and in practice the ability to retire 2, 3, or 4 results to the stack is more realistic and has an impact on the design of the tag match logic (which we have not considered in this analysis) and therefore makes low orders of concurrent writes more desirable.
A further complication is that each issue slot can write an operand to the stack, albeit only to the top element.Do each of these write channels count as write ports in the traditional sense?With the restricted nature of the destination (top of stack only) it seems that it makes sense not to count these as true write ports in their own right.The postulated 12-deep stack with an issue width of four could retire four ALU results and also accept up to four new data values, whilst reading out 8 operands.The extreme interpretation is that this is therefore an 8+8 port operand store.Given that the 8 write channels are actual two groups of four channels with functionally different behaviors, we do not think this is the best way to compare like-for-like (as far as this is possible).We therefore come to a fairly loose conclusion: a suitable SSIA model for comparison would assume the equivalent of 1 write port and two read ports per issue slot.Note that the stack size has no direct impact on the number of write ports assumed as this is largely a function of issue width; however, this is a parameter that could be investigated further.
To perform a comparison of SSIA versus register file performance, we collated a range of over forty FO4 timing data citations and timings for register files as reported in the literature [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37].These are detailed in Table 11.Access-time data appears to be more widely reported than cycle times (28 citations versus 13 in our sample).We consider both the access time and the cycle time of our SSIA in comparison to the register file.The population of access-time data for register files (from Table 11) was large enough to derive both best-case and worst-case envelopes and also a general trend for port count versus access delay, based on the whole data set.The data for cycle time was not sufficiently populated to support meaningful trend extrapolation.However, the position of the SSIA cycle time predictions within the group highlights its relative timing behavior.Plotting this analysis gives the graph shown in Figures 17(a  the same timing characteristics for the tristate-multiplexer SSIA model.The comparative analysis presented in Figure 17 shows that the predicted models for the proposed SSIA configuration have a delay characteristic that is very competitive with register file for both access time and cycle time.This is certainly true for operand pair issue widths of 1, 2, 3, and 4 (equating to port counts of 3, 6, 9, and 12).One can observe that SSIA model appears to have increased delay penalty relative to register file for extreme port counts.For cycle times, where data was only available for register files up to 12 ports, the SSIA model appears competitive across the issue width/port count range plotted.Overall, it can be stated, based upon the data available in this comparison, that SSIA has highly competitive performance across both access and cycle times for up to 8 read ports and issue widths up to 4.
Taking the worst case delays from Table 10 (cycle times), it is possible to make tentative frequency estimates for a pipelined architecture limited by the superscalar store, with frequencies of around 2.5 GHz, 1.7 GHz, 1.3 GHz, and 1.0 GHz for issue widths of 1, 2, 3, and 4, respectively.A complete layout for one bit of an n-wide SSIA is given in Figure 18.

Conclusions
In this paper, we have considered a novel approach to operand management, using a stack based approach with a scheme which we believe is a new and novel approach for multiple operand issue, permitting superscalar in-order issue and out-of-order completion.Our methodology has been detailed and a model for building suitable stack structures is demonstrated.The cascading nature of the logic structure has two aspects: first of all, it allows a linear growth in logic cost and area, as well as delay characteristics and power consumption, which are potentially advantageous.However, the cascading nature also means that at some point (an issue width of 8, equivalent to a port count of 20, for example) the SSIA as presented here becomes less desirable as those costs accumulate.However, with realistic issue widths the cascading effect does not reach an uncompetitive behavior.Delay evaluations have been made, utilizing industry standard tools for core building block characterization, and making reasonable assumptions for configurations of systems equating to various issue widths.We have compared our projected performance data against a range of existing register file models and found comparable performance is viable.There are a number of possible enhancements that could be made to this base design in order to improve performance.These include (i) advanced bit-cell design, (ii) ganged pass gates rather than tristates, (iii) look-ahead schemes to reduce cascading, (iv) dynamic precharge of common node, (v) selective clock-gating of empty cells.
Combining these possibilities, we envisage that cycle delays might be halved and perhaps more.This suggests that a 2 GHz 4-wide issue stack is conceivable with 90 nm CMOS with careful optimization and more innovative design, implying operand issue bandwidth somewhere in the region of 8 to 12 Gigawords per second at 90 nm.Naturally, more advanced processes will offer further performance gains.
One of the most interesting aspects of the superscalar stack is its ability to deliver issue slot operands at different times within the cycle time window, a unique behavior which we believe will allow more freedom in layout floor planning at a higher level of abstraction.This is particularly interesting if wire pipelining is introduced [41].We also expect to observe (and perhaps to enhance further) a power-spreading effect that reduces peak power over machine cycle time scales due to the ripple-through effects of logic in our structures.This has some potential to reduce power density hot spots in the operand store as well as reducing peak power spikes.When combined with the knowledge that "useful" register lifespans are often a small fraction of the power-hungry register file [39,40,42], SSIA starts to look more interesting as a possible candidate upon which to base superscalar systems.Work on more advanced multiplexer design continues to be a current topic [43], and there is substantial scope to learn from this and improve upon the implementations reported here.
In the wider context, it has been fashionable to consider the stack machine as outdated.However, the potential for such architectures to deliver complex operand and instruction issue models highlights fresh opportunities and offers a new twist in the development of stack machines and related queue machines.Combining this with significantly better stack code optimization frameworks and models highlighted earlier [8-10, 14, 15] suggests that stack machines might be overdue a fresh examination in view of the trend for many simpler cores per chip rather than fewer but more complex ones.We believe that new avenues have been opened up by our initial study, in answering one question we have uncovered many others.A more comprehensive VLSI oriented study is a highly desirable next step in this work.Collectively these objectives will allow a complete design characterization for a prototype superscalar stack processor to be achieved.We therefore expect to continue to evaluate these new and novel SSIA architectures in the future and hope to report further findings in due course.
), 5(b), 6(a), and 6(b).We had predicted that the tristate selection method would perform better than combinational logic, based upon initial logical effort analysis.The tristate model assumes the preceding decode stage generates one-hot selection signals.Other cases assume encoded selection signals, generated in the same way.

10 Figure 7 :Figure 8 :
Figure 7: Issue width versus cycle time for various implementations.

Figure 10 :
Figure 10: Significant wires in the 5m1 tristate cell, with dimensions in nm.

Figure 11 :
Figure 11: External and internal signals relationships for 5m1 RC extracted simulation.CADENCE timing data plot (redrawn for clarity).

Figure 13 :
Figure 13: Data sheet versus RC delay with bus metal.Data sheet (blue) distribution bus (green).

Figure 14 :
Figure 14: Hot-zone row {5, 4, 4, 5} configuration, and interconnect bus detail.Four outputs at the bottom are used to drive the four bus lanes of the next cell (equivalent to those at the top of this cell).

Figure 17 :
Figure 17: Comparison between SSIA and register file timing.(a) Access times and (b) cycle times, plotted on same scales.
) and 17(b), which show the population of register file FO4 characteristics for access time (Figure 17(a)) and cycle time (Figure 17(b)) alongside
Figure1illustrates a fundamental stack implementation scheme.Each stack element bit is translated from its current state {a, b, c, d} into its respective next state {a  , b  , c  , d  } by use of a multiplexer arrangement, which performs local reordering, to reflect the stack effect of the applied action.
In a standard scalar stack, the next state is fed back to the state retention flip-flops.However, if multiple reorder-muxstages are cascaded, then the state feedback represents the stack effects of a group of actions, and the intermediate stages provide multiple operand pairs simultaneously, as illustrated in Figure2.It is also notable that only one state transition is observed at the state retaining flip-flop, no matter how many issue slots are cascaded.A superscalar stack will have significantly fewer such transitions than a serial/scalar stack given the same sequence of actions, in effect one that actually saves dynamic flip-flop power by going from a scalar to a superscalar structure.By adding logic to permit insertion of a tag-matched value from a common data bus (CDB), it is possible for stack cells to contain tags reserving locations for delayed write-back values and for these to "retire" when a CDB tag is matched.

Table 1 :
Timing of 5m1 component models.

Table 3 :
Bit cell and tag timing data.
FO4 delays in brackets.

Table 6 :
Access times.ps for timing combined tpHL and tpLH.The results show that TNN is the best choice with this level of analytical detail.The combinational logic model CEN is by far the worst option even for a scalar issue model, and TNN is almost twice as fast as CEN for the highest issue width examined. 43

Table 7 :
Delays with and without local wires, for TNN MUX.

Table 8 :
(a) Timing models, various simulation modes.(b) Simulation timings for TNN multiplexer driving various next stages.(c) Schematic timings for LATCH and TAG-MUX components.

Table 9 :
Delay data for simple wire of length 0 um-60 um.

Table 10 :
Cycle/access times versus issue width.

Table 11 :
Reported register file delay times.
[38]rts are stated as T (R + W) where T is port total, and bracketed figures (R + W) represent read and write ports where known.bDelaysare stated as FO4 delay, assuming 1 FO4 delay equates to an approximation scale of 2 nm per ps[38].