Instruction Scheduling Across Control Flow

Instruction scheduling algorithms are used in compilers to reduce run-time delays for the compiled code by the reordering or transformation of program statements, usually at the intermediate language or assembly code level. Considerable research has been carried out on scheduling code within the scope of basic blocks, i.e., straight line sections of code, and very effective basic block schedulers are now included in most modern compilers and especially for pipeline processors. In previous work Golumbic and Rainish: IBM J. Res. Dev., vol. 34, pp. 93-97, 1990, we presented code replication techniques for scheduling beyond the scope of basic blocks that provide reasonable improvements of running time of the compiled code, but which still leaves room for further improvement. In this article we present a new method for scheduling beyond basic blocks called SHACOOF. This new technique takes advantage of a conventional, high quality basic block scheduler by first suppressing selected subsequences of instruc tions and then scheduling the modified sequence of instructions using the basic block scheduler. A candidate subsequence for suppression can be found by identifying a region of a program control flow graph, called an S-region, which has a unique entry and a unique exit and meets predetermined criteria. This enables scheduling of a se quence of instructions beyond basic block boundaries, with only minimal changes to an existing compiler, by identifying beneficial opportunities to cover delays that would otherwise have been beyond its scope. ©


INTRODUCTION
Instruction scheduling is a process of rearranging or transforming program statements before execution by a processor in order to reduce possible run-time delays between compiled instructions.An instruction scheduler is normally implemented as part of a compiler [ 1 L and usually operates at an intermediate language (lL) or assembly code level.Such transformations must preserve data dependency and are subject to other constraints.Instruction scheduling can be particularly advantageous when compiling for pipelined machine architectures.which allow increased throughput by overlapping instruction execution.For example, if there is a delay of one cycle between fetching and using a value V, it would be desirable to cover this delay with an instruction that is independent of V and is readv to be executed.
Previous work on the implementation of instruction scheduling concentrated on scheduling within basic blocks [2-7J.A basic block is a sequence of consecutive instructions for which the flow of control enters at the beginning of the sequence and exits at the end thereof without a branch possibility, except at the point of exit. A. basic block scheduler attempts to interleave independent instructions within each basic block so as 1 to eliminate wasted machine cvcles.Such schedulers are quite effective for programs with long hasic blocks, common in some scientific applications.Branch instructions, however.restrict the effectiveness of pipelined architecture in ways that cannot be handled with onlv basic block transformations.
In an earlier work [8].we have investigated code replication techniques for scheduling beyond the scope of basic blocks . .resulting in reasonahle improvements of running time of the compiled code.However. the approach described there still leaves room for further improvement.and is unrelated to the new method presented here.
In this paper, we present a technique called SHACOOF for ScHeduling Across COntrOl Flow, which extends capabilities well beyond basic blocks.This technique enables reductions in runtime delays, due to branches and loops.etc .. and enables pipelined architectures to be exploited in ways that would not otherwise be possible.The method depends on a new, but simple, decomposition in which successively larger portions of the program control How graph are replaced by summary pseudo instructions . .resulting in new, larger sections of straight line code that can be scheduled by existing techniques.
In Section 2, we introduce the notion of an Sregion, and describe in Section 3 how S-regions are used to identify subsequences of instructions for suppression.A detailed example showing how the SHACOOF instruction scheduler covers delays in a simple yet typical program is given in Section 4 along with a discussion of implementation issues.

5-REGIONS
The fundamental idea of the SHACOOF method is to identify a candidate subsequence of instructions to be suppressed, treating it as one would a subroutine.making it transparent to the basic block scheduler.These subsequences correspond to regions of the program How graph having a unique entry, a unique exit, and satisfying certain minimality conditions.We call theseS-regions.By suppressing the S-region, regarding it as a pseudo instruction, and preserving data dependency, instructions can then be moved over it using the current basic block scheduler.We now make these definitions precise.
Let G be the programjlow graph, with vertices corresponding to the basic blocks and with a directed edge from block B 1 to B2 if B2 is a successor of B 1 [1~.Letpred(.V) and succ(S) denote the ,.;et of all predecessors and successors of vertices of set N. respectively.A subgraph S of G is called an Sregion with entry x and exit y (x # y) if the following four conditions hold: There i>i no regionS'~ S. with entry :r' and exit _y' satisfying ( 1) -(:-3).with .r= x' ory = y'.ln particular.the definition implies that every path from G -S to S goes through x. and that every path from S to G -S goes through y.It is also easy to show the following.
Lemma.There is at most one S-region with entry x or with exit y.
Proof Suppose there are two S-regions 5 1 and 82 with the same entry x.Condition 4 implies that one is not contained in the other.so there exists a zESt (z # x), such that z fE. S. 2 .But there must be a path from x to z. so z E succ(S2-.n:;.where y 2 is the exit of 5 2 .Condition 3 implies that z E S 2 - x, a contradiction.o S-regions may be nested or chained as illustrated in Figure 1.This example shows part of a How graph, where each vertex B 1 to B 2 c, represents a basic block and So, . . ., 5 6 denoteS-regions.In the example, So -S1 -S2 and Ss -S;:, are chained S-regions; region 5 4 is nested inside region 83, etc. S-regions can be generated or recognized in 0 ( e log e) time,* where e is the number of edges in the How graph using modifications of standard algorithmic design techniques [1, 9j.The notion of an S-region is similar to that of a statement in a well-structured language without the minimality condition, however. in addition to dealing with arbitrary program constructs we also disallow certain well-structured statements.The S-regions So and 5 2 , for example, are very unstructured.Note too in the example that the set of blocks {B12, B13, B 14 , B 1 :;} is not an S-region, although under certain definitions it might be considered a statement.S-regions were introduced in graph theory [10], where they were called hammocks, but have remained unstudied with the exception of one other compiler application [ 11].

THE USE OF S-REGIONS IN SHACOOF
The decomposition into S-regions provides a mechanism for eliminating remaining delays (noops) that result from branches and loops.For an S-region S having entry x and exit y. the candidate subsequence P of instructions to be suppressed comprises: 1.The last instruction b of the entrv block x if b is a branch statement 2. The label l at the top of the exit blocky if l exists** 3.All instructions in all remaining blocks of S By suppressing P. regarding it as one pseudo instruction, we can then apply a basic block scheduler (like that in reference 6) to the straight line code consisting of the block x (without b if bE ** We assume an IL for which every basic block begins with a label (with no instruction) only if there is a need for iL such as being the target of a branch.P). the single pseudo instruction P. and the block y (without l if l E P ).
A plurality of candidate subsequence,; can be identified for suppression in this way.by generating all S-regions.or generating them on demand.and replacing successively larger portions of the flow graph by such pseudo instructions when delays still remain, resulting in new.larger sections of straight line code that can be scheduled by the basic block scheduler.
For the example in Figure 1.thf' chain [2 . .Po. 5 . .P 1 , ?, P 2 , 10] (where P 0 • P 1 • and ? 2 represent the pseudo instructions for the S-region,; 5 0 • S1. and 5 2 , respectively) is straight line code, and it now becomes possible to move instructions from ? up to 2 or from 5 down to 10, etc .. assuming, in the normal manner, that data dependency permits.l\'ote that the instruction subsequence for So, (which the pseudo instruction P:-, will replace) consists of only instructions from basic block 21, because basic block 20 has no branch and basic block 22 has no label.

41MPLEMENTATION ISSUES
The SHACOOF instruction scheduler forms part of an experimental version of one of the lB.\1XL family of compilers for the IB.\1 RISC System/ 6000 computers.It can be called in to operation one or more times as required during compilation and is applied to instructions at an IL level.It can be used either conservatively (only after register allocation) or aggressively* (also before register allocation).On the SPEC Benchmark program EQYI'OTT.it prmides a further 6% run-time speedup over basic block scheduling alone.
The control flow graph provides the information on the structure of the basic blocks that is needed for the identification of the S-regions including the determination of their respective entry node x, exit node y. and intermediate nodes.An S-region table can be built in which data on x. y and the intermediate nodes for each S-region are stored, and which implicitly tag the instructions of the S-region.In other words, by accessing the Sregion table.the instnlctions that belong to a can-* As with manv compiler optimization techniques involving code motion.SHACOOF instrul'tion scheduling mav tend to increase register pressure bY lengthening the "live" area of some variables.Therefore.very aggressive instruction scheduling applied before global register allocation maY cause extra spill code to be inserted and may limit certain other optimizations such as coalescing [ 12-15:.didate subsequence for suppression can be identified.As the instructions are not physically tagged, there are no tags to be removed after scheduling.
It is common practice in modern compilers to maintain the sequence of IL instructions as a type of linked list.Before calling the SHACOOF scheduler, however, we modify the pointers in the instruction sequence list so that the basic blocks occur in the logical sequence dictated by the control flow graph, so that the entry block of an S-region precedes all intermediate blocks that in turn precede the exit node.This neither modifies the content of the instruction fields (merely the pointers accompanying them) nor modifies the control flow graph.After this modification.it is a simple matter to sequence through all the instructions of an Sregion from the entry node to the exit node.**** It should be noted that the order of the instructions within a basic block in the input instruction sequence, determined by the pointers, will correspond to the logical sequence in accordance with control flow.This will not normally be the case over basic block boundaries, due to branch and jump instructions between basic blocks.(which is a label).Table 3 illustrates the instruction sequence as perceived by the basic block scheduler after P has been suppressed.called here "PSECDO" and numbered SCPR.Following the suppression of instructions 1.8 to 1.1 0, the new instruction sequence from 1.4 to 1.12 in Table 3 forms a piece of straight line code that basic block scheduler handles in a conventional manner, resulting in the code illustrated in Table 4.
Instruction 1.11 has been moved across the PSECDO instruction in order to coYer the pipeline delay between the compare of l. 11 and the branch of 1.12.The final output instruction sequence with the suppressed instructions restored is shown in

SUMMARY
The reordering of selected instructions at compilt> time can t>xploit the potential parallelism inherent in the code.In order to take most advantage of pipelined architectural feature,.;.we have presented the SHACOOF technique.which enlarges the '•vision"' of an instruction ,.;cheduler bevond basic blocks.This technique augments an already good basic block scheduler and extends its capa-bilitv in covering delays with onlv minimal changes to existing compilers.
So consists of 8z-Bs with entry 82 and exit Bs S1 consists of 85-B7 with entry 85 and exit B7 Sz consists of 87-B1o with entry 87 and exit 810 S3 consists of 811-8zo with entry 811 and exit 8zo S4 consists of 816-819 with entry 816 and exit 819 Ss consists of 8zo-8zz with entry 8zo and exit 822 S6 consists of 81-823 with entry 81 and exit 823 FIGURE 1 A flow graph.

Table 1 .
An Example of a Program Explaining the Operation of SHACOOF

Table 1
illustrates a sample program for scheduling and Table2gives the intermediate level machine instruction sequence corresponding to line 2 until line 6 of the program, annotated with inand exit at H ... The three instructions labeled with a plus sign indicate the subsequence P which is to be suppressed.that is, the last instruction of basic block B 2 (which is a branch), all instructions of basic block B. 1 (in this example there is only one), and the first instruction of basic block B ... B~

Table 2 .
An Annotated Sequence of Machine Instructions Corresponding to Lines 2-6 of the Program of Table 1 B3+ I. 10 CL. 3: label B4

Table 3 .
The Sequence of Instructions Following Suppression

Table 4 .
The Sequence of Instructions Following Scheduling

Table 5 .
The Output Sequence of Instructions