An Empirical Study of Precise Interprocedural Array Analysis

In this article we examine the role played by the interprocedural analysis of array accesses in the automatic parallelization of Fortran programs. We use the PTRAN system to provide measurements of several benchmarks to compare different methods of representing interprocedurally accessed arrays. We examine issues concerning the effectiveness of automatic parallelization using these methods and the efficiency of a precise summarization method.


INTRODUCTION
Effective program parallelization, like any compiler optimization, can benefit from increased precision during its analysis phase.However, increased precision often implies an increase in compilation time and/ or storage, forcing a tradeoff between precision and efficiency.If the benefits of increased precision outweigh the degradation in efficiency, a precise analysis should be utilized.
In this article we assess the effectiveness and efficiency of a precise form of interprocedural array analysis for automatic parallelization.Specifically, we examine a method employed to represent the interprocedural accesses of arrays.Csing the

BACKGROUND
Traditionally, compilers have processed programs at the subroutine level.In the absence of subroutine calls, standard intraprocedural analysis techniques [4,5] can be applied.However, due to the use of modular programming techniques, programs are often written with multiple subroutines.When an intraprocedural analysis encounters a subroutine call, information regarding how the called routine accesses its parameters and global variables is absent.Without this information, conservative assumptions must be made.For a parallelizing compiler, this can imply superfluous dependencies that lead to a loss of parallelism.Thus, it seems imperative that as much information as possible be captured regarding the side effects of subroutine calls.lnterprocedural analysis attempts to provide this information.
Procedure integration or inlining can be viewed as an alternative to interprocedural analysis.When inlining is performed, the body of a called subroutine is substituted for the call statement with appropriate changes made to the naming of the formal parameters.Although selectively performing inlining can be beneficial, the cost of an enlarged program renders it infeasible as a general solution to handling all subroutines calls [ 6,7].Thus, inlining is used as a complement, rather than an alternative, to interprocedural analysis.The relationship between inlining and our precise summary method is discussed in Section 3. 1 .
Traditionally, to determine the side effects of a call statement, two prior analyses of the called routine are performed.For definitions, a flow-insensitive analysis is computed for the routine, recording nonlocal variables that may be defined.In contrast, to determine what uses should be created by a call, a flow-sensitive analysis is employed to find upward-exposed uses (a use on a definition-free path from the subroutine entry) of nonlocal variables in the called routine [8].A flow-sensitive analysis of definitions, which can determine which variables must be defined, can be used to supplement the flow-insensitive analysis.
The results of these side-effect analyses are represented by two sets for each routine.The PMOD(P) set contains all global variables and parameters of routine P that may be defined.The PUSE(P) set contains all global variables and parameters of routine P that have an upwardexposed use.
This approach is illustrated in Figure 1.As subroutine P contains definitions of the formal parameter A and the global B, calls to P assume that both of these variables are modified (PMOD(P) = {A, B}).Although both variables are referenced in subroutine P, only A is upward exposed with respect to the subroutine entry; no definition-free path exists from the subroutine entry to the use of B. Thus, a use is created for A, but not B, at the call site of P (PUSE(P) = {A}).
Consider the example in Figure 2 where A and B are arrays.Because an array access only references one element of the array, array definitions are treated as preserving, i.e., they do not kill any definitions that reach them.Thus, the use of B is viewed as upward exposed in P. In the interest of efficiency, classical interprocedural analysis represents array accesses by treating them in the same manner as scalars.Thus, it regards an access to an element of the array as an access to the whole array.Although this method retains efficiency, it suffers a loss of precision.
For example, in Figure 2 only the first element of A is used, while a proper subset ( 1, . . ., 100) of the elements are defined.Likewise, parts of B are neither modified (odd elements) nor referenced (elements> 100) in P. Because A and Bare arrays, simply stating that they are modified or used disregards subscript information describing which part of the array is accessed.
To address the loss in precision of the classical approach, several approaches have been suggested to represent portions of an accessed array.These techniques differ in the amount of precision they provide, as well as the storage and time required in processing the suggested representations.The spectrum of Figure 3  methods.Movement to the right on the spectrum represents improved precision as well as diminished efficiency.The following is a list of the techniques, with a brief description of each in terms of precision: 1. RS-Regular Sections [ 9-12] : Several variants have been described, some of which include strides and bounds information using triplet notation.Others allow for diagonal references and triangular sections.
2. DAD-Data Access Descriptors [13,14]: More general than all RS variants because trapezoidal shapes can be represented.3. Reg-Regions [ 15] : Allows more general shapes than DAD by using linear inequalities to describe the shape's boundaries.
4. AI-Atom Images [9,16,17]: Represents full subscript information for each dim ension as a linear combination of iteration variables and formal parameters.Loop bounds are also retained.5. Lin-Linearization [18]: Similar to AI except all subscripts are linearized into one dimension.

IOmega-lnterprocedural
Omega Test [ 19]: Full subscript information is captured in the form of an integer programming projection so that the Omega exact dependence test [20] can be applied.Although multiple projections are rp.ergedinto a single projection, the size of the integer projection is increased by adding extra variables.7. FIDA-Full lnterprocedural Dependence Analysis [21]: Combines AI and Lin techniques.
In addition to these techniques, and the classical technique just discussed, we include a pessimistic approach in our spectrum, which performs no interprocedural analysis.For correctness, it assumes that each routine modifies and uses all pa-rameters and global variables.Although this scheme is imprecise, it can be highly efficient because no summary information needs to be recorded.In fact, most production compilers perform this type of analysis by default.Furthermore, this method must be selectively employed when some routines of a program are not available for analysis.
To the right of the pessimistic approach is the classical mod/ exposed-use approach utilized for scalars [8] and described above.In our experiment we compare these two approaches with a more precise, but less efficient, approach called FIDA [21].An overview of FIDA is given in Section 3.
A number of advanced techniques (RS, DAD, Reg) lie between classical and FIDA on the spectrum.These techniques offer more precision (at the cost of less efficiency) than the classical approach, yet they are more efficient (and less precise) than the more precise techniques (AI, Lin, IOmega, FIDA).
The key difference between these two groups of advanced techniques is how they handle multiple accesses to the same array in a routine.Information about each access is retained in full with the precise techniques.For AI, Lin, and FIDA, this information is represented by a list of descriptors.For IOmega it is represented by modifying the projection function.By contrast, the more efficient advanced techniques represent multiple accesses with one descriptor.Thus, no matter how many accesses to a variable are made in a routine, only one descriptor is retained.However, there are two disadvantages to the less efficient techniques.For efficiency, they place more restrictive constraints on the expressiveness of their descriptors than what is employed for intraprocedural array accesses.This results in a less precise representation than is used for intraprocedural accesses.Moreover, the union of two descriptors cannot always be performed precisely (i.e., union is not closed over the descriptors).Representing the union approximately introduces further imprecision.
The FIDA approach combines the functionality of Lin proposed by Burke and Cytron [ 18] and AI suggested by Li and Yew [16,17,22,23].It is more precise than these two approaches because it draws from the benefits of both: simultaneity by coupling subscript positions (Lin) and more opportunities for proving independence by recording subscript expressions separately (AI).The distinguishing characteristic between each of these approaches and the previous ones is that multiple access descriptors are not combined, thereby making the union operation closed.Although this improves precision, it also implies that a list of accesses is associated with a call site.The result is that a dependence test of a particular variable between two calls can require 1 1 * 1 2 dependence tests, where / 1 and /2 are the descriptor list lengths corresponding to the first and second calls, respectively.

FIDA
FIDA, like Lin [18] and AI [16, 171. is a precise interprocedural array summary scheme motivated by the information required for standard dependence analysis.Our motivation for developing FIDA is to assess an upper bound on precision and efficiency of array access repreentations.This approach captures the same type of dependence information that is available intraprocedurally.This allows all array accesses to be analyzed in a uniform manner regardless of whether they are intraprocedural or interprocedural.In particular, standard dependence analysis techniques can be employed.The next section describes the information retained in each descriptor.In Section 3.2 we present some of the implementation highlights of FIDA in the PTRAN system, leaving the full details to Hind [21].

Functionality
As mentioned in Section 2, each nonlocal array access is described by an access descriptor.An access descriptor contains information about: subscripts, loop nests and bounds, and the declared shape of the array.As with intraprocedural dependence analysis in PTRAN, we allow a linear combination of induction variables in the subscripts and loop bounds.To capture the effects of arguments, we also allow a linear combination of unmodified formal parameters in the subscripts and loops bounds, and in the dimension statement defining the shape of the accessed array.When processing a call site, the corresponding arguments will be substituted for these formal parameters.
Consider Figure 4a where subroutine P contains a definition of the array parameter A. When summarizingP, the context of this definition (subscripts, loop nest and bounds, and dimension information) is retained.At a call site of P, this Do 30, i = 2, 100 Call P(A, i) = A(i-1, 3) information is propagated.substituting actual parameters for their corresponding formals.This method provides functionally similar information to that obtained from data dependence analysis after inlining.It differs in that only the information of interest is "inlined": superfluous information (for the purposes of the dependence test) is not collected.
Figure 4b represents a functional view of the information that would be present using FIDA.(No code modification is actually performed.)By using FIDA, we can detect that the outer loop surrounding the call in Figure 4a can be executed in parallel.Less precise interprocedural analysis would force serial execution of this loop.
Where the shapes of references are consistenL both Lin and subscript-by-subscript analysis are performed.Furthermore, Lin [18] is employed to handle cases where array dimensions and sizes are not consistent across routines, or where offsets into array arguments are used.
l\"ote that this method allows traditional dependence testing schemes to be employed.In particular, we utilize the Burke-Cytron hierarchical dependence method [18] as well as the following dependence tests: GCD, Banerjee-Wolfe.and trapezoidal Banerjee-Wolfe [24,25].

Implementation Highlights
In this section we present a high level description ofFIDA (Fig. 5), which is broken into three phases for each routine being analyzed.*A FIDA descriptor is one of two types: access or call site.An access descriptor represents an actual reference (read or write) to the array.A call site descriptor is created when a nonlocal arrav access exists due to a call site, i.e., an access descriptor exists at a call site.
During the clef/use generation phase, definitions (uses) are created at call sites in the classical manner using the PMOD(PUSE) set.However, when a definition (use corresponds to a variable for which FIDA descriptors exist, this definition (use) is marked as a special FIDA clef (use).This maintains the number of definitions (uses) as the same number as in the classical approach, leaving data flow analysis unaffected by FIDA.
A FIDA clef (use) is used to communicate with the dependence analysis phase.During this phase, the context of a FIDA descriptor (subscript reference, loop information, and dimension information) may be required.When this is the case, we utilize the FIDA description information by substituting references to formal parameters with their corresponding actuals.
In the PTRAN system, dependence analysis is performed on demand as determined bv a cost model of the target architecture.Under .thisapproach only dependencies that will provide useful parallelism if disproven are tested.If breaking a dependence will not result in any useful parallelism, the dependence is not tested.For example, once a loop is marked sequential due to either insufficient granularity or some other dependence that cannot be disproven, dependence analysis of other loop-carried dependencies is not beneficial and is not performed.
This technique increases the efficiency of dependence analysis by eliminating some de pendencies from consideration.It is also beneficial in the context of FIDA, as descriptor translation is directly tied to dependence analysis.If dependence analysis information is not required for a particular call site, translation is not performed.* Currently the FIDA algorithm is limited to Fortran 77 as it does not handle recursion.However.as it is similar to AI. we anticipate that techniques to handle recursion with this approach [16l will apply.as well as those mentioned in Havlak and Kennedy f12].

I0:TERPROCEDLRAL ARRAY ANALYSIS 259
For each routine, P, in a bottom-up traversal of the call graph: 1. Def/Use Generation • For each call site in P: Create a FIDA def (use) for each array argument and global variable if it is in the P MOD (PU SE) set for the called routine.
2. Dependence Analysis (Performed on demand) • If a FIDA def/use is involved: Translate the FIDA (call site or access) descriptor(s) to the call site environment using the appropriate arguments.This may require propagating through multiple call site descriptors.

Summarization
• For each non-local array reference: Create an access descriptor (subscript expressions, loop bound and nesting information, and dimension information).
• For each call site with a summarized nonlocal array reference: Create a call site descriptor (argument expressions, loop bound and nesting information, and dimension information).
• Collect the FIDA descriptors created in the previous two steps into lists associated with each non-local array variable.
FIGURE 5 An overview of the FIDA algorithm.
This characteristic distinguishes FIDA from all other previous methods.For each routine, translated descriptors are cached to avoid redundant translations.
During the summarization phase the "context" for each nonlocal (formal or global) array access is captured in a F IDA descriptor (access or call site).An access descriptor represents an explicit reference.A call site descriptor represents an implicit reference via a call site.
Callahan [9] states that the amount of summary information can grow exponentially with the depth of the call graph.We avoid this potential exponential increase of storage by postponing the propagation of call site descriptors until the information is required by dependence analysis.Thus, the number of FIDA descriptors can grow (at worst) linearly with respect to the program.

THE EXPERIMENT
The PTRAN parallelization system [1-3] was used for our experiment.In addition to detecting parallelism, PTRAN has also been shown to be a useful vehicle for gathering experimental data [26 J.We ran several Fortran benchmarks, varying the levels of interprocedural analysis and recording various metrics.
The benchmarks we ran are: The Perfect Club benchmarks are a collection of applications that were contributed by various large system vendors and that have been used to characterize supercomputer performance.
2. SPEC [28]: The System Performance Evaluation Cooperative benchmark programs are designed to establish a fair method of evaluating workstation performance on typical customer applications.The experiment includes members of the Fortran subset of Release 1. 3. LINPACK [29]: The LINPACK library is a collection of linear algebra subroutines.We modified the main subroutines to give values to their parameters if they are used in a dimension statement.
As the environment in which an experiment is performed affects the results obtained, we present an overview of our environment in the next section.

The Environment
PTRAN takes a Fortran 77 program and automatically detects parallelism, producing a parallel Fortran program.In this section we describe the environment by specifying the target model, the analysis and transformations performed by PTRAN, and two Fortran 77 language issues.

Parallelism Model
The PTRAN target model of parallelism allows loops to be designated as parallel (DOALL) or sequential.In addition to loop-level parallelism, nonloop parallelism is allowed in a 1 1 cobegin ... coend 1 1 style, with a DAG of sequencing constraints allowed among parallel begin ... end blocks [30].IBM Parallel Fortran [31] andPCF [32] are examples of languages that fit our model.

Analysis
The PTRAN system includes a rich collection of pro~am analyses.As a description of these analyses IS beyond the scope of this article, we refer the reader to the cited articles for details, and list a summary below: 1. lnterprocedural analysis [ 1 J: alias analysis (see "Standard Fortran Versus Fortran Practice" in Section 4), constant propagation, and mod and exposed use 2. Program dependence graph for nonloop parallelism [33] 3. SSA-based data flow analysis [34] and the sparse evaluation graph [35 J 4. Demand-driven dependence analysis 5. Dependence tests using the Burke-Cytron hierarchical framework [18]: GCD, Banerjee-Wolfe, Trapezoidal Banerjee-Wolfe [24, 25] 6.Standard intraprocedural analysis: constant propagation, induction variable analysis, loop normalization 7. Static cost analysis for architecture-specific effective parallelization [3] In Section 4.3 we describe how the cost analysis phase is used in one of the metrics.

Transformations
Privatization is the only transformation (other than constant propagation) implemented in the version of PTRAN used in the experiment.t This fact, combined with our target loop model, implies ~hat only loo~s that are parallelizable in their origmal form (With the aid of loop privatization) are marked parallel.
Scalar privatization for loops and nonloops is performed [30,37].To enhance the effect of privatization, interprocedural analysis includes flowsensitive kill information for formal parameters.We also perform array privatization when dependence analysis can prove its legality.:j:This privatization may require run-time support or additional storage to ensure proper "copy out" semantics.
t Although a general loop distribution algorithm has been implemented in PTRAN [36], its interface with the cost model is not complete.Thus, it is not included in this experiment.
:j: This is not as powerful as the element-level data flow analysis for arrays previously described [38-41 J.

Standard Fortran Versus Fortran Practice
The benchmarks we measure are written in Fortran 77.Although the Fortran 77 standard does not allow them, two well-known programming practices appear in these benchmarks.The first makes use of the underlying storage model most often implemented by Fortran compilers, which allow arrays to exceed their declared bounds.Although this is not legal Fortran 77, it is nevertheless done in practice.
Fortran 77 also prohibits assignment to an interprocedurally ali~sed parameter or common variable [42].Several examples in the Fortran 77 benchmarks violate this prohibition.
A parallelizing compiler for Fortran 77 must make a decision whether to recognize the standard or to accept common practices that are prohibited by it.PTRAN handles this problem by defining two switches that can be set to allow either of these features.As these features are present in the test programs we analyze, we allow both features in this experiment.

lnterprocedural Analysis Parameters
The PTRAN system computes classical mod/ exposed use interprocedural analysis by default.For our experiments we implemented two additional levels of interprocedural analysis.The levels of interprocedural analysis are: 1. Pessimistic: No interprocedural information is known.To uphold safety; all globals and formal parameters are assumed to be both modified and used by a called routine.Likewise, conservative alias information is assumed (see "Standard Fortran Versus Fortran Practice"). 2. Classical: Flow-insensitive mod and flowsensitive use analysis is performed on each called routine before call sites are processed as described in Section 2. 3. FIDA: The precise scheme for arrays described in Section 3. Classical interprocedural analysis is used for scalars.

Metrics
Because the goal of the experiment is to measure the effectiveness and efficiency of various approaches, our metrics fall into two categories:

261
those that measure parallelism detection and those that measure compilation overhead.We describe each in the next two sections.

EHectiveness lor Paral/elization
We utilize two metrics to measure the effectiveness of all levels of interprocedural array analysis and a third metric to measure the effectiveness of FIDA.
The first two metrics are the number of parallelized loops and the ideal speedup.
Ideal speedup is a static measure of the parallelized program, which disregards the costs associated with parallelism overhead (startup and management) and assumes an unlimited number of processors.It is found by statically estimating the cost of instructions along the critical path in both sequential and parallel cases and computing the ratio of the two [3].Therefore, it is an upper bound on the amount of obtainable speedup.
To obtain a more detailed measure of effectiveness, we inspect the results of dependence analysis.Of particular interest are those dependence tests where the information provided by our interprocedural analyses differ-dependence candidates involving call site array accesses.We refer to these candidates as the target dependence candidates.These candidates are used in our third effectiveness metric, the independence success rate, which is defined as the number of target dependence candidates proven independent divided by the number of target dependence candidates.
As dependence testing in our experiment is the same for all forms of interprocedural array analysis, only the precision of the input information can affect its result.In the case of the target dependence candidates, both pessimistic and classical analyses are not precise enough to prove independence.This results in a success rate of 0% for these approaches.In contrast, FIDA can provide enough information to prove independence, making a non-zero success rate achievable.Thus, this metric captures how often precise information is potentially beneficial.Unlike the previous metrics, it is not dependent on the transformations that are performed.

EHidency
We measure two types of efficiency for FIDA: storage and time.We assess storage efficiency by measuring: the number of access descriptors for formal parameters, the number of access descrip-tors for common blocks, and the number of call site descriptors.
These metrics give an estimate of the amount of storage required by this technique regardless of whether the information is used or not.Recall that the number of access descriptors corresponds to the total number of definitions and uses to nonlocal arrays in the program.A call site descriptor is created when a nonlocal array is accessed through a call.This number is bounded by the number of call sites in the program.
We capture time efficiency by recording two pairs for each nonlocal array variable, the maximum and average length of all access lists and the examined part of all access lists.
The first pair of metrics describes the amount of information associated with a FIDA clef/use.This information is composed of a list of access descriptors linked by call site descriptors.The maximum length represents an upper bound on the amount of translation that can be performed for any definition or use in the program.The average length represents the average upper bound on translation for the definitions and uses of the program.
The second pair of metrics identifies how much of this information is actually processed.As FIDA performs both the translation of arguments to formal parameters and dependence analysis with the resulting information, on demand, the second pair of metrics is a good measure of the time efficiency of our approach.

Results
In this section we present the results of our experiment using the parameters and metrics described in the previous section.§ We ran the Perfect, SPEC, and LINPACK benchmarks and report the results in two parts: effectiveness and efficiency.Three programs that are not included are FPPPP (unrelated compilation error) and SPICE (irreducible flow graph) in the SPEC benchmarks, and SPEC77 (storage overflow) in the Perfect benchmarks.

EHectiveness
The second column of Table 1 reports the number of loops in each program.The next three columns describe how many of these loops are parallelized using the three forms of interprocedural analysis.A comparison of the results of the first two forms § These results correct an earlier version of this article [ 43". of analysis seems to suggest an error, as in some cases the difference in the number of parallelized loops is greater than the number of loops with calls.However, recall that the pessimistic analysis does not capture any interprocedural information.Thus, not only must it assume that all call sites modify their arguments and global variables, but also that worst case aliasing exists (see "Standard Fortran Versus Fortran Practice").As the numbers suggest, this conservative aliasing assumption has a drastic effect on the number of parallelizable loops.
Excluding pessimistic analysis, the levels of interprocedural analysis differ only in how they summarize interprocedurally accessed arrays.As interprocedural accesses arise only at call sites, only loops with calls are affected by whether classical interprocedural analysis or FIDA is performed.Thus, these loops represent an upper bound on the potential increase of parallelized loops due to a more precise interprocedural analysis.The sixth column of Table 1 identifies the number of loops that contain subroutine or function calls.To the right of this column is the number of these loops that are parallelized for the three levels of interprocedural array analysis.
Comparing the classical approach (where arrays are treated like scalars) with F IDA, we see three routines in the LINPACK benchmarks where additional parallel loops are detected: SGEDL SPODI, and SSVDC.Each of these loops contains calls to the much documented routine SAXPY, where independent columns of a matrix are modified on different loop iterations.ll The number of parallel loops can be a misleading metric, as some loops are more critical to the running time of a program than others.Table 2 presents ideal speedup .figuresfor the three interprocedural analysis techniques.Although some programs exhibited a dramatic increase in the number of parallel loops between pessimistic and classical analysis, the increase in ideal speedup is sometimes more modest.We attribute this to the fact that some loops that are parallelized are not critical to a program's execution.Nevertheless, some programs show a substantial increase (FL052Q, DYFESM) in ideal speedup when classical interprocedural analysis is used.Once again, the benefit of a precise technique is limited to the I I A slight modification to the SAXPY code was performed to simulate constant folding of the value returned by the YIOD built-in function.Similar modifications were made in [11.12:.three LINPACK programs, two of which show significant improvement.
Table 3 illustrates the effect of FIDA on target dependence candidates, i.e., dependencies involving a call site where the corresponding formal or common block element is an array.This table reports the number of target dependence candidates (CAND) and the number of these proven independent due to the additional information provided by FIDA.The last column gives the success rate.As no subscript information is present using the pessimistic or classical approach, each of these candidates would be classified as a dependence (success rate = 0% ).
In 6 of the 32 programs a nonzero success rate is found; dependencies were eliminated solely due to the more precise array access information provided by FIDA.However, in three of these programs, the removal of these dependencies did not result in an increase in parallelism.
In 26 of the 32 programs, using a precise interprocedural analysis technique does not enhance automatic parallelization.This does not imply that automatic parallelization of these programs cannot benefit from precise interprocedural array analysis.By transforming loops with dependencies, parallelization can often be obtained.For example, in Blume and Eigenmann [44] the au- thors show significant speedup in ARC2D by performing some sophisticated transformations by hand.To perform these transformations automatically, precise interprocedural analysis is usually required.Simple transformations such as loop distribution can also benefit from precise information [17].

FIDA EHiciency
In this section we present efficiency results for FIDA using the metrics described in "Efficiency" in Section 4. The metrics concerning space are given in Table 4: the number of access descriptors and the number of call site descriptors.Access descriptors are divided into accesses of formal parameters (FP) and common blocks (CB).
Recall that an access descriptor is created for each access to a nonlocal array.A call site descriptor is created for each call site that is associated with at least one FIDA clef/use.These descriptors, which are created regardless of whether they are used, represent the amount of space overhead for FIDA.The size of each access descriptor is dependent on the number of dim ensions, the number of formal parameters occurring in the descriptor, and the depth of the loop nest.The size of the call site descriptor is dependent on these characteristics as well as the number of arguments in the call.
The number of access descriptors does not necessarily correlate to the program size.For example, the ratios of statements to access descriptors in ARC2D and OCEAN differ by about a factor of 6.Furthermore, the proportion between formal parameter and common block descriptors varies widely.This proportion is an attribute of the method of data communication between subroutines.In ARC2D, an average of almost seven formal parameters per routine is found.In DYFESM, where common blocks are more prevalent, this ratio is less than one [26 J.
Whereas Table 4 represents information overhead, Table 5 illustrates how this information is used.In the second and third columns of Table 5 we capture the access descriptor list length associated with a particular formal parameter or common block.We report the maximum and average lengths.They do not correspond to any additional storage (except the pointer required to link them together), but do capture the magnitude of information associated with each nonlocal array.
The large list length associated with the Perfect program MG3 requires explanation.Through a chain of calls, portions of an array of 60,000 elements are passed through many routines.Each routine accesses parts of the array and calls several other routines that also access it.At the end of these call chains is a routine, CPASSM, which makes 208 references to the array.As CPASSM is called eight times by each of several routines, the list of descriptors grows quickly.
As this list comprises several duplicate sublists, each of which contains a potentially unique call site descriptor, it does not require a lot of storage.Thus, no storage or performance penalty is paid for this excessive size, unless it is examined.Moreover, a simple optimization can be per- As dependence analysis is performed on demand, an element on the list is inspected only if all preceding elements have proven independence.The last two columns of Table 5 record the number of list elements inspected.Notice that even though some programs have a large list maximum, the length of the list that is actually inspected is usually small.This result is consistent with the effectiveness results reported in the previous section.Once independence cannot be proven for a F IDA reference, whatever remains of its list is not translated or tested.As the previous section showed that few programs exhibited an increase in parallelism using FIDA; a limited amount of list inspection is expected.This illustrates an important advantage of this approach when parallelization is the goal: much of the overhead-list traversal and translation-is incurred only when it may be beneficial.When independence cannot be proven, the compilation performance penalty is negligible.
An inspection of QCD2 explains the large list As each of these references generates an access descriptor, 108 descriptors are created for each of the arrays A and B. This routine is then called twice within a loop (in routine ROTMEA).The first column of an 18 by 2 matrix is written by the first call, and the second column is written in the second call.As no overlap exists between these two calls, each of the 108 references of both lists is translated and tested.Although this results in an exorbitant number of dependence tests, a simple program transformation could reduce both lists to one element.This transformation would not only reduce the number of translations and dependence tests (by a factor of 108 to 2 and 1, respectively), but also reduce the list size, and hence, the number of access descriptors from 108 to 1. Consider the results for LINPACK in Table 5.In 2 of the 17 routines (SGEDI and SPODI), the average examined list length is relatively close to the average list length.This contrasts to the other programs, where the average length examined is close to one.Tables 1 to 3 show that independence and new parallelism are detected for these two routines although it is not for the other 16.Once again, the extra processing implied by examining the access lists is only paid when it might be beneficial.To further assess the frequency of dependence tests involving interprocedurally accessed array references, we provide two additional tables.Table 6 records the number of interprocedural dependence candidates involving arrays, partitioning these into arrays passed by parameters and arrays that reside in common blocks.
The second column reports the total number of dependence candidates tested in our demanddriven approach.The third and fourth columns give the number of dependence candidates for formal parameters and common blocks, respec- tively, duplicating these columns from Table 3.The percentage of interprocedural arrays tested is small.This illustrates that the earlv focus on intraprocedural dependence analysis has been justified.
To better judge the magnitude of the FIDA success ratios, Table 7 reports the success ratios for dependence tests with at least one FIDA reference and those with no FIDA references.

Discussion
The results of the previous two sections illustrate two points: (1) Precise interprocedural analysis,

Average
.\1aximum Average The sophisticated interprocedural and intraprocedural analyses used in this experiment did not dramatically affect program parallelization.From this, we do not feel that one may conclude there is no reason to perform a precise interprocedural analysis.In fact.recent work [ 44,45] has shown that advanced transformations can significantly improve parallelization and has called for precise interprocedural analysis information.In particular, we feel that loop distribution and array privatization would benefit greatly from our analysis.Other interprocedural transformations have also been suggested [ 46].If a precise form of analysis is required to perform these transformations, the efficiency of such an analysis is paramount.Due to its demanddriven implementation, FIDA is reasonably efficient in the context of automatic parallelization.

RELATED WORK
In previous work [9][10][11][12][13][14][15][16][17][18][19], emphasis has been placed on improving the precision of interprocedural analysis for array accesses.Although exper-imental results appear in some of these articles, most of it has been limited to the parallelization of LINPACK with little empirical results concerning efficiency.
Li and Yew [ 16,17] evaluate the effectiveness of their approach by reporting the number of parallel loops containing calls using the LINPACK benchmarks.No information is presented pertaining to the efficiency of their approach except stating that it runs 2.6 times faster than the Paraphrase implementation of [15].
Havlak and Kennedy [11,12] evaluate their implementation of bounded regular sections using LINPACK and a collection of other programs.They measured the efficiency of their implementation in real-time as part of PFC.They report the number of calls in parallel loops as well as the number of dependencies removed using their approach.
Although a direct comparison with these two works would be illustrative, it is not possible as only Li and Yew report the number of parallel loops with calls without loop distribution.Under this scenario, they report a total of six parallel loops in five LINPACK routines, all but one of which we parallelize.~The failure to parallelize the loop in SGEF A is not a result of the precision of dependence analysis, but rather a consequence of PTRAN not being able to evaluate a function I SA-MAX at compile-time.Furthermore, in programs SQRDC and SGEDI, we detect an additional parallel loop containing a call.Both of these loops were not parallelized by Li and Yew [ 17].
The observation that more precise interprocedural analysis alone is not enough for effective parallelization was also made by lrigoin et al. [ 4 7] but no experimental numbers were presented.They call for better programming practice as well as new compilation techniques like array privatization.

6SUMMARY
This work has presented an experiment designed to capture the effectiveness and efficiency of interprocedural analysis of array accesses in the context of parallelization.It has shown that classical interprocedural analysis can provide a significant improvement over pessimistic interproce-~ Their original work incorrectly reported two additional loops in SSIFA as being parallel without loop distribution [Li, Personal Communication, 1992].
examined.In addition to the maximum list of 193, list of 108 elements exists and is fully examined.This program contains a subroutine (MULT) with three array parameters of size 18 (A,B,C).In the subroutine, C is computed as a function of A and B. Although this computation could have been performed in a nest of loops, it was written as 18 assignment statements, each of which contained six uses to both arrays A and B.

Table 1 .
Number of Parallel Loops for the Perfect Club, SPEC, and LINPACK Benchmarks * One parallel loop can be found using range analysis on the function call isamax.

Table 2 .
Ideal Speedup for the Perfect Club, SPEC, and LINPACK Benchmarks

Table 4 .
FIDA Storage Efficiency for the Perfect Club, SPEC, and LINPACK Benchmarks 1\o.

Table 5 .
FIDA Performance Efficiency for the Perfect Club, SPEC, and LINPACK Benchmarks

Table 6 .
Dependence Tests Using Interprocedurally Accessed Arrays