POLYSHIFT Communications Software for the Connection Machine System CM-200

We describe the use and implementation of a polyshift function PSHIFT for circular shifts and end{o(cid:11) shifts. Polyshift is useful in many scienti(cid:12)c codes us-ing regular grids, such as (cid:12)nite di(cid:11)erence codes in several dimensions, multigrid codes, molecular dynamics computations, and in lattice gauge physics computations, such as Quantum Chromodynamics (QCD) calculations. Our implementation of the PSHIFT function on the Connection Machine systems CM{2 and CM{200 o(cid:11)ers a speedup of up to a factor of 3{4 compared to CSHIFT when the local data motion within a node is small. The PSHIFT routine is included in the Connection Machine Scienti(cid:12)c Software Library (CMSSL).

1 Introduction E cient and minimal data motion is critical for high performance in most computer architectures.The polyshift function presented in this paper addresses this issue.The impact of the data motion on performance depends upon the memory architecture of the system.Memory systems have been slower than processors, almost as long as electronic computers have been built.Though the technological reasons for this fact has changed over time, it is expected to be the case also for the foreseeable future.Memory hierarchies (registers, cache, main memory, etc.) and parallel memories (banked and interleaved memories) have been used extensively for a long time to achieve a desired level of performance at an acceptable price.The e ciency of these architectures depend critically upon locality of reference.Massively parallel supercomputer architectures achieve the required memory bandwidth by using thousands of processing units with local memories.We refer to a processor, its local memory and associated communications circuitry, as a node.A communications system interconnects the nodes.Preserving locality of reference assumes several new characteristics in distributed memory architectures.Data placement among memory modules a ects the lower bounds for latency and bandwidth.The routing disciplines determine how close to the bounds the actual data motion time is.The goal in allocating data to the memory units is to make most data references be references to local memory, yet achieve good load balance.Whenever references must be nonlocal, then the placement should be such that the communication time is minimized with a good (optimal) routing strategy.Ideally, data is mapped to nodes such that nonlocal references always are references to adjacent nodes.The ability to accomplish this task depends both on the data reference pattern, and the network topology (mesh, binary cube, tree, ring, etc.).The access time to data in nonlocal memory depends both upon the network topology and the routing mode (for example, circuit switched, packet switched, or wormhole routing 2]).Many problems in the natural and mathematical sciences and in engineering can be solved by discretizing the governing equations onto a regular grid (lattice) in two, three, or several dimensions.One such example is Quantum Chromodynamics (QCD) calculations, which use a four{dimensional space{time regular lattice.The computational requirements for QCD are enormous.The desired lattice sizes are of the order of 100 million grid points.For each such grid a range of parameter values must be covered.Each set of parameters, known as a con guration, requires 10 14 { 10 15 oating{point operations 3].Clearly, a high e ciency in utilizing the computational and communication resources is highly desirable.An early implementation of a QCD code on the Connection Machine system CM{2 resulted in a performance of 0.9 G op/s in 32{bit precision for a 2048 node con guration.Code restructuring and other optimizations improved the performance by close to a factor of six to 5.2 G op/s.A large fraction of the performance enhancement in the QCD application was due to code restructuring to allow for concurrent bidirectional communications in each of four dimensions simultaneously and to avoid extraneous local memory moves.The polyshift function described here is a generalization of the communications routines developed for the QCD application.The polyshift function is included in the Connection Machine Scienti c Software Library (CMSSL) 20,17] as the routine PSHIFT.The PSHIFT routine is critical for the performance of many scienti c programs based on nite di erence techniques, multigrid techniques, as well as molecular dynamics applications.In this paper, we describe this software, along with features of the Connection Machine system CM{200 which support it, and give performance numbers and analyses.The PSHIFT software technology is also used in a special compiler known as the stencil or convolution compiler, now available in CMSSL for the CM{2 and CM{200.A prototype version of this compiler was described in 1].The stencil compiler will be described elsewhere.In section 2, we describe the programming model of the CM{200 used by the compilers and the run{time system, as well as the hardware features which are used for the implementation of PSHIFT.Section 3 describes the software architecture of the PSHIFT routine, and Section 4 discusses its implementation in some more detail.Section 5 describes the interface of the PSHIFT library routine to Connection Machine Fortran 18, 19], a subset of Fortran 90 14] with extensions.Calling sequences and supported functionality are reported.Section 6 presents some performance measurements and a performance model.We conclude with a section summarizing our experience from developing and using the PSHIFT library routine, and discussing possible future enhancements and generalizations. 2 The Connection Machine Model CM{200

Data allocation
The CM{2 and CM{200 support a programming model with a global address space.(For the remainder of this article, unless otherwise stated, descriptions of the CM{200 hardware and software will also apply to the CM{2.)Data arrays declared in any of the supported languages are by default distributed evenly over all memory units.The default allocation of arrays to memory units is entirely based upon array shape.This allocation is known as a canonical layout, and is determined by the geometry manager at run{time.For each array, the nodes are con gured as an array of nodes with the same rank as the data array.Thus, for one{dimensional data arrays the nodes form a linear array; for a two{dimensional data array, the nodes form a two{dimensional array of nodes, etc.The geometry manager also decides which elements are mapped into the same node, and to which node each aggregate of data is mapped.On the Connection Machine systems, a set of consecutive elements 9] along each axis are mapped into the same memory unit.If the number of elements along an axis is not evenly divisible by the number of nodes assigned to that axis, then some nodes may not be assigned any elements.Of the nodes that are assigned elements, all but one receive the same number of elements.Thus, for a one{dimensional data array of M elements mapped to N nodes, d M N e successive elements are assigned to the rst bM=d M N ec nodes.On nodes that receive fewer than d M N e elements, memory is nevertheless allocated for this number of elements.For rank two or higher arrays, one or more axes may be padded in this manner.Nonvalid data elements are identi ed by setting bits of a garbage mask.This mask has one bit for each data element.Other mappings of interest with respect to either load balance (such as in LU decomposition 8,9,11,12]) or communications requirements (such as in FFT computations 10, 21]) are cyclic mapping 9], and combinations of consecutive and cyclic mappings, such as block cyclic mappings 7,8,13,22].The proposed languages Fortran D 5] and High Performance Fortran 4] support such data allocations.However, the current Connection Machine languages do not support cyclic mappings, or combinations thereof with the consecutive mapping.
The consecutive allocation de nes which elements are assigned to the same node, but does not specify how aggregates of elements are assigned to nodes.On the CM{200, the default assignment is such that a pair of successive indices along any axis are either mapped into the same memory unit, or into the memory units of adjacent nodes.This mapping, known as NEWS order, uses a binary{re ected Gray code 6,15] for the encoding of node addresses.The standard binary encoding is referred to as SEND order.
In the default NEWS order, allocation blocks i and i + 1 are assigned to adjacent nodes, while in the SEND order allocation block i is assigned to node i.In a SEND ordered assignment, blocks N 2 1 and N 2 are assigned to nodes at a distance of log 2 N apart.
The LAYOUT compiler directive is available to individually choose which order is to be used for the axes of a given data array.The default strategy for assigning nodes to axes of multidimensional data arrays is to make the local length of each of the data array axes the same, if possible.For example, in two dimensions the geometry manager attempts to make the local data array approximately square, and in three dimensions it attempts to make the local array approximately a cube.This strategy minimizes the average number of remote references per local reference when the references along the di erent array axes are equally frequent for all array elements.In other words, this minimizes the surface area of a subgrid for a given volume.For example, a 32 by 32 array mapped onto a four{ dimensional binary cube would result in subarrays on each node of size 8 by 8, i.e., each instance of an axis being assigned a two{dimensional subcube.However, because of certain low level details of the architecture, the geometry manager does not always succeed in creating subarrays with axes of equal lengths, even when that should be possible.The shape of the local arrays, but not their size, is controlled by assigning weights to the various axes using the LAYOUT compiler directive.The number of local array elements is not a ected by the LAYOUT directives as long as the array size is such that the lengths of the axes of the selected processing array shape divides the corresponding data array axes.A high weight for an axis relative to the weight of other axes increases its local length at the expense of the length of the other axes.The SEND and NEWS order speci ers have no e ect on the local array shape.They only a ect the node to which subarrays are assigned.Further control of axes layouts can be obtained by using the SERIAL speci er in the LAYOUT compiler directive.This speci er forces an axis to be entirely local to a memory unit, so that a distinct copy of the entire axis resides in every memory unit.Finally, the allocation of an array can also be controlled by the ALIGN compiler directive.With this directive one array is aligned with another array in a speci ed way.The run{time system maintains information about array type, location, shape, and layout in an array descriptor for each array.

Data motion primitives
Fortran 90 and CM Fortran provide the intrinsic functions CSHIFT and EOSHIFT for circular shift, and end{o shift.These intrinsic functions shift an entire Fortran array in a given dimension by a given amount.CSHIFT is a circular shift, while EOSHIFT is an end{o shift, with incoming boundary data speci ed as literal data, or a scalar or array{valued variable.Since these functions are the array{syntax expressions of o set array indices, their use is very common in scienti c programming.The PSHIFT routine allows a user to specify, in a single statement, data motion equivalent to one or more calls to CSHIFT and EOSHIFT.The CM{2 and CM{200 implementations of PSHIFT also o er enhanced performance, compared to calls to CSHIFT and EOSHIFT through an e cient implementation of multiple communication operations.In the CM{200 systems, an exchange of data between adjacent nodes requires the same time as a one way communication between a pair of adjacent nodes.Moreover communication along several axes can be performed concurrently.This ability for concurrent communication between nodes is exploited by PSHIFT to e ciently perform multiple shifts when compared to CSHIFT.The speci c node hardware which supports the concurrent exchange of data is described next.

Node architecture
The CM{200 can have up to 2048 nodes, each of which is equipped with a 32{ or 64{bit (internal) oating{point unit (FPU), 4 MBytes of Dynamic Random Access Memory (DRAM) operated in page mode, and a network interface.There is also a set of 32 bit{serial processors, and some associated hardware for translating data between the bit{serial representation and the 32{bit wide representation required by the FPU.64{ bit operations are supported by the 64{bit FPU, but the data paths external to the FPU are 32{bits wide.This requires consecutive loads and stores of the two 32{bit halfwords associated with each 64{bit oating point word.During memory loads, a DRAM page fault adds one cycle.That is, striding out of a 4k (32{bit) word DRAM page causes a single{cycle stall.Stores always require close to two cycles.The clock frequency is 10 MHz (7 MHz for the CM{2).The bit{serial processors are physically arranged on two \CM Chips", each of which is the terminus for 12 communications wires.One wire on each CM Chip is connected to the other CM Chip on the node, the others go to other nodes.Because of this doubling of the CM Chips and their associated communications wires, the machine is e ectively interconnected as a binary hypercube with up to 11 dimensions, depending on machine size, and with connections between adjacent nodes consisting of a pair of 1{bit wide channels.Figure 1 gives a block diagram of a node and Figure 2 shows how a three{dimensional cube of CM Chips can be thought of as a two{dimensional cube with a pair of channels between each pair of nodes.The CM{200 is described in detail in 16].
The basic communications operation between a pair of adjacent nodes is an exchange, i.e., a pair of nodes can exchange data in the time required for one of the nodes to send data to the other node.Moreover, it is possible to concurrently exchange data on all channels of a node.In a 2048 node system, up to 22 (2 11) data elements can be exchanged concurrently.The network interface contains three register les of 32 32{bit registers each, known as \transposers" A, B and C. The transposers function also to convert data between the 1{bit wide bit{serial representation (employed by the bit{serial CM processors), and the 32{bit wide representation required by the FPU.In this case, each transposer converts a block of 32 bit{serial words of 32 bits, to a 32{bit parallel word, and vice versa.When functioning as part of the communications system, the transposers serve as the source and destination of the data exchanged between a pair of adjacent nodes.In a typical operation, transposer A is loaded with two 32{bit data elements for each adjacent node to which data shall be sent.After the exchange, transposer B contains the data received { two 32{bit data elements from each adjacent node.\Slots" 1 through 11 in the transposer register le correspond to one set of 11 1{bit channels, and slots 17 through 26 correspond to the other set of 11 1{bit channels.The correspondence between transposer slots and cube dimensions is illustrated in Figure 3.

Virtual machine model
The Connection Machine systems implement a virtual machine model in which one data element (e.g. a CM Fortran array element) is assigned to one virtual processor.Conceptually, the user data arrays are fully distributed amongst the virtual processors, and all operations take place fully in parallel.However, to actually implement a data distribution which assigns one data element to one physical processor would result in highly ine cient use of the hardware, and also the inability to run di erent sized problems on the same system con guration.The virtual machine model allows the distribution of a large number of data elements onto a smaller number of physical processors, so that multiple data elements are assigned to a single physical processor, alleviating these problems.(Details of this distribution were discussed above.)Now consider a typical shift operation in CM Fortran.The operation CSHIFT(A, 1,1) implies that all references to A(I; :; :) after the shift operation reference element A((I + 1) mod N; :; :) prior to the shift, where N is the length of the shifted axes.In the current implementation of CSHIFT all elements of A are moved precisely as stated in the call to the function CSHIFT.EOSHIFT is implemented analogously, and so is PSHIFT.But, since, in general, there are many array elements assigned to each node, the virtual machine model results in a large number of memory moves local to a node.As the size of a subarray increases, the time for local memory moves becomes comparable to, and eventually exceeds, the time for communication.Thus the performance advantage of PSHIFT decreases with increasing size of the local subarrays.However, the local memory moves could be avoided by suitable manipulation of pointers, i.e., by appropriately modifying the address calculation in referencing array ele-ments.The CM{200 compilers do not currently maintain a set of pointers.In the PSHIFT function, the data motion between nodes and the support of the virtual machine model are implemented as separate modules.Both modules are required for consistency with the memory model used by the CM Fortran compiler.

PSHIFT software architecture
The polyshift operation is supported by the three routines PSHIFT, PSHIFT SETUP and DEALLOCATE PSHIFT SETUP.The PSHIFT routine is further divided internally into modules corresponding to 1. the data motion between nodes, 2. local data motion to support the virtual machine model,

the boundary conditions
We now describe the routines in more detail.The purpose of the PSHIFT SETUP function is to carry out operations that are the same for each call to the routine PSHIFT, which performs the shift operations, and thus can be performed once and for all.Largely, the PSHIFT SETUP function performs address calculations and generates a custom routine to accomplish the requested set of shifts.On the CM{200, each array axis is assigned a subcube of nodes.For a data array mapped to the nodes in NEWS order, traversing a given data array axis corresponds to traversing the binary subcube to which it is allocated.For a shift operation it is necessary to determine, for each node, the dimensions for incoming and outgoing data in the positive and negative axis direction.This information is held in communication tables on the CM.One table, of at most 4 32{bit words, is required for each shift.The fact that di erent cube dimensions are used for an axis is clearly seen in Figure 2. In order to perform the required calculations the PSHIFT SETUP function must gather information about the mapping of the array to the nodes and the array data type from the array descriptor.While the axis encoding (NEWS or SEND), the data type, and the array rank are all known at compile{time, neither the actual array extents nor the machine size is known until run{time.Using these nal pieces of information, the geometry manager determines the actual layout of the arrays.It is only at this point that PSHIFT has enough information to perform the actual setup functionality required by the given arrays.All data arrays having the same shape, same data type, and same layout directives, are mapped to the CM{200 nodes in the same way.Thus, one call to the setup routine su ces for a given set of shifts for all such arrays.The PSHIFT routine performs the actual data motion.The data movement is performed in three steps: 1. Perform all of the on{node memory{to{memory moves (in support of the virtual machine model).2. Perform the required exchange of data between nodes.3.If needed, move the boundary data for EOSHIFT to the appropriate nodes.
The rst step performs any local memory moves that are needed.If the source and destination arrays are the same for any shift, then the source is rst copied to a temporary array to avoid overwriting data needed in step 2. No local memory moves are required when the shift distance is greater than the length of the local axis segment for an axis.If this is not the case, local memory moves can only be avoided by code restructuring in which the boundary is extracted from the original array.The second step performs the data motion required to exchange data between nodes.When data are needed from an adjacent node, a single exchange across the hypercube channels is needed.PSHIFT can handle any shift distance up to twice the subgrid extent along the shift axis and shifts of distance S = L2 k for k 1, where L is the length of the subgrid along the shift axis.Shifts of length S = L2 k are referred to as power{of{2 shifts.For such shifts, all array elements are moved a distance of two nodes (require two exchanges).This fact is due to the properties of the binary{re ected Gray coding used in distributing the array elements.The third step moves boundary data for EOSHIFT into the appropriate nodes.This step is only required for end{o shifts with array or scalar variable boundaries.For end{o shifts, PSHIFT also allows either a constant 0 or a constant 1 to be speci ed for the boundary values.These boundaries can often be handled without requiring this third step.This aspect of PSHIFT is discussed further in the next section.The subfunction DEALLOCATE PSHIFT SETUP deallocates the data structures allocated in the PSHIFT SETUP routine.The only CM memory allocated by PSHIFT SETUP is for the communication tables, since these tables are di erent for each processor node.Temporary arrays, if needed by PSHIFT, are allocated and deallocated on each call to PSHIFT.

PSHIFT Implementation
In this section we describe the implementation details of the PSHIFT routines.The desired level of control over data motion required that much of the code for the CM{200 be expressed in CMIS (Connection Machine Instruction Set).The remainder of the polyshift code was written in C. Lying somewhere between assembly language and microcode, CMIS allows low{level control of the features of a Connection Machine system without a need for the programmer to be concerned with the lowest level details, such as setting up pipelines in the node.The CMIS functionality includes memory{ to{memory transfers, memory{FPU pipelines, and special instructions for concurrent communication on multiple channels of a node.The PSHIFT SETUP routine accepts a set of arguments that specify a prototypical array and a set of shifts to be performed on arrays of the same size and layout as the prototype.This information is used to generate a custom IMP (Internal Macro Procedure, consisting of CMIS and IMP instructions) that will perform the shifts requested.Communication tables are computed for each node.These tables are used by the IMP to load and unload the correct transposer slots for each exchange of data across the hypercube connections.The IMP is written to a le, assembled, and then loaded into the SRAM (Static Random Access Memory) of the sequencer(s) for the CM{200 conguration being used.The PSHIFT SETUP function returns an integer ID, which is used as an argument to the PSHIFT routine to identify the previously generated IMP.The same ID can be used for all calls to PSHIFT for the same set of shifts on any arrays with the same layout and data type as the prototype array passed to PSHIFT SETUP.Di erent calls to the PSHIFT SETUP function, and di erent IDs are needed for di erent array layouts, and/or a di erent set of shifts.The IMPs generated by PSHIFT SETUP perform the following steps: Repeat until all exchanges are completed 1. Load transposer A, indirectly with communication tables, from the source arrays 2. Exchange data with adjacent nodes 3. Unload transposer B, indirectly with communication tables, to the destination arrays The IMPs generated by the PSHIFT SETUP function perform all of the communications required by the shifts, except for assigning the array, scalar, and in some cases, constant 1.0 and 0.0 values to the boundaries.These boundary values are assigned with calls to Connection Machine Run{Time Library routines after the IMP has completed.Boundaries assigned the constant 0.0 are handled in the IMPs by loading 0's into transposer A slot 0 and modifying the communication tables to read from transposer B slot 16 for nodes that require the boundary elements.Each boundary node in e ect sends the appropriate constant to itself.This communications path is an artifact of the original CM{2 bit{serial architecture and was used to communicate between the pair of CM{2 chips found on each node (see Figure 1).The same technique is used if all end{o shifts specify a constant boundary value of 1.0 (loading a 1.0 instead of a 0.0 into transposer A slot 0).In the case that some end{o shifts specify a constant 1 boundary and some specify a constant 0.0, then the constant 0.0 boundaries are handled in the IMP and the constant 1.0 is broadcast to those boundary elements requiring the value, after the IMP has completed.Due to limited space in SRAM, and for performance reasons, there are actually two setup routines.The PSHIFT SETUP routine generates IMPs in which all loops are completely unrolled.The PSHIFT SETUP LOOPED routine generates IMPs in which the loops are unrolled only slightly e.g., 4 data exchanges are performed in each iteration of a loop.The PSHIFT SETUP routine generates faster routines due to the high overhead of loop control statements in IMPs (especially on the CM{2) at the expense of larger IMPs and higher setup times.In fact, the PSHIFT SETUP routine can fail in some cases if it produces an IMP that is too large to t into the available SRAM.In these cases the PSHIFT SETUP LOOPED routine must be used.The calling convention is the same for PSHIFT SETUP and PSHIFT SETUP LOOPED.Other restrictions and performance considerations are discussed in more detail in the next two sections.
For shifts that cannot be handled by PSHIFT, PSHIFT SETUP sets an internal ag, causing the shift to be performed through calls to the Connection Machine Run{Time Library.
If none of the requested shifts can be handled by IMPs then the performance of PSHIFT will be approximately the same as for the equivalent set of calls to the intrinsic routines CSHIFT and EOSHIFT.For a shift along any axis a, with subgrid axis extent L, and a shift distance S, a shift will be handled by an IMP if the following restrictions are met: Axis a is not padded.Axis a is distributed in NEWS order.jSj 2L or S = L(2 k ) for k 1.
Also, no more than two shifts are allowed per axis.

PSHIFT CM Fortran interface
PSHIFT is accessed from CM Fortran using the calls described in Figure 4.The shift{ type arguments are prede ned constants de ned in the CMSSL include le \cmsslcmf.h".They may be one of CMSSL CSHIFT, CMSSL EOSHIFT SCALAR, CMSSL EOSHIFT ARRAY, CMSSL EOSHIFT 0, or CMSSL EOSHIFT 1, in any combination.Boundary values for end{o shifts may be the constant values 0.0 or 1.0, a front{end scalar variable, or a CM array (of rank one less than the source and destination arrays).
To compare the use of CSHIFT and PSHIFT, we present in Figure 5 code fragments representing the calculation of a ve{point, two{dimensional stencil.Notice that in using PSHIFT, temporary storage must be managed by the user, while in the corresponding code using CSHIFT, the compiler manages temporary storage.In the example, the temporary arrays are named N, E, W, and S. The advantage of PSHIFT lies in the implementation of the required communication.The library routine PSHIFT performs the speci ed communications concurrently, while the CM{200 compilers do not instantiate multiple CSHIFT or EOSHIFT operations concurrently.Thus, in our example, the CM Fortran code requires four communications instead of one.An optimized CM Fortran compiler would not only perform the required communication concurrently, but also avoid unnecessary local data motion, and optimize the register usage in the oating{point unit.A prototype compiler of this nature has been implemented for stencils applied to two{dimensional arrays.This compiler is known as a stencil or convolution compiler 1].A generalization of this prototype compiler has now been completed for the CM{200 and is part of CMSSL 3.1 for the CM{2 and CM{200.A CM{5 version is now under development.The CM{2/200 version of the stencil compiler uses a module of the PSHIFT function for data motion between nodes.
For the QCD computations mentioned earlier PSHIFT replaces eight communications performed sequentially by one concurrent communication exchanging data along all four axes at once.The use of PSHIFT in a fragment of a QCD application is shown in Figure 6.
A nal example is the calculation of a 27{point, three{dimensional stencil, given as a complete subroutine in Figure 7.This example also illustrates the re{use of intermediate communications results in a complicated communications pattern, as indicated by the appearance of several PSHIFT destination arguments as both later PSHIFT source arguments, and as operands in the arithmetic expression.

Timings
In this section we present performance data for the PSHIFT routines.The goal of PSHIFT is to provide improved performance over the intrinsic functions CSHIFT and EOSHIFT.
The PSHIFT routine supports the virtual machine model, like CSHIFT and EOSHIFT.For small subarrays of high rank the speedup is expected to be relatively high, while for large subarrays of low rank the on{node memory moves will dominate, and PSHIFT is not expected to o er much improvement over CSHIFT or EOSHIFT.Furthermore, the advantage of PSHIFT is the highest when the lengths of the axes of the local subarray are the same.For instance, if the local subarray is of shape 100 10, then only 10 element exchanges along the two axes are overlapped for a shift distance of one.Then, 90 elements must be exchanged without any concurrency.In the default layout targeted by the geometry manager, the local segments of the array axes are of as equal length as possible.Thus, in such a layout, a maximum overlap between communications along di erent axes is achieved.The timings reported in Tables 1 through 4 were designed to verify the performance behavior relative to the intrinsic functions.All timings were performed using the version of PSHIFT that is included in the Connection Machine Scienti c Software Library (CMSSL) Version 3.0.The timings were carried out at the Advanced Computing Laboratory of Los Alamos National Laboratory on a Connection Machine system CM{200 with a Sun{4 front{end.The system was operated using the Connection Machine System Software Version 6.1, and Version 2.1 of the CM Fortran compiler.In order to get accurate timings, each call was repeated a large number of times, and the time measured for all calls.The number of calls was chosen such that the total time was approximately the same for the di erent cases.Thus, for example 10,000 calls were used for shifting a one{dimensional array, while 100 calls were used for shifting four{dimensional arrays along all four axes.The times given in tables 1 through 4 are given in milliseconds per call.The time required for the PSHIFT SETUP calls varied from approximately 0.1 second up to 4 or 5 seconds.There was no correlation of setup time to the array size or shifts speci ed.The large amount of disk I/O required for writing, assembling, and loading the IMPs accounts for the majority of the setup time.Arrays of type REAL*8 were used for all timings.The shape of the subarrays were chosen to measure the maximum relative performance gain of PSHIFT over the intrinsic functions.Thus, in two dimensions square subarrays were used and in three dimensions cubic subarrays were used.This is the default layout on the Connection Machine systems.
All the data in Tables 1 through 4 are for shift distances of 1 along all axes of the arrays.The Elapsed time shown is the total time including time in which the front{ end is busy while the CM is idle.The CM{time is the time that the CM was busy.The CM{time will always be less than or equal to the Elapsed time.
The speedup of PSHIFT over CSHIFT is summarized in Figure 8.The speedups are computed using the Elapsed times.As expected, the speedup decreases with the size of the local subarrays.For large subarrays, the execution time is dominated by on{node memory moves to support the virtual machine model, and the performance of PSHIFT and CSHIFT is practically the same.For small subarrays the time for a shift is dominated by exchanging data with neighboring nodes.In the one{dimensional case the expected speedup of PSHIFT over CSHIFT for a shift in both directions of the axis is two.However,   CSHIFT with a shift distance of 1 along all four axes of rank four subarrays of type REAL*8.
the speedup is actually higher due to di erent implementation techniques.This fact is most notable for the front{end time.For four{dimensional subarrays the expected speedup of PSHIFT over CSHIFT for shifts in both directions of all four axis is eight.But, in this case the measured speedup is signi cantly less.Part of the reason is that though the actual exchange of data between a node and its neighboring nodes is fully concurrent, all memory operations in a node are serial.A more careful analysis of the expected performance is given next.

Performance Model
From the architecture of PSHIFT we can derive the following model for the time to execute a single call to PSHIFT for a k{dimensional subarray of shape L 0 L 1 L k 1 = V : where L i is the length of subgrid axis i, k is the dimension (rank) of the subgrid, N sl is the number of slices (one slice is 32 bits) per element for the given data type, S i is the shift distance for axis i, T 0 is a startup time, independent of k and L i , T mm is the time for the memory{to{memory transfer of a single slice of data, T mt is the time to transfer a slice of data from memory to a transposer, T tm is the time to transfer a slice of data from a transposer to memory, T ex is the time to do one exchange.The ceiling function occurs in the last term because the exchange operation swaps two slices at a time; if the number of slices to be sent over the wires is not even, the number of exchanges required is the next multiple of two.
For a shift distance of one and L i = L, 0 i < k, combining terms yields T total = a + bL (k 1) + cL k : (2) The L (k 1) term is the one associated with the actual between{processor communications; the L k term is associated with on{processor memory{to{memory tra c.Thus depending on the relative values of the coe cients b and c, the PSHIFT execution time can become dominated by either term for a given subgrid extent and array rank.
We can obtain approximate values for the time of an exchange and the time for a memory transfer from the rank one array data.In this case, there is only one exchange per call to PSHIFT regardless of the subgrid length.Thus, any dependence of the timing on subgrid length comes entirely from the memory{to{memory portion of the code.For the smallest subgrid (length 4), the PSHIFT CM{time is totally dominated by the single exchange, since one array element is sent in each direction, and only six array elements are moved within the processor's memory.(We are anticipating here that an exchange takes much longer than a memory transfer.)Similarly, for the largest rank one subgrid, the time is dominated by the memory transfer time.The timings indicate the CM{time for a single call to PSHIFT with a rank one array and a subgrid length four requires 41 microseconds.
To obtain the time per memory transfer, we take the timing in Table 1 for a subgrid length of 16384, subtract o the 41 microseconds corresponding to the subgrid length four shift, and divide the remainder by twice 16380 (the total number of memory{ to{memory transfers not accounted for by the subgrid length four times).From this we obtain a time of .81microseconds per slice moved.As noted above, the time per exchange is much greater than the time for a memory{to{memory transfer.However, as the subgrid size increases, the execution time will eventually be dominated by on{ processor memory{to{memory time.Figure 9 gives the time to execute PSHIFT as a function of the rank one subgrid length.An interesting feature is the kink in the curve occurring at a subgrid length of 64.This can be understood in terms of the Connection Machine memory architecture, wherein memory locations are addressed in pages.Changing the address of consecutive memory accesses by more than one page will require an extra step of resetting the page address; hence the e ective time of memory{to{memory transfers depends on whether the source and target locations are in the same page or not.The slope of the curve increases for this case.Figure 10 and Figure 11 give the results of tting the above model to the rank two and rank three data.As can be observed, the t is quite good for the rank two data.For the rank three data, we have only four data points (so a cubic function of L can be t exactly too the data points).

Conclusions and discussion
PSHIFT gives the best performance gain over the CM Fortran intrinsics for 1. subarrays with axes of equal lengths, 2. bidirectional shifts, 3. arrays of high dimensionality, 4. small subarrays.
The rst three rules simply result from the observation that parallelism is lost when they are violated.The fourth comes about because, as the subarray size increases, the execution times of both CSHIFT and PSHIFT become dominated by the on{node memory{to{memory movement of the data.There is no remedy for this situation as long as the virtual machine model is fully supported.However, it would be possible for a compiler to analyze the context of the communications pattern to determine what is the ultimate destiny of the shifted data.In the cases where this analysis succeeds, it would be possible to generate code so that unshifted data is used within a node, thereby eliminating the unnecessary local memory moves.Such a technology is not currently present in the CM Fortran compiler.Compiler features such as these are a research topic at this time, and are being considered by Rice and Syracuse Universities in developing a compiler for Fortran D 5], as well as at Thinking Machines Corporation.Another approach towards reducing or eliminating the on{node memory{to{memory moves, is to identify classes of calculations which use patterns of shifts and consume the output locally, and provide library routines or a special{purpose compiler to generate optimized code for these classes.Both these approaches have been taken by Thinking Machines Corporation.The Stencil Library is an example of the former approach, and the Stencil Compiler 1] is an example of the second approach.

Figure 2 .
Figure2.Eight nodes can be thought of as a three dimensional hypercube,

Figure 8 : 1 .Figure 9 :
Figure 8: The speedup of PSHIFT compared to CSHIFT for some arrays and shift distances of 1.

Table 1 :
Timings on a Connection Machine system CM{200 for calls to PSHIFT and CSHIFT with a shift distance of 1 on rank one subarrays of type REAL*8.

Table 2 :
Timings on a Connection machine system CM{200 for calls to PSHIFT and CSHIFT with a shift distance of 1 along both axes of rank two subarrays of type REAL*8.

Table 3 :
Timings on a Connection Machine system CM{200 for calls to PSHIFT and CSHIFT with a shift distance of 1 along all three axes of rank three subarrays of type REAL*8.

Table 4 :
Timings on a Connection Machine system CM{200 for calls to PSHIFT and