An extended two-phase method for accessing sections of out-of-An extended two-phase method for accessing sections of out-of-core arrays core arrays

A number of applications on parallel computers deal with very large data sets that cannot (cid:15)t in main memory(cid:13) In such applications(cid:2) data must be stored in (cid:15)les on disks and fetched into memoryduring programexecution(cid:13) Parallelprograms with large out(cid:16)of(cid:16)core arrays stored in (cid:15)les must read(cid:17)write smaller sections of the arrays from(cid:17)to (cid:15)les(cid:13)


Introduction
Parallel computers are being used increasingly to solve large computationally intensive a s w ell as data-intensive applications, such as large-scale computations in physics, chemistry, biology, engineering, medicine, and other sciences.The data required by m a n y of these applications must be stored in les on disks, as it is too large to t in main memory 8].The program must perform I/O to access data from disks.Examples of such applications are Hartree-Fock calculations in chemistry, v ery large Fast Fourier Transforms to detect faint radio pulsars, seismic data processing, weather and climate modeling, 3D turbulence simulations, scattering and radiation problems in computational electromagnetics, and several others 1].
Multidimensional arrays are widely used as data structures in scienti c programs.Scienti c applications with large out-of-core data sets may therefore have one or more out-of-core multidimensional arrays stored in les.At run time, the program must fetch smaller sections of these arrays from les, perform computation, and, if necessary, store the results back to les.Di erent processors may need di erent sections of the arrays depending on the data distribution, and the sections may h a ve strides in each dimension.
In this paper, we describe a method, called the extended two-phase method, for parallel programs to access sections of out-of-core arrays e ciently.In this method, the requesting processors cooperate in reading or writing data|a process known as collective I/O.Speci cally, processors cooperate to combine several I/O requests into fewer larger granularity requests, reorder requests so that the le is accessed in proper sequence, and eliminate simultaneous I/O requests for the same data.In addition, the extended two-phase method partitions the total I/O workload among processors dynamically, depending on the access requests.Compared to a static partitioning scheme, dynamic partitioning results in a more balanced distribution of I/O among processors and therefore performs considerably better.
We present extensive performance results comparing the extended two-phase method with a direct (non-collective) method on the Intel Touchstone Delta.For this purpose, we use two real parallel applications|out-of-core matrix multiplication and out-of-core Laplace's equation solver|as well as several synthetic access patterns.We found that the extended two-phase method performed considerably better than the direct method for a wide range of access patterns, array sizes, and number of processors.
The rest of this paper is organized as follows.In Section 2, we describe the I/O access patterns of two out-of-core parallel applications and thus motivate the need for the extended two-phase method.The method itself is explained in Section 3. In Section 4, we describe a simple static scheme for partitioning I/O among processors and then show h o w the partitioning can be improved by using a dynamic scheme.Extensive performance and scalability results are presented in Section 5. We draw o verall conclusions in Section 6.
2 Two Out-of-Core Parallel Applications Here we describe the I/O access patterns of two out-of-core parallel applications|matrix multiplication and a Laplace's equation solver.

Out-of-Core Matrix Multiplication
We consider an out-of-core GAXPY algorithm for matrix multiplication, described in 3].Let A, B, and C be n n matrices such t h a t C = A B. The matrices can be represented in terms of their individual columns as The GAXPY algorithm for computing C = A B is c j = P n k=1 b kj a k j = 1 : n In other words, to compute the j th column of C, w e need the j th column of B and all columns of A. An out-of-core GAXPY algorithm for matrix multiplication can be implemented as follows.In the rst step, processors read two-dimensional sub-blocks of matrix A into main memory such that the sub-blocks of all processors together span entire rows (see Figure 1).The processors also read two-dimensional sub-blocks of matrix B into memory such that the sub-blocks of all processors together span entire columns.The data now present in memory is su cient to compute the rst two-dimensional sub-block of matrix C.This computation requires a global sum operation.The processors then write the newly computed sub-block of C to the le.In the following step, processors read the next set of sub-block s o f B ( s h o wn by dashed lines in Figure 1), reuse the sub-blocks of A fetched in the previous step, and calculate the second sub-block of C.This process is repeated until all the sub-blocks in the rst block o f r o ws of C are computed.The above process is then repeated with the sub-blocks from the next set of rows of A, shown by dashed lines.The entire matrix C is computed in this fashion.Note that, at any time, each processor has only one sub-block of matrices A, B, and C in memory.Step 2 Step 5 Step 2, 6 Step 1 Step 1 Step 5 Step 1, 5 We consider a Laplace's equation solver that uses a Jacobi iteration method.This is a stencil computation where the value at each p o i n t is computed by using the values at its neighbors in each of the four directions.
do k = 1 niter A(i j) = ( B(i ; 1 j ) + B(i + 1 j ) + B(i j ; 1) + B(i j + 1))=4 i j = 1 : n Exchange A and B end do An out-of-core Laplace's equation solver can be implemented as follows.Divide the out-of-core array i n to two-dimensional sub-blocks such t h a t t wo b l o c ks (one for old values, one for new values) can t at a time in the memory of each processor.Assign blocks to processors in a round-robin fashion as shown in Figure 2.Each processor reads one block at a time from the le containing the array.Processors can either communicate boundary rows and columns or read them directly from the le.After a processor computes new values, it writes the new block to a le containing the new array.This process is repeated on other sub-blocks of the array to complete one iteration.The algorithm is repeated for further iterations until it converges.

Accessing Out-of-Core Array Sections
In the above applications, processors access two-dimensional sub-blocks of out-of-core arrays.This type of access pattern also occurs in other applications, such as out-of-core LU solvers 10].Since arrays are usually stored in a le in either column-major order (as in Fortran) or row-major order (as in C), the data required by each processor is not located contiguously in the le.In many cases, the requests of di erent processors are interleaved in the le.To read non-contiguous data with the interfaces currently provided by parallel le systems, each processor must explicitly seek to the appropriate location in the le, read a small chunk of data, then seek to the next location, and so on.We call this the direct method.The Vesta and PIOFS le systems on the IBM SP 5, 9] and the nCUBE le system 6] do provide support for the user to specify a logical view of the data to be read and use a single call to read data.Each processor's request, however, is serviced independently, and the le systems do not perform collective I/O.
The drawback of the direct method is that the parallel le system may receive a large number of low-granularity requests from multiple processors in any o r d e r .As I/O latency is very high, such access requests perform poorly.F or many access patterns, such as in the above applications, the I/O performance can be improved by using the collective k n o wledge of the access requests of all processors.Processors can cooperate among themselves to perform I/O in large chunks and in the proper order, a process known as collective I/O.The extended two-phase method speci es a procedure for performing

Extended Two-Phase Method
The two-phase method, proposed in 7, 4 ], is a collective I/O technique for reading an entire in-core array from a le into a distributed array in main memory, and conversely, for writing a distributed in-core array to a le.I/O is done in two phases.In the rst phase, processors always read data assuming a conforming distribution.A conforming distribution is de ned as a distribution of an array among processors such that each processor's local array is stored contiguously in the le, resulting in each processor reading a single large chunk of data.For an array stored in a le in column-major order, a column-block distribution is the conforming distribution.In the second phase, data is redistributed among processors to the desired distribution.Since I/O cost is orders of magnitude more than communication cost, the cost incurred by the second phase is negligible.This two-phase approach is found to perform well for all array distributions 7, 4].
We h a ve extended the basic two-phase method to access sections of out-of-core arrays.This extended two-phase method performs I/O for out-of-core arrays e ciently by: dynamically partitioning the I/O workload among processors, depending on the access requests, combining several I/O requests into fewer larger granularity requests, reordering requests so that the le is accessed in proper sequence, and eliminating simultaneous I/O requests for the same data.

Reading Sections of Out-of-Core Arrays
We rst describe the extended two-phase method for reading array sections.For the purpose of explanation, we consider the case where each processor must read a section (speci ed in terms of a l o wer-bound, upper-bound, and stride in each dimension) of a two-dimensional array stored in a le in column-major order.In general, the extended two-phase method can be used for arrays with any n umber of dimensions, stored in any order in the le, and accessed by a subset of the total number of processors.
The extended two-phase method divides the I/O workload among processors by assigning ownership to portions of the le.A processor can directly access only the portion of the le it owns, called its le domain.F or a le stored in column-major order, the le domain of each processor is some set of columns of the array.Section 4 describes two w ays of assigning le domains to processors.
Assume that each processor must read a section (l 1 : u 1 : s 1 l 2 : u 2 : s 2 ) of the out-of-core array, in global coordinates.The sections required by di erent processors may be identical, overlapping, or distinct.In the rst step of the extended two-phase method, processors exchange their own access information (the indices l 1 u 1 s 1 l 2 u 2 s 2 ) with other processors, so that each processor  knows the access requests of other processors.This information is stored in a data structure called the le access descriptor (FAD).The FAD contains exactly the same information on all processors.This exchange phase is not required if the collective I / O i n terface itself provides information about the access requests of other processors.
Since each processor knows its own le domain and the access requests of other processors, it can determine what portion of the data in its le domain is needed by other processors.This is done by computing the intersection of the requests of other processors from the FAD and its own le domain.This information is stored in a data structure called the le domain access table (FDAT).The FDAT of a processor thus contains information indicating which portions of its le domain have been requested by other processors.
Each processor must now read data from its le domain as speci ed by t h e F D AT.For example, Figure 3 shows the le domain of processor 0 and, for some access pattern, the portions of this le domain that have been requested by other processors.A simple way of reading is to read all the data needed by processor 0, followed by that needed by processor 1, and so on, in order of processor number.This method, however, may result in too many small accesses that are not in sequence.For reading the data e ciently, processors must analyze the FDAT and use a read strategy that accesses the le in sequence and contiguously.
We use the following general method for this purpose.Each processor calculates the minimum of the lower-bounds and the maximum of the upper-bounds of all sections in its FDAT.This e ectively determines the smallest section containing all the data that must be read from the le domain (for example, section ABCD in Figure 3).This section may a l s o c o n tain some data that is not required by a n y processor.If the processor attempts to read only the useful data, it may result in a number of small strided accesses.To a void this, the processor uses an optimization we proposed previously, called data sieving 14,13].The processor reads a column (for column-major order) of the section at a time in a single operation into a temporary bu er.This may include some unwanted data.The useful data is extracted from the temporary bu er and placed in communication bu ers, depending on which processors need the data.The entire section is read from the le domain in this fashion.The processor may read more than one column at a time, if su cient m e m o r y i s a vailable to do sieving on the set of columns.This forms the rst phase of the extended two-phase method.
The second phase of the extended two-phase method consists of communicating the data read in the rst phase to the respective processors.From the information in the FDAT, each processor determines what data must be sent to which processor.In addition, since each processor knows the le domains of other processors and its own access request, it can calculate how m uch data to receive from other processors and where to store it in memory.
The two phases of the extended two-phase method either can be done distinctly by performing all I/O rst and then communication, or they can be overlapped (pipelined) by reading smaller portions of data and communicating it.

Writing Sections of Out-of-Core Arrays
The algorithm for writing sections is essentially the reverse of the algorithm for reading sections.From the FAD, each processor determines what portions of its write request are located in the le domains of other processors those portions must be sent to the respective processors.From the FDAT, each processor determines what portions of the write requests of other processors are located in its own le domain those portions must be received from the respective processors.This communication forms the rst phase of the extended two-phase method for writing sections.
Data is written to the le in the second phase.The FDAT is analyzed in the same way as in the read algorithm.Each processor calculates the minimum and maximum of all indices in its FDAT, which determines the smallest section containing all the data to be written to the le domain.The processor uses data sieving 14, 1 3 ] to write the useful data in this section.Note that, since there may be \holes" between the useful data to be written, an extra read operation is required before writing.This extra read is not required if the useful data is located contiguously in the le.
If the sections requested to be written by di erent processors have some elements in common, there is a data-consistency problem.The result depends on the particular implementation of the extended two-phase method.In our implementation, if there are write requests from multiple processors to the same location, the data from the highest numbered processor is written to the le.

Partitioning the I/O Workload
In the extended two-phase method, processors cooperate to perform I/O.The exact partitioning of the I/O workload among processors depends on how le domains are de ned.In general, I/O can be partitioned either statically or dynamically.N o t e t h a t w e are referring to a logical partitioning of the le among processors the le is not physically repartitioned into separate les.

Static Partitioning
One way of partitioning I/O (for an array stored in column-major order) is to assign a block o f columns of the entire out-of-core array t o e a c h processor, as if the array w ere distributed among processors in a column-block fashion.The le domain of each processor is therefore a block o f columns of the array, stored contiguously in the le.The size of each le domain can be determined from the size of the array and the number of processors and is independent of the access requests.This is called a static partitioning scheme.Figure 4(A) shows the le domains of four processors, with static partitioning of I/O.

Dynamic Partitioning
The main drawback of static partitioning is that the partitioning is independent of the access requests.For many access patterns, static partitioning may result in an imbalance of I/O among processors some processors may perform more I/O than others, some may not perform any I/O at all.For example, consider the access pattern in Figure 4.With static partitioning, the access requests span the le domains of only two processors (1 and 2) therefore, only two processors perform all the I/O.In addition, if we increase the size of the out-of-core array, k eeping the number of processors xed, the size of each le domain also increases, and the access requests span the le domains of fewer processors, resulting in greater I/O imbalance.A dynamic partitioning scheme, based on access requests, can divide the I/O workload more evenly and therefore improve I/O throughput.Figure 4(B) illustrates such a partitioning scheme.For a le stored in column-major order, each processor calculates the rst and last among the columns of the sections requested by all processors.The section formed by these columns and all the rows of the out-of-core array is called the bounding section.The bounding section includes the sections requested by all processors and is located contiguously in the le. Figure 4(B) shows If the requested sections span all the columns of the out-of-core array, the dynamically selected le domains are identical to those determined statically.If the requested sections span only a few columns, however, dynamic partitioning provides a much better balance of I/O among processors (as Figure 4 shows).It also reduces the memory requirements of the extended two-phase method, because the le domain of each processor is smaller.With static partitioning, if all requested sections are located in a single processor's le domain, all the requested data may not t in the memory of that processor.Consequently, I/O and communication may n e e d t o b e d o n e i n s t a g e s , several times.This situation is less likely to occur with dynamic partitioning, because the requested data is more evenly divided among processors.
For an array stored in row-major order, le domains are determined as follows.Each processor calculates the rst and last among the rows of the sections requested by all processors.The bounding section is the section formed by these rows and all the columns of the out-of-core array.File domains are determined by dividing the bounding section among processors in a row-block fashion.
Figure 5 summarizes the extended two-phase method for reading sections of out-of-core arrays, with dynamic partitioning of I/O.
1. Exchange access information with other processors and ll in the le access descriptor (FAD).
2. Calculate the smallest section, called the bounding section, that includes the sections requested by all processors.
3. Determine the le domain of each processor b y dividing this bounding section among processors in a column-block manner for a rrays stored in column-major o rder or r o w-block manner for a rrays stored in row-major o rder.
4. Compute the intersection of the FAD and this processor's le domain,and ll in the le domain access table (FDAT).
5. Calculate the minimum of the lower bounds and the maximum of the upper bounds of all sections in the FDAT to determine the smallest section containing all the data needed from the le domain.
6. Read this section by using data sieving, and communicate the data to the requesting processors.We used the Intel Touchstone Delta for an experimental study of the performance of the extended two-phase method.The Touchstone Delta has 512 compute nodes (each a n I n tel i860/XR microprocessor) and 32 I/O nodes (each a n I n tel 80386 microprocessor).Each I/O node is connected to two disks, resulting in a total of 64 disks.Intel's Concurrent File System (CFS) provides parallel access to les.By default, CFS stripes les across all 64 disks in 4-Kbyte blocks.See 2] for a detailed discussion of the performance of CFS.We studied the performance of the extended two-phase method versus the direct method extensively for several synthetic access patterns as well as for two real out-of-core parallel applications| matrix multiplication and a Laplace's equation solver.We report the results of these experiments below.

Synthetic Access Patterns
We used three basic types of synthetic access patterns: 1. Common sections: All processors access the same section of the array.
2. Overlapping sections: P arts of the section requested by a processor may o verlap with parts of the sections requested by other processors.
3. Distinct sections: The section requested by e a c h processor does not have a n y data in common with the section requested by a n y other processor.

Reading Common Sections
Table 1 shows the performance of the direct and extended two-phase methods for reading common sections (4K 4K array, 16 processors).Figure 6 illustrates the approximate location of each of these sections in the array.W e measured the performance of the extended two-phase method with both static and dynamic partitioning.In all cases, the extended two-phase method performed considerably better than the direct method, because it read the common section only once and broadcast it to other processors.In the direct method, on the other hand, all processors read the same section from the le simultaneously, resulting in extra I/O overhead.In all cases, the extended two-phase method took much less time with dynamic partitioning.With static partitioning, each processor's le domain was of size 4K 256.Therefore, all sections, except those in case V, were located in the le domains of only a few processors.With dynamic partitioning, on the other hand, the I/O requests were evenly divided among all available processors, resulting in higher I/O throughput.Since the section in case V spanned all 4096 columns, the statically and dynamically selected le domains were identical, and so was the performance.For case V, the extended two-phase method performed considerably better than the direct method, because the direct method resulted in a large number of small requests spread across the entire le.

Reading Overlapping Sections
Table 2 shows the time taken for reading various overlapping sections.Figure 7 illustrates the approximate location of each of these sections in the array.T o represent these overlapping sections for all processors concisely, w e use the following notation.Each processor's request is denoted by (l 1 + ov1 p : u 1 + ov1 p : s 1 l 2 + ov2 p : u 2 + ov2 p : s 2 ), where p is the processor number and ov1, ov2 are some constants.The amount o f o verlap can be changed by v arying ov1 a n d ov2.
The extended two-phase method with dynamic partitioning performed the best in all cases.The sections in cases I and II were of the same size, but they di ered in the amount o f o verlap the sections in case I had more overlap than those in case II.Since the total number of columns of the out-of-core array spanned by the sections in case I was less than that by the sections in case II, it took less time to read the sections in case I.The sections in cases IV, V, and VI spanned only a few columns.For these cases, the direct method performed better than the extended twophase method with static partitioning, because static partitioning resulted in only a few processors performing I/O.The extended two-phase method with dynamic partitioning, however, performed better than the direct method, since the I/O workload was better distributed.The worst case for the direct method was case VII, which spanned all columns of the array.The sections in case VIII were overlapping in both dimensions, and again the extended two-phase method with dynamic partitioning took the least time.

Reading Distinct Sections
Table 3 shows the time taken for reading distinct sections.Figure 8 illustrates the approximate location of these sections in the array.W e use the same notation as above, (l 1 +ov1 p : u 1 +ov1 p : s 1 l 2 + ov2 p : u 2 + ov2 p : s 2 ), for representing distinct sections.The overlap factors ov1 and ov2 m ust be large enough to ensure that the sections are distinct.
In case I, the requests of di erent processors were situated in separate locations in the le, because the sections requested were located along rows.As a result, I/O in the extended two-phase method with dynamic partitioning was identical to that in the direct method, and they took the same time.The extended two-phase method with static partitioning took longer than the direct method, because only a few processors performed I/O.The sections in cases II|IV were located along columns, and the requests of di erent processors were interleaved in the le.The extended two-phase method therefore performed considerably better for these cases.Static partitioning did not perform well for the sections in case II, because they spanned only a few columns.The best case for the extended two-phase method was case IV, since the sections spanned all columns.The sections in cases V and VI were partly interleaved in the le, and even for these cases, the extended two-phase method performed the best.
Figure 8: The distinct sections listed in Table 3 (not to scale)

Writing Distinct Sections
We considered only the case where each processor writes a distinct section to the le, because other cases, such as writing overlapping or common sections, are unlikely to occur.Table 4 shows the time taken for writing distinct sections.The sections chosen were the same as those for reading (Table 3, Figure 8).As for reading distinct sections, the direct method and the extended twophase method with dynamic partitioning took the same time for writing the sections in case I, whereas the extended two-phase method with static partitioning took longer.In the other cases, the extended two-phase method with dynamic partitioning performed considerably better than the direct method.

Accessing Sections with Non-Unit Strides
We also tested the performance for accessing sections with non-unit strides.When an array section has a non-unit stride, each element requested is strided in the le.The only way of reading such array sections using a direct method is to seek explicitly to each individual element and read only that element.This results in very low granularity of data transfer, which i s v ery expensive.The extended two-phase method overcomes this drawback of the direct method by reordering requests and using data sieving for larger granularity accesses.Table 5 shows the performance for reading sections with non-unit strides.The sections in case I spanned almost the entire array, with stride equal to the number of processors.As a result, static and dynamic partitioning took the same time.The sections in cases II and III were located diagonally across the out-of-core array.The sections in case IV were located along columns, and the sections in case V were located along rows.In all cases, the extended two-phase method was more than 20 times faster than the direct method.Table 6 shows the performance of the extended two-phase method for writing sections with non-unit strides.The sections chosen were the same as in Table 5.Even for writing sections, the extended two-phase method improved I/O performance considerably.

Scalability
We also studied the scalability of the extended two-phase method for large number of processors, large array sections, and large out-of-core arrays.Since dynamic partitioning always performed better than, or at least as well as static partitioning, we considered only dynamic partitioning for the scalability experiments.Table 7 shows the timings obtained by v arying the number of processors requesting array sections from 4 to 128, for both reading and writing.We selected a few sections in each category|common, overlapping, distinct, and non-unit strides.Note that, as the number of processors was increased, the total amount of I/O performed also increased.
The extended two-phase method scaled well with the number of processors.In many cases, Table 7: Scalability of the extended two-phase method.The number of processors accessing sections was varied from 4 to 128.Array size 4K 4K real numbers (single precision), time in seconds.DR = Direct Read, ETP = extended two-phase method with dynamic partitioning, DW = direct write.
the time taken increased only slightly as the number of processors was increased, indicating that we obtained higher I/O throughput by increasing the number of processors.For example, for the sections in case I, the time taken increased from 1.282 sec.to only 2.130 sec.when the number of processors was increased from 4 to 128.In some cases, such a s c a s e I I , t h e t i m e t a k en even decreased.
The direct method performed quite poorly when the number of processors was increased, especially for cases II, IV, and VIII.The extended two-phase method also scaled well for writing sections.
For small number of processors, the extended two-phase method took longer for writing, because of the extra read before each write.For large number of processors ( 16), however, the extended two-phase method performed better than the direct method in spite of the extra read.For sections with non-unit strides, the extended two-phase method performed considerably better than the direct method.Table 8 shows the performance for accessing large sections of a large out-of-core array of size 16K 16K single precision real numbers ( le size 1Gbyte).Figure 9 shows the approximate location of these sections in the array.W e considered common, overlapping, and distinct sections for reading and distinct sections for writing.The trend in the results was the same as for a 4K 4K array ( T able 7).The direct method performed much w orse for accessing large sections than for small sections, whereas the extended two-phase method performed consistently well for sections of any size.Figures 10 and 11 compare the relative performance of the two methods for reading and writing the sections in case VI of Table 8.

Real Applications
We also studied the performance of the extended two-phase method with dynamic partitioning versus the direct method, for two real out-of-core parallel applications|matrix multiplication and a Laplace's equation solver.

Matrix Multiplication
Table 9 shows the I/O time for out-of-core matrix multiplication for di erent array sizes and number of processors.The I/O time was calculated as the maximum of the time taken by all processors, for all I/O (reading and writing) required in the out-of-core matrix multiplication algorithm described in Section 2. Note that in the extended two-phase method, the I/O time includes the time for data communication.In all cases, the extended two-phase method performed better than the direct method.Figure 12 shows that the percentage improvement in I/O time provided by the extended two-phase method over the direct method varied from 22% to 75%.

Laplace's Equation Solver
Table 10 shows the I/O time for an out-of-core Laplace's equation solver for di erent array s i z e s and number of processors.The I/O time is the maximum of the time taken by all processors for all I/O (reading and writing) required in the out-of-core Laplace's equation solver algorithm described in Section 2. As in the case of matrix multiplication, the extended two-phase method performed better than the direct method.The percentage improvement in I/O time provided by the extended two-phase method over the direct method is shown in Figure 13.The percentage improvement was lower than in the case of matrix multiplication, possibly because of the di erence in the I/O access patterns of the two applications.Recall that in out-of-core matrix multiplication, matrix B is accessed in blocks along columns.The results with synthetic access patterns in Section 5.1 indicate that the extended two-phase method performs very well for such accesses.

Conclusions
The extended two-phase method is clearly superior to a direct method for accessing sections of out-of-core arrays.In our experiments with real applications as well as several synthetic access patterns, the extended two-phase method outperformed the direct method signi cantly.
The extended two-phase method also provides much exibility in partitioning the I/O workload among processors.We h a ve described one dynamic partitioning scheme that performed signi cantly better than a static partitioning scheme, but it may be possible to do even better.For example, instead of dividing the bounding section among processors in a column-block fashion, it could be divided in a block-cyclic fashion, so that if the bounding section includes some unwanted columns, they are evenly distributed.Another approach is to divide I/O among processors in such a w ay that the I/O requests from di erent processors go to di erent disks or I/O nodes.Furthermore,  if the ratio of processors to disks on the machine is very high, it is possible to have only a few processors perform I/O, thereby reducing contention for the I/O system.The extended two-phase method can be used for accessing arrays with any n umber of dimensions and any storage order.For the dynamic partitioning scheme we h a ve proposed, the le domains for an n-dimensional array can be obtained by rst calculating the n-dimensional bounding section of all requests, and then dividing it among processors such that the le domain of each processor is located contiguously in the le.
Array sections other than those that can be represented by a l o wer-bound, upper bound, and stride in each dimension, for example, sections with non-uniform strides, can also be accessed by using the extended two-phase method.This requires a more general notation for representing such sections.The data structures, such a s F AD and FDAT, must be modi ed to handle such requests, but the basic idea remains the same.
It is not necessary that all processors running the application must call the extended twophase read/write routine.Even a subset of processors may call the routine and participate in the two-phase process.The I/O workload can be divided among the processors in this subset.
The extended two-phase method is not speci c to any particular machine, le system, or architecture it can be easily implemented by using any le-system interface, or by using portable interfaces, such as MPI- IO 16], resulting in portable implementations.It can also be easily modied and tuned for any particular system|by de ning le domains appropriately and possibly using a di erent algorithm for interprocessor communication.
The best way to use the extended two-phase method is to implement it as a library routine that can be called from an application program.We h a ve implemented it in the PASSION runtime library 15], which i s a vailable on the World-Wide Web at http://www.cat.syr.edu/passion.html.

Figure 2 :
Figure 2: I/O access pattern in an out-of-core Laplace's equation solver collective I/O to access out-of-core array sections.Other examples of collective I/O are disk-directed I/O 11] and server-directed collective I/O 12].

Figure 3 :
Figure3: Processor 0 must read the requested data from its le domain.Section ABCD is the smallest section containing all the requested data.Processor 0 reads this section by using an optimization called data sieving.

Figure 5 :
Figure 5: Extended two-phase method for reading sections of out-of-core arrays with dynamic partitioning of I/O 5 Performance

Figure 10 :
Figure 10: Scalability results, 16K 16K array, time for reading sections in case VI of Table 8

Figure 11 :
Figure 11: Scalability results, 16K 16K array, time for writing sections in case VI of Table 8

Figure 12 :
Figure 12: Percentage improvement in I/O time of out-of-core matrix multiplication by using extended two-phase method versus direct method

Figure 13 :
Figure 13: Percentage improvement in I/O time of out-of-core Laplace's equation solver by using extended two-phase method versus direct method

Table 1 :
represents a group of Comparison of direct method and extended two-phase method (static and dynamic partitioning) for reading common sections.Array size 4K 4K real numbers (single precision), 16 processors, time in seconds.
Figure 6: The common sections listed in Table 1 (not to scale)

Table 2 :
Comparison of direct method and extended two-phase method (static and dynamic partitioning) for reading overlapping sections.Array size 4K 4K real numbers (single precision), 16 processors, time in seconds.
overlap Figure 7: The overlapping sections listed inTable 2 (not to scale)

Table 3 :
Comparison of direct method and extended two-phase method (static and dynamic partitioning) for reading distinct sections.Array s i z e 4 K 4K real numbers (single precision), 16 processors, time in seconds.

Table 4 :
Comparison of direct method and extended two-phase method (static and dynamic partitioning) for writing distinct sections.Array s i z e 4 K 4K real numbers (single precision), 16 processors, time in seconds.

Table 5 :
Comparison of direct method and extended two-phase method (static and dynamic partitioning) for reading sections with non-unit strides.Array size 4K 4K real numbers (single precision), 16 processors, time in seconds.

Table 6 :
Comparison of direct method and extended two-phase method (static and dynamic partitioning) for writing sections with non-unit strides.Array s i z e 4 K 4K real numbers (single precision), 16 processors, time in seconds.

Table 10 :
I/O time in seconds for an out-of-core Laplace's equation solver using direct method and extended two-phase method with dynamic partitioning (ETP).