A 1 gorithm Development for Distributed Memory Multicomputers Using CONLAB

CONLAB (CONcurrent LABoratory) is an environment for developing algorithms for parallel computer architectures and for simulating different parallel architectures. A user can experimentally verify and obtain a picture of the real performance of a parallel algorithm executing on a simulated target architecture. CONLAB gives a high-level support for expressing computations and communications in a distributed memory multicomputer (DMM) environment. A development methodology for DMM algorithms that is based on different levels of abstraction of the problem, the target architecture, and the CONLAB language itself is presented and illustrated with two examples. Simulation results for and real experiments on the Intel iPSC/2 hypercube are presented. Because CONLAB is developed to run on uniprocessor UNIX workstations, it is an educational tool that offers interactive (simulated) parallel computing to a wide audience. © 1993 by John Wiley & Sons, Inc.

Comnnmication na message passing 1,.; achieved with send and receive primitives .. \ message can be serlt to one or more proees:oes and a type is specified for the message.Reception of messages can be done by specifying the sender and/ or me,;;;age type.Communication can be either synchronous or a:->ynchronous.,'\.hen synchronous cornmunicarion is used the ;;t>nder waits until the messatre is completely received by the receiver.whereas for a:->ynchronou:-> conulnmication the sender continues execution immediatelY after submittin~ the message.

The Time Model for Simulation of DMM Architectures
The executioll of the proce,;,.;eson the Yirnml proce,.;sors is controlled by a time model dwt com-  guish between the two types of communication.
2. Receive time is the time it takes for the receiver to deal with a messaf!e that has arrived at the node.1'\ote that receiYe time only concerns buffered communication.
:3. Delay time is the time it takes for the rne:'sage to reach the target node aftf'r the sender has initiated the send operation.
4. Total communication time.\\.hen unbuffered communication is used the receiYer is blocked on recein• when tlw mes,.agearrives.The total communication time for unbuffered communication is the time it take,; for the message to be fully transmitted to the receiver from the point where the ;-;ender issued the send call.

delay time
The process P sends a message to the process Q. u;-;ing the send statement.Process () recei\•e,; the message from the process P. with the rccehe call.
If Q initiates the receive after P has made the call to send then buffered communication i,-used.Figure 1 shows the relation,.;hip,.; between tlw tinw components when buffered communication i>< achieYed in the abm•e program.
If the receh•e call is executed before the call to send then unbuffered communication is used.which i,.; depicted in Fif!ure 2. In Figure 2 the <lt•lay time decides when the recei\•er changes state from idle.waiting for a message to actiYe in communication.During simulation the delay time i,; usPd to determine if the messap:e has reached tlw recPivinp: node by the time the receive calli,.; don!:'.Thi,.; information is thPd to dt>cidP if buffered or unbuffered communication is to he tH-compJi,..hed.

Performance Measuring
The functions timer.arithtime.and commlime are used to calculate the timing characteristics of a COl\'LAB program.The timer function returns the current time \•alue of a process iit is initialized to zero when the proce;;s stun,.;)and it can Le used to lllPa;;ure tlw elapsed execution time of a process.Siruilarh-. the functions arithtime and commtime t•eturn the <•urrent times for arithmetic computations and <'OI!II11tll1ication;.;.r<:>,.;pectivd~•,J)Oth initialized to zprn when tlw proce,;,.;starts).

ParaGraph
There is a possibility to use trace files gt>nerated by CO);LAB in the visualization tool ParaGraph [9].ParaGraph, which primarily is intended a,.; a postprocessor to the instrumentation packa;:.rePICL [ 10 j, include::; ::;everal graphs de::;criLing algorithm performance when used with CO:\"L-\B.

ALGORITHM DEVELOPMENT METHODOLOGY
The methodology for developing parallel algorithms in COJ\"LAB is based on different levels of abstraction of the problem.the tar;:.retparallel architecture, and of the CO);LAB language itself.By following the methodology the user will implement.at as high abstraction level as possible, functions and processe;; in CO:'\L-\B that define the architecture topology.communication on the topology.node and host algorithms, and a user interface.Below, we describe the different levels of abstraction.The host and node processes define the distributed all!orithm.which is a composition of communicating proct'sses.The concept of a host process and node proce,.;ses is motivated hy the commer-cial~\available D\1\1 architectures.and in CO:\"-LAB, they are con,o;idered equal.,,.ithin each process all computations are sequemial and performed within a ,.;ingle address space (tl1t' local memory of the node).
The simulation function can be seen as the problem lor• application) leYel of the ab,.;traction.
By changing problem sizes and/ or the number of

TWO EXAMPLES OF ALGORITHM DEVELOPMENT FOR DMM ARCHITECTURES
In this :-en ion we illustrate the all!orithm de\t>lopment methodolo/-(y for D\1.\l architecturP".descri!Jed in the pn

Block Torus Matrix Multiplication
GiYen n X n matrices .
In the distributed alf!orithm the processors are organized in a two-dimensional mesh where com-L\COBSO.".KAGSTRC)\1..\:\DR.\:\SAH munication takes place between nearest neighbors as well as around the edge in both directions.In order to compute Cu at node (i.j) we need access to block row i of A and block column} of B.
Initially, the matrices A and B are distributt:d block-wise among the nodes in the nwsh so that node (i.j) has the blocks Aik and B, 1 .where k = Cu := o fort 1: s (i + j -2) mods + 1. Thi5 mean,; that the matrix A is skewed row-wise and B is 5kewed columnwise.This distribution scheme allows node (i.j) to compute Cif with only nearest -neighbor comnnmication in the two-dimen,;ional nwsh.The mnin step,; of the alwwithm [ 1:3] for node Ci. j) look a:; follows.
CO.'\L\13  191processor~ 1 mods + 1 Cu .CiJ + A,k * BkJ Send A;k to nearest nei~hhor to tht> wp;;t Send BA:f to nearest neighbor to the north end 1\"orice.the firM time node (i.}1 receives an A- block and a B-block thev are dehered bv tlw ..host.but the following blocks ori1-:inate from the nodes to the east and south.rt>spectiYely.It would also be possible to let nodt> (i.ji initially hold blocks Au and B~1 .