Publishers imprint. Printed in Malaysia. Investigation of Various Mesh Architectures with Broadcast Buses for High-Performance Computing

Extensive comparative analysis is carried out of various mesh-connected architectures 
that contain sparse broadcast buses for low-cost, high-performance parallel computing. 
The two basic architectures differ in the implementation of bus intersections. The first 
architecture simply allows row/column bus crossovers, whereas the second architecture 
implements such intersections with switches that introduce further flexibility. Both 
architectures have lower cost than the mesh with multiple broadcast, which has buses 
spanning each row and each column, but the former architectures maintain to high 
extent the powerful properties of the latter mesh. The architecture that employs switches 
for the creation of separable buses is even shown to often perform better than the 
higher-cost mesh with multiple broadcast. Architectures with separable buses that 
employ store-and-forward routing often perform better than architectures with 
contiguous buses that employ the high-cost wormhole routing technique. These 
architectures are evaluated in reference to cost, and efficiency in implementing several 
important operations and application algorithms. The results prove that these 
architectures are very promising alternatives to the mesh with multiple broadcast while 
their implementation is cost-effective and feasible.


I. INTRODUCTION
The mesh architecture is used frequently in parallel processing because of its low YLSI complexity and its support for scalability.However, its main drawbacks, namely large diameter and large average internode distance, affect dramatically its communication capabilities.Although other popular interconnection networks, such as the direct binary hypercube, have smaller values for the latter pair of parameters, their major drawback is that they do not permit the application of incremental growth techniques [7] and their VLSI implementation becomes a Herculean task for *Tel.: (973) 596-5651, Fax: (973) 596-5680.e-mail: ziavras@megahertz.njit.edumassively parallel systems [2,8].To allow the efficient implementation of distant data transfers, several enhancements have been proposed for the mesh.The addition of a single global bus is such an enhancement [12,13].Although it is often assumed that the propagation time of messages on the global bus is independent of the size of the mesh, this justification may not be acceptable for practical systems [3].
To avoid bottlenecks caused by the single global bus, the mesh parallel computer can instead be augmented by adding multiple broadcast buses, where each bus connects a subset of PEs (process- ing elements) in the mesh.Such a popular architecture is the mesh with multiple broadcast [1,21,23].It is a mesh-connected parallel computer where all PEs on each row and each column are connected to a shared row and shared column bus, respectively.The performance of this architecture is comparable to that of the pyramid computer for several image processing problems.Rectangular meshes with multiple broadcast may perform better than square ones with the same number of PEs, when row and column buses are considered, as the former systems contain more buses [3, 15,  29].The mesh with multiple broadcast is the most relevant architecture in this paper.The rest of this section describes briefly other important variations of the mesh architecture.
The CHiP parallel computer [4] consists of a mesh (grid) of PEs with programmable switches interposed between neighboring PEs.The local memory in switches stores interconnection pat- terns to be implemented at run time.The reconfigurable mesh, or mesh with reconfigurable bus, is a square mesh of PEs where all PEs are connected to a global broadcast bus that spans all rows and columns [14,22].Switches are located at all column/row bus intersections and PEs control their neighboring switches that can divide the global bus into subbuses.The cost of this arctiitecture may be prohibitively high because of its large number of switches, whereas the assumption of fixed-time data transfers may be unrealistic.
Many algorithms have good theoretical perfor- mance on the reconfigurable mesh [17].In the non- cross-over model, the four communication ports of a PE can be connected together to form only planar connections [27].In the higher-cost cross- over model, non-planar connections can be formed as a PE may connect together its north-south and east-west ports independently.
The PEs in the polymorphic torus are located at the vertices of a two-dimensional torus network [5, 16, 25].Switches are distributed over the nodes of the torus network, as for the reconfigurable mesh.Each switch can implement a complete graph of a PE's four ports.It supports high com- munication bandwidth, downplaying any hard- ware repercussions.Reconfiguration of switches is needed to match each time the structure of the program graph.This architecture has higher hardware cost than the reconfigurable mesh because crossbar switches are used to control all possible connections among the four NEWS (North-East-West-South) directions.An X-shaped grid interconnects the PEs within each chip of the BLITZEN massively parallel processor array and extends across chip bound- aries [18].The network is dynamically reconfigured to support either NEWS or diagonal connections.The custom chip of BLITZEN contains 128 one- bit PEs arranged in an 816 array [18].The standard configuration of this system contains 16,384 PEs arranged in an 128128 array.The mesh of trees is constructed from a grid of PEs by adding additional PEs and wires to form on top of it a complete binary tree on each row and each column [9].The cost of these additional PEs may be a drawback in the implementation of this architecture.Several slight variations of the mesh of trees have also been introduced [11].
Other mesh-connected parallel computers are obtained by superimposing one or more global meshes on an underlying mesh of PEs [6].The mesh with a single global mesh is constructed starting with several regular meshes and connect- ing together with a global mesh the lead PEs at the top leftmost corners of these meshes.Additional links between PEs in the original meshes can be used to form a single underlying mesh that contains all PEs.The creation of a mesh with global meshes is also possible.The first global mesh connects the lead PEs of a set of regular meshes.The second global mesh connects the lead PEs of a subset of meshes with a single global mesh.This process is repeated, and finally, the/th global mesh connects a subset of meshes contain- ing l-1 global meshes.
To improve the performance of the mesh with multiple broadcast, processor-controlled switches can be used to partition the row and column buses for reconfiguration purposes [26].For example, a switch can be inserted after every other PE on each row or each column bus in order to produce a mesh with separable buses.A lower-cost modifica- tion of this architecture does not require that all PEs be connected to row and column broadcast buses [24].The latter architecture employs one row/column bus for every n k rows/columns of PEs in the nn mesh, where k < 1/2.More specifically, PEs in blocks of size nkn k are interconnected using only local links as in the mesh.The PE in the upper left corner of each block is connected to a pair of separable row and column buses.Data broadcasting algorithms first use local block links, then broadcast buses until full row or full column broadcasts are completed, and finally local block links again.Algorithms with limited global com- munication requirements, such as semigroup and prefix computations, can be implemented effi- ciently on these architectures.It has been shown that rectangular meshes are optimal for semi- group, prefix, and convex hull computations [24,29].However, we assume here only square meshes because our objective is to evaluate architectural differences of systems as they are related to data transfer operations.
The mesh with multiple broadcast and its aforementioned variation with separable buses achieve very good performance at the expense of very high hardware cost.Their existing cost- performance comparisons with alternative archi- tectures are very limited.The objective of this paper is to show that families of low-cost mesh- connected architectures can achieve performance comparable to that of the higher-cost mesh with multiple broadcast.More specifically, this paper investigates in detail two families of mesh-con- nected architectures with sparse broadcast buses (i.e., buses that do not cover every row and every column of the mesh).One of the architectures employs switches for the implementation of separable buses.In contrast to the work in [24] that assumes hierarchical sectioning of broadcast buses, these switches are located at all bus intersections and also connect to both row and column buses.Thus, they reduce the hardware cost further while improving the system's flexibility for many operations.Also, we assume that the underlying structure is a single/complete mesh, while [24] assumes independent submeshes that can communicate only via broadcast buses.Another innovation of this research is that extensive analysis is carried out for two classes of systems, namely those that employ the lower-cost store-and-forward routing technique and those employing the higher-cost wormhole routing technique.
The paper is organized as follows.Section II introduces the two families of mesh-connected architectures for the study.The development for these architectures of important operations and algorithms, mainly for comparison with the mesh with multiple broadcast, is the main objective in the succeeding sections.Algorithms for global broadcasting are presented in Section III.The implementation of some other fundamental data movement operations is presented in Section IV.Section V presents advanced prefix computation and graph component-labeling algorithms.Final- ly, Section VI contains conclusions.

II. MORB MESHES WITH MULTIPLE ORTHOGONAL BROADCAST BUSES
This section presents promising models(families) of low-cost meshes with multiple orthogonal broad- cast buses or multiple-orthogonal-broadcasts (MORB) meshes for the construction of low-cost, high-performance parallel computers.A MORB mesh contains sparse row and column broadcast buses.Therefore, it contains only a subset of the broadcast buses found in the mesh with multiple broadcast.Another innovation of this work is that both store-and-forward and wormhole-routing switching systems are investigated in detail.Two families of MORB meshes are studied here.The first family contains meshes with contiguous row/ column buses, whereas the second family contains meshes with separable buses implemented with switches located at all intersections of broadcast buses.
Additional flexibility is introduced in the latter case (i.e., MORB meshes with separable buses) due to the switches that can divide broadcast buses into subbuses for a much larger number of small-range broadcast operations.The properties of two-dimensional SIMD meshes are studied here.In the rest of the paper, MORB(n,p) represents an nxn grid of PEs (i.e., a regular mesh), with p row and p column broadcast buses superimposed on it; (r/2) is always chosen to be a natural number, where r ((n-1)/(p-1)).The notation for the corresponding system with separable buses is MORBR R (n,p).Throughout this paper we use the assignments r ((n-1)/(p-1)), R (n/p), N n-1, and P p-1.Several sets of word-wide buses are embedded recursively into the regular n xn mesh in order to produce the MORB(n,p).The first set contains four distinct buses that connect together the PEs on the rows and columns at the boundaries of the mesh.That is, the 4N PEs on row 0, row N, column 0, and column N arc connected to this first set of four buses; each PE in a corner of the mesh is connected to both row and column buses.The second set of two buses coincide with the lines that divide the mesh into four equal-sized quadrants.More sets of buses are introduced by dividing each quadrant of PEs into four equal-sized subqua- drants, and this procedure is repeated for the introduction of more buses.Figure shows the structure of the MORB (17,5).The term bus segment is used here to denote any contiguous part of a broadcast bus that falls between two consecutive bus crossovers.It is assumed that PEs connected to two (i.e., row and column) buses can receive data on both buses simultaneously while they can transmit data on only one bus at a time.More than one PE can transmit simultaneously on the same bus only if all of them send the same value.The nxn mesh with multiple broadcast or multiple-broadcast mesh [21], denoted here by MB(n), is identical to the MORB(n,n).The broadcasting structure of the MORB mesh is similar to a special instance of the mesh with separable buses that has only one sectioning level for row and column broadcast buses [24].
A simple programmable switch is placed at each intersection of row and column buses in the MORBn(n,p).These switches can partition row/ column buses into subbuses for the simultaneous implementation of many broadcasts of limited range.This capability is not present in the mesh with multiple broadcast.A realistic approach is taken here for the implementation on the MORBR(n,p) of broadcast operations with short communication (clock) cycle.More specifically, only the following types of PE coverage by bus broadcasts are allowed.(1) An entire row or column of PEs attached to the same broadcast bus.
(2) An entire row or column of PEs attached to the same broadcast bus .aswell as any number of PEs attached to single column or row bus segments, respectively, that touch the former bus.(3) A contiguous part of a row or column of PEs attached to the same broadcast bus as well as any number of PEs attached to any number of single column or row bus segments, respectively, that touch any of the former PEs.These limited broadcast types guarantee low hardware complexity for the switches (otherwise, high-cost pre- charged circuits are needed to facilitate distant data transfers) and small clock cycle comparable to that of the mesh with multiple broadcast, thus supporting system scalability.In further compar- ison, the reconfigurable mesh [10] contains a much larger number of switches with significantly higher complexity because O(n2) switches may be tra- versed by a packet during a single data transfer (therefore, making absolutely essential the incor- poration of precharged circuits).Additionally, long clock cycles may become inevitable in the implementation of distant bus transfers on the reconfigurable mesh.The mesh with fewer separable buses [24] cannot implement directly the PE coverages numbered (2) and (3).It can implement only the first type of PE coverage.Also, for the same fixed distance between consecutive switches the latter system requires double the number of switches found in the MORB R mesh, because it uses distinct switches for row and column broadcast buses.Although the implementation on the MORB R mesh of the PE coverages numbered (2) and (3) may require switches of slightly higher complexity, the reduced number of switches results in significant cost savings.Additionally, another major innovation of this paper is that both store- and-forward and wormhole-routing switching systems are studied extensively.
Hardware cost analysis for these families of meshes and relevant comparison with the mesh with multiple broadcast, the most relevant existing system, follow.In the rest of this section only, MORB mesh stands for both the MORB and MORB R meshes.Also, throughout the paper bus intersection stands for bus crossover.

PROPOSITION
The m-th set of broadcast buses in the MORB(n,p) contains 2 m-(full-length) buses for rn > 1, and four buses for rn 1.Each bus covers n PEs.
Proof The first set of four broadcast buses (i.e., for rn 1) cover the PEs on the boundaries of the mesh.For rn 2, the number of additional broadcast buses is equal to two, i.e., 21 because the original mesh is divided into four quadrants by two buses that intersect in the center of the mesh.
The proof for m> 2 is based on mathematical induction.Assume that the kth set contains 2 k- buses, where k> 2. For the (k + 1)th set, the number of row(column) buses introduced is double the number of row(column) buses in the kth set.Therefore, the number of buses in the (k + 1)th set is 2k.Hence, the result.

COROLLARY
Ignoring duplication of PEs at bus intersections, the m-th set of broadcast buses cover 2 mn PEs for rn > 1, and 4n PEs for rn 1. Proof Each row or column bus covers n PEs.
From Proposition 1, the mth set contains 2 m- buses for rn > 1, therefore the total number of PEs covered by the mth set of buses is 2 xn.It is 4n PEs for rn 1. PEs attached to two broadcast buses are considered twice in the calculations here, once for each bus.PRO'OSITIOr 2 Ignoring duplication of PEs at bus intersections, the total number of PEs directly connected to broadcast buses in the MORB(n,p) is 2pn.
Proof Each row or column bus covers n PEs.
There exist 2p buses, therefore the total number of PEs directly connected to broadcast buses is 2 p n.
The total number of switches in the MORBR(n,p) is equal to p2.This is also the total number of bus intersections in the MORBR(n,p).
In contrast, the total number of bus intersections in the MB(n) mesh with multiple broadcast is n 2 and results in much higher VLSI complexity for n >p because of the implementation of more wire crossovers.
COROLLARY 2 Assume that each PE needs a single communication port for its connection to a broad- cast bus.The total number of ports in the MORB(n,p) for connections to broadcast buses is equal to 2p n.
Proof According to Proposition 2, the total number of PEs connected to broadcast buses is 2pn if bus intersections are not accounted for.Therefore, the total number of communication ports is equal to 2p n.
The ratio of the total numbers of ports used for connections to broadcast buses in the MB(n) and MORB(n,p) meshes, respectively, is ((2n2)/ (2pn)) R > 1. Obviously, for fixed mesh size the value of this ratio increases with a decrease in p (i.e., the number of broadcast buses in the MORB mesh).Considering also the cost of missing bus segments, the cost of the MORB mesh may be dramatically lower than that of the MB mesh.For fair hardware cost comparison with the MORBR(n,p) mesh, the cost of simple switches at bus intersections in the latter system must be accounted for.There exist p2 such switches.
Assuming that the cost of a broadcast bus segment between two neighboring PEs is Cl, the cost of a switch is c, the cost of a port is c3, and the cost of implementing a bus wire crossover is Ca, then the ratio of the cost overheads for the provision of hardwired broadcast capabilities on the MB(n) and MORBR(n,p) meshes, respectively, is (2nNCl + 2n2c3+(n2-4n + 4)c4)/(2pNCl +p2c2 + 2p n c3).The value of this cost ratio should be expected to be greater than 1 in all practical cases (note: it is always true that p < n).For 1 2--3--C4 which could be an assumption very close to reality, the value of the above ratio is (5n2-6n + 4)/(p(p + 4 n-2)); this value is ap- proximately equal to (5/4) R (i.e., it is always greater than 1) for n > p >> 1.This analysis proves that even the robust MORB R mesh architecture, which has higher cost than the basic MORB mesh architecture, results in very significant cost savings when compared to the MB mesh architecture.
Nevertheless, MORB meshes can become better alternatives than the mesh with multiple broadcast only if they can achieve performance comparable or better than that of the latter for frequently used operations and algorithms.Relevant comparisons follow in the rest of the paper.

III. GLOBAL BROADCASTING
A significant innovation of the analysis in this paper is that it is carried out for both the store-and- forward and wormhole-routing switching techni- ques.Before we dwell on the presentation and analysis of algorithms, let us describe timing models for these techniques.In store-and-forward switching, a packet is always stored temporarily in the packet buffer of any encountered intermediate processor.The network latency is given by (M/B) L, where M is the packet length (in number of bits, including the header), B is the bandwidth of a channel (in bits/sec), and L is the length (in number of hops) of the path travelled by the packet.In wormhole routing, each packet is divided into a number of flits (flow control digits) for transmis- sion.As the header flit opens the selected path, the other flits of the packet are pipelined along the same path.If the header flit is blocked because of an unavailable channel, the remaining flits are also blocked along the same path; they are stored temporarily in flit buffers.The network latency is approximately equal to (Mf/B)L + (M-My)/B, where My is the flit length (we assume that the header is one flit long).For practical cases with My << M, the path length L does not have any significant effect on the network latency (assuming that L is not much larger than M).
The following conventions are used in this paper for the sake of simplicity.All network latencies are expressed here in numbers of communication cycles; a communication cycle is assumed to have the same duration as a clock cycle and is consumed for the transmission on a word-wide channel (all channels are assumed to be word-wide) of a single word between any two neighboring processors.For the sake of simplicity, the length of packets in our calculation of network latencies is expressed in number of communication cycles by assuming that B word/cycle; as a consequence, M and My represent numbers of communication cycles.The network latency is then approximated by M L and L + M for the store-and-forward and wormholer- outing techniques, respectively.For the sake of simplicity, a convention used here is that whenever L appears in an equation as an additive term because of wormhole routing, it represents com- munication cycles.
The problem of global broadcasting on MORB meshes, where a single PE copies a value into all other PEs in the system, is investigated here.This section is devoted exclusively to global broad- casting because this is an operation encountered very frequently in application algorithms.The primary goal is to illustrate that in practical cases the reduced number of broadcast buses does not seriously affect the broadcasting capability of the MORB mesh, when compared to the MB mesh.Since the algorithm of global broadcasting pre- sented here is common to the MORB and MORB R meshes, MORB mesh in this section stands for both the MORB and MORB n meshes.

THEOREM
The worst case time for a global broadcast on the MORB(n,p) is r NEWS commu- nication steps and three bus broadcasting steps, for p < n.It is just two bus broadcasting steps for p n (i.e., similarly to the MB mesh).
Proof Each side of the smallest-sized squares formed by broadcast buses covers +r PEs.
Therefore, the maximum of the shortest distances of PEs from their nearest PE which is attached to a broadcast bus is (r/2) (this is an integer number for p < n, as noted in Section II).This is also the largest possible number of NEWS communication steps required to reach the nearest broadcast bus.One broadcasting step is then needed to broadcast the value on the full bus to which the receiving PE is attached.Two additional bus broadcasting steps (i.e., for row and column broadcast buses) are needed in the worst case for all PEs attached to broadcast buses to receive the value.The same number of NEWS communication steps as in the beginning (i.e., (r/2)) are then required to copy the value from the upper and left boundaries of the aforementioned squares into the PEs within the squares.It is easy to see that only two bus broadcasting steps are required for p n. Hence, the result.This paper focuses on detailed time analysis of fundamental operations and algorithms for the MORB and MB meshes.Two delay models for bus broadcasts are used.These models are similar to those assumed for the reconfigurable mesh in asymptotic time analysis [22].The unit-time delay (UTD) model assumes that a bus broadcast takes O(1) time.The log-time delay (LTD) model assumes that a bus broadcast takes O(log 1) time, where is the number of switches or bus crossovers traversed.All logarithms here are to the base 2 and the emphasis is on detailed worst-case time analysis.It is assumed that each arithmetic or logic ALU operation requires tALt clock cycles, whereas each data transfer operation with a neighbor in the mesh requires tNEWS clock cycles; therefore, tNEWS M. Finally, a single bus broad- cast takes tbus clock cycles under the UTD model and tbus log clock cycles under the LTD model, where l is the number of switches or bus crossovers traversed.
The store-and-forward and wormhole-routing switching techniques are chosen throughout this paper.The analysis of broadcasting first assumes the store-and-forward technique.Based on Theo- rem 1, the worst-case time for a global broadcast on the MORB(n,p) is Tglobal(n,P) rtNEWS -F-3 tbus (n,p) rtqEWS + 3 log ptbus under the and Tglobal UTD and LTD models, respectively, for p < n.For p n, the values are Tglobal(n,n) 2tbus and global(n, n) 2 log ptbus, respectively, for the two models.Let us define as the cost of implementing global broadcasting on the MORB(n,p) the pro- duct cOStglobal(n,p) PE_buses(n,p)x Tglobal(n,P), where PE buses stands for the total number of PEs attached to broadcast buses and is given in Proposition 2. This cost is in reference to the regular nxn mesh without broadcast buses, and can be used to compare the MORB and MB meshes for this operation; the MB(n) mesh has the same PE_buses as the MORB(n,n), that is 2n2.If q= tbus/tNEWS, the cost ratio cOStglobal(r/,p)/ cOstglobal(n,n) becomes p(r + 3q)/(2 n q) and p(r + 3 q log p)/(2 n q log p) under the UTD and LTD models, respectively.The smaller the value of this ratio, the better the MORB(n,p) system is for global broadcasting (i.e., it has a better balance between hardware cost and performance).Under the UTD model, this ratio is approximately equal to (r + 3 q)/(2rq).Similarly, the time ratio Tglobal (n, P)/Tgloba (n,n) is approximated by (r + 3q)/(2q) for the UTD model.
Figure 2 shows these cost and time ratios as functions of the parameters q and r.The first graph shows that in practical cases (i.e., r > 4) the cost ratio is much less than 1, while the second graph shows that the (execution) time ratio (i.e., speedup) is less than 5.It can be concluded that the MORB mesh results in very significant cost savings, while it achieves good performance in global broadcasting.Since global broadcasting is the operation for which the MB mesh achieves its most impressive result, the results above are very encouraging for the viability of MORB meshes.
We have (coStoba l(n,p))/ w (cOStgloba (n, n)) p(r-+- tNEWS + 3tbus) under the UTD model.This ratio is much less than in all practical cases.Wormhole routing results in dramatic performance improve- /costgoba(n,n) and (b) time ratio Tglobal(n,p)/Tglobal(n,n under the UTD model, where r=(n-1)/(p-1) and q tbus/tNEWS.ment because it corresponds to a much lower cost ratio.

IV. OTHER DATA MOVEMENT OPERATIONS
The efficient implementation of popular data movement operations that involve on-the-fly computations is critical for the successful employment of parallel computers [10].For example, semi-group computations appear quite often in parallel algorithms [15].The semigroup computa- tion that employs the associative operator (R) calculates al (R) a2 (R) a3 (R)... (R) aN for a set of N data items.The operation can be add, multiply, maximum, minimum, etc.Without loss of generality, the example chosen here is the computation of the logical OR of n data items, stored one datum per PE,j on the ith row of the nxn mesh, with the result stored in the leftmost PE,o, for all 0 _< i, j_< N. The next theorem is pertinent for systems without wormhole routing.
THEOREM 2 Given the MORB(n,p) where each PE stores a datum, the logical OR of the data on individual rows/columns can be found in parallel in time Trows_oR(n, P) =(r-+ 2(n--P))tNECS + (rp-1)tALu + N tbus under the UTD model.The tbus term for the LTD model is r log (p!) tbus,.P! represents the p factorial.
Hence, the result.Therefore, this algorithm consumes time O(n) under the UTD model.The time is O(n + (n/ p)log(p!)) under the LTD model.In the rest of this paper, asymptotic time analysis is presented only for the UTD model because the latter provides very realistic performance while it is simple.Also, we always assume the UTD model for systems with wormhole routing.
Proof Let the operation be applied to row data.
In the first phase the OR operation is applied sequentially from right to left within row segments that fall between column broadcast buses.More specifically, at the end of the first phase each PE with Cartesian coordinates (i, r jl) contains the result of the OR operation among the data stored in the PEs with coordinates (i, r j +j2), for all 0<jl <p-2and0<j2<r-1.This phase re- quires time tl=(r 1) (tNEWS + tALU), in number of clock cycles.The second phase is composed of P subphases, where each subphase k, for k=l,2,...,P, is composed of r cycles that transfer all values from the kth column broadcast bus to the PEs on column 0 (attached to the 0th broadcast bus) on the same row with the corresponding senders.The OR operation is also applied to the received data.For the implementa- tion of data transfers, the NEWS network as well Therefore, this algorithm consumes time O(n) with wormhole routing as well.For tNEws=M, the ratio of NEWS communication overheads of the store-and-forward and wormhole-routing techni- ].This ratio is shown in Figure 3 as a function of the parameters r and M. It can be concluded that wormhole routing results in dramatic performance improvement.
A word of caution is in order here.This problem may consume less time if data from more PEs than those corresponding to a single row segment are combined first.However, the purpose of this paper is not necessarily to always devise the best possible algorithm to solve a given problem, because of its comparative nature that emphasizes architectural differences.For the most efficient implementation of this operation, the technique devised for the second phase of the algorithm in Corollary 3 can 9.5 FIGURE 3 The ratio of NEWS communication overheads of the store-and-forward and wormhole-routing techniques for finding the logical OR of data on individual rows/columns, under the UTD model.be used.Depending on the values of n and p, it may be that using the naive solution that does not incorporate broadcast buses can result in better performance; the latter solution has execution time (n-1)tNEWS "+" (n 1)tALU.
The next corollary expands further by investi- gating the case of producing a single global result on systems without wormhole routing.The corre- sponding algorithm emphasizes optimality and its exact implementation depends on the hardware characteristics of the system under consideration.COROLLARY 3 Given the MORB(n,p) where each PE stores a datum, the logical OR among all data in the system can be found and stored in the PEo,o in time Ta/t_oR(n,p) 7" + tc and Tall_OR(n,p "r + under the UTD and LTD models, respectively, x L((lm'+ 1)/p)J]tbu+(m'+F(n/m')] -2) tALu.
Assuming a single broadcast bus that covers n PEs, the values of m and m' are such that if the partial OR results are first found for contiguous segments of m or m' PEs, respectively, and these partial results are then combined using the broadcast bus, the value of tc, is minimized under the UTD or LTD model, respectively.
Proof The algorithm-for this reduction operation comprises three phases.In the first phase, every PE which is not attached to a row broadcast bus sends its value to its closest PE which is attached to a row broadcast bus.The OR operation is applied to contained and received values in all intermediate steps.This phase consumes time tl (r/2)(tNEW_ S + tALU) q-tALU under both delay models.At this point, the OR reduction operator must be applied only by PEs attached to a row broadcast bus.In the second phase, the OR results are found in parallel for all rows of PEs which are attached to a row broadcast bus.This is accomplished in optimal time as follows.First, assume the UTD model.This phase is carried out by first finding the partial OR results for contiguous segments of size m on each broadcast row and then using the broadcast bus to collect these partial results in the corresponding PEs of column 0; the optimal size m is determined later in this proof.Actually, there exist [(n/m)J segments of size m and one segment of size n-(n/m)J m on each row.The total number of segments on a row is [(n/m).Because of the $IMD execution mode, this phase requires time tc=(m 1) (tlWS + tgLu) + ((n,m)] 1) (tUus + tALU) or tc (m-1)tws + ([(n/m)] 1) tbus -I-(m -t-(n/m)q --2)tgLu.The value of m that minimizes t is found by (dtc/dm) 0 and is given by tNEWS + tALU] For the LTD model the value of t' for contiguous segments of size m' is given by tc=(m'-l) (tNEWS + tALU)+ -(__n/m')]-l[(logL((lm' + 1)/p)] +l)tbus + tALU], where 0 is used instead of the logarithm if k((lm' + 1)/p)] 0. Thus, t'--(m'-1) L((Zm' 4-1) /P)J)]tbus + (m' + r(n/m')] 2)tALU.The third phase applies the OR operator to data stored in the PEs on column 0 which are attached to a row broadcast bus.Data are combined sequentially in pairs using the broadcast bus and starting with the PE that has the largest row number.This phase takes time t3 P(tbus + tALU) under both delay models.The total time is given by Tall_OR(n,p) tl + tc + t3 and all_OR(n,p) t + tc + t3 un- der the UTD and LTD models, respectively.Hence, the result.Assuming a system with wormhole routing, we substitute (r/2)+ M for (r/2)tNEWS in tl, and m-1 + M for (m-1)tNEWS in tc.The new value of m is now found to be m w For tNEws=M, the difference between the communication overheads of the store-and-for- ward and wormhole-routing techniques is ((r/ 2)+mw-1)(M-1)-2M.This difference is shown in Figure 4 as a function of the parameters n, p, and M, for tbus tALU 1.The pure computa- tion times (i.e., iguoring communication over- heads) for the curves from the bottom to the top of the graph are 56, 56, 77, 85, 112, and 136, respectively.Thus, wormhole routing may result in dramatic performance improvement.
The semigroup computation can be performed in time O(n2/5) on a rectangular n6/5n 4/5 MORB mesh with r-n 2/5 [24].However, we consider only square meshes in this paper (as noted in the Introduction).The MORB g mesh with separable buses is treated now for the same operations as above.
THEOREM 3 Given the MORB g (n,p) where each PE stores a datum, the logical OR of the data on individual rows/columns can be found in parallel in time TrRow_og(n, p) =(r-1)tNEWS + (r-+ [log P]r)tALu + [ og r   under the UTD model.The tbu term for the LTD model is r -l=ofP] log(2i_ + 1)thus.
Proof The first phase is identical to that for the MORB (n,p) mesh, requiring time tl--(r-1) (tNEWS + tALU).At this point, only PEs attached to a vertical broadcast bus contain partial results to be combined in the next stage.Binary trees are then emulated to repeatedly combine pairs of values within each row until a single result is produced for the row.This emulation is easy to implement because of the separability of buses.
More specifically, [log P iterations are needed to combine the partial results on each row.The ith iteration, where M FIGURE 4 The difference between the communication overheads (in number of cycles) of the store-and-forward and wormholerouting techniques for finding the logical OR. among all data in the MORB (n, p), under the UTD model.
iteration consumes r cycles, the same as the number of PEs within a (column) bus segment.
Therefore, this operation consumes time O((n/p) logp) under the UTD model.The separability of buses on the MORBR(n, p) reduces dramatically the time complexity of the algorithm in comparison to the MORB(n, p) mesh that requires time O(n) for this operation (see Theorem 2).For the MORBR(n, p) with wormhole routing, we sub- stitute r-1 + M for (r-1)tNEWS.The commu- nication overhead is thus reduced by r(M-1)-2 M + 1; this reduction is very significant for large values of r and M. Figure 5 shows the speedup (TrWows_oR(n, p)/Trgows_oR(n, p)) as a function of n, p, and M, for thus 1 (i.e., word-wide buses) and tbus M. The MORB(n, p) and MORB(n, p) use wormhole routing and store-and-forward, respec- tively.This figure illustrates the versatility of the lower-cost mesh with separable buses that does not employ the higher-cost wormhole-routing technique.
The next corollary expands further by investi- gating the problem of producing a single global result on the MORB n mesh with separable buses.

5.0o
":::. . . .:.-:::.: :.:....:.:.:..-..-...-.....+ 2.50 240 i ".Proof The algorithm for this reduction operation comprises three phases.In the first phase, every PE which is not attached to a row broadcast bus sends its value to its closest PE which is attached to a row broadcast bus.The OR operation is applied to contained and received values in all intermediate steps.This phase consumes time t =(r/2) (tNEW-S + tALU) d-tALU under both delay models.At this point, the OR reduction operation must be applied only for PEs attached to a row broadcast bus.In the second phase, the OR results are found in parallel for all rows of PEs which are attached to a row broadcast bus.This is accomplished by first combining sequentially the data within row seg- ments attached to broadcast buses and then reducing the values further by emulating binary trees, as in the proof of Theorem 3.This phase takes time t2 (r 1) (tNEWS + tALU) d-[log P] (tbus + tALU) under the UTD model.For the LTD model this phase takes time f2 (r-1) tNEWS -t-(r-1 + log P)tALU -I-l__lg P] log(2 i-1 +1) tbus.At the end of the second phase the partial results are stored in those PEs of column 0 which are located at bus intersections.These p values are combined in the third phase by emulating a binary tree on column 0. This phase takes time t3 [log (tbu d-tALU) under the UTD model.For the LTD model this phase takes time t [logp]   tALU -+-'[l__lg p] log(2 i-1 -+-1)tbus.Hence, the result.COROLLARY 5 Given the MB(n) mesh with multi- ple broadcast where each PE stores a datum, the logical OR among all data in the system can be found and stored in the PE0,0 in time Talt_OR(n, (n, n) 2t' n) 2 tc and Tatt_Ol under the UTD and LTD models, respectively, where t and are as in Corollary 3.
Proof The process described for the second phase of the algorithm in the proof of Corollary 3 is applied twice.This process consumes time tc and under the UTD and LTD models, respectively.The first time it finds in parallel the logical OR results for individual rows and stores them on the leftmost column.The second time this process is applied to the latter column in order to store the final result in the PEo,o.
Therefore, this algorithm consumes time O(n/2) under the UTD model.The asymptotic time required on the MORB(n, p) is larger for O(n/2) > O(p).At lower hardware cost the MORBn(n, p) may perform better than the MB(n) for O(n/2) < O(p).Also, it is important to notice that the asymptotic performance of the lower-cost MORB mesh is similar to that of the MB mesh for practical cases where O(p) O(nl/2).
As earlier, PE_buses represents the total number of PEs attached to broadcast buses and is given by Proposition 2. This cost is in reference to the regular n xn mesh without broadcast buses, and can be used to compare the MORB, MORBR, and MB meshes for this operation; the MB(n) mesh has the same PE_buses as the MORB(n, n), that is 2n .
If tbus--tNEWS tALU, the cost ratios (cOStall_OR (n, p)/cOStall_OR(n, n)) and (cOStal_OR(n p)/ cOStall_OR (n, n)) under the UTD model are shown in Figure 6 for practical values of n and p. Figure 6 shows that the cost ratio is very small for both families of MORB meshes.It is even smaller for the MORBR(n, p) mesh because its separable buses support more efficient implementation of this global OR operation at a very small additional hardware cost.
For systems with wormhole routing, the addi- tional communication overhead of the versatile MORBR(n, p) mesh when compared to the higher- cost MB(n) mesh is O(r + logp-hi/E).In practical cases this difference in communication overheads is either insignificant or negative, thus supporting our hypothesis that MORB g meshes can often perform better than higher-cost meshes with multiple broadcast.Figure 7 that shows the R,W difference Tall_OR(n, n) Tall_OR,n p) supports our claim that MORB R meshes with wormhole routing achieve much better performance in practical cases than MB meshes with multiple broadcast.
The last ,algorithm also can be used to find in the same amount of time the maximum or minimum of n 2 values, stored one value per PE.This is a very common operation.An alternative algorithm that implements this operation on the reconfigurable mesh was presented in [10].The latter algorithm is adapted here for its application to MORB meshes.The proofs in the next two theorems show the modifications needed in the algorithm for its application to MORB and MORB R meshes.The execution times stated in Corollaries 3 and 4 are then used in a comparative analysis involving the two algorithms that can solve the same problem.
Proof Without loss of generality, maximum in this proof stands for maximum or minimum.An algorithm developed for the reconfigurable mesh is modified here for the MORB mesh.The maximum among p values x0, Xl,...,xe initially stored on a single row of the reconfigurable pp mesh is determined as follows [10].Every PEi, j on row first receives xj, for all 0 < i, j < P; a broadcast operation that involves all p columns is applied for this purpose.Then, every PEg, uses a row broadcast to send its value xi to all PEi,j s on the same row, for all 0 < i, j< P. Every PEi, then compares the values xi and xy it contains and produces '1' if xy < x, otherwise it produces '0'.If the result of the OR operation among all these values on column j is 0, then xj is the maximum.
The OR results for all columns are produced in parallel and stored in different PEs of row 0. These OR results can be found by bus splitting so that values from pairs of PEs are combined each time in a binary tree fashion.As more than one PE on row 0 may then contain the maximum value, a bus splitting technique is used on row 0 to send this value to the PE0,0.More specifically, every PE on row 0 that contains the maximum value discon- nects the bus on row 0 from its neighbor to the right and only PEs that contain the maximum value are allowed to broadcast.This way, the PE0, 0 receives the maximum value.
The same algorithm is applied here for data on each broadcast row of the MORB(n, p), after the smallest possible number of local maxima are M FIGURE 7 The difference Tall-OR(n, n) T R'w tn all-OR ,P), in number of cycles, under the UTD model.
found for blocks of the mesh, with these local maxima being stored in PEs attached to broadcast buses.Therefore, the algorithm begins with every PE which is not attached to a broadcast bus sending its value to its closest PE which is attached to a broadcast bus.After every data transmission, the maximum value is found of the already contained value and any value received.This phase takes time t. (r/2) (tNEWS -t'-tALU) "-[-tALU; the last addition is due to PEs attached to broadcast buses that receive two values simultaneously.At this point, only the PEs attached to broadcast buses contain values which will be involved in the next phase of the algorithm.In the second phase, the values stored in the PEs attached to broadcast buses are combined further using NEWS connections.At the end of this phase, only the PEs at the intersections of broadcast buses contain values that remain to be combined.This second phase takes time t2(r/2) (tNEWS 4-tALU) + 3tALU; the last term is due to the simultaneous arrival of four values at some bus intersections.
Therefore, in the third phase the algorithm described in the beginning of this proof for the reconfigurable mesh is used p times, once for each set of values stored on a row broadcast bus (i.e., similarly to data on a single row of the reconfigur- able mesh).The following analysis assumes that in the kth execution of the third phase the maximum among the values on the (k-1)th row is found and stored in the PE0,(k-1)r, where k-1, 2,...,p.Thus, at the end of the third phase the PEs of row 0 located at bus intersections are involved to find the maximum and store it in the PE0,0, in time -.
Each application of the iterative process in the third phase requires time -3 tbus q-tALU + P (tbus + tALU) under the UTD model; each iteration involves a column broadcast, a row broadcast, an ALU comparison, a logical OR operation on columns, and the collection of the maximum value on row 0 using a single row broadcast (it was assumed in the Introduction that many PEs are allowed to send simultaneously the same value on the same broadcast bus).Therefore, the third phase of the algorithm requires time t3 (p + 1)- under the UTD model.
Therefore, the asymptotic time of this algorithm is O((n/p)+ p2) under the UTD model.The algorithm presented in Corollary 3 can solve the same problem in time O((n/p)+p + nl/2).It can be concluded that in practical cases, where p2>nl/2, the latter algorithm has lower time complexity.It can be seen that this is also true for systems with wormhole routing.
The next theorem investigates the same problem for the MORBn(n, p) with separable buses.The switches present in the MORBR(n, p) can be taken advantage of for a closer match with the algorithm developed in Theorem 4 for the reconfigurable mesh.
THEOREM 5 If data items are stored one per PE in the MORB R (n, p), the maximum or minimum among the n 2 values can be found and stored in the PEo,o in time T R max/min r tNEWS under the UTD model.The tbus term for the LTD model is [4log(p,)-41og(([ 1 1),) [ + (p + )logp + (p + Z i=1 log(2 -+ 1) + 2] tbu.
Proof Without loss of generality, maximum in this proof stands for maximum or minimum.As in the proof of Theorem 4, first the smallest possible number of local maxima are found in the fastest possible way for blocks of the mesh, with these maxima finally stored in PEs attached to broadcast buses.The time for this phase is (r/2) (tNEWS-k-tALU)-t-tALU.At this point, only the PEs attached to broadcast buses contain values which will be involved in the next phase of the algorithm.In the second phase, the values stored in the PEs attached to broadcast buses are combined by finding the maxima for subsets of PEs as in Theorem 4; NEWS connections are used for this purpose.At the end of this phase, only the PEs located at the intersections of broadcast buses contain values that remain to be combined.This phase takes time t2=(r/2) (tNEWS-b tALU)+ 3tALU.Therefore, in the third phase the algorithm for the reconfigurable mesh is used p times, once for each set of values stored on a row broadcast bus.The following analysis assumes that in the kth execution of the third phase the maximum of the values on the (k-1)th broadcast row is found and stored in the PE0, (k-)r, where k 1,2,...,p.Thus, at the end of the third phase the PEs on row 0 are involved to find the maximum and store it in the PE0,0, by executing the same algorithm once more.The iterative process in the third phase includes a (parallel) column broadcast, a row broadcast, an ALU comparison, an OR reduction operation on columns using binary tree emula- tions, and bus splitting on row 0. The third phase requires time t3 (p + 1) (3 tbus + tALU + [1ogp] (tbus + tgLu>)) under the UTD model.
Therefore, this operation consumes time O((n/p) + p logp) on the MORBR(n, p) under the UTD model.This performance is better than that of the MORB(n, p) in Theorem 4, which is O((n/ p) + p2).It is the result of higher utilization of bus segments and increased number of parallel ALU operations.However, this new algorithm does not perform better in practical cases than the algorithm for the MORBR(n, p) in Corollary 4 that can solve the same problem in time O((n/p)+ logp).
The MORBR(n,p) without wormhole routing even performs better than the higher-cost MORB(n, p) with wormhole routing for this algorithm.The additional execution time of the latter is (p + 1) (p-1-[1ogp)(tALU + tbus)-M-r(M-1), which is much larger than 0 in all practical cases (e.g., for p=65 and tALU=tbus 1, this difference is positive for all M 114 packets).This result also shows that MORB meshes with separable buses do not necessarily require the higher-cost worm- hole-routing technique to achieve very good performance.

V. ADVANCED ALGORITHMS
The effectiveness of MORB and MORB meshes in implementing efficiently prefix computations and graph connected-component labeling is in- vestigated in this section.Both operations are very important in parallel processing.

A. Prefix Computation
For a prefix computation we are given n 2 inputs ai, aE,..., an2 and a binary associative operator (R).
The n 2 outputs bl,bE,...,bn2 are such that b/c al(R)a2(R)... (R)a/c, for any 1 _<k<_n2.This problem is fundamental in parallel processing and has everal applications [20, 28].Hardwired implementation of prefix operations is very com- mon for massively parallel computers.
An algorithm developed for the regular mesh without broadcast buses is modified here to make it suitable for MORB and MORB e meshes.The next lemma is relevant.

LEMMA
Given a set of n 2 values al,aE,...,an2, distributed one per PE in the row-major order, and a binary associative operator (R), the parallel prefix problem for this operator can be solved on the regular mesh without broadcast buses in time Tmesh, prefix (n) 3NtNEWS + (2 N + 1)tALV.
Proof The prefix computation problem is solved on the regular mesh without broadcast buses in three phases.In the first phase, the prefix computation is carried out sequentially from left to right on every row.This phase takes time tl--N(tNEWS / tALU).In the second phase, the prefix computation is applied to the rightmost column sequentially, from top to bottom.This phase takes the same amount of time as the first phase (i.e., t2 tl).In the third phase, the content of the rightmost PE on every row k-is sent to all PEs on row k, except for the rightmost PE, for k 1, 2,...,N, and the operator is applied to the contained and received values.The most efficient implementation of this phase consumes time t3 N tNEWS / tALU.Hence, the result.Therefore, the prefix computation problem consumes time O(n) on the regular mesh.Worm- hole routing improves the timing for the third step to N + M + tALU.The total time then becomes W tmesh,prefix (n) (2N + 1) (M / tALU) -f-N.The next theorem deals with the MORB mesh.TI-IEOaV.M 6 Given a set of n 2 values al,a2,... ,an2, distributed one per PE in the row-major order, and a binary associative operator (R), the parallel prefix problem for this operator can be solved on the MORB(n, p) in time Tprefix(n,p) =5(r-1)tNEWS + (3r + n + P)taLt + (2r + N + P-1)tbus under the UTD model.The tbus term for LTD model is (r(1 / log P) + N + p-1)tbus.
Proof In the first phase, the prefix computation is carried out in parallel for all of the row segments that fall between any two consecutive column buses.That is, this phase groups together all PEs, on the same row i, with Cartesian coordinates (i, r jl + j2), where 0 < < N, 0 <jl <P-2 and <j2 < r.This phase involves left-to-right shifts within these row segments and requires time tl=(r--1)(tNEWS + tALU) under both models.In the second phase, the following process is repeated sequentially in parallel from the leftmost to the rightmost smallest-sized squares formed by broad- cast buses.The switches are set so that the r previously calculated partial prefix values which are located on the left side of such squares are transferred one-by-one to the PE each time that has the same x coordinate with the initial sender but lies on the opposite side of the corresponding square.The (R) operator is then applied each time to the received value and the contained partial prefix value.Therefore, this phase requires time t2 P r(tbus -4-tALU) or t2 N(tbus -4-tALU) under both models; the first term P represents the total number of iterations whereas r represents the total number of separate parallel data transfers in each iteration.In the third phase, using NEWS connections the partial prefix values stored in the PEs with addresses (i, rjl) are sent sequentially to the PEs with addresses (i, rjl +j2), for all 0 < _jl _P-2 and <jz < r and the (R)  operator is applied to the received value and the previous prefix value stored in each PE.The most efficient implementation consumes time t3 (r-1)tNEWS + tALU.In the fourth phase, the prefix computation is carried out on the rightmost column in a top-down fashion.In a way similar to that for the first phase, the prefix computation is carried out in parallel for all of the segments that fall between two consecutive row buses.Then, partial prefix values are sent in a sequential topdown manner between PEs on the rightmost column attached to consecutive row buses and the (R) operator is applied.The fourth phase takes time t4 tl P(tbus + tALU) or t4 (r--1) (tNEWS -4-tALU ) -4-P(tbus -4-tALU), under both delay models.Time t5 equal to t3 is then needed in the fifth phase to perform prefix computations within the segments used in the fourth phase.In the sixth phase, the r prefix values at the rightmost side of each smallest-sized rightmost square of buses are broadcast one-by-one to all PEs at bus intersec- tions which are also attached to the upper bus of each such referenced square.The column buses are then used to send these prefix values for rows to PEs attached to column buses so that the PE on row k receives the value coming from the right- most PE. on row k-1.Then, the (R) operator is applied.This sixth phase takes time t6 r (2 tbus + tALU)--tbus and t61=r[(l+logP)tbus+tALU] --tbus under the UTD and LTD models, respec- tively, assuming column-bus splits just above their intersections with row buses.The seventh phase is carried out similarly to the third phase, the only difference being that the value transmitted is the one received in the sixth phase.Therefore, this phase takes time t7 equal to t3.By summing up the times for the seven phases, we get the times stated in this theorem.
Therefore, the parallel prefix algorithm requires time O(n) on the MORB(n, p), under the UTD model.This is also the asymptotic time for the regular mesh without broadcast buses.However, the constants in Tprefix (n, p) are smaller.For example, assume that n 513, p 65, tNEWS 1, tbus 1, and tALU 1.Then, the speedup (Tmesh,prefix(n)/Tprefix(n,P) (3072/1227)= 2.5037 is significant.Wormhole routing reduces the time for each of the steps 3, 5, and 7 from (r-1) tNEWS + tALU to r-1 + M + tALU.To derive the best performance of the MB(n) for this problem, we find the value ofp that minimizes the execution time of the MORB(n, p).This value is Poptimal [1 / [(n-1)(5tNEWS + 3 tgLu + 2 tbus)/ (tALU + tbus)]l/2].Other types of data ordering may result in lower time complexity for prefix computa- tions [24].However, most often data are stored in the row-major order.
The algorithm for the MORBR(n, p) follows.
Proof In the first phase, the prefix computation is carried out in parallel for all of the row segments that fall between any two consecutive column buses.That is, this phase groups together all PEs on the same row i, with Cartesian coordinates (i, rjl+j2), where 0<_i_<N, 0_<jl_<p-2, and 1 <_j2 <_ r.This phase involves left-to-right shifts within these row segments and requires time tl (r-1) (tNEWS + tALU) under both models.
In the second phase, an algorithm that employs binary tree emulation is used [11].Its implementa- tion on the MORBR(n, p) is carried out very efficiently due to the robustness supported by the switches.Let us focus on the p PEs which are attached to the same row broadcast bus.The first iteration combines pairs of partial prefix values coming from consecutive such PEs and stores for each such pair the result into the PE on the right side.The (R) operator is then applied to received and contained values.In the second iteration, every other PE that received a value in the previous iteration, starting with the leftmost such PE, sends its value to the two consecutive such PEs that follow on its right side.The (R) operator is then applied to received and contained values.Generally, in the ith iteration, every other PE which is the rightmost PE of a group of PEs that received the same value in the preceding iteration, starting with the leftmost such PE, sends its value to the 2 iconsecutive PEs that follow on its right side.The (R) operator is then applied to received and contained values.A total of log p] iterations are needed.Actually each iteration carries out the above process for all of the r PEs on each column bus segment.This phase consumes time t2 r [log P](tbus + tALU) under the UTD model.For the LTD model the time is t r(tALU+ -l=lgP] 1og(2i-1 + 1)tbus).
In the third phase, the values stored in PEs attached to column broadcast buses are shifted into the r-PEs on their right side and the (R) operator is applied each time.The most efficient implementation of this phase takes time t3 (r-1)tNEWS + tALU.At this point the prefix computation is complete for all rows in the mesh.In the fourth phase, the prefix computation is carried out within the rightmost column.This phase takes time t4 (t2/r) and t' 4 -(t' 2 -/r) under the UTD and LTD models, respectively.In the fifth phase, values are broadcast from the rightmost column within all rows and the (R) operator is then applied.This phase takes time t5 r tbus + tALU + t3 under the UTD model, and t r(logp tbus + tgLu) + t3 under the LTD mod- el.Hence, the result.Therefore, this operation consumes time O((n/p) log p) on the MORBR(n, p) under the UTD model, and the value of the speedup (Tprenx(n,p)/ Tp'Rrenx(n,p) O(p/logp) shows the superiority of the MORB R mesh with separable buses.
Assume as earlier that n 513,p=65, tNEWS 1, tbus 1, and tALU 1.Then, the values of the speed- ups (Tmesh, prefix (n) R /Tprefix(n,p)) (3072/165) 18.6181 and (Tprenx(n,p)/Ti'Rret x (n,p))=(1245/165) =7.5455 are extremely impressive.Figure 8 shows that the MORBR(n, p) without wormhole routing performs much better than the higher-cost MORB (n, p) with wormhole routing, for this algorithm; it is assumed that tbus M, tNEWS M, and ALU 1.   Figure 9 shows the speedup T'grenx(n,n)/ Tp'grex(n,p) for the comparison of the MORag(n, p) with the higher-cost MB(n) mesh with broadcast buses; we assume that tALU 1.The results show that the MORBR(n, p) performs better than the mesh with multiple broadcast MB(n) in all practical cases.

B. Graph Component Labeling
Algorithms for graph component labeling on the MORB(n, p) and MORBR(n, p) meshes are presented here.Assume an undirected graph G(V, E), where V represents the set of vertices, with IV I< n and E represents the set of edges.The graph is represented by its adjacency matrix A where the element aij is if the vertices and j are directly connected; otherwise, a O. 0. The con- nected component labeling algorithm assigns a label li to each vertex i, where l is the minimum of the vertex indices belonging to the same connected component with i.Its special case of 2D image connected component labeling is very popular [19].
Any two vertices in a connected component are connected through a path that employs one or more graph edges.
The graph component labeling algorithms pre- sented here are based on an algorithm for the polymorphic torus [16].A brief description of the latter algorithm for the nn polymorphic torus follows.The algorithm uses two parallel variables, namely vrtx_num and label.The initialization sets vrtx_num0 and vrtx_num0i on the first row and the first column, and these values do not change during the execution of the algorithm.labelio-label0i contain the label assigned to vertex and are updated in each iter/ttion.The initialization sets labeli0 and label0i-i.
The main body of the algorithm implements a loop of + [log n] iterations.Each iteration executes two functions, namely shortcut and hook.shortcut assigns as a label to each vertex the current label of the vertex with index equal to i's label, hook assigns a new label to each vertex which is the minimum, over the set of vertices with the same label as vertex i, among the minima of the labels of the vertices connected to at least one such vertex.The implementation of these two functions is as follows.Shortcut broadcasts the labels from the first column to the east.Both the labels and the vertex indices are broadcast from the first row to the south.Labels broadcast from the first row are then sent to the west and copied into the variable label of PEs on the first column the label chosen for this copy operation comes from the leftmost PE of each row that receives the same value for the vertex index broadcast from the first row and for the label broadcast from the first column.Finally, the labels are copied from the first column into the first row.
Each implementation of hook requires (label) column broadcasting from row 0, (label) row broadcasting from column 0, finding the minima within columns among labels received from the west by employing only the PEs that contain for the adjacency matrix element (these minima are then stored in the PEs of the first row), broad- casting these minima from row 0 within the columns, finding the minima among just broadcast values within rows by choosing only the PEs that received in the beginning the same label from the north and west (these minima are then stored as labels in the PEs of column 0), and, finally, copying these labels from each PEgo into the PEo; (as in the last step of shortcut, the labels stored on the first column are copied into the first row).This graph component labeling algorithm requires time O(log 2 n) on the polymorphic torus.THEOREM 8 The graph component labeling pro- blem can be solved on the MORB (n, p) in time Tta6eting(n,p)=(1 + [logn] [3(r-1 + 2(n-p))tuEws + (3rp-1) taLV + 3(3r+ U)tbus] under the UTD model.The thus term for the LTD model is 3 r log(p3p ")tbus.
Proof Assume a graph G(V, E) with up to n vertices.Let row 0 and column 0 contain in the beginning the indices of the assigned vertices, while the adjacency matrix A is already stored in the n n mesh, such that the PE 0. contains ag.Thus, there is no need for initialization.A loop of n iterations is carried out, similarly to the one described previously for the polymorphic torus.Each implementation of shortcut requires a (parallel) column broadcast, a row broadcast (since the vertex index does not change, there is no need to broadcast it), a comparison in each PE of two values, an operation that finds the PE with the smallest index on each row for which the comparison produces affirmative result, an opera- tion that copies the label from the latter PE of each row into the PE on the first column, and an operation that copies these labels from each PEi0 into the PE0i.The column and row broadcasts, and the subsequent comparison are implemented in total time tl=2rtbus + tALU and t]--2r logp tbus + tALU, under the UTD and LTD models, respectively.The operation that locates the PE with the smallest index on each row for which the comparison is affirmative, and copies the label from the latter PE into the corresponding PE on the first column consumes time t2 Trows,OR(n,p) and Ttrows_OR(n,p), under the UTD and LTD models, respectively; we assume that the index of the source PE is attached to the transmitted label.In the last phase of shortcut, the labels stored on the first column are copied into the first row; the PE0g, rbceives the label of the PE0, for i-1, 2,..., N.This phase is implemented as follows.Using broadcast buses, each PEi0 sends its label to the PE, where j [(i/ r)]r.Then, broadcast buses are employed by each PE in order to send the r values it received to the r distinct PEs PE0._, for s-0, l, 2, r-1.
This process requires time t3 2 r tbus and t' =2rlogptbus, under the UTD and LTD models, respectively.Ignoring temporarily the last column-to-row copy operation, there exist three broadcast opera- tions on rows/columns, one internal PE compar- ison, and two operations to find the minimum within rows/columns.The three broadcast opera- tions and the comparison require total time t4-3rtbus + tALU and t 3rlogptbus tALU, under the UTD and LTD models, respectively.
The two operations that find the minimum within individual rows/columns require total time t5 2 Trows_oR(n,P) and 2Trows_O R' (n,p), un- der the UTD and LTD models, respectively.The last copy operation requires time t6 --t3 and t t, under the UTD and LTD models, respec- tively.Hence, the result.Therefore, this algorithm consumes time O(nlog n) on the MORB(n, p) mesh under the UTD model.With wormhole routing the term representing NEWS data transfers is replaced by 3 (r-1 + M)(1 + 2 P).
Proof The same algorithm as in the proof of Theorem 8 is considered here.However, some steps in its implementation are altered to take advantage of the switches on separable buses.The implementation of shortcut requires two row/ column (parallel) broadcast operations and a comparison that consume total time 2r tbus + tALU and t' 2rlogp tbus -F-tALU, under the UTD and LTD models, respectively.Also, the operation that copies into column 0 the labels of PEs with the smallest index on each row for which the comparison produces affirmative result consumes tR OR (n,p), time t2 TRrows-oR(n,P) and t-Trows under the UTD and LTD models, respectively.
Finally, the operation that copies the labels from column 0 into row 0 consumes time similar to that in the proof of the preceding theorem.More specifically, this process requires time t3 2 r tbus and t 2 r logp tbus, under the UTD and LTD models, respectively.
The implementation of hook requires three row/ column broadcast operations and a comparison that consume total time t4 3 r tbus + tALU and t 3 r logp tbus -+-tALU, under the UTD and LTD models, respectively.The two operations that find the minimum among labels on each row/column R (n,p) and consume total time t5 2Trows_O R t 2 ' Trows_oR(n, p), under the UTD and LTD models, respectively.The last operatioia in hook consumes time t6 t3 and t t, under the UTD and LTD models, respectively.Hence, the result."""" "":::. . . . .":'"; : : : . .: . . . . . .