Analysis of Parallel Multidimensional Wave Digital Filtering Network on IBM Cell Broadband Engine

As an alternative approach for the numerical integration of physical systems, the MDWDF technique has become of importance in the field of numerical analysis due to its attractive features, for example, massive parallelism and high accuracy both inherent in nature. In this study, speed-up efficiencies of a MDWDF network are studied for the linearized shallow water system, which plays an important role in fluid dynamics. To achieve the goal, the full parallelism of the MDWDF network is established in the first place based on the chained MD retiming technique. Following the implementation on the IBM Cell Broadband Engine (Cell/BE), excellent performance of the full parallel architecture is revealed.The IBMCell/BE containing 1 power processor element (PPE) and 8 synergistic processor elements (SPEs) perfectly fits the architecture of the retimed MDWDF model. Empirical results have demonstrated that the full parallelized model with 8 processors (1PPE + 7SPEs) outperforms the other three models: partial right/left-loop retimed models and the full sequential model with 4× improvements for scheduled grids 51 × 51. In addition, for scheduled fine grids 201 × 201, the full parallel model is shown to possess significant performance over these models by up to 7× improvements.


Introduction
Physical system modeling is an important discipline in all fields as it enables the development of system simulation engines for physically realistic processes such as fluid flow, electrical, and acoustical phenomena.The most popular kind of models for multidimensional (MD) physical systems can be represented by sets of linear and/or nonlinear partial differential equations (PDEs) with properly imposed initial and boundary conditions.Many numerical approaches existing to analyze the behavior of such physical systems include finite elements (FEs) and finite differences (FDs).The FEs permit easy inclusion of local grid refinement and handling of complex geometries [1].These methods, however, are computationally expensive and harder to directly and correctly set up the simulation plane than the frequently used finite difference methods.The latter approaches, on the other hand, have difficulties in handling irregular boundaries and need extra care of the local grid refinement in order to increase its measurement accuracy [2].As FEs and FDs require a large number of grid storage in order to get results even at or around a point of interest with acceptable accuracy, the requirements for excessive CPU runtime and storage consumption are often unnecessarily large in such cases.Furthermore, the computational load of the difference method due to local grid refinement also prevents its use for real-time hardware synthesis.
Built on properties of the traveling wave formulation of lumped electrical elements to the modeling and simulation of a system represented by PDEs, a novel approach named wave digital filtering (WDF) network [3] had been proposed in the past due to its excellent features that fit requirements of practical interest [4,5].Unlike most types of digital filter, every delay element in a WDF network can be interpreted physically as holding the current state of a mass or spring (or capacitor or inductors) [3].Furthermore, because the rules for interconnecting the elementary models are based on scattering theory, all signals explicitly computed via a WDF 2 Journal of Computational Engineering network are interpreted as traveling wave components of physical variables [4,6].Making use of the WDF network paradigm and analogies with electrical networks, the MDWDF technique, thus, draws a maximum advantage of essential physical properties of such systems, in particular of causality, passivity, stability, finite propagation velocity, and so forth, which can be all translated to the actually considered physical problems so as to preserve important relationships between variables [4,5,[7][8][9][10].Proceeding in this way, it has the unique advantage of simultaneously offering the second-order accuracy, high robustness and fault tolerance, massive parallelism, full localness (also for taking into account arbitrary boundary conditions and shapes), and explicit or at least semiexplicit computability.
Fettweis and fellow researchers [3][4][5]7] have carried out the pioneering work in the area of this subject.Our interest in the past and currently has revolved around the software toolbox development and hardware designing aspects [8][9][10][11][12][13].In particular, the full parallel architecture of MDWDF network based on the chained MD retiming technique [14], a class of software pipelining, is focused on partitioned nested loop across distinct iterations in terms of spatial and temporal updating of its corresponding nonparallel MDWDF network.With the full parallelism adopted within loops, it is then desirable to exploit the multiple core hardware architecture with safe and efficient multithreading paradigm per core to further boost the performance of the MDWDF network.This will be a significant advantage of parallel implementation.Clearly, the proposed method is only attractive if it can be shown that the resultant simulation models provide an efficient technique to the solution of certain PDEs representing mechanical behaviors of the physical system when it is compared with conventional approaches such as finite elements.
Adopting the full parallel architecture [11], in this study, the analysis of CPU runtime speed-up efficiencies is empirically implemented on the IBM Cell Broadband Engine (Cell/BE) to further improve performance of the MDWDF network representing the linearized shallow water (LSW) system, which plays an important role in fluid dynamics.The IBM Cell/BE facilitates 1 power processor element (PPE) core processor providing system functions together with 8 synergistic processor element (SPE) coprocessors optimized for efficient data processing.As the key design goal of the full parallel MDWDF network is to maximize the performance in terms of speed-up efficiency, the IBM Cell/BE perfectly fits the architecture of the retimed MDWDF network, which requires at most 1 main processor and 7 routine processors running independently.Empirical results have demonstrated that the full parallelized model making use of 8 processors (1PPE + 7SPEs) significantly outperforms other three models: models with partial right/left-loop retimed bodies and the full sequential model by at least 3× and 4× improvements for scheduled grids 51 × 51, respectively.In addition, for scheduled grids 201 × 201, the full parallel model is shown to possess significantly performance over these models by up to 7× improvements.

2𝐷 LSW System and Its Corresponding Multidimensional
Wave Digital Filtering Network.Let us consider a linearized shallow water (LSW) system characterized by a set of PDEs for the surface displacement,  [1,8,16]: Here the horizontal velocities V 1 and V 2 are directed along the spatial domain of  and  direction, respectively.Furthermore, ℎ is the total water depth defined as the sum of the undisturbed water depth (a constant mean depth)  and the free surface elevation  measured upward from the undisturbed surface; that is, ℎ =  + .The gravity acceleration  and the Coriolis parameter  are constants.This hydrostatic formulation is the combined physical processes of radiation, refraction, diffraction, and reflection when the system is confined in a bounded domain with nonempty piecewise boundary.In an unbounded domain, the LSW system, however, generates divergent and nonreflective waves based on some artificial open boundaries [2].As a result, this modeling is quite successful in fluid dynamics to predict the dynamics of the surface gravity waves and is usually applied to the global modeling on large scale oceanographic or atmospheric quantities like transports and surface elevation, Tsunami modeling, and simulation [2,6].
Applying the standard procedure known from the MD wave digital filtering [4,5] for transforming the set of PDEs to its equivalent discrete passive model, a lumped MD-passive Kirchhoff circuit (MDKC) and its discrete approximation of MDWDF network [11] depicted in Figures 1(a) and 1(b), respectively, are obtained for the numerical integration of the LSW system.Since the resulting network behaves in the same way as the continuous one, it also preserves passivity for the discrete dynamical system, thus ensuring full robustness and stability of the algorithm [4,12].The reader is referred to [8] for more details of converting the given physical system to form the MDWDF network via the MDKC.

Chained Multidimensional
Retiming.The objective of the chained MD retiming technique applied to the MDWDF network of Figure 1(b) is to legally change the delay of edges in the MD data flow graph (MDFG) such that nonzero delays on all edges can be obtained in order to achieve the full parallelism of the MDWDF network.Here the chained MD retiming is based on the push up scheduling technique [14,16] where a good example illustrated in Figure 2(b) represents a retimed MDFG from a valid MDFG depicted in Figure 2(a).We note that the form of MDFG is described as a cyclic data flow graph with the tuple (, , , ) to represent a node-weighted and edge-weighted graph where  is the set for i = 0 to . . .for j = 0 to . . . of computation nodes in the loop body,  denotes the set of directed edges representing the dependence between two nodes,  is a function representing the MD delay between two nodes, and  is the discrete time required for computing a certain node [14,16].
Define a scheduled subspace of a realizable MDFG by  = { : () ⋅  ≥ 0, ∀ ∈ }.The technique of the chained MD retiming [14,16] for scheduling MD problems starts with finding a schedule vector  ∈  + where  + is a strictly positive scheduled subspace defined by Given any scheduled element , a legal retiming function  of a realizable MDFG for each node of a loop body can be obtained when it is orthogonal to  according to [11,14].As a consequence, the chained MD scheduling generates a retiming vector () for each computing node  in the MDFG such that the MD delay is pushed from incoming edges of  to its outgoing edges, and new delay of each edge is given by It is of importance to note that following the principle of MD retiming [14,16], () can be obtained immediately without any difficulty by the following formula: where   is the maximum length of the existing chains with the highest level number  in the construction process of a MDFG, while  is the level number of the node .We notice that if an incoming edge possesses zero delay, it is necessary to apply the same retiming function to that incoming node leaving the sum of delays of the loop body unchanged.Provided the loop body depicted in Figure 3(a) of the MDWDF network, the corresponding MDFG in Figure 3(b) is scheduled by which each operation of the network loop body in Figure 3(b) is executed in one clock cycle by one time unit.As a result, the scheduled table in Table 1(a) is obtained for reference, which clearly shows that the traverse of an iteration of the loop requires a minimum of 7 clock cycles.Choosing the maximum length of the existing chain (D2, E2, F23, G23, H2 or D3, E3, F23, G23, H3) by   = 5 ((D1, E1, EF1, Gc1, F1, G1, H1 or D4, E4, EF4, Gc2, F4, G4, H4) by   = 7, resp.) with the highest clock cycle  = 4 ( = 6, resp.) for the left-loop body (the rightloop body, resp.) of the MDFG of Figure 3(b), (4) yields the retiming vectors listed in Table 2 for each node in the MDFG.Substituting the retiming vector obtained into (3), a full retimed MDFG using the MD push up scheduling is then established as illustrated in Figure 4(a) with its corresponding retimed MDFG loop body in Figure 4(b Clock cycle (level number) Operations demonstrated the achievement of full parallel architecture by the retimed MDFG of Figure 4(a).In view of the retimed MDFG, all edges contain nonzero delay, which eliminates the intraiteration dependencies.Furthermore, the total delays of each loop remains unchanged.To achieve the full parallelism, clearly the left-loop body must require at most 5 parallel processors synchronously executing all operations, while at most 7 parallel processors are required for the right-loop body.The loop body of the retimed MDFG illustrated in Figure 4(b) containing prologue and epilogue processes is executed by P0 to provide the necessary data for the parallel loops, which complementarily completes the process.Thus, provided at most 7 parallel processors named P1, . . ., P7, the retimed MDFG has achieved significantly full parallelism by reducing the number of necessary clock cycles into one pass.

Hardware Experiment
In this section, we present empirical simulations for the study of computational efficiency among different MDWDF networks all implemented on the IBM Cell/BE within SONY PS3 with architecture depicted in Figure 5(a) [15].Comparisons between full parallel and partial/or sequential networks are also given to demonstrate the excellent performance of the proposed full parallel architecture.
In view of Figure 5(a) also according to [15], the most important differences to conventional multicore CPUs is that the Cell/BE is not a homogeneous system with multiple copies of the same core.Instead, the architecture possesses uniquely itself.Apart from the parallel process implemented on the SPE coprocessors, there are processes of prologue, epilogue, and program initiation, which are controlled fully by the PPE core.
Similar advantages of using the full parallel model are also applied, which outperforms the right-loop and left-loop retimed models.As can be seen from Figures 8(c) and 8(d), the full parallel model has gained up to 3.59× and 3.86× faster than the right-loop and left-loop retimed models, respectively.The cause of less efficiency can be found in Figure 6(b) where the left-loop body, although it technically requests 5 SPEs according to the scheduled retimed MDFG in Table 2, is implemented sequentially only by the PPE core resulting in a huge burden of 5× average CPU runtime of Having the computational complexity increased 16 times by arranging the grids of 201 × 201, Figures 9(a)-9(d) clearly demonstrate that the full parallel architecture of MDWDF network has achieved its goal by significantly boosting the performance of nonparallel MDWDF network.More specifically, the full parallel model outperforms models of sequential and partial parallelism of right/left-loop bodies by the least 3.72×, 5.73×, and 6.27× runtimes, respectively, whose detail is listed in the second-half part of Table 3.More specifically, we look at Table 3 concerning about the performance of full parallel model against that of the full sequential one for two scheduled grids.Despite the increase of computational complexity, the full parallel model still remains at the range of gain [3×, 4×] over the full sequential model.By contrast, the full parallel model receives a bigger range of gains with [3×, 6×] and [3×, 7×] over the rightand left-loop retimed bodies, respectively.The increment of gain over these partial parallel models may be attributed to improper allocation of massive data transfer between the PPE core and these SPEs coprocessors.These results, nevertheless, have clearly suggested that the scheduled 64bit power architecture for full parallelism of the MDWDF network model provides the key performance advantage by fully equipping the 8 decoupled SPE SIMD engines (1PPE + 7SPEs) with dedicated resources including large register files and DMA channels.

Conclusion
In this study, we have studied the significant computer runtime performance for different models of MDWDF network implemented on the IBM cell/BE.The full parallelism of the MDWDF network was carried out by working together with 8 decoupled SPU SIMD engines, each with dedicated resources including DMA channels.In particular of interest, a custom design of 64-bit power architecture was developed, which facilitates the PPE core processor to control the prologue, epilogue, and program initiation, while the massive parallel processes of the right-and left-loop bodies were performed by at most 7 SPE coprocessors.This has rendered the Cell/BE to utilize 8 times more SIMD capability (for up to 16way data parallelism) than the conventional processors with vector architecture extensions, such as the PPC970 in the G5.Simulation results have demonstrated that the CPU runtime speed of the full parallel model outperforms that of the other 3 models by up to 4× improvement for scheduled grids 51 × 51; while for scheduled grids 201 × 201, the full parallel model can significantly gain by up to 7× improvement.

Figure 2 :
Figure 2: (a) A valid MDFG and its corresponding loop body.(b) A retimed MDFG and its corresponding loop body.

Figure 3 :
Figure 3: (a) The loop body representing the MDWDF network.(b) A MDFG corresponding to the loop body.

Figure 6 :Figure 8 :
Figure 6: (a) The right-loop body retimed MDFG of the MDWDF network.(b) The corresponding loop body of the right-loop body retimed MDFG.

Figure 8 (
b) shows that the full parallel model receives a significant improvement with up to 4.35× more efficiency than the full sequential one based on details given in Table

Table 1 :
) and the scheduled table can be traced back as listed in Table 1(b) at each iteration.As compared to the level of clock cycle in Table 1(a), analytical results listed in Table 1(b) have simply Schedule tables: (a) MDFG.(b) The retimed MDFG.

Table 3 :
Maximum CPU runtime for different types of model.