On the Parallelization of Stream Compaction on a Low-Cost SDC Cluster

Many highly parallel algorithms usually generate large volumes of data containing both valid and invalid elements, and highperformance solutions to the stream compaction problem reveal extremely important in such scenarios. Although parallel stream compaction has been extensively studied in GPU-based platforms, and more recently, in the Intel Xeon Phi platform, no study has considered yet its parallelization using a low-cost computing cluster, even when general-purpose single-board computing devices are gaining popularity among the scientific community due to their high performance per $ and watt. In this work, we consider the case of an extremely low-cost cluster composed by four Odroid C2 single-board computers (SDCs), showing that stream compaction can also benefit—important speedups can be obtained—from this kind of platforms. To do so, we derive two parallel implementations for the stream compaction problem using MPI. +en, we evaluate them considering varying number of processes and/or SDCs, as well as different input sizes. In general, we see that unless the number of elements in the stream is too small, the best results are obtained when eight MPI processes are distributed among the four SDCs that conform the cluster. To add value to the obtained results, we also consider the execution of the two parallel implementations for the stream compaction problem on a very high-performance but power-hungry 18-core Intel Xeon E5-2695 v4 multicore processor, obtaining that the Odroid C2 SDC cluster constitutes a much more efficient alternative when both resulting execution time and required energy are taken into account. Finally, we also implement and evaluate a parallel version of the stream split problem to store also the invalid elements after the valid ones. Our implementation shows good scalability on the Odroid C2 SDC cluster and more compensated computation/communication ratio when compared to the stream compaction problem.


Introduction
Continuous improvements in the technologies used to build computers have recently made possible the fabrication of extremely low-cost general-purpose single-board computing devices. Nowadays, one can buy one of these tiny computers for a few dollars and make it run Windows 10 or Ubuntu-Linux operating systems [1,2]. Among the variety of vendors providing these single-board computers (SBCs), maybe the most renowned ones are Raspberry Pi and Odroid. Although the initial aim of these devices was to promote the teaching of basic computer science in schools [3,4] and developing countries [5][6][7], recent appearance of single-board computers with multicore ARM CPU chips and several gigabytes of main memory also provides a desirable hardware platform for the project-based learning paradigm in computer science and engineering education [8][9][10][11] and have attracted interest of a multitude of projects trying to take advantage of their very low-cost performance ratio (i.e., for scientific computing [12][13][14]) in contrast with other energyefficient but which are alternatives of higher cost [15].
Whereas Raspberry Pi SBCs seem to have put the focus more on a "stand-alone" scenario, Odroid devices provide increased processor frequency, more main memory, and higher bandwidth Ethernet capabilities. Particularly, the Raspberry Pi 3 model B that was launched in February 2016 features a 1.2 GHz, 4-core ARM Cortex-A53 CPU chip, 1 GB main memory, and a 10/100 Ethernet port. Compared with its predecessor, the Raspberry Pi 2 model B released in February 2015 adds wireless connectivity (2.4 GHz Wi-Fi 802.11n and Bluetooth 4.1). On the contrary, the Odroid C2 sacri ces wireless connectivity in favor of higher clock frequencies (1.5 GHz, 4-core ARM Cortex-A53 CPU chip), larger main memory (2 GB), and Gigabit Ethernet connection. ese characteristics make these particular devices more appropriate at building high-performance low-cost clusters able to meet the demands of some scienti c applications.
On the other hand, a common characteristic found in many highly parallel algorithms is that they usually generate large volumes of data containing both valid and invalid elements. In these scenarios, high-performance solutions to the data reduction problem are extremely important. Stream compaction (also known as stream reduction) has been proposed to "compact" an input stream mixed with both valid and invalid elements to a subset with only the valid elements [16]. is way, stream compaction is found in many applications that go from data mining and machine learning (in order to prune invalid nodes after each parallel breadth-rst tree traversal step [17]) to deferred shading (to obtain the subset of pixels whose rays intersect, which allows for better workload balancing among the participating threads [18,19]) or more speci cally to speedup dosimetric computations for radiotherapy, using Monte Carlo methods (they compacted computations on photons that worked longer than others [20]) and during voxelization of surfaces and solids [21].
Formally, given a list of elements i 1 , i 2 , . . . , i n belonging to the set I and a predicate function stream F : I → true, false , stream compaction divides I into valid and invalid elements (ones that satisfy the predicate F and others that do not) and keeps the relative order for all the valid elements in the output (O) [18]. As shown in Algorithm 1, the serial stream compaction of I under the predicate function F is O iεI | F(i) true { }. erefore, the output O simply contains all valid elements copied from the input I. An example of the execution of Algorithm 1 can be observed in Figure 1. e list of input elements is composed by numbers between 0 and 4. e serial stream compaction selects all elements that are not zero (assuming that zero represents the invalid value), based on the predicate function F, as shown in the low part of Figure 1. Although Algorithm 1 is simple, the parallelization is not trivial because the output position of each valid element cannot be obtained until all its preceding elements have been discovered [22].
Parallel stream compaction has been extensively studied in GPU-based platforms [16,18,[22][23][24][25], and more recently, parallel implementations for the Intel Xeon Phi processor have also been proposed [26]. In this work, we consider the case of an extremely low-cost cluster composed by four Odroid C2 single-board computers (SDCs), showing that stream compaction can also bene t-important speedups can be obtained-from this kind of platforms. To do so, we derive two parallel implementations for the stream compaction problem using MPI. en, we evaluate them considering varying number of processes and/or SDCs, as well as di erent input sizes. In general, we see that unless the number of elements in the stream is too small, the best results are obtained when 8 MPI processes are distributed among the 4 SDCs that conform the cluster.
is manuscript extends a preliminary version of this work [27] by making the following two important additional contributions: (i) To highlight the importance of our study, we also consider the execution of the two parallel implementations for the stream compaction problem on a very high-performance but power-hungry 18-core Intel Xeon CPU E5-2695 v4. Overall, the obtained results show that the Odroid C2 SDC cluster constitutes an appealing alternative to a traditional highend multicore processor in those contexts in which both low-cost and energy e ciency requirements are present. (ii) We derive a parallel version of the stream split problem to append the invalid elements to the output stream of the valid elements. We evaluate it on the Odroid C2 SDC cluster, observing good results in terms of scalability that lead to important speedups, and better balance between computation and communication requirements than in the stream compaction problem. e rest of the paper is organized as follows. e parallelization strategies that we have implemented and evaluated for the stream compaction problem in this work are explained in Section 2. In Section 3, we give the details of the cluster of Odroid C2 SDCs used for the evaluation, and then, we present the experimental results. e parallelization of the stream split problem and results on the Odroid C2 SDCs are exposed in Section 4. Finally, Section 5 draws some important conclusions obtained from this work.

Parallel Stream Compaction on a Cluster of Odroid C2 SDCs
In this section, we present the two parallelization strategies that we have considered in this work. In both cases, we have implemented them using MPI [28].

Parallel Stream Compaction.
We have based on the implementation proposed in the rust library [29] to develop the parallel stream compaction scheme shown in Algorithm 2. A vector of a particular length, the predicate function, the number of processes, and the pid of each process are the inputs. We have divided Algorithm 2 into four phases, namely: Validation phase (lines 4-8), Scan phase (lines 9-12), Communication phase (lines [13][14][15][16][17][18][19][20][21], and Scatter phase (lines [22][23][24][25][26]. During the Validation phase, the input vector (I) is examined in parallel, and taking into consideration the predicate function, each process annotates the validity of each of its assigned elements in array temp (representing 1 a valid element and 0 an invalid one). e parallel Scan phase needs an additional array (scan) to compute the so-called pre x-sum [30][31][32], where each element is the addition of all its preceding elements excluding itself. So, each process obtains in parallel the number of valid elements (nvalid) in its portion of the stream. Following this, in the Communication phase, each process, identi ed by a pid, sends the number of valid elements that it has found to all the processes with higher pids. All the processes, except the rst one, receive the number of valid elements and compute the position (pos) of the rst of their valid elements. Finally, during the Scatter phase, based on the scan and temp arrays, all valid elements are transferred from the input array to the output one (I and O, resp.), preserving the order in which these elements appear in the input array. Figure 2 shows an example of an execution with four MPI processes for a list of input elements composed by numbers ranging between 0 and 4. In this case, the predicate function F selects all elements that are not zero. Now, the input vector of length 16 positions is divided among the four MPI processes (P0, P1, P2, and P3). All the processes carry out the Validation and Scan phases in parallel. e position (pos) computed by each process is shown below the vector scan. Finally, the output O is built taking into account the temp and scan vectors, as well as the pos, previously computed.

Parallel Work-E cient Stream Compaction.
In [26], it is presented a work-e cient stream compaction algorithm aimed at improving the computing complexity of the parallel stream compaction that was shown in Algorithm 2. Again, using MPI, we have developed the parallel version of this work-e cient stream compaction and we show it in Algorithm 3. Now, during the Validation phase (lines 5-10), each process saves the validity of each element on the array scan and stores the number of valid elements on the vector V. erefore, the additional array of integers (temp) needed in Algorithm 2 is no longer necessary. In the Communication phase (lines 11-26), all processes except the rst one send the number of valid elements to the rst process (that with pid 0), which executes the inclusive pre x-sum on vector V [31], where each element is the addition of all its preceding elements including itself. en, each position of the array V is sent back to the corresponding process. Following this, each process executes the Scan phase (lines  (16) if pid > 0 then (17) for i 0 to pid in parallel do (18) Receive   Figure 3 illustrates an example for a list of elements ranging between 0 and 4 and the predicate function F that selects all elements that are not zero for an execution of four MPI processes. As in the previous example, 4 input elements are assigned to each MPI process and the Validation phase is applied producing directly the validity of each element on vector scan together with the number of valid elements that each process nds out. e latter is stored on vector V. en, the process P0 executes the inclusive pre x-sum on vector V and sends back the output to the rest of the processes as is indicated by the arrows in Figure 3. Finally, each process enters the Scan and Scatter phases taking into account the corresponding shifting value calculated by P0.

Experiments
We have built a cluster which is composed by four Odroid C2 nodes. Each node contains a 1.5 GHz quad-core 64-bit ARM Cortex-A53 CPU and 2 GB of RAM memory. All the nodes are interconnected through a Gigabit Ethernet switch.
e operating system installed on each node is Ubuntu 16.04.02 LTS. In this cluster, we have installed MPICH (v3.2) as the MPI library implementation.
We have executed and measured the two parallelization strategies for stream compaction presented in Section 2 on this cluster. e baseline for all the comparisons is the sequential version of Algorithm 2 without the Communication phase. Moreover, we have con gured di erent parallel execution scenarios for the two parallel versions of the stream compaction problem explained before. We consider parallel executions with 2, 4, 8, and 16 MPI processes, running on the same Odroid C2 board or di erent boards (up to 4). We have chosen several input data sizes for our tests. In particular, we consider input arrays with 1M, 8M, 32M, and 64M integer elements ranging between 0 and 4. e predicate function in all cases determines as valid all numbers that are not zero. e 64M input set is the largest con guration that we could run taking into account the 2 GiB limit that the Odroid SDC imposes. Figures 4(a), 4(b), 5(a), and 5(b) show the execution times (in milliseconds) that are observed for input data sizes of 1M, 8M, 32M, and 64M elements, respectively. For all these gures, from left to right, we rst present the result obtained for the sequential version (Sequential), and then we show the results for the parallel stream compaction (Compaction) and parallel work-e cient stream compaction (Compaction-Shifted) parallelization strategies, respectively. For each one of them, we consider 2, 4, and 8
From Figure 4(a), we can see that the two proposed parallelization strategies for the stream compaction problem obtain noticeable speedups when they are executed on a single Odroid C2 board with 2 or 4 MPI processes with regard to the sequential version. However, the executions on di erent Odroid C2 boards show negative outcomes from the performance point of view when the size of the input array is excessively small (1M elements). What makes the di erences is that in the rst case, all communications take place on the same board and, therefore, can be performed with low latency. Contrarily, what happens when  Scienti c Programming communications involve several Odroid C2 boards? In this case, the time required for communication does not compensate the small processing time that is needed to obtain the stream compaction for such a small number of elements (the communication time constitutes between 65% and 92% of the execution time). Moreover, the executions on a single Odroid C2 SDC with 8 MPI processes (2 MPI processes per core) also show negative speedups, revealing (as expected) that a configuration with more than one MPI process per core increments the communications, which represents up to 58% of the total time, and potentially slows computations. Taking a closer look at the results for one Odroid C2 board and 1M input size, we see that the speedups of the parallel stream compaction strategy for 2 and 4 MPI processes are 1.68 and 1.64, respectively. Similarly, the workefficient stream compaction parallelization strategy obtains speedups of 2.25 and 1.63 for 2 and 4 MPI processes, respectively. erefore, in both cases, fewer MPI processes, and therefore, less amount of communications among several processes, bring the best results. ese two parallel versions do not scale due to the small computation/ communication ratio that they exhibit (approximately 4 and 2 for 2 and 4 processes for both proposals), which decreases as the number of processes grows.
In general, from Figures 4(b), 5(a), and 5(b), we can see that as the input data size increases, so it does the speedups obtained by the two parallelization strategies analyzed in this work when more cores are involved. e exception is the configuration with 16 processes running on 4 Odroid C2 boards (4 processes per board), which reaches lower speedups than that with 8 processes running on 4 Odroid C2 boards (2 processes per board).
More specifically, Figure 4(b) shows the results seen for 8M elements. In this case, the two proposed parallelization strategies obtain significant speedups when executed on a single Odroid C2 board with 2, 4, or 8 MPI processes with regard to the sequential version. Additionally, the scalability is good for 2 and 4 MPI processes obtaining 1.69 and 2.06 for the parallel stream compaction strategy and 2.14 and 2.74 for the parallel work-efficient stream compaction approach. erefore, for medium input data sizes, the computation/ communication ratio is appropriate (approximately 100 and 7 for 2 and 4 processes). Although the two parallelization strategies also achieve gains for the configuration (8P-1C2) with 2 processes per core on a single Odroid C2 (speedups of 1.44 and 1.31, resp.), these speedups are (as expected) lower than those of the (4P-1C2) case. It is clear that the fact that there are twice the number of MPI processes than the total number of cores available introduces extra scheduling overhead and causes worse use of cores' resources (such as caches). On the other hand, the executions on different Odroid C2 SDCs (except for 8P-4C2 and 16P-4C2) present important speedups and good scalability for 2, 4, and 8 MPI processes for the two proposed parallelization strategies.
us, the increment in the number of processes per Odroid C2 implies a suitable operation of the Odroid C2 cluster, where the communication latency among the different boards of the cluster does not ballast performance. In the 8P-4C2 case is where the performance differences between the two parallelization strategies start appearing. Whereas the most efficient strategy (namely, Compaction-Shifted) achieves the highest speedup for this configuration, the other one cannot improve over the results reached by 4P-4C2 demonstrating its more limited scalability for medium-sized workloads. Finally, the large number of processes involved in 16P-4C2 results in excessively small computation/ communication ratios, which is the reason for the negative outcomes observed in both cases (the fraction of time due to communications reaches 87%).
As we can observe in Figures 5(a) and 5(b), having higher input data sizes for the two parallel stream compaction strategies results in significant gains in all the configurations. For both input data sizes, both Compaction and Compaction-Shifted obtain speedups that are close to that observed for the 8M element case when executed on a single Odroid C2 SDC with 2, 4, or 8 MPI processes. However, the resulting speedups become even more important as the number of involved cores grows. Moreover, they scale nicely for 2, 4, and 8 MPI processes, achieving their highest values for 8 MPI processes running on 4 Odroid C2 SDCs (5.10 and 5.06 for the parallel stream compaction and input data sizes of 32M and 64M, resp., and 6.96 and 7.04 for the parallel work-efficient stream compaction and input data sizes of 32M and 64M, resp.). It is also worth noting that even for these large input sizes, the results reached for the 16P-4C2 configuration are worse than those of the 8P-4C2 in both cases. Now the differences between them become narrower as input data sizes increase.

Energy Efficiency Results.
To give readers a more complete view that can help put our results in context, we also consider the case of executing the parallel stream compaction and parallel work-efficient stream compaction approaches described in Sections 2.1 and 2.2, respectively, on a conventional, high-performance multicore processor. In particular, we have considered the case of a state-of-the-art Intel ® Xeon ® E5-2695 v4 multicore processor running at 2.10 GHz. Particularly, the Intel Xeon multicore processor has 18 cores, and its price is approximately 8× that of the complete cluster. We have a dual-socket configuration. e comparison between the 4 Odroid C2 cluster and the Intel Xeon is done by taking into consideration the execution times of each version on every platform and the reported thermal design power (TDP) measures in each case (16 W for the Odroid C2 and 120 W for Intel Xeon processor). We have measured the energy consumption using RAPL [33] in the Intel Xeon processor. Although for the 1M input data size resulting watts are lesser than 120, this TDP is overcome in the rest of input data sizes. erefore, we have used the TDP as an average measure of the energy consumption. Figures 6 and 7 show total energy consumption (in joules) for parallel stream compaction and parallel workefficient stream compaction, respectively. Again, results for input data sizes of 1M, 8M, 32M, and 64M elements are reported. In both figures, we show the results for 2, 4, 8, and 16 MPI processes running on 4 Odroid C2 boards (having 1, 2, or 4 processes per board as appropriate (2P-OC2, 4P-OC2, 6 Scientific Programming 8P-OC2, and 16P-OC2, resp.)) and running on the Intel Xeon using 2, 4, 8, and 16 cores (2P-Xeon, 4P-Xeon, 8P-Xeon, and 16P-Xeon, resp.).
In the gures, we can see that the trend observed for the low-cost SDC cluster does not keep in the case of the Intel Xeon, and the best results in this case are obtained for 16 cores. e fact that, in this case, all communications occur on the same chip signi cantly reduces the overhead of involving a larger number of cores in the computation.
is is also evidenced by the fact that speedups are obtained even for the small problem sizes. However, even when the computational power of the Intel Xeon is much greater than the one of the Odroid C2 clusters, the very large TDP of the Intel Xeon ballasts its results when energy e ciency is also considered. Particularly, the best results for the Odroid C2 cluster (obtained when 2 processes run on 4 boards) clearly outperform those achieved when 16 processes are executed using 2 Intel Xeon chips, demonstrating that the Odroid C2 SDC cluster constitutes an appealing alternative to a traditional high-end multicore processor in those contexts in which both low-cost and energy e ciency requirements are found.

Extension to Stream Split
ere are some applications, for example, a radix sort [34] or random forest-based data classi ers [35], in which it is  (10) end for (11) if pid > 0 then (12) Send V[pid] to process pid 0 (13) end if (14) if pid 0 then (15) for i 1 to npid do (16) Receive (19) for i 1 to npid do (20) Send V[i − 1] to process pid i (21) end for (22)  Scienti c Programming needed to append the invalid elements to the end of the output stream with the valid elements. is is the so-called stream split problem. In this work, we have also developed a parallel solution to the stream split problem and we present it in Algorithm 4. e stream split algorithm is very much like the parallel work-e cient stream compaction version presented in Algorithm 3. e main di erences are that now the rst process must send the number of valid elements to all the processes with higher pids (line 23), the di erent processes (except the rst one) receive the number of valid elements (line 27), and if an element is invalid, it would be show the execution times (in milliseconds) that are observed for input data sizes of 1M, 8M, 32M, and 64M elements, respectively. For all these gures, from left to right, we rst present the results obtained for the sequential version (Sequential), and then we show the results for the parallel stream split (Split). For each one of them, we consider 2, 4, and 8 MPI processes running on one Odroid C2 board (2P-1C2, 4P-1C2, and 8P-1C2, resp.) and on 2 Odroid C2 boards, having 1, 2, and 4 MPI   9(a), and 9(b), we can see that the trend observed for the different input sizes is very similar to that already explained for the Compaction and Compaction-Shifted proposals except for the configuration with 16 processes running on 4 Odroid C2 boards (4 processes per board) and input sizes from 8M to 64M elements, which reaches higher speedups than the rest of configurations and that those observed in the two previous approaches. Speedups of 4.74, 6.66, and 8.36 with regard to the sequential version are achieved for 8M elements, 32M elements, and 64M elements, respectively. Now, the increase in computation due to the storage of the invalid elements compensates the communication requirements and significant speedups are obtained.
More specifically, the scalability is good for 2, 4, and 8 MPI processes running on 2 Odroid C2 boards, obtaining 1.52, 1.75, and 2.08 for the parallel stream split approach for 8M elements. erefore, for medium input data sizes, the computation/communication ratio is appropriate. On the same way, the scalability is suitable for 2, 4, and 8 MPI processes for higher input data sizes. For example, for 64M elements, the speedups achieved are 2.25, 3.63, and 4.52 for 2, 4, and 8 MPI processes executing on 2 Odroid C2 boards. Moreover, the scalability is good for 2, 4, 8, and 16 MPI processes running on 4 Odroid C2 boards, obtaining 1.47, 2.76, 3.47, and 4.74 for medium input sizes, whereas gains for big input sizes are very similar, and achieving, for instance, 2.25, 4.39, 7.02, and 8.36 for 64M elements.

Conclusions
In this work, we have studied the parallelization of the stream compaction problem on a low-cost cluster of single-board computers. Particularly, we have configured the low-cost cluster from 4 Odroid C2 SDCs which are interconnected using a typical Gigabit Ethernet switch. We have implemented two parallel versions for the stream compaction problem using MPI. en, we evaluate them considering varying number of processes and/or SDCs, as well as different input sizes. In general, we see that when the number of elements in the stream is too small, the most important benefits are observed when all participating processes are in the same Odroid board. In this case, the low computation/communication ratio for small number of input elements cannot make up for the overhead entailed by the inter-SDC communications. As the number of elements in the input stream increases, so it does the number of processes that can participate in parallel executions, and important speedups are reached. Overall, the best results are reached when eight MPI processes are distributed among the four SDCs that conform the cluster. In this case, speedups of 5.10 and 7.04 are obtained for the Compaction and Compaction-Shifted strategies, respectively, for the larger problem size considered in this work (input data size of 64M). Moreover, to add value to the obtained results, we also consider the execution of the two parallel implementations for the stream compaction problem on a very high-performance but power-hungry 18-core Intel Xeon E5-2695 v4 multicore processor, obtaining that the Odroid C2 SDC cluster constitutes a much more efficient alternative when both resulting execution time and required energy are taken into account. Finally, the parallelization of the stream split problem is implemented and evaluated on the Odroid C2 SDC cluster. In this case, for input data sizes starting from 8M elements, important speedups are achieved and the computation/communication is more equilibrated due to the storage of the invalid elements. In summary, the best results are obtained for the configuration of 16 MPI processes running on 4 Odroid C2 boards. In this case, speedups of 6.66 and 8.36 are reached for input data sizes of 32 and 64M elements, respectively.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.