On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as
As technology advances, on-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which feature integrating a number of computing cores that run in parallel and adopting an on-chip network that provides concurrent pipelined communication. The many-core network-based systems are referred to as
Scalability is one of the important features of homogenous OLPCs. In homogenous OLPCs, as the network size is scaled up, the network communication latency is increasing and becoming one of the most significant factors affecting the system performance. Therefore, we firstly propose two abstract concepts:
The contributions of the paper are summarized as follows. Since homogenous OLPCs match data-parallel applications well and vice versa, our study exhibits a workable way to formulate and evaluate the speedup performance of data-parallel applications onto homogenous OLPCs before application programming and hardware design. Two abstract concepts, Based on Amdahl’s Law, we propose a performance model of homogeneous OLPCs for data-parallel applications (see Section A cycle-accurate homogenous OLPC experimental platform is built up and three real data-parallel applications are mapped to validate the effectiveness of the proposed performance model.
The rest of the paper is organized as follows. Section
The development of on-chip computation presents two trends. One is towards a growing number of processors integrated on a chip [
With respect to performance analysis, Amdahl’s Law [
Homogenous OLPCs are a suitable architecture for data-parallel applications and vice versa. Regularity and scalability are the key features of homogenous OLPCs. Figure
(a) Sketch map of homogenous OLPCs and (b) an example of data partitioning of data-parallel applications.
The problem we consider is the performance in the context of homogeneous OLPCs for data parallel applications. We give detailed analysis on communication latency. The program running on OLPCs are divided into several subprograms running on different processor nodes. The subprogram can be abstracted as a set of subtasks and communications (see Figure The noncommunication time and communication time of the subprogram assigned to each node is equal to each other. That is, the subprogram in each node contains the same number of subtasks and communications. The execution time of each subtask is also equal to each other. The time of each communication is also equal to that of others.
The subprogram running on a processor node is abstracted as a set of subtasks and communications.
Figure
To facilitate the analysis, we first define a set of symbols in Notaions section.
Communication latency contains two parts: minimal (noncontention) latency and contention latency.
The minimal latency is determined by the distance of the two communicating nodes. We use hop count to calculate the latency. Table
Calculation of Hop Count in
Uniform | Hotspot |
---|---|
|
|
The contention latency mainly depends on the behavior of parallel applications running on OLPCs. In general, it is difficult to quantify the contention latency exactly. “When to communicate,” “which processor core starts a message passing” and “where the destination is” lead to different contention latency. If no contention occurs, transmitting a packet in one hop takes 1 cycle (
In the next, we use three steps to establish the communication latency model.
With packet switching, the average time of transmitting a packet in the network is
In general, a communication issued by a processor node contains one or more packets. These packets are launched by the same processor node. Packet transmissions may overlap. In the best case, a packet in a communication is launched one cycle after the preceding packet. A packet transmits in the on-chip network without need of waiting for the completion of its preceding packet transmission. The packet transmissions are overlapping. For the worst case, all packets are transmitted serially; that is, a packet will not be transmitted until the previous one is finished. The overlap among packet transmissions improves the performance by shortening the network communication latency.
To measure the time of a communication, we define an abstract concept:
Packets in a communication issued by a processor node.
From (
The program is parallelized on
Therefore, in order to quantify the network contention and measure the communication overhead of the entire program, we define an abstract concept:
If communications are concurrent and they all exist in the same local area resulting in a If communications are concurrent and they are uniformly distributed in the entire on-chip network, the network contention becomes light. In this case, the total communication time of the program is shorter and hence If communications are sequential, although the network contention is not heavy, the total communication time of the program is always long and hence For the best case that all nodes are fully concurrent and there is no network contention, the number of equivalent serial communication is equal to the number of real communication in each node (
Communications in the entire program.
Examples of communications in a 3
From (
From (
Network contention is hard to quantify exactly. The concrete behavior of parallel applications leads to different traffic patterns, packet generation rate, and other factors. These factors influence the network contention. In this section, by introducing two abstract concepts,
In this subsection, inspired by Amdahl’s Law, we establish the performance model for homogenous OLPCs, incorporating the network communication latency. We elaborate the performance model under both
The same as with Amdahl’s Law, we assume that the total problem size is fixed as the number of computing nodes increases. The parallel part in the program is speeded up. The parallel part assigned to each processor node decreases with the increase of the system size. So we can get the performance model as the formula below shows:
By including (
The last product item in the denominator describes the communication overhead. If this item is ignored, (
The behavior of parallel programs determines the communication patterns, affecting the value of
Assuming
Since
We have the limitation of
Let
Let
From formula (
The extreme minimal value of
The OLPC hosts at least one processor core, so when
when
when when when
the ratio between the serial part and the parallel part in a program determines the upper limit of
As we can see,
Performance trends under uniform traffic model.
From aforementioned formula transformation and Figure The increase of the network size ( Both the incremental ratio and the limit of As The increase of
Assuming
The same as with Section
We have the limitation of
Let
The extreme maximal value of
With Formulas (
Let
Because when
when
when when when
To further discuss the effect of Although the increase of the network size ( Both the incremmental/decremental ratio and the maximal value of As The increase of
Performance trends under hotspot traffic model.
The homogenous OLPC experimental platform.
In all, performance under hotspot traffic model is worse than that under uniform traffic model.
With the performance analysis in this subsection, we could have the following.
In this section, we apply three real data-parallel applications on our cycle-accurate homogenous OLPC experimental platform to validate and demonstrate the effectiveness of our performance analysis.
Figure
We use Wavefront Computation, Vector Norm, and Block Matching Algorithm in Motion Estimation as application examples and perform experiments on various instances of the three applications. Wavefront Computation and Vector Norm are mostly used in wireless communication, computer vision, and image/video processing. And Block Matching Algorithm in Motion Estimation is one of the basic components in image/video processing.
Wavefront Computations are common in scientific applications. Given a matrix (see Figure Two ways of data storing are realized to reflect the two traffic models. One is “ Both integer matrix and floating point matrix are implemented to vary the noncommunication time: The Wavefront Computation conducts a matrix with the size of 256 × 256, on the homogenous OLPC with the network size varying from 1 × 1
(a) Wavefront Computation; (b) its parallelization.
Vector Norm is used to compute the magnitude (length) of the vector. Figure Two ways of data storing are realized to reflect the two traffic models. One is “ Both integer data type and floating point data type are implemented to vary the noncommunication time: The
(a) Vector Norm and (b) its parallelization.
Motion Estimation is one of important parts in H.264/AVC standard, which addresses obtaining high coding efficiency and good picture quality [ We also realize two ways of data storing to reflect the two traffic models. One is “ Only integer data type is considered, since the data in image processing are “integer.” We conduct a Search Window with the size of 128 × 128 (i.e., 16384 candidate reference blocks), on the homogenous OLPC with the network size varying from 1 × 1
(a) Block Matching Algorithm in Motion Estimation and (b) its parallelization.
To compare our theoretical analysis with the real simulation results, we first estimate the theoretical speedups of the three applications.
The program of Wavefront Computation can be fully parallelized, thus The subtask on each node is For “
The program of Vector Norm is partially parallelized. The serial part consumes much time. Step 1 is parallel. In Step 1, the subtask on each node is For “
The Reference Frame has been computed and stored in on-chip local memories in the last Motion Estimation. In current Motion Estimation, the “Block Matching” processing starts until the Current Block in the Current Frame is transferred from the off-chip DRAM into the on-chip memory. The elapsed time of transferring the Current Block from the off-chip DRAM memory into the on-chip memory is the serial part of the Block Matching Algorithm. In our OLPC platform, the central PM node features an External Memory Interface connecting with the off-chip DRAM. The External Memory Interface reads a datum from the DRAM in 20 cycles and the size of the Current Block is 16 × 16. Hence, for “ The subtask on each node is the comparison of the Current Block and a candidate Reference Block. The time of such subtask (including computation and local memory reference) is collected in our experiment: For “
Then, using the Formula (
The real speedups of the three applications are calculated based on the simulation results on our homogenous OLPC experimental platform (because the sequential part in the program of Vector Norm dominates, the performance improvement is limited).
The effect of network size on the performance reflects the scalability of homogenous OLPCs. Figures
Effect of traffic models: Wavefront Computation with (a) integer data type and (b) floating-point data type.
Effect of traffic models: Vector Norm with (a) integer data type and (b) floating-point data type.
Effect of traffic models: Block Matching Algorithm in Motion Estimation.
Effect of noncommunication/communication ratio: Wavefront Computation with (a) uniform traffic model and (b) hotspot traffic model.
Effect of noncommunication/communication ratio: Vector Norm with (a) uniform traffic model and (b) hotspot traffic model.
Figure For uniform traffic model, consistent with the theoretical speedup performance model, the real speedup increases as the network size is scaled up, no matter the data type is integer or floating-point. This is because the contention latency induced by uniform traffic is not enough to kill the performance improvement introduced by the parallelization. However, it can slow down the performance improvement. Because a hotspot traffic model incurs heavy contention latency, the speedup increases when the network size is small but begins decreasing when the network size is scaled up to a certain finite value. Using ( Because hotspot traffic model consumes much more network contention latency than uniform traffic model, the speedup with hotspot traffic model is smaller than that with uniform traffic model for the same network size. The difference becomes larger when the network size is increasing.
Figure For the same network factors, the theoretical and real speedups for the floating point data type is higher than those for the integer data type. This is as expected because when the noncommunication time increases, the portion of communication latency becomes less significant, thus achieving higher performance. For hotspot traffic model, the increase of noncommunication/communication ratio shifts the optimal network size (
The target architectures and applications of our study are homogenous OLPCs and data-parallel applications, respectively. Homogenous OLPCs are such an on-chip computing platform that have a number of computing cores that run in and an on-chip network that provides concurrent pipelined communication, and data-parallel applications represent a wide range of applications whose data sets can be partitioned in parallel into data subsets handled individually by the same program running in different processor cores. Scalability is the common characteristic of both. Considering that homogenous OLPCs and data-parallel applications match each other very well in nature and hence data-parallel applications can obtain good speedup in homogenous OLPCs, the performance model proposed by the paper is applicable to homogenous OLPCs for data-parallel applications.
The performance model is general for homogenous OLPCs and a variety of data-parallel applications. For any particular application, a customized many-core platform such as application-specific architecture and hardware accelerator will be superior, but a NoC-based homogenous OLPC would be better than such a custom-designed hardware architecture when a variety of data-parallel applications share the same OLPC. The custom-designed many-core platform is specific so as not to be in the range of the general homogenous OLPCs. GPU (Graphic Processing Unit) is such kind of hardware accelerator for graphic processing as the name suggests. Although GPGPU (General-Purpose GPU) exhibits generality to some extent by providing programmability in its GPUs, it is still specific because the programmable GPU adopts a special structure for accelerating the graphic processing applications and the interconnections in GPGPU is special for such as stream processing and data shuffling that are common in graphic processing. Therefore, GPGPU is not in the range of homogenous OLPCs. Besides data-parallel applications, there exist other applications that do not have the scalability feature, so the proposed model is not applicable to those applications’ performance analysis.
The proposed performance model is not suitable for many-core platforms in specific application areas and the applications without the characteristic of scalability. The purpose of the model is to offer a general but workable way to estimate and evaluate the performance of homogenous OLPCs for data-parallel applications. Because the network communication latency and the ways of storing data of data-parallel applications are two of the most significant factors affecting the performance of homogenous OLPCs, when we establish the speedup performance model, the network communication latency and the data storing ways are stressed out and modeled in detail. Therefore, the processor behavior such as cache hierarchy and cache miss is not considered. We assume that all of the data are moved from the external memory to the appointed on-chip memory regions in different nodes before the system handles the data and the performance is measured from the time when the system begins handling the data, even if the data is continuously fed from the external memory, part of the latency can be hidden in the process of data handling, and we emphasize analyzing the effect of the data storing ways, so the model does not describe the situation that data are moved from the external memory.
Understanding the speedup potential that homogenous OLPC computing platforms can offer is a fundamental question to continually pursuing higher performance. This paper has focused on analyzing the performance of homogeneous OLPCs for data-parallel applications. Because the enhancement of application performance in OLPCs may be restricted by the increasing network communication latency even though the number of cores increases, one main issue for the analysis is to properly capture the network communication. We first detailed a network communication latency model by proposing two abstract concepts (
In the future, we plan to extend the performance model by considering the cache hierarchy and cache miss and the external memory access. Another direction is to emphasize studying the effect of topologies and communication protocols on the performance models of homogenous OLPCs.
Number of nodes in each dimension
The number of processor nodes,
The number of subtasks in the serial part of a program
The number of subtasks or communications in the parallel part of a program
The ratio between the serial part and the parallel part in a program,
The time of a communication
The execution time of a subtask, that is, noncommunication time
Average hop count of transmitting a packet
The time of transmitting a packet in one hop
Average time of transmitting a packet in the network
The number of packets in a communication
The number of equivalent serial packets in a communication
The number of equivalent serial communications in a program
The communication overhead of a program on OLPCs
Speedup
Maximal speedup
Minimal speedup.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The research is partially supported by the Hunan Natural Science Foundation of China (no. 2015JJ3017), the Doctoral Program of the Ministry of Education in China (no. 20134307120034), and the National Natural Science Foundation of China (no. 61402500).