Utilizing heterogeneous platforms for computation has become a general trend, making the portability issue important. OpenCL (Open Computing Language) serves this purpose by enabling portable execution on heterogeneous architectures. However, unpredictable performance variation on different platforms has become a burden for programmers who write OpenCL applications. This is especially true for conventional multicore CPUs, since the performance of general OpenCL applications on CPUs lags behind the performance of their counterparts written in the conventional parallel programming model for CPUs. In this paper, we evaluate the performance of OpenCL applications on out-of-order multicore CPUs from the architectural perspective. We evaluate OpenCL applications on various aspects, including API overhead, scheduling overhead, instruction-level parallelism, address space, data location, data locality, and vectorization, comparing OpenCL to conventional parallel programming models for CPUs. Our evaluation indicates unique performance characteristics of OpenCL applications and also provides insight into the optimization metrics for better performance on CPUs.
The heterogeneous architecture has gained popularity, as can be seen from AMD’s Fusion and Intel’s Sandy Bridge [
Even though OpenCL provides portability on multiple architectures, portability issues still remain in terms of performance. Unpredictable performance variations on different platforms have become a burden for programmers who write OpenCL applications. The effective optimization technique is different depending on the architecture where the kernel is executed. In particular, since OpenCL shares many similarities with CUDA, which was developed for NVIDIA GPUs, many OpenCL applications are not well optimized for modern multicore CPUs. The performance of general OpenCL applications on CPUs lags behind the performance expected by programmers considering conventional parallel programming models. The expectation comes from programmers’ experience with conventional programming models. OpenCL applications show very poor performance on CPUs when compared to applications written in conventional programming models.
The reasons we consider CPUs for OpenCL compute devices are as follows. CPUs can also be utilized to increase the performance of OpenCL applications by using both CPUs and GPUs (especially when a CPU is idle). Because modern CPUs have more vector units, the performance gap between CPUs and GPUs has been decreased. For example, even for the massively parallel kernels, sometimes CPUs can be better than GPUs, depending on input sizes. On some workloads with high branch divergence or with high instruction-level parallelism (ILP), the CPU can also be better than the GPU.
A major benefit of using OpenCL is that the same kernel can be easily executed on different platforms. With OpenCL, it is easy to dynamically decide which device to use at run-time. OpenCL applications that select a compute device between CPUs and GPUs at run-time can be easily implemented. However, if the application is written in OpenMP, for example, it is not trivial to split an application to use both CPUs and GPUs.
Here, we evaluate the performance of OpenCL applications on modern out-of-order multicore CPUs from the architectural perspective, regarding how the application utilizes hardware resources on CPUs. We thoroughly evaluate OpenCL applications on various aspects that could change their performance. We revisit generic performance metrics that have been lightly evaluated in previous works, especially for running OpenCL kernels on CPUs. Using these metrics, we also verify the current limitation of OpenCL and the possible improvement in terms of performance. In summary, the contributions of this paper are as follows. We provide programmers with a guideline to understand the performance of OpenCL applications on CPUs. Programmers can verify whether the OpenCL kernel fully utilizes the computing resources of the CPU. We discuss the effectiveness of OpenCL applications on multicore CPUs and possible improvement.
The main objective of this paper is to provide a way to understand OpenCL performance on CPUs. Even though OpenCL can be executed on CPUs and GPUs, most previous work has focused on only GPU performance issues. We believe that our work increases the understandability of OpenCL on CPUs and helps programmers by reducing the programming overhead to implement a separate CPU-optimized version from scratch. Some previous studies about OpenCL on CPUs discuss some aspects presented in this paper, but they lack both quantitative and qualitative evaluations, making them hard to use when programmers want to estimate the performance impact of each aspect.
Section
In this section, we describe the background of several aspects that affect OpenCL application performance on CPUs: API overhead, thread scheduling overhead, instruction-level parallelism, data transfer, data locality, and compiler autovectorization. These aspects have been emphasized in academia and industry to improve application performance on CPUs on multiple programming models. Even though most of the architectural aspects described in this section are well-understood fundamental concepts, most OpenCL applications are not written considering these aspects.
OpenCL has high overhead for launching kernels, which is negligible on other conventional parallel programming models for CPUs. In addition to the kernel execution on the compute device, OpenCL needs OpenCL API function calls in the host code to coordinate the executions of kernels that are overheads. The general steps of an OpenCL application are as follows [ Open an OpenCL context. Create a command queue to accept the execution and memory requests. Allocate OpenCL memory objects to hold the inputs and outputs for the kernel. Compile and build the kernel code online. Set the arguments of the kernel. Set workitem dimensions. Kick off kernel execution (enqueue the kernel execution command). Collect the results.
The complex steps of OpenCL applications are due to the OpenCL design philosophy emphasizing portability over multiple architectures. Since the goal of OpenCL is to make a single application run on multiple architectures, they make the OpenCL programming model as flexible as possible. Figure
OpenCL platform model.
On the contrary, flexibility for various platform supports does not exist on conventional parallel programming models for multicore CPUs. Many of the APIs in OpenCL, which take a significant execution time on OpenCL application do not exist on conventional parallel programming models. The compute device and the context in OpenCL are implicit on conventional programming models. Users do not have to query the platform or compute devices and explicitly create the context.
Another example of the unique characteristics of OpenCL compared to conventional programming models is the “just-in-time compilation” [
Therefore, to determine the actual performance of applications, the time cost to execute the OpenCL API functions also should be considered. From evaluation, we find that the API overhead is larger than the actual computation in many cases.
Unlike other parallel programming languages such as TBB [
OpenCL execution model.
It is common for OpenCL applications to launch a massive number of threads for kernels expecting speedup by parallel execution. However, portability of OpenCL applications in terms of performance is not maintained on different architectures. In other words, an optimal decision of how to parallelize (partition) a kernel on GPUs does not usually guarantee good performance on CPUs. The partitioning decision of a kernel is done by changing
First, the number of workitems and the amount of work done by a workitem affect performance differently on CPUs and GPUs. A massive number of short workitems hurts performance on CPUs but helps performance on GPUs. The performance difference comes from the different architectural characteristics between CPUs and GPUs. On GPUs, a single workitem is processed by a scalar processor (
The number of workitems affects the instruction-level parallelism (
A modern superscalar processor executes more than one instruction concurrently by dispatching multiple independent instructions during a clock cycle to utilize the multiple functional units in CPUs. Superscalar CPUs use hardware that checks data dependencies between instructions at run-time and schedule instructions to run in parallel [
One of the performance problems of OpenCL applications on CPUs is that usually the kernel is written mostly to utilize the TLP, not for ILP. The OpenCL programming model is an SIMT model, and it is common for an OpenCL application to have a massive number of threads. Since independent instructions computing different elements are separated into different threads, most instructions in a single workitem in the kernel are usually dependent on previous instructions, so that typically most OpenCL kernels have
The second important component is the workgroup size. Workgroup size determines the amount of work in a workgroup and the number of workgroups of a kernel. On GPUs, a workgroup or multiple groups are executed on a streaming multiprocessor (
An OpenCL programmer can explicitly set workgroup size or let the OpenCL implementation decide. If
Many proposals to reduce the scheduling overhead by serialization have been presented [
In general, a parallel programming model can have two types of address space options: unified memory space and disjoint memory space [
On the contrary, even though it is hard for programmers to program, OpenCL provides disjoint memory space to programmers. This is because most heterogeneous computing platforms have disjoint memory systems due to the different memory requirements of different architectures. OpenCL assumes for its target a system where communication between the host and compute devices are performed explicitly by a system network, such as PCI-Express. But, the assumption of discrete memory systems is not true when we use CPUs as compute devices for kernel execution. The host and compute devices share the same memory system resources such as last-level cache, on-chip interconnection, memory controllers, and DRAMs.
The drawback of disjoint memory address space is that it requires the programmer to explicitly manage data transfer between the host and compute devices for kernel execution. In common OpenCL applications, the data should be transferred back and forth in order to be processed by the host or the compute device [
One of rewriting efforts is changing the memory allocation flag. OpenCL provides the programmer multiple options for memory object allocation flags when the programmer calls
OpenCL also provides different APIs for data transfer between the host and compute devices. The host can enqueue commands to read data from an OpenCL memory object that is created by
Utilizing SIMD units has been one of the key performance optimization techniques for CPUs [
Various methods have been proposed to utilize the SIMD instruction: using optimized function libraries such as Intel IPP [
It is quite natural for programmers to expect that a programming model difference has no effect on compiler autovectorization on the same architecture. For example, if a kernel is written in both OpenCL and OpenMP and both implementations are written in a similar manner, programmers would expect that both codes are vectorized in a similar fashion, thereby giving similar performance numbers. Even though it depends on the implementation, this is not usually true. Unfortunately, today’s compilers are very fragile about vectorizable patterns, which depend on the programming model. Applications should satisfy certain conditions in order to fully take advantage of compiler autovectorization [
Where to place threads can affect the performance on modern multicore CPUs. Threads can be placed on each core in different ways, which can create a performance difference. The performance impact of the placement would increase with more processors on the system.
The performance difference can occur for multiple reasons. For example, because of the different latency on the interconnection network, threads that are far away will take longer to communicate with each other, whereas threads close to the adjacent core can communicate more quickly. Also, an application that requires data sharing among adjacent threads can benefit if we assign these adjacent threads to nearby cores. Proper placement can also eliminate the communication overhead by utilizing shared cache. For the performance reason, most conventional parallel programming models support affinity, such as
Unfortunately, thread affinity is not supported in OpenCL. An OpenCL workitem is a logical thread, which is not tightly coupled with a physical thread even though most parallel programming languages provide this feature. The reason for the lack of this functionality is that the OpenCL design philosophy emphasizes portability over efficiency.
We present the lack of affinity support as one of the performance limitations of OpenCL on CPUs compared to other programming languages for CPUs. We would like to present a potential solution to enhance OpenCL performance on CPUs. We found the benefit of better utilizing cache on OpenCL applications by thread affinity. An example is presented in Section
Given the preceding background on the anticipated effects of architectural aspects to understand the OpenCL performance on CPUs, the goal of our study is to quantitatively explore these effects.
The experimental environment for our evaluation is described in Table
Experimental environment.
CPUs | Intel Xeon E5645 |
# Cores | 4 |
Vector width | SSE 4.2, 4 single precision FP |
Caches | L1D/L2/L3: 64 KB/256 KB/12 MB |
FP peak performance | 230.4 GFlops |
Core frequency | 2.40 GHz |
DRAM | 4 GB |
|
|
GPUs | NVidia GeForce GTX 580 |
# SMs | 16 |
Caches | L1/Global L2: 16 KB/768 KB |
FP peak performance | 1.56 TFlops |
Shader Clock frequency | 1544 MHz |
|
|
O/S | Ubuntu 12.04.1 LTS |
Platform | Intel OpenCL Platform 1.5 for CPU |
Compiler | Intel C/C++ compiler 12.1.3 |
We use different applications for each evaluation. To verify the API overhead, We use NVIDIA OpenCL Benchmarks [
List of NVidia OpenCL benchmarks for API overhead evaluation.
Benchmark | |
---|---|
oclBandwidthTest, oclBlackScholes, oclConvolutionSeparable, oclCopyComputeOverlap, | |
oclDCT8x8, oclDXTCompression, oclDeviceQuery, oclDotProduct, oclHiddenMarkovModel, | |
oclHistogram, oclMatrixMul, oclMersenneTwister, oclMultiThreads, oclQuasirandomGenerator, | |
oclRadixSort, oclReduction, oclSimpleMultiGPU, oclSortingNetworks, oclTranspose, | |
oclTridiagonal, oclVectorAdd |
Configurations of simple applications.
Benchmark | Kernel | Global work size | Local work size |
---|---|---|---|
Square | Square | 10000, 100000, 1000000, 10000000 | NULL |
Vectoraddition | vectoadd | 110000, 1100000, 5500000, 11445000 | NULL |
Matrixmul | matrixMul | 800 × 1600, 1600 × 3200, 4000 × 8000 | 16 × 16 |
Reduction | reduce | 640000, 2560000, 10240000 | 256 |
Histogram | histogram256 | 409600 | 128 |
Prefixsum | prefixSum | 1024 | 1024 |
Blackscholes | blackScholes | 1280 × 1280, 2560 × 2560 | 16 × 16 |
Binomialoption | binomialoption | 255000, 2550000 | 255 |
Matrixmul(naive) | matrixMul | 800 × 1600, 1600 × 3200, 4000 × 8000 | 16 × 16 |
Configurations of the Parboil benchmarks.
Benchmark | Kernel | Global work size | Local work size |
---|---|---|---|
CP | Cenergy | 64 × 512 | 16 × 8 |
|
|||
MRI-Q | computePhiMag |
3072 |
512 |
|
|||
MRI-FHD | RhoPhi |
3072 |
512 |
We use the wall-clock execution time. To measure stable execution time without fluctuation, we iterate the kernel execution until the total execution time of an application reaches a long enough running time, 90 seconds in our evaluation. This is sufficiently long to have a multiple number of kernel executions for all applications in our evaluation. Using the average kernel execution time per kernel invocation calculated, we use normalized throughput to clearly present the performance difference on multiple sections.
As we discussed in Section
Execution time distribution of kernel execution and auxiliary API functions.
For detailed analysis, we categorized OpenCL APIs into 16 categories. We group multiple categories for visibility in the following. Figure
Execution time distribution of each category of API function for
Figure
Execution time distribution of each category of API functions.
Execution time distribution of
The list of OpenCL kernels in the application is represented by the
JIT compilation overhead is another source of the API overhead. Figure
Execution time distribution of
In this section, we can see the high overhead of explicit context management (Section
It should be noted that the workload size for the evaluation in Section
Associated with the discussion in Section
Number of workitems for each application.
Benchmark | Base | 10x | 100x | 1000x |
---|---|---|---|---|
Square_1 | 10000 | 1000 | 100 | 10 |
Square_2 | 100000 | 10000 | 1000 | 100 |
Square_3 | 1000000 | 100000 | 10000 | 1000 |
Square_4 | 10000000 | 1000000 | 100000 | 10000 |
Vectoradd_1 | 110000 | 11000 | 1100 | 110 |
Vectoradd_2 | 1100000 | 110000 | 11000 | 1100 |
Vectoradd_3 | 5500000 | 550000 | 55000 | 5500 |
Performance of Square and Vectoraddition applications with different workload per workitem.
From Figure
Compared to CPUs with high overhead of handling many workitems, GPUs have low overhead for maintaining a large number of workitems, as our evaluation shows. Furthermore, reducing the number of workitems degraded performance on GPUs significantly. The large performance degradation on GPUs is because we could no longer take advantage of many processing units on GPUs.
One of the reasons for performance improvement by allocating more workload per workitem is the reduced number of instructions. Figure
The number of dynamic instructions of Square and Vectoraddition applications with different workload per workitem including (L) instructions from OpenCL APIs and (R) kernel only.
For this evaluation, we implement a tool based on Pin [
The ratio of instructions from kernel over the instructions around clEnqueueNDRangeKernel for Square and Vectoraddition applications with different workload per workitem.
Figure
Performance of Parboil benchmarks with different workloads per workitem.
The number of dynamic instructions of Parboil benchmarks with different workload per workitem.
As we discussed in Section
To evaluate the ILP effect on both the CPU and the GPU, we implemented a set of compute-intensive microbenchmarks that share common characteristics. Every benchmark has an identical number of dynamic instructions and memory accesses. Each benchmark also has the same instruction mixture, such as a ratio of the number of branch instructions over the total number of instructions. The only difference between each benchmark is ILP by varying the number of independent instructions. From the baseline implementation, we increase the number of operand variables, so that the number of independent instructions can increase. For example, in the case of
Figure
Performance of ILP microbenchmark on the CPU and the GPU.
Associated with the discussion in Section
Workgroup size for each application.
Benchmark | Base | Case_1 | Case_2 | Case_3 | Case_4 |
---|---|---|---|---|---|
Square | NULL | 1 | 10 | 100 | 1000 |
Vectoraddition | NULL | 1 | 10 | 100 | 1000 |
Matrixmul | 16 × 16 | 1 × 1 | 2 × 2 | 4 × 4 | 8 × 8 |
Blackscholes | 16 × 16 | 1 × 1 | 1 × 2 | 2 × 2 | 2 × 4 |
Matrixmul(naive) | 16 × 16 | 1 × 1 | 2 × 2 | 4 × 4 | 8 × 8 |
Performance of applications with different workgroup size on CPUs and GPUs.
The benchmarks can be categorized into three categories, depending on the behavior. The first group consists of
The left figure of Figure
(U) The number of dynamic instructions of Square, Vectoraddition, and naive implementation of Matrixmul with different workgroup size on CPUs. (L) The ratio of instructions from kernel over the instructions around clEnqueuNDRangeKernel for Square, Vectoraddition, and naive implementation of Matrixmul with different workgroup size.
Performance of Matrixmul with different workgroup size on CPUs and GPUs.
As we can see from Figure
The number of dynamic instructions of Matrixmul with different workgroup size on CPUs.
Performance of Blackscholes with different workgroup size on CPUs and GPUs.
Unlike other applications,
The number of dynamic instructions of Blackscholes with different workgroup size on CPUs.
Figure
Performance of Parboil benchmarks with different workgroup size on CPUs.
The number of dynamic instructions of Parboil benchmarks with different workgroup size on CPUs.
Here, we summarize the findings on thread scheduling for OpenCL applications. Allocating more work per workitem by manually coalescing multiple workitems reduces scheduling overhead on CPUs (Section High ILP increase performance on CPUs but not on GPUs (Section Workgroup size affect performance both on CPUs and GPUs. In general, large workgroup size increases performance by reducing scheduling overhead on CPUs and enables utilizing high TLP on GPUs. Workgroup size can also affect the cache usage (Section
Associated with the discussion in Section
Different APIs for data transfer:
explicit transfer: mapping: Kernel access type when referenced inside a kernel:
the kernel accesses the memory object as read-only/write-only:
the kernel accesses the memory object as read/write: Where to allocate a memory object:
allocation on the device memory; allocation on the host-accessible memory on the host (pinned memory).
The throughput we present here is the performance, including data transfer time, between the host and compute devices, not just the kernel execution throughput on the compute device. For example, the throughput of an application becomes half of the throughput when we consider only the kernel execution time if the data transfer time between the host and the compute device equals the kernel execution time. The way we calculate the throughput of an application is illustrated in
We compare the performance of different data-transfer APIs on all possible allocation flags. (The combinations are as follows: (1) read-only/write-only memory object + allocation on the device; (2) read-only/write-only memory object + allocation on the host; (3) read-write memory object + allocation on the device; (4) read-write memory object + allocation on the host.) Figure
Normalized application throughput of mapping over explicit data transfer for all combinations on other dimensions. The performance of mapping APIs is superior to explicit data transfer on all possible combinations.
Different APIs change data transfer time. Figure
Normalized data transfer (host to device) throughput of mapping over explicit data transfer for all combinations on other dimensions.
Normalized data transfer (device to host) throughput of mapping over explicit data transfer for all combinations on other dimensions.
We also report the performance of Parboil benchmarks with different APIs for data transfer [
Data transfer time with different APIs for data transfer. (Left) host to device and (Right) device to host.
The difference of data transfer time is due to the different behaviors of different APIs. When the host code explicitly transfers data between the host and the compute device, the OpenCL run-time library should allocate a separate memory object for the device and copy the data between the memory object allocated by the
We also verify the performance effect of specifying a memory object as read-only/write-only or as read/write. Figure
Normalized application throughput of read-only/write-only memory objects over read/write memory objects for all combinations on other dimensions. There is no noticeable performance difference.
Finally, we also verify the performance effect of the allocation location of memory objects. Programmers can allocate the memory object on the host memory or the device memory. Figure
Normalized application throughput of the pinned memory over the device memory for all combinations on other dimensions. Where to allocate a memory object does not change the performance much on CPUs.
In this section we find that mapping APIs perform superior compared to explicit data transfer APIs with reduced data transfer time by eliminating the copying overhead on CPUs. Allocated location and kernel access type do not affect the performance on CPUs.
We evaluate the possible effect of programming models on vectorization, even though vectorization is more about compiler implementation. For evaluation, we port the OpenCL kernels to identical computations being performed by their OpenMP counterparts. We map multiple workitems on OpenCL to a loop to port OpenCL kernels to their OpenMP counterparts. We utilize the Intel C/C++ 12.1.3 compiler and the Intel OpenCL platform 1.5 for our evaluation. The programmer’s expectation is that when we run the same computation in the OpenCL and OpenMP applications, both runs should give comparable performance numbers. However, the results show that this assumption does not hold. For the evaluated benchmarks, the OpenCL kernels outperform their OpenMP counterparts. Figure
Performance impact of vectorization.
Noncontiguous memory access:
four consecutive floats may be loaded directly from the memory in a single SSE instruction; but if the four floats to be loaded are not consecutive, we will have a load using multiple instructions; loops with a nonunit stride are an example of the above scenario. Data dependence:
vectorization requires changes in the order of operations within a loop since each SIMD instruction operates on several data elements at once; but such a change of order might not be possible due to data dependencies.
FMUL(_ FMUL(_ FMUL(_ FMUL(_ FMUL(_ FMUL(_ }
FMUL(_ FMUL(_ FMUL(_ FMUL(_ FMUL(_ FMUL(_ }
We evaluate the performance benefit using the CPU affinity in OpenMP. We use
We use a simple application for evaluation. The aim of the application is to verify the effects of binding threads to cores in terms of cache utilization. Performance can improve when the OpenCL run-time library maps logical threads of a kernel on physical cores so that it can utilize the cached data of the previous kernel execution. The application we use consists of two kernels:
Table
Performance impact of CPU affinity.
Aligned
Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 | |
---|---|---|---|---|---|---|---|---|
Computation 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Computation 2 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Misaligned
Core 0 | Core 1 | Core 2 | Core 3 | Core 4 | Core 5 | Core 6 | Core 7 | |
---|---|---|---|---|---|---|---|---|
Computation 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
Computation 2 | 6 | 3 | 4 | 0 | 2 | 1 | 7 | 5 |
#pragma omp parallel
#pragma omp parallel
#pragma omp parallel
As we expect, the (a)
As the results show, even though OpenCL emphasizes portability, adding the affinity support to OpenCL may provide a significant performance improvement in some cases. Hence, we argue that coupling logical threads with physical threads (cores on the CPU) is needed on OpenCL, especially for CPUs. The granularity for the assignment could be a workgroup; in other words, the programmer can specify the core where a specific workgroup would be executed. This functionality would help to improve the performance of OpenCL applications. For example, data from different kernels can be shared without a memory request if the programmer allocates cores for specific workgroups in consideration of the data sharing of different kernels. The data can be shared through the private caches of cores.
Multiple research studies have been done on how to optimize OpenCL performance on GPUs. The GPGPU community provides TLP [
Several publications refer to the performance of OpenCL kernels on CPUs. Some focus on algorithms and some refer to the performance difference by comparing it with GPU implementation and OpenMP implementation on CPUs [
Ali et al. compare OpenCL with OpenMP and Intel’s TBB on different platforms [
Seo et al. discuss OpenCL performance implications for the NAS parallel benchmarks and give a nice overview of how they optimize the benchmarks by first getting an idea of the data transfer and scheduling overhead and then coming up with ways to avoid them [
One of the references that is very helpful to understand the performance behavior of OpenCL is a document from Intel [
We evaluate the performance of OpenCL applications on modern multicore CPU architectures. Understanding the performance in terms of architectural resource utilization is helpful for programmers. In this paper, we evaluate various aspects, including API overhead, thread scheduling, ILP, data transfer, data locality, and compiler-supported vectorization. We verify the unique characteristics of OpenCL applications by comparing them with conventional parallel programming models such as OpenMP. Key findings of our evaluation are as follows. OpenCL API overhead is not negligible on CPUs (Section Allocating more work per workitem therefore reducing the number of workitems helps performance on CPUs (Section Large ILP helps performance on CPUs (Section Large workgroup size is helpful for better performance on CPUs (Section On CPUs, Mapping APIs perform superior compared to explicit data transfer APIs. Memory allocation flags do not change performance (Section Programming model can have possible effect on compiler-supported vectorization. Conditions for the code to be vectorized can be complex (Section Adding affinity support to OpenCL may help performance in some cases (Section
Our evaluation shows that considering the characteristics of CPU architectures, the OpenCL application can be optimized further for CPUs, and the programmer needs to consider these insights for portable performance.
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to thank Jin Wang and Sudhakar Yalamanchili, Inchoon Yeo, the Georgia Tech HPArch members, and the anonymous reviewers for their suggestions and feedback. We gratefully acknowledge the support of the NSF CAREER award 1139083 and Samsung.