Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel MIC specific alignment optimization, and small matrix transpose/multiplication 2D vectorization implemented in the Intel C/C++ and Fortran production compilers for Intel Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel Xeon Phi coprocessor. We also demonstrate a 2000x performance speedup from the seamless integration of SIMD vectorization and parallelization.
The Intel Xeon Phi coprocessor is based on the Intel Many Integrated Core (Intel MIC) architecture, which consists of many small, power efficient, in-order cores, each of which has a powerful 512-bit vector processing unit (SIMD unit) [ 60 cores, 240 threads (4 threads/core), 1.053 GHz, 1 TeraFLOP double precision theoretical peak performance, 8 GB memory with 320 GB/s bandwidth, 512 bit wide SIMD vector engine, 32 KB L1, 512 KB L2 cache per core, fused multiply-add (FMA) support.
One Teraflop theoretical peak performance is computed as follows: 1.053 GHz × 60 cores × 8 double precision elements in SIMD vector × 2 flops per FMA. As such, any compute bound applications trying to achieve high performance on Intel Xeon Phi coprocessors need to exploit a high degree of parallelism and wide SIMD vectors. Using a 512-bit vector unit, 16 single precision (or 8 double precision) floating point (FP) operations can be performed as a single vector operation. With the help of the fused multiply-add (FMA) instruction, up to 32 FP operations can be performed at each core at each cycle. In comparison to the current 128-bit SSE and 256-bit AVX vector extensions, this new coprocessor can pack up to 8x and 4x the number of operations into a single instruction, respectively.
Wider SIMD vector units cannot be effectively utilized by simply extending the vectorizer for Intel SSE and Intel AVX architecture. Consider the following simple example. There exists a scalar loop that executes
Furthermore, architectural or microarchitectural differences between Intel Xeon Phi coprocessors and Intel Xeon processors necessitate that new compiler techniques be developed. This paper focuses on three SIMD vectorization techniques and makes the following contributions. We propose an extended compiler scheme to vectorize short trip-count loops and peeling and remainder loops that are classified as “less-than-full-vector” cases, with a masking capability supported by the Intel MIC architecture. We describe our specific data alignment strategies for achieving optimal performance through vectorization, as the Intel MIC architecture is much more demanding on memory alignment than the Intel AVX architecture [ We describe our 2-dimensional vectorization method which is beyond the conventional loop vectorization for small matrix transpose and multiplication operations by fully utilizing long SIMD vector units, swizzle, shuffle, and masking support on the Intel MIC architecture.
The rest of this paper is organized as follows: Section
This section describes the Intel C/C++ and Fortran compiler support for the Intel Xeon Phi coprocessor at a high level with respect to loop vectorization and the translation and optimization of SIMD Perform automatic loop analysis and identify and analyze programmer annotated functions and loops by parsing and collecting function and loop vector properties. In addition, our compiler framework can apply interprocedural analysis and optimization with profiling and call-graph creation for automatic function vectorization. Generate vectorized function variants with properly constructed signatures via function cloning and vector signature generation. Vectorize SIMD for loops that are identified by the compiler or annotated using SIMD extensions (#pragma SIMD can be used to vectorize outer loops) and cloned vector function bodies and all arguments by leveraging and extending our automatic loop vectorizer. Enable classical scalar, memory, and loop optimizations and parallelization effectively, before or after loop and function vectorization, for achieving good performance.
SIMD vector compilation infrastructure for function and loop vectorization.
Intel Xeon Phi coprocessor provides long (512-bit) SIMD vector hardware support for exploiting more vector-level parallelism. The long SIMD vector unit imposes the requirement of packing more scalar loop iterations into a single vector loop iteration, which also results in more iterations in the peeling loop, and/or in the remainder loop remaining nonvectorized, due to the fact that they do not constitute the full SIMD vector (or less-than-full-vector) unit of Intel MIC architecture. For example, consider the short trip-count loop as shown in Algorithm
When the loop is vectorized for Intel SSE2 with vector length = 4 (128-bit), the remainder loop will have 3 iterations. When the loop is vectorized for the Intel MIC architecture with vector length = 16 (512-bit), the remainder loop will have 15 iterations. In another situation, if the loop is unrolled by 16, then the remainder loop will have 15 iterations, leaving the remaining 15 iterations in a scalar execution form. Thus, vectorizing the peeling and remainder loops (i.e., short trip-count loop in general) is very important for the Intel MIC architecture. This section describes how to apply vectorization, with masking support, to peeling and remainder loops (i.e., short trip-count loop) with special guarding masks to prevent the SIMD code from exceeding original loop and memory access boundaries. At a high level, the following steps describe our vectorization scheme without vectorization of peeling and remainder loops. s0: select alignment, vector length, and unroll factor. s1: generate alignment setup code. s2: compute the trip count of the peeling loop. s3: emit the scalar peeling loop. s4: generate the vector loop initialization code. s5: emit the main vector loop. s6: compute the trip count of the remainder loop. s7: emit the scalar remainder loop.
Given the simple example as shown in Algorithm
{
On the Intel MIC architecture the vector length is 512 bits, which requires 64-byte alignment for efficient memory accesses. To achieve 64-byte aligned memory loads/stores, we need to pack 16 float (32-bit) elements for each single vector iteration and generate a peeling loop. Pseudocode
Note that we performed loop unrolling for the main vectorized loop, which allows the hardware to issue more instructions per cycle by hiding memory access latency and reducing branching. To enable the “less-than-full-vector” (i.e., peeling loop, remainder loop, or short trip-count loop) vectorization, the loop vectorization scheme is extended as below. s0: select alignment, vector length and unroll factor. s1: generate alignment setup code. s2: compute the trip count of peeling loop. Create a vector of 16 elements with value Create a vector of 16 elements with value s3: emit the vectorized peeling loop with masking operations. s4: generate the main vector loop initialization code. s5: emit the main vector loop. s6: compute the trip count of the remainder loop. Create a vector of 16 elements with the value Create a vector of 16 elements with the value s7: emit the vectorized remainder loop with masking operations.
Pseudocode
In the cases of short trip-count loop vectorization of peeling and remainder loops with runtime trip-count and alignment checking, loops are vectorized as efficiently as possible. These loops are vectorized with optimal vector lengths and an optimal amount of profitable unrolling regardless of a known loop trip count. This provides better utilization of SIMD vector hardware without sacrificing the performance of short loops. This scheme allows us to completely eliminate scalar execution of the loop in favor of masked SIMD vector code generation. Special properties of the mask are used to match unmasked code generation in most cases. For example, masked scalar memory loads that could be unsafe under an empty mask are considered safe under a remainder mask since it is never empty.
Without adding the capability of short trip-count loop vectorization, the loops in the ConvolutionFFT2D benchmark with 7 iterations and double precision data type would end up as a fully scalar execution. Applying vectorization with masking to these short trip-count loops results in a ~2x to ~5x speedup for the 7-iteration short trip-count (or less-than-full-vector) loops in the ConvolutionFFT2D benchmarks on the Intel MIC Architecture.
The Intel Xeon Phi coprocessor is much more sensitive to data alignment than the Intel Xeon E5 processor, so developing an Intel MIC oriented alignment strategy and optimization schemes is one of the key aspects for achieving optimal performance. Similar to Intel SSE4.2, the SIMD load+op instructions require vector size alignment, which is 64-byte alignment for the Intel MIC architecture. However, simple load/store instructions require the alignment information to be known at compile time on the Intel Xeon Phi coprocessor. Different from prior Intel SIMD extensions, There are no special unaligned load/store instructions in the Intel Initial Many Core Instruction (Intel IMCI) set. This is overcome by using unpacking loads and packing stores that are capable of dealing with unaligned (element-aligned) memory locations. Due to their unpacking and packing nature, these instructions cannot be directly used for masked loads/stores, except under special circumstances. The faulting nature of masked memory access instructions in Intel IMCI adds extra complexity to those instructions addressing data outside paged memory and may fail even if actual data access is masked out. The exceptions are gather/scatter instructions.
Therefore, the compiler aggressively performs data alignment optimizations using traditional techniques such as alignment peeling and alignment multiversioning.
Alignment peeling implies the creation of a preloop that executes several iterations on unaligned data in order to reach an aligned memory address. As a result, most of these iterations are executed using aligned SIMD operations. The preloop can be vectorized with masking as described in Section
For unmasked unaligned (element-aligned) vector loads and stores, the compiler uses unpacking/packing load and store instructions. They are safe in this scenario and perform much better than gather/scatter instructions. If the compiler cannot prove the safety of the entire address range of a particular memory access, it inserts a zero-mask check in order to avoid a memory fault. All instructions with the same mask are emitted under a single check to avoid execution under the empty mask and to eliminate multiple checks of the same condition.
Unpacking and packing instructions may cause fault when they are used with a mask, as they may address masked-out invalid memory. On-the-fly data conversion may cause fault even without masking. Thus, for unaligned masked and/or converting loads/stores, the compiler uses gather/scatter instructions instead of safety, even though this degrades performance. Memory faults would never happen if each memory access had at least one vector (64 bytes) of memory paged after its initial address. This can be achieved by padding each data section in the program and each dynamically allocated object with 64 bytes. For developers who are willing to do the padding to achieve optimal performance from masked code, the compiler knob-opt-assume-safe-padding was introduced. Under this knob, unaligned masked and/or converting load/store operations are emitted as unpacking loads/packing stores. In unmasked converting cases, as well as cases with peel/remainder masks, the compiler emits loads/stores directly. The mask in this case will work since it is dense. For an arbitrary masking scenario, an unmasked load unpack instruction is used, which is safe due to the padding assumption, followed by a masked move (blend). The “nonempty-mask” check guarantees that the 64-byte padding is always enough for safety; that is, at least one item within the vector is to be loaded. Thus, the tail end of the memory access is within 64 bytes from meaningful data.
The safe-padding optimization has provided notable improvements on a number of benchmarks, for example, 10% gain on BlackScholes and selected Molecular Dynamics kernels.
Frequently seen in HPC workloads, operations on small matrices are a growing, profitable set of calculations for vectorization on Intel Xeon Phi coprocessors. With the wider SIMD unit support, the Intel C/C++ and Fortran compilers are enhanced to vectorize common operations on small matrices along 2 dimensions. Small matrices are matrices whose data can reside entirely in one or two 512-bit SIMD registers. Consider the example Fortran loop nest with 32-bit float (or real) type as shown in Algorithm
With nonunit stride references present in the inner loop of Algorithm
The vectorization approach is detailed below with vector intrinsic pseudocode. For visualization, Tables
Contents of vector register A_v512 after load.
|
|
|
|
|
|
|
|
| |
|
|
|
| |
|
|
|
|
Contents of vector register B_v512 after load.
|
|
|
|
|
|
|
|
| |
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Vector register contents after first shuffle.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Vector register contents after second shuffle.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Vector register contents after third shuffle.
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
Vector register contents after the final shuffle.
|
|
|
|
|
|
|
|
| |
|
|
|
| |
|
|
|
|
Vector register contents after load with broadcast.
|
|
|
|
|
|
|
|
| |
|
|
|
| |
|
|
|
|
Vector register contents illustrating swizzle.
|
|
|
|
|
|
|
|
| |
|
|
|
| |
|
|
|
|
C_v512 vector unit contains elementwise product of t1_v512 and t2_v512.
|
|
t1_v512 vector register contents illustrating final load with broadcast.
|
|
|
|
|
|
|
|
| |
|
|
|
| |
|
|
|
|
t2_v512 vector register contents illustrating final swizzle.
|
|
|
|
|
|
|
|
| |
|
|
|
| |
|
|
|
|
Final C_v512 vector unit contains sum of existing values of C_v512 and elementwise products t2_v512 and t_v512.
|
|
First, array data is loaded into a vector unit. With a wider SIMD vector unit, the compiler is able to load the entire A and B matrix each into a single vector unit.
(a) Matrices A and B are loaded into two SIMD registers:
For more details see Table
For more details see Table
Next, the compiler optimizes the multiplication operation between matrix A and matrix B, through a series of data layout transformations and vector multiplication and addition operations. The compiler identifies a matrix multiplication in this loop and permutes the elements in matrix A and matrix B setting up simple vector multiplications and additions.
(b) We can simplify the multiplication needed through a transposition of the elements of A, followed by a multiply and add of each row B and with each row of transposed A. We start by transposing the elements of A.
For more details see Table
For the transpose operation, we use a set of new Intel MIC _mm512_mask_shuf128 × 32( ) intrinsic calls. Similarly in classic architecture, this shuffle intrinsic is bound by four 128-bit “lanes” in each vector register. Thus, this intrinsic contains arguments for permutation patterns for each of the four 128-bit lanes, as well as a permutation pattern for each of the four 32 bit boundaries within each of those lanes. The arguments are as follows:
_m512 res = _mm512_mask_shuf128 × 32(_m512 v1, (I16) vmask, _m512 v2, (SI32)perm128, (SI32)perm32), res: result vector unit, v1: blend-to-vector unit; the values in this vector unit will be blended with the shuffled elements of the v2, according to the write mask, vmask: write mask; the write mask is a bit vector specifying which elements to overwrite in v1 with the shuffle elements of v2, v2: incoming data vector unit; this vector unit holds the elements which are to be shuffled, perm128: 128-bit lane permutation; this value specifies the permutation order of the vector unit’s 128-bit lanes, perm32: elementwise permutation; this value specifies the permutation order of the each of the four 32 bit boundaries within each 128-bit lane,
For more details see Table
For more details see Table
For more details see Table
For more details see Table
After the elements of matrix A have been permuted through transposition, each element of A and B is now in the correct position within each vector unit for a vector product, resulting in the same behavior as the dot product of rows and columns.
(c) Next, we perform the multiplication of each row of the transposed A with each row of B, maintaining a sum of the products from row to row:
For more details see Table
Another useful intrinsic used in this optimization is the Intel MIC _mm512_swizzle_ps( ) intrinsic. This intrinsic is similar to that of the shuffle above except it only permutes each 128-bit lane and not each of the 32 boundaries within those lanes. The arguments are as follows:
_m512 res = _mm512_swizzle_ps(_mm512 v1, SI32 perm) res: result vector unit, v1: incoming data vector unit to be permuted, perm: permutation pattern for each 128-bit lane,
For more details see Table
For more details see Table
Each subsequent multiplication must be accumulated for each row. These multiplications and additions are the corresponding dot product of rows and columns found in matrix multiplication, but because of the earlier transpose, no further permuting is required:
For more details see Table
For more details see Table
For more details see Table
After the simplified matrix multiplication, the loop further requires that results be stored in the C matrix. With all elements correctly computed and residing in vector unit only one store operation is generated.
(d) Finally, the result vector unit of values is stored to the C array:
The 512-bit long SIMD vector unit of the Intel MIC architecture supports consumption of both matrix dimensions for 2D vectorization, fitting an entire small matrix (4 × 4 float type) into one 512-bit SIMD vector register. This enables more efficient flexible vectorization and optimizations for small matrix operations. For example, the scalar version of single precision 4 × 4 matrix multiply computation naively executes 128 memory loads, 64 multiplies, 64 additions, and 16 memory stores. The small matrix 2D vectorization reduces instructions to 2 vector loads from memory, 4 multiplications, 4 shuffles, 4 swizzles, 3 additions, and 1 vector store to memory for a reduction of approximately 15x in number of instructions.
This section presents the performance results measured on an Intel Xeon Phi coprocessor system using a set of workloads and microbenchmarks.
We have selected a set of workloads to demonstrate the performance benefits and importance of SIMD vectorization on the Intel MIC architecture. These workloads exhibit a wide range of application behavior that can be found in areas such as high performance computing, financial services, databases, image processing, searching, and other domains. These workloads include the following.
NBody computations are used in many scientific applications such as astrophysics [
Convolution is a common image filtering computation used to apply effects such as blur and sharpen. For a given 2D image and a 5 × 5 spatial filter containing weights, this convolution computes the weighted sum for the neighborhood of the 5 × 5 set of pixels.
Back projection is commonly used for performing cone-beam image reconstruction of CT projection values [
The 1D convolution is widely used in applications such as radar tracking, graphics, and image processing.
In memory tree structured index search is a commonly used operation in database applications. This benchmark consists of multiple parallel searches over a tree with different queries, where the path through the tree is determined based on the comparison of results of the query and node value at each tree level.
The detailed information on the configuration of the Intel Xeon Phi Coprocessor used for the performance study and for evaluating the effectiveness of SIMD vectorization techniques is provided in Table
Target system configuration.
System parameters | Intel Xeon Phi processor |
---|---|
Chips | 1 |
Cores/threads | 61 and 244 |
Frequency | 1 GHz |
Data caches | 32 KB L1, 512 KB L2 per core |
Power budget | 300 W |
Memory capacity | 7936 MB |
Memory technology | GDDR5 |
Memory speed | 2.75 (GHz) (5.5 GT/s) |
Memory channels | 16 |
Memory data width | 32 bits |
Peak memory Bandwidth | 352 GB/s |
SIMD vector length | 512 bits |
All benchmarks were compiled as native executable using the Intel 13.0 product compilers and run on the Intel Xeon Phi coprocessor system specified in Table
The performance scaling is derived from the OpenMP-only execution and OpenMP with 512-bit SIMD vector execution on the Intel Xeon Phi coprocessor system that we described at beginning of this section. That is, when the workload contains 32-bit single precision computations, 16-way vectorization may be achieved. When the workload contains 64-bit double-precision computations, 8-way vectorization is achieved.
Figure
Performance results of workloads.
To examine the impact of the less-than-full-vector loop vectorization, a simple microbenchmark was written with three small kernel functions:
Figure
Performance gain with “less-than-full-vector” loop vectorization.
These kernel loops used in Section
As shown in Figure
Performance gain with data alignment.
Small matrix operations such as addition and multiplication have served as important parts of many HPC applications. A number of classic compiler optimizations such as loop complete unrolling, partial redundancy elimination (PRE), scalar replacement, and partial summation have been developed to achieve optimal vector execution performance. The conventional inner or outer loop vectorization for 3-level loop nests of 4 × 4 matrix operations is not performing well on Intel Xeon Phi coprocessor due to less effective use of 512-bit long SIMD unit, for example, for 32-bit float data type, when either inner loop or outer loop is vectorized. In this case 4-way vectorization is used instead of 16-way vectorization, side-effects on classic optimizations, for example, the partial redundancy elimination, partial summation, and operator strength reduction, when the loop is vectorized.
As shown in Figure
Performance gain/loss with SIMD vectorization.
The classical loop optimizations are not as effective as for the single matrix multiplication case due to the transpose operation of
Effectively exploiting the power of a coprocessor like Xeon Phi requires that both thread- and vector-level parallelism are exploited. While the parallelization topic is beyond the scope of this paper, we would still like to highlight that the SIMD vector extensions can be seamlessly integrated with threading models such as OpenM
In the
Figure
OpenM
The compiler vectorization technology [
In the past three plus decades, the rich body of SIMD vectorization capabilities has been incorporated in a number of industry and research compilers [
Compared to the conventional loop vectorization [
In addition, programming language extensions such as OpenM
Driven by the increasing prevalence of SIMD architecture in the Intel Xeon Phi coprocessor, we proposed and implemented new vectorization techniques to explore the effective use of its long SIMD units. This paper presented several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel MIC specific data alignment optimizations, and small matrix operations 2D vectorization for the Intel Xeon Phi coprocessor. A set of workloads from several domains was employed to evaluate the benefits of our SIMD vectorization techniques. The results show that we achieved up to 12.5x performance gain on Intel Xeon Phi coprocessor. Mandelbrot workload demonstrated the seamless integration of SIMD vector extensions with threading and showed a 2067.91x performance speedup with the combined use of OpenMP “parallel for” and “SIMD” constructs using Intel C/C++ compilers on an Intel Xeon Phi coprocessor system.
Intel C/C++ and Fortran compilers are highly enhanced for programmers to harness the computational power of Intel Xeon Phi coprocessors for accelerating highly parallel applications found in chemistry, visual computing, computational physics, biology, financial services, pixel, multimedia, graphics, and HPC applications by effectively exploiting the use of the Intel MIC architecture SIMD vector unit beyond traditional loop SIMD vectorization.
The authors declare that there is no conflict of interests regarding the publication of this paper.