Effective SIMD Vectorization for Intel Xeon Phi Coprocessors

Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel MIC specific alignment optimization, and small matrix transpose/multiplication 2D vectorization implemented in the Intel C/C++ andFortran production compilers for Intel XeonPhi coprocessors. A set ofworkloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel Xeon Phi coprocessor. We also demonstrate a 2000x performance speedup from the seamless integration of SIMD vectorization and parallelization.


Introduction
The Intel Xeon Phi coprocessor is based on the Intel Many Integrated Core (Intel MIC) architecture, which consists of many small, power efficient, in-order cores, each of which has a powerful 512-bit vector processing unit (SIMD unit) [1].It is designed to serve the needs of applications that are highly parallel, make extensive use of SIMD vector operations, or are memory bandwidth bound.Hence, it is targeted for highly parallel, high performance computing (HPC) workloads [2] in a variety of fields such as computational physics, chemistry, biology, and financial services [3].The Intel Xeon Phi Coprocessor 5110P has the following key specifications: (i) 60 cores, 240 threads (4 threads/core), (ii) 1.053 GHz, (iii) 1 TeraFLOP double precision theoretical peak performance, (iv) 8 GB memory with 320 GB/s bandwidth, (v) 512 bit wide SIMD vector engine, (vi) 32 KB L1, 512 KB L2 cache per core, (vii) fused multiply-add (FMA) support.
One Teraflop theoretical peak performance is computed as follows: 1.053 GHz × 60 cores × 8 double precision elements in SIMD vector × 2 flops per FMA.As such, any compute bound applications trying to achieve high performance on Intel Xeon Phi coprocessors need to exploit a high degree of parallelism and wide SIMD vectors.Using a 512-bit vector unit, 16 single precision (or 8 double precision) floating point (FP) operations can be performed as a single vector operation.With the help of the fused multiply-add (FMA) instruction, up to 32 FP operations can be performed at each core at each cycle.In comparison to the current 128-bit SSE and 256-bit AVX vector extensions, this new coprocessor can pack up to 8x and 4x the number of operations into a single instruction, respectively.
Wider SIMD vector units cannot be effectively utilized by simply extending the vectorizer for Intel SSE and Intel AVX architecture.Consider the following simple example.There exists a scalar loop that executes N-iterations.Using the vector length of VL, a vector loop would execute floor (N/VL) full vector iterations followed by N mod VL scalar remainder iterations.Unless  is sufficiently larger than VL, executing N mod VL scalar iterations can still be a significant portion of the vector execution of such a loop.In what follows, we  will discuss two approaches in handling such "less-than-fullvector" situations: the first technique is masked vectorization and the second technique is small matrix optimization and 2-dimensional (2D) vectorization.Furthermore, architectural or microarchitectural differences between Intel Xeon Phi coprocessors and Intel Xeon processors necessitate that new compiler techniques be developed.This paper focuses on three SIMD vectorization techniques and makes the following contributions.
(i) We propose an extended compiler scheme to vectorize short trip-count loops and peeling and remainder loops that are classified as "less-than-full-vector" cases, with a masking capability supported by the Intel MIC architecture.
(ii) We describe our specific data alignment strategies for achieving optimal performance through vectorization, as the Intel MIC architecture is much more demanding on memory alignment than the Intel AVX architecture [4].
(iii) We describe our 2-dimensional vectorization method which is beyond the conventional loop vectorization for small matrix transpose and multiplication operations by fully utilizing long SIMD vector units, swizzle, shuffle, and masking support on the Intel MIC architecture.
The rest of this paper is organized as follows: Section 2 provides a high-level overview of Intel C/C++ and Fortran compilers.In Section 3, the compiler details of "less-thanfull-vector" loop vectorization are described and discussed.Specific data alignment strategies for the Intel Xeon Phi coprocessor and the schemes of performing data alignment optimization are discussed in Section 4. Section 5 presents the 2D vectorization methods for small matrix transpose and multiplication.Section 6 discusses related work.Section 7 provides the performance results with a set of workloads and microbenchmarks.Section 8 concludes the paper.

Compiler Architecture for Vectorization
This section describes the Intel C/C++ and Fortran compiler support for the Intel Xeon Phi coprocessor at a high level with respect to loop vectorization and the translation and optimization of SIMD vector extensions [5][6][7].The compiler translates serial C/C++ and Fortran code via automatic loop analysis or based on annotations using the SIMD pragma and vector attributes into SIMD instruction sequences.The compilation process is amenable to many optimizations such as loop parallelization, memory locality optimizations, classic loop transformations and optimizations, redundancy elimination, and dead code elimination before and after the loop/function vectorization.Figure 1 depicts the SIMD compilation infrastructure of the Intel C/C++ and Fortran compilers for automatic loop vectorization and compiling SIMD pragma, vector function annotations, and associated clauses.The framework consists of four major parts.
(i) Perform automatic loop analysis and identify and analyze programmer annotated functions and loops by parsing and collecting function and loop vector properties.In addition, our compiler framework can apply interprocedural analysis and optimization with profiling and call-graph creation for automatic function vectorization.
(ii) Generate vectorized function variants with properly constructed signatures via function cloning and vector signature generation.
(iii) Vectorize SIMD for loops that are identified by the compiler or annotated using SIMD extensions (#pragma SIMD can be used to vectorize outer loops) and cloned vector function bodies and all arguments by leveraging and extending our automatic loop vectorizer.
(iv) Enable classical scalar, memory, and loop optimizations and parallelization effectively, before or after loop and function vectorization, for achieving good performance.

Less-than-Full-Vector Loop Vectorization with Masking
Intel Xeon Phi coprocessor provides long (512-bit) SIMD vector hardware support for exploiting more vector-level parallelism.The long SIMD vector unit imposes the requirement of packing more scalar loop iterations into a single vector loop iteration, which also results in more iterations in the peeling loop, and/or in the remainder loop remaining nonvectorized, due to the fact that they do not constitute the full SIMD vector (or less-than-full-vector) unit of Intel MIC architecture.For example, consider the short trip-count loop as shown in Algorithm 1.
When the loop is vectorized for Intel SSE2 with vector length = 4 (128-bit), the remainder loop will have 3 iterations.When the loop is vectorized for the Intel MIC architecture with vector length = 16 (512-bit), the remainder loop will have 15 iterations.In another situation, if the loop is unrolled by 16, then the remainder loop will have 15 iterations, leaving the remaining 15 iterations in a scalar execution form.Thus, vectorizing the peeling and remainder loops (i.e., short trip-count loop in general) is very important for the Intel MIC architecture.This section describes how to apply vectorization, with masking support, to peeling and remainder loops (i.e., short trip-count loop) with special guarding masks to prevent the SIMD code from exceeding original loop and memory access boundaries.At a high level, the following steps describe our vectorization scheme without vectorization of peeling and remainder loops.
(iii) s2: compute the trip count of the peeling loop.
(v) s4: generate the vector loop initialization code.
(vii) s6: compute the trip count of the remainder loop.
Given the simple example as shown in Algorithm 2, the loop trip-count "" and the pointer "" (&[0]) have a memory alignment that is unknown at compile time.
On the Intel MIC architecture the vector length is 512 bits, which requires 64-byte alignment for efficient memory accesses.To achieve 64-byte aligned memory loads/stores, we need to pack 16 float (32-bit) elements for each single vector iteration and generate a peeling loop.Pseudocode 1 shows the vectorized loop based on the vectorization steps [s0, s1, . . ., s7] described above.The "less-than-full-vector" loops, that is, the peeling and remainder loops, are not vectorized.
Note that we performed loop unrolling for the main vectorized loop, which allows the hardware to issue more instructions per cycle by hiding memory access latency and reducing branching.To enable the "less-than-full-vector" (i.e., peeling loop, remainder loop, or short trip-count loop) vectorization, the loop vectorization scheme is extended as below.
(iii) s2: compute the trip count of peeling loop.(iv) s3: emit the vectorized peeling loop with masking operations.
(v) s4: generate the main vector loop initialization code.
(vii) s6: compute the trip count of the remainder loop.
In the cases of short trip-count loop vectorization of peeling and remainder loops with runtime trip-count and alignment checking, loops are vectorized as efficiently as possible.These loops are vectorized with optimal vector lengths and an optimal amount of profitable unrolling regardless of a known loop trip count.This provides better utilization of SIMD vector hardware without sacrificing the performance of short loops.This scheme allows us to completely eliminate Pseudocode 2: Pseudocode with vectorizing "less-than-full-vector" loops using mask.scalar execution of the loop in favor of masked SIMD vector code generation.Special properties of the mask are used to match unmasked code generation in most cases.For example, masked scalar memory loads that could be unsafe under an empty mask are considered safe under a remainder mask since it is never empty.
Without adding the capability of short trip-count loop vectorization, the loops in the ConvolutionFFT2D benchmark with 7 iterations and double precision data type would end up as a fully scalar execution.Applying vectorization with masking to these short trip-count loops results in a ∼2x to ∼5x speedup for the 7-iteration short trip-count (or less-than-fullvector) loops in the ConvolutionFFT2D benchmarks on the Intel MIC Architecture.

Alignment Strategy and Optimization
The Intel Xeon Phi coprocessor is much more sensitive to data alignment than the Intel Xeon E5 processor, so developing an Intel MIC oriented alignment strategy and optimization schemes is one of the key aspects for achieving optimal performance.
(i) Similar to Intel SSE4.2, the SIMD load+op instructions require vector size alignment, which is 64-byte alignment for the Intel MIC architecture.However, simple load/store instructions require the alignment information to be known at compile time on the Intel Xeon Phi coprocessor.
(ii) Different from prior Intel SIMD extensions, all SIMD load/store instructions including gather/scatter require at least element size alignment.Misaligned elements will cause a fault.This necessitates the Intel MIC architecture ABI [8] to require that all memory accesses be elementwise aligned.
(iii) There are no special unaligned load/store instructions in the Intel Initial Many Core Instruction (Intel IMCI) set.This is overcome by using unpacking loads and packing stores that are capable of dealing with unaligned (element-aligned) memory locations.Due to their unpacking and packing nature, these instructions cannot be directly used for masked loads/stores, except under special circumstances.
(iv) The faulting nature of masked memory access instructions in Intel IMCI adds extra complexity to those instructions addressing data outside paged memory and may fail even if actual data access is masked out.The exceptions are gather/scatter instructions.
Therefore, the compiler aggressively performs data alignment optimizations using traditional techniques such as alignment peeling and alignment multiversioning.
Alignment peeling implies the creation of a preloop that executes several iterations on unaligned data in order to reach an aligned memory address.As a result, most of these iterations are executed using aligned SIMD operations.The preloop can be vectorized with masking as described in Section 2. Unfortunately, this scheme works only for one set of coaligned memory addresses, and the others are assumed to be unaligned.In addition, our multiversioning optimization can be applied to the second set of coaligned locations by examining them dynamically.Aligned or unaligned operations are used based on the results of the examination.
For unmasked unaligned (element-aligned) vector loads and stores, the compiler uses unpacking/packing load and store instructions.They are safe in this scenario and perform much better than gather/scatter instructions.If the compiler cannot prove the safety of the entire address range of a particular memory access, it inserts a zero-mask check in order to avoid a memory fault.All instructions with the same mask are emitted under a single check to avoid execution under the empty mask and to eliminate multiple checks of the same condition.
Unpacking and packing instructions may cause fault when they are used with a mask, as they may address masked-out invalid memory.On-the-fly data conversion may cause fault even without masking.Thus, for unaligned masked and/or converting loads/stores, the compiler uses gather/scatter instructions instead of safety, even though this degrades performance.Memory faults would never happen if each memory access had at least one vector (64 bytes) of memory paged after its initial address.This can be achieved by padding each data section in the program and each dynamically allocated object with 64 bytes.For developers who are willing to do the padding to achieve optimal performance from masked code, the compiler knob-opt-assumesafe-padding was introduced.Under this knob, unaligned masked and/or converting load/store operations are emitted as unpacking loads/packing stores.
(i) In unmasked converting cases, as well as cases with peel/remainder masks, the compiler emits loads/stores directly.The mask in this case will work since it is dense.
(ii) For an arbitrary masking scenario, an unmasked load unpack instruction is used, which is safe due to the padding assumption, followed by a masked move (blend).The "nonempty-mask" check guarantees that the 64-byte padding is always enough for safety; that is, at least one item within the vector is to be loaded.Thus, the tail end of the memory access is within 64 bytes from meaningful data.
The safe-padding optimization has provided notable improvements on a number of benchmarks, for example, 10% gain on BlackScholes and selected Molecular Dynamics kernels.

Small Matrix Operations 2D Vectorization
Frequently seen in HPC workloads, operations on small matrices are a growing, profitable set of calculations for vectorization on Intel Xeon Phi coprocessors.With the wider SIMD unit support, the Intel C/C++ and Fortran compilers are enhanced to vectorize common operations on small matrices along 2 dimensions.Small matrices are matrices whose data can reside entirely in one or two 512-bit SIMD real, dimension(4,4):: A, B, C real sum integer j, l, i do j = 1, 4 do l = 1, 4 sum = 0.0 do i = 1, 4 sum = sum + A(i,l) * B(i,j) enddo C(l,j) = sum enddo enddo Algorithm 3: Small matrix multiplication summation.
Table 1: Contents of vector register A v512 after load.
Table 2: Contents of vector register B v512 after load.
With nonunit stride references present in the inner loop of Algorithm 3, the conventional inner loop vectorization will not provide the most efficient vectorization of the loop nest.The outer loop vectorization faces similar issues.The Intel C/C++ and Fortran compiler employs the wider SIMD vector unit of the Intel MIC architecture and vectorizes this example loop across all three loop nesting levels, named as 2-dimensional (2D) vectorization on small matrices.
The vectorization approach is detailed below with vector intrinsic pseudocode.For visualization, Tables 1-13 depict a snapshot of the various vector unit contents after each corresponding instruction.Tables 1-13 represent a vector unit, whose name is in the leftmost column and its contents in the rightmost four columns.Of the rightmost four columns, the lowest addressed element is in the top left corner and each consecutive element follows a row-major addressing order.
Table 5: Vector register contents after second shuffle.
Table 6: Vector register contents after third shuffle.
Table 7: Vector register contents after the final shuffle.
Table 8: Vector register contents after load with broadcast.
First, array data is loaded into a vector unit.With a wider SIMD vector unit, the compiler is able to load the entire A and B matrix each into a single vector unit.
(a) Matrices A and B are loaded into two SIMD registers: //Load A matrix from memory into vector register, For more details see Table 1.
//Load B matrix from memory into vector register, For more details see Table 2.
Table 13: Final C v512 vector unit contains sum of existing values of C v512 and elementwise products t2 v512 and t v512.
Next, the compiler optimizes the multiplication operation between matrix A and matrix B, through a series of data layout transformations and vector multiplication and addition operations.The compiler identifies a matrix multiplication in this loop and permutes the elements in matrix A and matrix B setting up simple vector multiplications and additions.
(b) We can simplify the multiplication needed through a transposition of the elements of A, followed by a multiply and add of each row B and with each row of transposed A. We start by transposing the elements of A.
//First, create a vector unit of zeros.
For the transpose operation, we use a set of new Intel MIC mm512 mask shuf128 × 32( ) intrinsic calls.Similarly in classic architecture, this shuffle intrinsic is bound by four 128-bit "lanes" in each vector register.Thus, this intrinsic contains arguments for permutation patterns for each of the four 128-bit lanes, as well as a permutation pattern for each of the four 32 bit boundaries within each of those lanes.The arguments are as follows: m512 res = mm512 mask shuf128 × 32( m512 v1, (I16) vmask, m512 v2, (SI32)perm128, (SI32)perm32), (i) res: result vector unit, (ii) v1: blend-to-vector unit; the values in this vector unit will be blended with the shuffled elements of the v2, according to the write mask, (iii) vmask: write mask; the write mask is a bit vector specifying which elements to overwrite in v1 with the shuffle elements of v2, (iv) v2: incoming data vector unit; this vector unit holds the elements which are to be shuffled, (v) perm128: 128-bit lane permutation; this value specifies the permutation order of the vector unit's 128-bit lanes, (vi) perm32: elementwise permutation; this value specifies the permutation order of the each of the four 32 bit boundaries within each 128-bit lane, //Begin transpose operation by shufflingelements into //desired order.Shuffle used to insert matrix diagonal //into transpose result vector unit, For more details see Table 4.
//Shuffle the next four elements and blend-in with the //elements written from previous shuffle, For more details see Table 5.
For more details see Table 7.
After the elements of matrix A have been permuted through transposition, each element of A and B is now in the correct position within each vector unit for a vector product, resulting in the same behavior as the dot product of rows and columns.
(c) Next, we perform the multiplication of each row of the transposed A with each row of B, maintaining a sum of the products from row to row: //Load the first row of A  v512 and broadcast that row to //each of the remaining three rows t1 v512 = mm512 extload ps(A  v512[0 : 4], MM FULLUPC NONE, MM BROADCAST 4 × 16, 0).
For more details see Table 8.
Another useful intrinsic used in this optimization is the Intel MIC mm512 swizzle ps( ) intrinsic.This intrinsic is similar to that of the shuffle above except it only permutes each 128-bit lane and not each of the 32 boundaries within those lanes.The arguments are as follows: m512 res = mm512 swizzle ps( mm512 v1, SI32 perm) (i) res: result vector unit, (ii) v1: incoming data vector unit to be permuted, (iii) perm: permutation pattern for each 128-bit lane, //Load the first row of B v512 and broadcast that row to //each of the remaining three rows t2 v512 = mm512 swizzle ps (B v512, MM SWIZ REG AAAA).
For more details see Table 9.
For more details see Table 10.
//Load the second row of A  v512 and broadcast that row //to each of the remaining three rows Each subsequent multiplication must be accumulated for each row.These multiplications and additions are the corresponding dot product of rows and columns found in matrix multiplication, but because of the earlier transpose, no further permuting is required: For more details see Table 11.
//Load the fourth row of B v512 and broadcast that row to //each of the remaining three rows t2 v512 = mm512 swizzle ps (B v512, MM SWIZ REG DDDD).
For more details see Table 12.
For more details see Table 13.
After the simplified matrix multiplication, the loop further requires that results be stored in the C matrix.With all elements correctly computed and residing in vector unit only one store operation is generated.
(d) Finally, the result vector unit of values is stored to the C array: //The elements of vector register C v512 are then stored //to memory at &C [1] The 512-bit long SIMD vector unit of the Intel MIC architecture supports consumption of both matrix dimensions for 2D vectorization, fitting an entire small matrix (4 × 4 float type) into one 512-bit SIMD vector register.This enables more efficient flexible vectorization and optimizations for small matrix operations.For example, the scalar version of single precision 4 × 4 matrix multiply computation naively executes 128 memory loads, 64 multiplies, 64 additions, and 16 memory stores.The small matrix 2D vectorization reduces instructions to 2 vector loads from memory, 4 multiplications, 4 shuffles, 4 swizzles, 3 additions, and 1 vector store to memory for a reduction of approximately 15x in number of instructions.

Performance Evaluation
This section presents the performance results measured on an Intel Xeon Phi coprocessor system using a set of workloads and microbenchmarks.
6.1.Workloads.We have selected a set of workloads to demonstrate the performance benefits and importance of SIMD vectorization on the Intel MIC architecture.These workloads exhibit a wide range of application behavior that can be found in areas such as high performance computing, financial services, databases, image processing, searching, and other domains.These workloads include the following.
6.1.1.NBody.NBody computations are used in many scientific applications such as astrophysics [9] and statistical learning algorithms [10].The main computation involves two loops that iterate over the bodies and computes a pairwise interaction between them.

2D 5 × 5 Convolution.
Convolution is a common image filtering computation used to apply effects such as blur and sharpen.For a given 2D image and a 5 × 5 spatial filter containing weights, this convolution computes the weighted sum for the neighborhood of the 5 × 5 set of pixels.[11].The input consists of a set of 2D images that are "back-projected" onto a 3D volume in order to construct a 3D grid of density values.

Radar (1D Convolution).
The 1D convolution is widely used in applications such as radar tracking, graphics, and image processing.

Tree Search.
In memory tree structured index search is a commonly used operation in database applications.This benchmark consists of multiple parallel searches over a tree with different queries, where the path through the tree is determined based on the comparison of results of the query and node value at each tree level.

System Configuration. The detailed information on the configuration of the Intel Xeon Phi
Coprocessor used for the performance study and for evaluating the effectiveness of SIMD vectorization techniques is provided in Table 14.

Performance Results
. All benchmarks were compiled as native executable using the Intel 13.0 product compilers and run on the Intel Xeon Phi coprocessor system specified in Table 14.To demonstrate the performance gains obtained through the SIMD vectorization, two versions of the binaries were generated for each workload.The baseline version was compiled with OpenMP parallelization only (-mmicopenmp -novec); the vectorized version is compiled with vectorization (default ON) and OpenMP parallelization (mmic -openmp).The performance scaling is derived from the OpenMPonly execution and OpenMP with 512-bit SIMD vector execution on the Intel Xeon Phi coprocessor system that we described at beginning of this section.That is, when the workload contains 32-bit single precision computations, 16-way vectorization may be achieved.When the workload contains 64-bit double-precision computations, 8-way vectorization is achieved.
Figure 2 shows the normalized SIMD performance speedup of five workloads.The generated SIMD code of these workloads achieved SIMD speedup ranging from 2.25x to 12.45x.Besides those classical HPC applications with regular array accesses and computations, the workload with a large amount of branching codes, such as tree search used in database applications, achieves 2.25x speedup as well with SIMD vectorization based on the masking support in the Intel MIC architecture.

Impact of Less-than-Full-Vector Loop Vectorization.
To examine the impact of the less-than-full-vector loop vectorization, a simple microbenchmark was written with three  small kernel functions: intAdd, floatAdd, and doubleAdd.Each of them has a short trip-count loop that takes 3 arrays, a, b, c of size 31, and does an elementwise addition with respect to int, float, and double data types.The vector length is 16 iterations for loops in the intAdd and floatAdd kernels and 8 iterations for the loop in the doubleAdd kernel function.This experimental setup ensures the intAdd and floatAdd loops contain a 15-iteration remainder loops, and the doubleAdd loop contains a 7-iteration remainder loop which can be vectorized with the "less-than-full-vector" loop SIMD vectorization technique using masking support described in the Section 2.
Figure 3 shows performance gains from vectorization without "less-than-full-vector" loop vectorization and with "less-than-full-vector" loop vectorization for three short tripcount loops in the intAdd, floatAdd, and doubleAdd kernel functions.The generated SIMD code of these loops achieves a speedup ranging from 2.89x to 3.32x without "less-than-full-vector" loop vectorization.With "less-thanfull-vector" loop vectorization, the performance speedup is improved significantly and ranges from 3.28x to 7.68x.Note that, in this measurement, all data are 64-byte aligned, there are no peeling loops generated, and the aligned memory load/store instructions such as vmovaps and vmovapd [1] are generated to achieve optimal performance.The next subsection shows the data alignment impact on the Intel MIC architecture.

Impact of Data
Alignment.These kernel loops used in Section 6.3.1 are reused for this measurement.In this study, the difference is that we do not provide alignment information of the arraysa, b, and c.Without alignment information, given these loops are short trip-count loops with constant trip count, the compiler generates SIMD instructions: (i) vloadunpackld and vloadunpackhd to load data from unaligned memory locations and vpackstoreld and vpackstorehd [1] to store data to unaligned memory locations for the vectorized main loop, (ii) vgatherdps and vscatterdps instructions [1] to load and store for the vectorized remainder loop with write mask.As shown in Figure 4, with data alignment information, the performance of SIMD execution is 1.45x, 1.41x, and 1.32x better than unaligned cases with respect to int, float, and double types of three kernel functions.The alignment optimization described in Section 3 is critical to achieving optimal performance on Intel MIC architecture.

Impact of Small Matrix 2D Vectorization.
Small matrix operations such as addition and multiplication have served as important parts of many HPC applications.A number of classic compiler optimizations such as loop complete unrolling, partial redundancy elimination (PRE), scalar replacement, and partial summation have been developed to achieve optimal vector execution performance.The conventional inner or outer loop vectorization for 3-level loop nests of 4 × 4 matrix operations is not performing well on Intel Xeon Phi coprocessor due to (i) less effective use of 512-bit long SIMD unit, for example, for 32-bit float data type, when either inner loop or outer loop is vectorized.In this case 4-way vectorization is used instead of 16-way vectorization, (ii) side-effects on classic optimizations, for example, the partial redundancy elimination, partial summation, and operator strength reduction, when the loop is vectorized.As shown in Figure 5, the convention loop vectorization on small matrix (4 × 4) operations does cause performance degradation.For both cases of single precision and double precision matrix (4 × 4) multiplications, the performance degradation is ∼50% when comparing against cases without vectorization, which are used as the baseline performance.
In the case of the paired matrix multiplication, there are two matrix (4 × 4) multiplications done in a single loop nest, and matrix B is transposed for computing sumy (for more details see Algorithm 4).
The classical loop optimizations are not as effective as for the single matrix multiplication case due to the transpose operation of matrix B and paired matrix multiplications in the loop.Thus, the performance achieved with classical loop optimization is on-par with applying conventional loop vectorization, and no notable performance difference is observed as shown in Figure 5. Promisingly, applying the small matrix 2D vectorization we proposed in Section 4, we achieved a performance speedup 1.15x/1.04xfor single matrix (4 × 4 float/double type) multiplication and a speedup 5.42x/4.18xfor paired matrix (4 × 4 float/double type) transpose and multiplication, which demonstrates the effectiveness of small matrix 2D vectorization using long SIMD vector unit supported by Intel Xeon Phi coprocessor.

Seamless Integration with Threading
Effectively exploiting the power of a coprocessor like Xeon Phi requires that both thread-and vector-level parallelism are exploited.While the parallelization topic is beyond the scope of this paper, we would still like to highlight that the SIMD vector extensions can be seamlessly integrated with threading models such as OpenMP * 4.0 supported by the Intel compilers.Given the Mandelbrot example Mandelbrot computes a graphical image representing a subset of the Mandelbrot set (a well-known 2D fractal shape) out of a range of complex numbers.It outputs the number of points inside and outside the set.
In the mandelbrot workload, the function "mandel" in the mandelbrot program is a hot function and a candidate for SIMD vectorization, so we can annotate it with #pragma omp declare SIMD.At the caller site, the hot loop is a double nested for loop, the outer for loop is asserted with "omp parallel for" for threading, and the inner loop is asserted with "omp SIMD" for vectorization as shown in Algorithm 5. Note that the "guided" scheduling type is used for achieving a good load balance, as each call to "mandel" function does varying amount of work in terms of execution time due to "break" exit of the loop.
Figure 6 shows that the SIMD vectorization alone delivers a ∼16x speedup, built with option -mmic -openmp -std=c99-O3 over the serial execution.The OpenMP parallelization delivers a 62.09x speedup with 61 threads using 61 cores with Hyperthreading OFF, a speedup 131.54x with 244 threads (61 cores with Hyperthreading ON, 4 HT threads per core) over the serial execution.The OMP PARALLEL FOR and SIMD combined execution delivers an OMP PAR + SIMD speedup 2067.9x with 244 threads, running on an Intel Xeon Phi system, which has 61-core on the chip with Hyperthreading ON.The performance scaling from 1 thread to 61 threads is close to linear.In addition, the Hyperthreading support delivers a ∼2x performance gain by comparing the 244-thread speedup with the 61-thread speedup, which is better than the well-known 20%-30% expectation on the performance gain from Hyperthreading technology due to the nature of less computing resource contention in the workload, and 4 busy HT threads did hide latency well.For the system information details see Section 6.2.

Related Work
The compiler vectorization technology [12] had been one of the key loop transformations for traditional vector machine decades ago.However, the recent proliferation of modern SIMD architecture [1,4] poses new constraints such as data alignment, masking for control flow, nonunit stride access to memory, and the fixed-length nature of SIMD vectors that shall demand more advanced vectorization technologies and vectorization friendly programming language extensions [7].
In the past three plus decades, the rich body of SIMD vectorization capabilities has been incorporated in a number of industry and research compilers [5,6,[12][13][14][15][16].These include works based on ICC (the Intel compiler) [5,6], XLC (the IBM compiler) [13,16], VAST [17], GCC [18,19], and the SUIF compiler [20].However, there are many unknown program factors such as loop trip count, memory access stride and patterns, alignment, and control flow complexity at compiletime that pose challenges to the modern optimizing compiler's ability to apply advanced and practical vectorization techniques and fulfill the semantic gap between application programs and the modern processors such as Intel Xeon Phi coprocessor for harnessing its computational power.
Compared to the conventional loop vectorization [5,12,20], the "less-than-full-vector" vectorization technique brings extra performance benefits for those vectorizable short   trip-count loops, especially when the processor provides the long SIMD unit masking capability like the Intel Xeon Phi coprocessor.Our alignment optimizations are built on top of existing dynamic alignment optimizations as presented in [5,6].However, the alignment strategy described in this paper is designed to satisfy the requirement of Intel MIC architecture with optimal SIMD instruction selection and mask utilization for safe and optimal performance.Beyond traditional singlelevel loop vectorization [5,12,16,18,19,21], the small matrix operation 2D vectorization increases vector-parallelism and improves the utilization efficiency of the long SIMD vector unit, swizzle, shuffle, broadcast, and mask support in Intel MIC architecture significantly.In addition, programming language extensions such as OpenMP * SIMD extensions [22,23] and Cilk Plus [3,7] function vectorization and loop vectorization through the compiler has been paving the way to enable more effective vector-level parallelism [7,22] in both C/C++ and Fortran programming languages.To support these SIMD vector programming models on the Intel Xeon Phi coprocessor effectively, the practical and effective vectorization techniques described in this paper are essential for achieving optimal performance and ensuring SIMD code execution safety on an Intel Xeon Phi coprocessor system.

Conclusions
Driven by the increasing prevalence of SIMD architecture in the Intel Xeon Phi coprocessor, we proposed and implemented new vectorization techniques to explore the effective use of its long SIMD units.This paper presented several practical SIMD vectorization techniques such as lessthan-full-vector loop vectorization, Intel MIC specific data alignment optimizations, and small matrix operations 2D vectorization for the Intel Xeon Phi coprocessor.A set of workloads from several domains was employed to evaluate the benefits of our SIMD vectorization techniques.The results show that we achieved up to 12.5x performance gain on Intel Xeon Phi coprocessor.Mandelbrot workload demonstrated the seamless integration of SIMD vector extensions with threading and showed a 2067.91xperformance speedup with the combined use of OpenMP "parallel for" and "SIMD" constructs using Intel C/C++ compilers on an Intel Xeon Phi coprocessor system.
Intel C/C++ and Fortran compilers are highly enhanced for programmers to harness the computational power of Intel Xeon Phi coprocessors for accelerating highly parallel applications found in chemistry, visual computing, computational physics, biology, financial services, pixel, multimedia, Collect function annotations or function analysis informationFunction cloning and vector signature generation Function vectorization Identify and analyze functions by parsing annotations and profiling information to decide function cloning and signatures for matching caller and callee on target CPU/GPU Memory/loop optimizations loop vectorization/parallelization Scalar optimization and code generation Phase-2: optimizing with program annotation or profiling feedback information Phase-1: compile and execute with call graph and profiling enabled Vectorized and optimized binary code

Figure 1 :
Figure 1: SIMD vector compilation infrastructure for function and loop vectorization.

Figure 4 :
Figure 4: Performance gain with data alignment.

Table 4 :
Vector register contents after first shuffle.

Table 11 :
t1 v512 vector register contents illustrating final load with broadcast.

Table 14 :
Target system configuration.