The optimization process of a H.264/AVC encoder on three different architectures is presented. The architectures are multi- and singlecore and SIMD instruction sets have different vector registers size. The need of code optimization is fundamental when addressing HD resolutions with real-time constraints. The encoder is subdivided in functional modules in order to better understand where the optimization is a key factor and to evaluate in details the performance improvement. Common issues in both partitioning a video encoder into parallel architectures and SIMD optimization are described, and author solutions are presented for all the architectures. Besides showing efficient video encoder implementations, one of the main purposes of this paper is to discuss how the characteristics of different architectures and different set of SIMD instructions can impact on the target application performance. Results about the achieved speedup are provided in order to compare the different implementations and evaluate the more suitable solutions for present and next generation video-coding algorithms.
In the last years the video compression algorithms have played an important role in the enjoying of multimedia contents. The passage from analog to digital world in multimedia environment cannot be performed without compression algorithms. DVDs, Blu-Ray, and Digital TV are typical examples. The compression algorithm used in DVDs is MPEG-2, and Blu-Ray supports VC-1 standardized with the name SMPTE 421M [
The H.264/AVC [
It results that a H.264 encoder addressing high definition (HD) resolutions needs to support High Profiles in order to be part of an effective video application. On the other hand, the already great complexity of the H.264 algorithm is further increased by supporting FRExt. In particular, this leads to implement two new modules: the
In case of mobile devices, the H.264 complexity issues together with the constraints of limited power consumption and the typical need of real-time operations in video-based applications draw a difficult scenario for video application developers.
The HD resolution involves a large amount of data, and the compression algorithms are high computational demand applications, often used as benchmark to measure the processor performance. In order to support real-time video encoding and decoding, specific architectures are developed. Multicore architectures have the potential to meet the performance levels required by the real-time coding of HD video resolutions. But in order to exploit multicore architectures, several problems have to be faced. The first issue is the subdivision of an encoder application in modules that can be executed in parallel. In this case, the main difficult is the strong data dependency in video encoder algorithms. Parallel architectures can be more easily exploited using other kind of algorithm like computer graphics, rendering technology or cryptography, where the data dependency is not as strong as in video compression. Once a good partitioning is achieved, the optimization of a video encoder should take advantage of the data level parallelism to increase the performance of each encoder module running on the architecture’s processing element. A common approach is to use the SIMD instructions to exploit the data level parallelism during the execution; otherwise, ASIC design can be adopted for critical kernel. SIMD architectures are widely used for their flexibility. SIMD ISAs are added at most market spread processor: Intel’s MMX, SSE1, SSE2, SSE3, SSE4; Amd 3DNow!; ARM’s NEON; Motorola’s AltiVec (also known as Apple’s Velocity Engine or IBM’s VMX).
In this paper, we will show how the data level parallelism is exploited by SIMD and which instructions are more useful in video processing. Different instruction set architectures (ISAs) will be compared in order to show how the optimization can be driven and how different ISA features can lead to different performance. This paper is intended to be a great help to both software programmers that have to choose for the most suitable SIMD ISA for developing a video-based application and for ISA designers that want to create a generic instruction set being able to give good performance on video applications. In that regard, the authors will select a set of generic SIMD instructions that can speed up video codec applications, detailing the modules that will profit from the introduction of each instruction. Besides describing the optimization methods, the paper indicates a few guidelines that should be followed to partition the encoder in separate modules.
Even though the work focuses on H.264/AVC, most of the proposed solutions will also apply to the earlier mentioned standards as well as to more recent video compression algorithms as scalable video coding (SVC) [
This paper is organized as follows. Section
The basic concept of SIMD instructions is the possibility of fill vector registers with multiple data in order to execute the same operation on several elements. One of the major bottlenecks in the SIMD approach is the overhead due to the data handling needed to feed the vector registers. Typical required operations are extra memory accesses, packing data, element permutation inside vectors, and conversion from vector to scalar results. All these preliminary operations limit the vector dimension and the performance enhancement achievable with SIMD optimization.
In literature, several studies regarding the SIMD optimization of video-coding applications are available [
In SIMD processors the memory access has an important impact on performance. The unaligned access is not usually possible in SIMD ISA, and when possible it is discouraged due to additional instruction latency. The programmers usually take care of handling the unaligned load adding further overhead to vector data organization. Moreover, the need of unaligned load is always present in video-coding algorithm especially in motion estimation (ME) and motion compensation (MC), where the pixel blocks selected by motion vectors are frequently at misaligned positions even if the start of a frame is memory aligned. Often, the position of a block we need to access cannot be known in advance, and this leads to unpredictable misalignment in data loaded from memory. In Intel’s architectures, starting from SSE2 the support to unaligned load has been added, but the performance is strongly reduced either if the load operation crosses the cache boundary or, with SSE3, if the load instruction needs store-to-load forwarding. In AltiVec, it is necessary to load two adjacent positions and shift data in order to achieve one unaligned load, a usually adopted approach to overcome the misalignment access issue. This problem is common in digital signal processor (DSP) as well. Usually, DSP do not support unaligned loads, but due to the large use of DSP in video application several producers have added the support to this kind of operation. For example, Texas Instruments family TMS320C64x supports unaligned load and store operations of 32 and 64 bit element, but with only one of the two memory ports [
The MediaBreeze SIMD processor was proposed to reduce the bottlenecks in SIMD implementations [
Another typical approach to reduce the SIMD overhead is the usage of multibank vector memory where data is stored interleaved. The drawback is the increase of hardware cost for supporting the addresses generation.
An alternative to SIMD implementation on programmable processor architectures is the hardwired processor. Usually, it is only used when performance and low power consumption are essential requirements [
In order to optimize the H.264 encoder, we chose three different ISAs. The adopted architectures are ST240, xSTream, and P2012, all developed by STMicroelectronics. The former is a single-processor architecture, and the others are multicore platforms. In the following, the three architectures will be briefly described, giving special attention to the SIMD instruction set.
We chose these architectures for their novelty and for the possibility to have a complete toolchain (code generation, simulation, profiling, etc.) for developing an application in an optimal way. Each toolchain allowed a complete observability of the system. In this way, it was possible to evaluate the effectiveness of every author’s solution. Observability is a very important characteristic when developing/optimizing an application. Using a real system it is not always possible to reach the degree of observability you have using a simulator and a suitable toolchain. Moreover, in an architecture under development as P2012 we had the possibility to contribute to the SIMD instruction set and, more important, to evaluate the contribution of each particular SIMD to the performance of the target video codec application. The three instruction sets present suitable characteristics for our research; they are generic instruction set, but ST240 includes a few video-specific instructions; we can analyse the impact of different vector register sizes; even if xSTream and P2012 share many characteristics, only xSTream supports horizontal SIMD (this is a special feature; e.g., other SIMD extensions as Intel SSE and ARM NEON do not have the same support); in P2012 platform, we were able to define and insert new SIMD instructions.
Besides the type of instructions, the SIMD extensions differ in both size and precision. These differences allow analyzing the impact of different architecture solutions on the global performance.
The ST240 is a processor of STMicroelectronics ST200 family based on LX technology jointly developed with Hewlett Packard [ 4-issue Very Long Instruction Word (VLIW) 64-32-bit general purpose registers 32KB D-Cache and 32KB I-Cache 450 MHz clock frequency 8-bit/16-bit arithmetic SIMD.
In the H.264 encoder SIMD optimization, the most significant instructions of the ST240 ISA are the following: the SIMD add.ph and sub.ph which perform, respectively, the packed 16-bit addition or subtraction; the perm.pb instruction which performs byte permutations and the muladdus.pb which multiplies an unsigned byte by a signed byte in each of the byte lanes and then sums across the four lanes to produce a single result. Furthermore, several data manipulation instructions are defined: pack.pb packs 16-bit values to byte elements ignoring the upper half; shuffeve.pb and shuffodd.pb, respectively, perform 8-bit shuffle of even and odd lanes. Two averaging operations (avg4u.pb and avgu.pb) are also defined in the instruction set.
One important operation in video-coding algorithms, the absolute value of the difference, abs (a-b), can be performed with the absubu.pb instruction (Figure
SAD operation.
xSTream is a multiprocessor dataflow architecture for high-performance embedded multimedia streaming applications designed at STMicroelectronics [
xSTream is constituted by a parallel distributed and shared memory architecture. It is an array of processing elements connected by a Network on Chip (NoC) with specific hardware for management of communication [
xSTream architecture.
The main elements in Figure
The XPEs are based on ST231 VLIW processors [ 2-issue VLIW, 128-bit vector registers, up to 512 KB local memory cache, up to 1 GHz clock frequency, and 16-bit/32-bit arithmetic SIMD.
In order to achieve excellent performance, the XPE core tries to exploit available parallelism at various levels. It supports a plethora of SIMD instructions to exploit available data-level parallelism. These instructions concurrently execute up to four operations on 32-bit operands or eight operations on 16-bit operands. The core supports wide 128-bit load/store.
The xSTream architecture handles scalar and vector operands.
Vector operands are 128-bit wide and consist of either eight 16-bit half-words or four 32-bit words, as shown in Figure
Vector operand.
In the xSTream ISA each SIMD instruction has an additional operand allowing permuting the result’s element positions or replicating any element in the other positions. This feature considerably increases the SIMD flexibility because the results have often to be reordered for further elaboration. This is especially true for video-coding algorithm with operations performed on several steps where the input of next step is usually the output of previous one. The permutation operand allows this with the cost of only one additional cycle. This leads to reduced costs to perform all the operation needed for data reordering.
The XPE supports horizontal SIMD as well. This kind of SIMD allows operations among elements in the same vector, and it is a key feature for speeding up execution in several H.264 functional units, as we will see in next sections.
Platform 2012 is a high-performance programmable architecture for high computational demanding embedded multimedia applications, currently under joint development by STMicroelectronics and Commissariat à l’énergie atomique et aux énergies alternatives (CEA) [
The P2012 architecture (Figure 32-bit RISC processor (up to 2 instructions per cycle), 128-bit vector registers, 256 KB of memory shared by all the processors (per cluster), 600 MHz clock frequency, and 16-bit/32-bit arithmetic SIMD.
P2012 scheme.
The P2012 basic modules can be easily replicated to provide scalability [
Vector operands are 128-bit wide and consist of either eight 16-bit half-words or four 32-bit words. In order to increase the SIMD flexibility, instructions able to permute data positions inside the vector operands are defined in the instruction set. The support to horizontal SIMD is limited at operation involving only two adjacent elements inside a vector, but its presence is fundamental for typical video-coding operation like sum of absolute difference (SAD).
Whatever platform we choose, we will have a limited number of SIMD instructions because of hardware constraints. For this reason, besides precision and size, one of the key issues while choosing a SIMD extension is generality versus application-specific instructions. The former can show good speedups for a large variety of applications. The latter can reach greater performance, but limited to a particular family of applications. Of course, there are a lot of solutions that lay in the middle.
The vector register size impacts performance, hardware reliability, and costs. The choice of the optimal size and precision of SIMD instructions is a key factor for reaching the desired performance for the target application. The axiom larger SIMD equal to better performance may be valid for applications having no constraints and data dependencies in either spatial or temporal field. It is not the H.264 encoder condition. In general, algorithms with a heavy control flow are very difficult to vectorize, and the SIMD optimization does not always lead to the desired performance enhancement.
The application developers should choose the dimension that best fit their needs, as well as ISA designers should take into account the requirements of the application families they are targeting. As stated in [
In order to support real-time video encoding addressing HD resolutions, multiprocessor architectures seem to be an optimal solution, as earlier explained. Moreover, we would like to test the multicore architectures with an application of high interest but not so suitable for these kind of architectures in order to stress the architecture design and to evaluate possible issues and finding solutions that could be also useful for other applications.
The first programmer’s task dealing with this type of platforms is the subdivision of the encoder application in modules that can execute in parallel. The H.264 encoder partitioning plays a fundamental role in multicore architectures as xSTream and P2012, where each functional block has to meet the resources of processor elements, and the interconnection system must fulfil the memory bandwidth needed to feed the modules. The designer choice becomes more complex when some modules can run in parallel avoiding stalls in pipeline [
Even if a detailed description of the encoder partitioning is beyond the scope of this paper, we can here depict some issues we faced approaching this process and the solutions we adopted.
First of all, it is worth to take into account the data dependency inside the H.264 encoder. Temporal data dependency is implicit in the Motion Estimation mechanism; the coding of the current frame always depends on the previously encoded frame(s) that are used as reference. Thus, there is always a temporal data dependency, except if the current frame is an I picture. Anyway, the encoding process also shows a spatial data dependency between macroblocks, that is, the basic encoding block comprising
MB neighbours.
Spatial data dependency can even occur inside a MB. The prediction of a
In this scenario, we cannot encode two frames in parallel, because of the temporal data dependency, and we cannot concurrently process different MBs, because of the spatial data dependency, unless the MBs belong to different slices. Thus, one opportunity is to concurrently process every slice, but this solution has two drawbacks; it strongly depends on the particular encoder configuration, and it requests to implement the whole encoder on every processor element. Therefore, the only chance to partition the encoder is during the MB processing. This does not mean to separately process
The encoder partitioning should now derive from an evaluation of the functional units that can be concurrently computed, taking into account the amount of data that needs to be exchanged between the different cores.
If we suppose that each module will run on a different core, we must consider both the chunk of data each core needs to exchange with the interconnected cores and the frequency of such communications. Therefore, for an optimal module partitioning, it is important to analyse the encoder data flow. Basically, this analysis should result in a list of selected modules with a set of input and output data for every list’s entry, as shown in Table
Encoder data flow.
Module | Input | Output | Input from local buffers |
---|---|---|---|
MV prediction | MV A | MV pred | MV B, C, D |
Motion estimation | Search window, original MB, MV predictor | MV, cost, best intermode, MB predictor | |
Intraprediction | Original MB, reconstructed MB A | Cost, best intramode, MB predictor | Reconstructed MB B, C, D |
Residual coding | Original MB, MB predictor | Residual signal, coded MB parameters | |
IDCT-DeQuant and reconstruction | Residual signal | Reconstructed MB | |
Deblocking filter | Reconstructed MB A | Decoded MB | Reconstructed MB B |
Entropy coding | Coded MB parameters | Output stream |
Encoder partition diagram.
Each module will keep local memory buffers containing the data required to process the current MB. For example, the Intraprediction module needs to store a row of reconstructed MBs plus one MB (the left MB) in order to be able to predict the current MB. The deblocking filter will need to store the same number of reconstructed MBs as well. These local storages are filled by producer modules as soon as they complete the respective tasks. In the previous example, “IDCT-DeQuant & Reconstruction” is the producer for intraprediction; when the MB reconstruction has completed for
For the sake of simplicity we did not put into Figure
The here described partitioning seems to both fulfil the data dependency constraints and exploit the few opportunities of parallel execution available in a H.264 encoder. Moreover, the computational weight of the encoder components is quite well distributed among the different cores. The only exception is the ME, which is the most time-consuming module. In our encoder we utilize the SLIMH264 ME algorithm [
Among the issues the designer should take into account, there is still the memory bandwidth needed to feed the modules. From Table
Search window update.
The H.264 encoder modules work on a block basis. Even though the basic block of the coding process is the macroblock, consisting of
Usually, inside each module the computations require 16-bit precision for intermediate results. Thus, a typical situation is as follows: load 8-bit samples from memory; switch to 16-bit precision and compute the results; store the results to memory as 8-bit samples.
Some of the modules, or at least some parts of them, require a 32-bit precision. Among them, it is worth noting a few computations for pixels interpolation and the Quantization and Inverse-Quantization process.
In order to evaluate the different performance achievable with the three different ISAs, we have inserted the SIMD instructions in an already optimized ANSI C code which is used as reference to evaluate the achieved speedup. For a better understanding of the presented work, the comparison is not only carried out at global level, but for every H.264 functional unit.
In the following, the implementation detail of the sum of absolute difference (SAD) and the Hadamard filter will be shown for all the three addressed ISAs. Among all the several modules implementations, we have chosen to describe these particular operations for different reasons: the SAD is one of the most time-consuming operations in video-compression algorithms; the implementation of Hadamard filter is a good example for describing how an ANSI C implementation can be rewritten to best fit the available SIMD ISA. The access to data stored in memory will be discussed as well because it is a typical issue in optimizing video compression algorithms using SIMD instructions. A complete description of the encoder SIMD implementation on the ST240 processor can be found in [
The sum of absolute differences is a key operation for a large variety of video-coding algorithms. The number of times this operation is executed during a coding process can vary depending on the encoder implementation and it strongly depends on the motion estimation module, that it is not covered by the H.264 standard definition. Anyway, independently of specific implementations, this operation is a key factor for the whole-encoder performance.
Here, we will show three different SAD implementations using SIMD instructions, and we will compare them with an optimized ANSI C code.
Given the essential role the SAD plays in video coding algorithms, some instruction sets include specific instructions to speed up such operation. Here, we will compare SIMD instruction sets having different size and different degree of specializations.
Using the ST240 32-bit wide SIMD extension, the optimization of the SAD computation has been quite straightforward thanks to the SIMD instruction
The SAD finds the “distance” between two
SAD implementation.
The achieved speedup is shown in Table
SAD performance.
Cycles | Operations | Load | Store | |
---|---|---|---|---|
ANSI C version | 36 | 134 | 8 | 0 |
SIMD version | 14 | 30 | 8 | 0 |
The xSTream and P2012 architectures support 128-bit-wide vector registers, and they can perform 8-bit, 16-bit, or 32-bit arithmetic SIMD operations. Usually, SAD is performed using 8-bit precision, allowing for each SIMD calculation a capability to handle sixteen elements. Using vertical SIMD instructions, it is easy to achieve the absolute difference among several elements stored in two vectors, but the addition of the elements stored in a single vector is onerous because usually it requires several vertical SIMD inefficiently utilized. Both P2012 and xSTream ISAs have horizontal addition of SIMD instructions, but with different capability. In xSTream, it is allowed adding all the elements stored in the same vector, producing a scalar result. In P2012, VECx horizontal addition is limited to only add two adjacent elements inside a vector; in this way, four SIMD instructions must be used in order to achieve the scalar result of the SAD operation. This difference significantly impacts the encoder optimization. For example, when the SAD is calculated to evaluate the predictor cost in Intra
Predictor cost calculation.
Even if in P2012 ISA the lacking of a horizontal SIMD for addition partially wastes the obtained great gain, we still complete the SAD operation using six VECx instructions and one scalar instruction, as shown in Figure
We consider very interesting the Hadamard SIMD optimization because it involves a large number of instructions and can be considered a typical case study.
Although the Hadamard transform it is not currently used in the rest of the encoder, the intraprediction module utilizes such transform to find the best
In the ST240 code, the optimization has started considering that Hadamard can be subdivided into two different phases: horizontal and vertical. The horizontal phase can be subdivided into 4 rows as well as the vertical phase into 4 columns, as shown in the portion of pseudocode in Table
Hadamard phases.
Horizontal phase | Vertical phase |
---|---|
/* first row */ | /* first column */ |
|
|
|
|
|
|
|
|
|
|
/* second row */ | /* second column */ |
|
|
|
|
|
|
|
|
|
|
/* third row */ | /* third column */ |
|
|
|
|
|
|
|
|
|
|
/* fourth row */ | /* fourth column */ |
|
|
|
|
|
|
|
|
In the pseudocode,
Once we have all the differences contained in packed 16-bit values subdivided into even and odd pairs, we can rewrite the first row of the horizontal Hadamard transform as
In such a way, we can exploit the packed 16-bit addition and subtraction to obtain the high and low halves of the
Anyway, since the vertical phase of Hadamard is yet to come, there is no need to compute such values at this point. In fact, we can rewrite the
We can use the low and high halves of the intermediate coefficients to compute the low and high halves of the final coefficients
Hadamard vertical phase with ST240 SIMD.
The Hadamard optimization with ST240 SIMD is quite complex. Due to shortness of SIMD, the standard algorithm has been modified in order to better match the SIMD ISA features.
Using the xSTream and P2012 architectures, we followed a different approach. Our goal was the exploitation of 128-bit-wide SIMD minimizing the data reordering. Considering that the Hadamard transform can be defined as
The only issue is obtaining a good implementation of the FFT butterfly with SIMD, avoiding wasting all the gain achieved using fast algorithm with the data reordering needed to implement the calculation. Our approach consists of a modified butterfly that allows using always the same butterfly structure for every level, even if we have to reorder data between stages (Figure
Hadamard modified butterfly.
The output values coming from every butterfly can be calculated for 16 samples at a time using two SIMD instructions, one calculating the additions and one calculating the differences. In this way, we have the advantage of computing the output of each level using a simple SIMD implementation, at the cost of swapping intermediate results between different levels.
Even if xSTream and P2012 share this implementation mechanism, we have measured different performance. In this case, the difference depends on the different types of data manipulation instructions. The xSTream ISA having the third operand allowing the permutation of results inside vectors is more flexible and can implement the above algorithm with a reduced number of instructions respect to VECx P2012 ISA. Algorithm
/*first level: one 16 samples butterfly*/ /*(s0 ÷ s7)+(s8 ÷ s15)*/ vaddh out_low = in_low, in_high /*(s0 ÷ s7)−(s8 ÷ s15)*/ vsubh out_high = in_low, in_high /*data reordering*/ /*0 1 2 3 8 9 10 11*/ vmrgbl in_low = out_low, out_high, perm /*4 5 6 7 12 13 14 15*/ vmrgbu in_high = out_low, out_high, perm /*second level: two 8 samples butterfly*/ vaddh out_low = in_low, in_high vsubh out_high = in_low, in_high /*data reordering*/ /*0 1 8 9 4 5 12 13*/ vmrge in_low = out_low, out_high /*2 3 10 11 6 7 14 15*/ vmrgo in_high = out_low, out_high /*third level: four 4 samples butterfly*/ vaddh out_low = in_low, in_high vsubh out_high = in_low, in_high /*data reordering*/ /*0 8 2 10 4 12 6 14*/ vmrgeh in_low = out_low, out_high /*1 9 3 11 5 13 7 15*/ vmrgoh in_high = out_low, out_high /*fourth level: eight 2 samples butterfly*/ vaddh out_low = in_low, in_high vsubh out_high = in_low, in_high
As previous exposed, a key factor to achieve a good performance improvement with SIMD optimization is the efficient handling of unaligned load operations. In general, programmers should structure the application data in order to avoid or minimize misaligned memory accesses. In video compression algorithm, the motion compensation is surely a case where it is not possible avoid unaligned memory accesses because it is impossible to predict motion vectors and consequently align data.
None of the three addressed architectures support unaligned load instructions. Therefore, it is important to efficiently use aligned accesses to load misaligned data from memory. The three ISAs support instructions to concatenate two vectors. This allows a solution consisting in two steps: first, we use two aligned load instructions for loading data in two vector registers, and, then, we concatenate and shift their elements in order to extract a single vector containing the needed data, as shown in Figure
Unaligned load.
If an ISA does not define a SIMD performing this type of concatenate operation, then the unaligned load will be implemented with an extra cost due to the use of additional instructions for merging data between the two vectors.
Algorithm
It is very important that these concatenate instructions can take the offset argument not as a constant value but as a variable value; otherwise, modules such as motion compensation would not get any benefit from using them. For example, the Intel SSSE3 “palignr” instruction concatenates two operands and shift right the composite vector by an offset for extracting an aligned results, but the offset must be a compile-time constant value. This is a big issue for a module as motion compensation, in which it is impossible to know in advance the offset of a misaligned address.
In the H.264 encoder, the most cycle-demanding modules have been optimized using SIMD instructions: motion estimation and compensation, DCT, Intraprediction, and so forth. The best way to compare different instruction sets in order to judge the effectiveness of both SIMD extensions and code optimizations is to measure the speedup obtained with the SIMD-based implementation versus the ANSI C version of the same source code. In order to separate the effect of SIMD performance improvement from ANSI C optimizations, we have inserted SIMD instructions in previously optimized ANSI C modules.
The results are provided in terms of average cycles spent to process one macroblock. The xSTream and P2012 architectures share the same modules subdivision. For the single-core DSP ST240, the subdivision is less fine, and related modules are joined together. In the reported tests, the presence of the ST240 processor is important because it allows comparing the single-processor elements of the multicore platforms to a single-core architecture. Tests are performed on a set of video sequences addressing different resolutions, and average results are resumed in Table
ISA comparison.
It is worth analyzing in detail these results to understand how different instruction sets lead to different performance. The xSTream processor elements take advantage from the “horizontal add” instruction that allow an efficient computation of the SAD operations: it is evident in the ME module, where xSTream spends about 25% fewer cycles than P2012 (84,342 versus 114,776 cycles/MB). The higher speedup obtained by P2012 is mainly due to the less-efficient ANSI C code generated by the P2012 compiler. We already described as the ST240 can exploit a specific instruction for the SAD operation. In fact, its result is not far from the architectures having 128-bit-wide vector registers (the 200, 380 cycles/MB also include motion compensation). From these results, we can state that the support for horizontal SIMD will not only give a great performance improvement for the SAD operation, but it significantly impacts the whole ME module.
As earlier said, data manipulation instructions are a key factor to fully exploit SIMD implementations because operations such as transposing matrices or data reordering become frequent in this type of optimizations. An experimental result confirming this consideration can be seen in the DCT/Q/IQ/IDCT
Intraprediction can exploit the horizontal permute instruction as well; the intraprediction modes involving diagonal directions require the permutation of elements inside the resulting vectors. For similar reasons, ST240 achieves great speedup factors in DCT and intramodules (resp., 1.7 and 1.6), considering that a 32-bit-wide SIMD can only perform two 16-bit-arithmetic operations.
There are other several SIMD instructions that in our opinion are to be considered as key instructions for optimizing video codec applications. Here, we assume an instruction set will already include SIMD for all the common arithmetic operations, compare, select, shift, and memory operations.
In previous sections, we already discussed about the impact of the unaligned memory access to the video codec performance. All the encoder modules are affected by the performance of the unaligned memory operations, but it becomes a keyfactor for motion estimation and compensation. An instruction concatenating two vectors and producing a vector at the desired offset is fundamental to implement an unaligned load instruction. As stated in Section
Inside most of the modules, the computations require a 16-bit precision for intermediate results, but the input and output data contained into the noncompressed YUV images are 8-bit values. Thus, a typical operation at the beginning of a module is to load 8-bit input values and extend them to 16-bit precision. At the end, the output data precision is usually demoted down to 8-bit saturating the values before storing the results. Therefore, even if the support to 8-bit operations is not required, it would be very useful that an instruction set will include SIMD instructions for promoting and demoting precision in a fast way. An optimal solution will also combine promotion with load operations and demotion with store instructions.
Usually, the video codec algorithms try to avoid the division operations because of its computational cost. When needed, divisors are power of two, and the division is substituted with a shift right with rounding as follows:
Therefore, even if most instruction sets already include this type of instruction, it is important to remind its utility. Often, the shift right with rounding is used for averaging two or more values, as in the intraprediction and deblocking filter. In our implementation, one of the reasons the ST240 achieves a good speedup in the intraprediction module is the presence of an average SIMD instruction in the instruction set.
Table
SIMD instructions for video coding.
Instruction description | Affected modules | Notes |
---|---|---|
|
ME, intraprediction | Speeds up SAD |
|
Intraprediction, DCT/Q/IQ/IDCT | Allows zig-zag scan and speeds up intra diagonal modes |
|
Motion estimation and compensation | Allows software implementation of unaligned load |
|
All the main modules | It will speed up the load and store operations for several modules |
|
ME, intraprediction, deblocking filter | Speeds up SAD in conjunction with horizontal add; used in deblocking filter |
|
IDCT, deblocking filter, motion compensation | Speeds up |
|
Intraprediction, deblocking filter, motion compensation | Speeds up |
Cycles/MB spent in each module for each ISA.
xSTream | P2012 | ST240 | |||||||
ANSI C | SIMD | Gain factor | ANSI C | SIMD | Gain factor | ANSI C | SIMD | Gain factor | |
|
|||||||||
Luma motion compensation | 4788 | 2257 | 2.1x | 8286 | 3965 | 2.1x | |||
Croma motion compensation | 3064 | 658 | 4.7x | 3626 | 1282 | 2.8x | 265559 | 200380 | 1.3x |
Motion estimation | 303769 | 84342 | 3.6x | 603182 | 114776 | 5.3x | |||
|
|||||||||
Intra 4 × 4 | 24366 | 10076 | 2.4x | 38234 | 15760 | 2.4x |
|
|
|
Intra 8 × 8 | 15396 | 4997 | 3.0x | 26972 | 9455 | 2,9x | |||
|
|||||||||
DCT/Q/IQ/IDCT 4 × 4 | 14994 | 7616 | 2.0x | 20473 | 9088 | 2.3x |
|
|
|
DCT/Q/IQ/IDCT 8 × 8 | 18660 | 3498 | 5.3x | 24486 | 11636 | 2.1x |
This paper presents efficient implementations of the H.264/AVC encoder on three different ISAs. The optimization process exploits the SIMD extensions of the three architectures for improving the performance of the most time-consuming encoder modules. For each addressed architecture, experimental results are presented in order to both compare the different implementations and evaluate the speedup versus the optimized ANSI C code.
The paper discusses how SIMD size and different instruction sets can impact the achievable performance. Several issues affecting video-coding SIMD optimization are discussed, and authors’ solutions are presented for all the architectures.
Most instruction sets have specific SIMD instructions for video coding. Even though these instructions can lead to great performance improvements, they could be useless for other application families. In this paper, we identify a set of generic SIMD instructions that can significantly improve the performance of video applications.
Besides presenting the SIMD optimization for the most time-demanding modules, the paper describes how a complex application as the H.264/AVC encoder can be partitioned to a multicore architecture.
The authors would like to thank STMicroelectronics’s Advanced System Technology Laboratories for their support. This work is supported by the European Commission in the context of the FP7 HEAP project (#247615).