We propose a Domain-Specific Architecture for elementary function computation to improve throughput while reducing power consumption as a model for more general applications: support fine-grained parallelism by eliminating branches, and eliminate the duplication required by coprocessors by decomposing computation into instructions which fit existing pipelined execution models and standard register files. Our example instruction architecture (ISA) extension supports scalar and vector/SIMD implementations of table-based methods of calculating all common special functions, with the aim of improving throughput by (1) eliminating the need for tables in memory, (2) eliminating all branches for special cases, and (3) reducing the total number of instructions. Two new instructions are required, a table lookup instruction and an extended-precision floating-point multiply-add instruction with special treatment for exceptional inputs. To estimate the performance impact of these instructions, we implemented them in a modified Cell/B.E. SPU simulator and observed an average throughput improvement of 2.5 times for optimized loops mapping single functions over long vectors.
Elementary function libraries are often called from performance-critical code sections and hence contribute greatly to the efficiency of numerical applications, and the performance of these and libraries for linear algebra largely determine the performance of important applications. Current hardware trends impact this performance as longer pipelines and wider superscalar dispatch favour implementations which distribute computation across different execution units and present the compiler with more opportunities for parallel execution but make branches more expensive; Single-Instruction-Multiple-Data (SIMD) parallelism makes handling cases via branches very expensive; memory throughput and latency which are not advancing as fast as computational throughput hinder the use of lookup tables; power constraints limit performance more than area.
The last point is interesting and gives rise to the notion of “dark silicon” in which circuits are designed to be un- or underused to save power. The consequences of these thermal limitations versus silicon usage have been analyzed [
Our proposal is less radical: instead of adding specialized coprocessors, add novel fully pipelined instructions to existing CPUs and GPUs, use the existing register file, reuse existing silicon for expensive operations, for example, fused multiply-add operations, and eliminate costly branches but add embedded lookup tables which are a very effective use of dark silicon. In the present paper, we demonstrate this approach for elementary function evaluation, that is,
To optimize performance, our approach takes the successful accurate table approach of Gal et al. [
Although fixed powers (including square roots and reciprocals) of most finite inputs can be efficiently computed using Newton-Raphson iteration following a software or hardware estimate [
For evaluation of the approach, the proposed instructions were implemented in a Cell/B.E. [
In the following, the main approach is developed, and the construction of two representative functions,
Driven by hardware floating-point instructions, the advent of software pipelining and shortening of pipeline stages favoured iterative algorithms (see, e.g., [
In proposing Instruction Set Architecture (ISA) extensions, one must consider four constraints: the limit on the number of instructions imposed by the size of the machine word, and the desire for fast (i.e., simple) instruction decoding, the limit on arguments and results imposed by the architected number of ports on the register file, the limit on total latency required to prevent an increase in maximum pipeline depth, the need to balance increased functionality with increased area and power usage.
As new lithography methods cause processor sizes to shrink, the relative cost of increasing core area for new instructions is reduced. The net cost may even be negative if the new instructions can reduce code and data size, thereby reducing pressure on the memory interface (which is more difficult to scale).
To achieve a performance benefit, ISA extensions should do one or more of the following: reduce the number of machine instructions in compiled code, move computation away from bottleneck execution units or dispatch queues, or reduce register pressure.
Considering the above limitations and ideals, we propose adding two instructions, the motivation for which follows below:
It is easiest to see them used in an example. Figure
Data flow graph with instructions on vertices, for
The dotted box in Figure
The gray lines indicate two possible data flows for three possible implementations: the second the second the second
In the first case, the dependency is direct. In the second two cases the dependency is indirect, via registers internal to the execution unit handling the lookups.
All instruction variations have two register inputs and one or no outputs, so they will be compatible with existing in-flight instruction and register tracking. On lean in-order architectures, the variants with indirect dependencies—(ii) and (iii)—reduce register pressure and simplify modulo loop scheduling. This would be most effective in dedicated computational cores like the SPUs in which preemptive context switching is restricted.
The variant (iii) requires additional instruction decode logic but may be preferred over (ii) because tags allow
In low-power environments, the known long minimum latency between the
To facilitate scheduling, it is recommended that the FIFO or tag set be sized to the power of two greater than or equal to the latency of a floating-point operation. In this case, the number of registers required will be less than twice the unrolling factor, which is much lower than what is possible for code generated without access to such instructions. The combination of small instruction counts and reduced register pressure eliminate the obstacles to inlining of these functions.
We recommend that
A key advantage of the proposed new instructions is that the complications associated with exceptional values (0,
Iterative methods with table-based seed values cannot achieve this in most cases because in to prevent over/underflow for high and low input exponents, matched adjustments are required before
By using the table-based instruction twice, once to look up the value used in range reduction and once to look up the value of the function corresponding to the reduction, and introducing an extended-range floating-point representation with special handling for exceptions, we can handle both types of exceptions without extra instructions.
In the case of finite inputs, the value
The second
In Table
Take, for example, the first cell in the table, recip computing
Contrast this with the handling of approximate reciprocal instructions. For the instructions to be useful as approximations
The other cases are similar in treating
In Table
Finally, for exponential functions, which return fixed finite values for a wide range of inputs (including infinities), it is necessary to override the range reduction so that it produces an output which results in a constant value after the polynomial approximation. In the case of exponential, any finite value which results in a nonzero polynomial value will do, because the second
A simplified data flow for the most complicated case,
Bit flow graph with operations on vertices, for
Starting from the top of the graph, the input (b) is used to generate two values (c) and (d),
Partial decoding of subnormal inputs (g) is required for all of the functions except the exponential functions. Only the leading nonzero bits are needed for subnormal values, and only the leading bits are needed for normal values, but the number of leading zero bits (h) is required to properly form the exponent for the multiplicative reduction. The only switch (i) needed for the first
On the right hand side, the lookup (e) for the second
The integer part has now been computed for normal inputs, but we need to switch (s) in the value for subnormal inputs which we obtain by biasing the number of leading zeros computed as part of the first step. The apparent 75-bit add (t) is really only 11 bits with 10 of the bits coming from padding on one side. This fixed-point number may contain leading zeros, but the maximum number is
If the variants (ii) and (iii) are implemented, either the hidden registers must be saved on context/core switches, or such switches must be disabled during execution of these instructions, or execution of these instructions must be limited to one thread at a time.
Two types of simulations of these instructions were carried out. First, to test accuracy, our existing Cell/B.E. functional interpreter, see [
Accuracy, throughput, and table size (for SPU/double precision).
Function | Cycles/double |
Cycles/double |
Speedup (%) | Max error (ulps) | Table size ( |
Poly order |
|
---|---|---|---|---|---|---|---|
recip | 3 | 11.3 |
|
0.500 | 2048 | 3 |
|
div | 3.5 | 14.9 |
|
1.333 | recip | 3 | |
sqrt | 3 | 15.4 |
|
0.500 | 4096 | 3 | 18 |
rsqrt | 3 | 14.6 |
|
0.503 | 4096 | 3 |
|
|
|||||||
cbrt | 8.3 | 13.3 |
|
0.500 | 8192 | 3 | 18 |
rcbrt | 10 | 16.1 |
|
0.501 | rcbrt | 3 |
|
qdrt | 7.5 | 27.6 |
|
0.500 | 8192 | 3 | 18 |
rqdrt | 8.3 | 19.6 |
|
0.501 | rqdrt | 3 | 18 |
|
|||||||
log2 | 2.5 | 14.6 |
|
0.500 | 4096 | 3 | 18 |
log21p | 3.5 | n/a | n/a | 1.106 | log2 | 3 | |
log | 3.5 | 13.8 |
|
1.184 | log2 | 3 | |
log1p | 4.5 | 22.5 |
|
1.726 | log2 | 3 | |
|
|||||||
exp2 | 4.5 | 13.0 |
|
1.791 | 256 | 4 | 18 |
exp2m1 | 5.5 | n/a | n/a | 1.29 | exp2 | 4 | |
exp | 5.0 | 14.4 |
|
1.55 | exp2 | 4 | |
expm1 | 5.5 | 19.5 |
|
1.80 | exp2 | 4 | |
|
|||||||
atan2 | 7.5 | 23.4 |
|
0.955 | 4096 | 2 | 18 |
atan | 7.5 | 18.5 |
|
0.955 | atan2 | 2 + 3 | |
asin | 11 | 27.2 |
|
1.706 | atan2 | 2 + 3 + 3 | |
acos | 11 | 27.1 |
|
0.790 | atan2 | 2 + 3 + 3 | |
|
|||||||
sin | 11 | 16.6 |
|
1.474 | 128 | 3 + 3 | 52 |
cos | 10 | 15.3 |
|
1.025 | sin | 3 + 3 | |
tan | 24.5 | 27.6 |
|
2.051 | sin | 3 + 3 + 3 |
(a) Values returned by
Function | Finite > 0 |
|
|
|
Finite < 0 |
---|---|---|---|---|---|
recip |
|
0, 0 | 0, 0 | 0, |
|
|
|||||
sqrt |
|
0, |
0, NaN | 0, 0 | 0, NaN |
|
|||||
rsqrt |
|
0, 0 | 0, NaN | 0, |
0, NaN |
|
|||||
log2 |
|
0, |
0, NaN | 0, |
0, NaN |
|
|||||
exp2 |
|
0, |
NaN, 0 | 0, 1 |
|
|
Finite |
|
|
NaN |
---|---|---|---|---|
Finite |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
NaN |
|
|
|
|
|
Finite |
|
|
NaN |
---|---|---|---|---|
Finite ≠ 0 |
|
2 | 2 | 2 |
|
|
|
|
|
|
|
|
|
|
NaN0 |
|
|
|
|
NaN1 |
|
|
|
|
NaN2 |
|
|
|
|
NaN3 |
|
|
|
|
Throughput, measured in cycles per double, for implementations of elementary function with (upper bars) and without (lower bars) the novel instructions proposed in this paper.
We have demonstrated considerable performance improvements for fixed power, exponential, and logarithm calculations by using novel table lookup and fused multiply-add instructions in simple branch-free accurate-table-based algorithms. Performance improved less for trigonometric functions, but this improvement will grow with more cores and/or wider SIMD. These calculations ignored the effect of reduced power consumption caused by reducing instruction dispatch and function calls and branching and reducing memory accesses for large tables, which will mean that these algorithms will continue to scale longer than conventional ones.
For target applications, just three added opcodes pack a lot of performance improvement, but designing the instructions required insights into the algorithms, and even a new algorithm [
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors thank NSERC, MITACS, Optimal Computational Algorithms, Inc., and IBM Canada for financial support. Some work in this paper is covered by US Patent 6,804,546.