Massively Parallel Searching for Better Algoritlnns or How to Do a Cross Product with Five Multiplications

A number of "tricks" are known that trade multiplications for additions. The term "tricks" reflects the way these methods seem not to proceed from any general theory, but instead jump into existence as recipes that work. The Strassen method for 2 x 2 matrix product with seven multiplications is a well-known example, as is the method for finding a complex number product in three multiplications. We have created a practical computer program for finding such tricks automatically, where massive parallelism makes the combinatorially explosive search tolerable for small problems. One result of this program is a method for cross products of three-vectors that requires only five multiplications. © 1996 John Wiley & Sons, Inc.


INTRODUCTION
Humans have talents that are hard to program.Driving a car, recognizing continuous speech, and playing chess have proved far more challenging to computers than they have to people.To this class we can probably add algorithm optimization.How is it that a mathematician can look at a simple algorithm for complex product like u.,_aXc-bxd and discover that the number of multiplication op-erations can be reduced to three.One method, which seems far from obvious. is The conventional method for 2 X 2 matrix products calls for eight multiplications and four additions.(In this article, we equate additions and subtractions in assessing operation count because they are computationally similar).
The trick behind the Strassen method for 2 X 2 matrix products [20] is even more abtruse and baffling than the shortcut for complex products (Yuval presents a possible derivation, see [22]): Where is the pattern in these methods?Although the theory of trilinear forms [ 16,17] has helped guide some shortcuts, it appears that these methods are the result of inexplicable intuitive leaps by some very bright people.
In August 1991, the authors began an expeti-rnent to see if computers, especially massively parallel computers, could discover these tricks, and perhaps find new ones.Beginning with brute force search methods that ran for days on very small algorithms, the program became gradually more sophisticated and able to handle interesting problems within the limits of our patience.Recently, an nCCBE 2 with 256 processors revealed that it is possible to compute the cross product of two 3dimensional vectors using only five multiplications, and this method is new to the best of our knowledge.The conventional method requires six multiplications and three subtractions:

k
The algorithm with five multiplications, found by the computer program, is: The rest of the article describes our search program.Shortcut ways of computing the given expressions are found by doing a search among all possible expressions that can be derived from the given set of variables and operations.Because this exhaustive enumeration has a combinatorially large number of expressions to explore, strategies are developed which reduce the search.Several "pruning" strategies are u;;ed to avoid the exploration of unpromising subtrees.Parallel computers are employed to conduct the search in parallel, to achieve higher speed.

PROBLEM SPECIFICATION
The problem can be defined as follows: Given: A set of variables a budget of M multiplications and A additions and a set of goal expressions Find: A sequence of (expl op exp2) triples where exp 1 and exp2 are selected from either the set of variables or previously computed expressions, and op is chosen from theremaining multiplications or additions that compute the goal expressions, and that minimize the total number of operations or the number of operations of a particular type.
We are interested in minimizing the number of operations of a particular type because the relative cost of the operation types is different in general.For example, in Strassen's matrix product algorithm, the variables involved could themselves be matrices, and matrix products are costlier than matrix sums.Any algorithm for finding the product of two k X k matrices in Jl,f multiplications can be used recursivelv to find the product of two n X n [1,20].For such a problem, we are obviously interested in minimizing the number of multiplications, even at the cost of increasing the total number of addition operations.
For the class of problems addressed in this artide, addition ( + ), subtraction (-), and multi plication (X) operations are sufficient.Addition and multiplication are commutative, while subtraction is not.For the purposes of uniformity, we introduce a notation "reverse subtraction" ( ~ ), with the associated meaning that exp 1 ~ exp2 is the same as exp2 -exp1.With this, we can impose an ordering on exp1 and exp2 and explore only cases with exp 1 < exp2 without omission.Also, we use the term ''add'' to denote anv of addition/ subtraction/reverse subtraction.Throughout the article, we assume that operation cost is data independent.We also use the term ''product'' to refer to multiplication of entities like complex numbers and matrices and reserve the term "multiplication" for real numbers.

MODELING AS A SEARCH PROBLEM
Consider the set of all possible expressions that can be derived from the given set of variables and operations.These can be thought of as a graph with each node representing an expression (Fig.

1)
. \V e use the tenn ''expression'' for a node as long as it results in no confusion, because each node can be specified by the expression it represents.The graph G(V, E) can be defined as follows: 1.Each of the given set of variables forms a node in the graph.

2.
If two expressions e 1 , e~ E V, then (e 1 op e 2 ) E V for every choice of operation op.
We can also think of edges from e 1 and e 2 to the node (e 1 op e 2 ).Clearly, this graph represents all possible expressions that can be generated from the given variables and operation types.Note that each node represents a unique expression and the way of generating it starting from the variables can easilv be found bv following the edges into the node repr~senting this .expression.However, two or more expressions could be mathematically equivalent.
The number of operations required to compute any expression in the graph can be found by recursively following the edges coming into the expression.The number of operations for computing e 1 op e 2 is one more than the operations required for computing e 1 and e 2 , except that expressions common to the paths for e 1 and e:2.need be computed only once.
A set of nodes in this graph whose expressions are mathematically equivalent to the goal expressions constitutes a way of computing the goal expressions with the associated number of operations.The problem then translates to finding the appropriate set of nodes such that the associated number of operations is the minimum over all such possible sets.
Because there is no budget for the number of operations, the graph has an infinite number of nodes.The nodes of the graph can easily be ordered according to the number of operations required to compute the expressions at the nodes.The nodes at level 0 constitute the given set of variables.The nodes at level i consist of expressions that can be computed using exactly i opera-  tions.We can easily place a bound on the search because we are interested in minimizing the number of operations.If a known way of generating the goal expressions needs k operations, we need only look at nodes at levels less than k.
Because the graph has a combinatorially explosive number of nodes, it would be impractical to Search() For input1 = 1 to level-1: For input2 = 1 to level-1:

THE BASIC SEARCH ALGORITHM
The Search algorithm is simply a depth-first search on the tree described in the previous section.Note that we do not allow operations involving specific integers; only letters involving unknown quantities.The naive form of the basic algorithm is easily stated: Input: m variables, a budget consisting of M multiplications, and A adds, and n goal expressions to be determined.
Output: A way of computing the goal expressions without exceeding the budgeted number of operations or a statement that no such method exists.
Method: Initialize the first m levels to the input variables.Initialize level to m + 1.
For each operation type left in the budget: Apply operation to the expressions at input1 and input2.
Remove the operation from the budget.
If the new expression is a goal expression not yet found, mark the goal expression as found.If all the goals are found, retum the solution.

If level< m + M + A, Search().
Cnmark the goal expression found, if any.Restore operation to the budget.End For.End For.End For.generate and store the graph during the search process.Also, solutions are difficult to identify, because matching subgraphs to the goal expressions (as algebraic equivalents) is itself a combinatorial problem.To avoid these problems, it is convenient to do a depth-first search on a tree as shown in Figure 2.
The tree is constructed as follows: The given set of variables forms the first m levels of the tree, one node at each level.The children of node i are all the expressions that can be formed using an operation and any two expressions on the path from the root to node i.The goal is to search for a path in the tree which contains expressions equivalent to goal expressions.Because each node accounts for one operation, paths containing more than the budgeted number of operations need not be explored.
In principle, questions like, "Is there a method of finding the product of two complex numbers involving three real multiplications and five real adds?"can be answered by exhaustive search.This approach is naive in general because there are so many possible algorithms with these constraints.Because the two input expressions can be taken from any previously computed expression, the number of input combinations is (A + Af + m-1)! 2 /(m -1)! 2 .There are (A ~i\-1) ways to place the multiplications in the set of steps.Because the "add'' operations can be any of addition, subtraction, or reverse subtraction, there are 3 1 possible sets of add operations for any specified placing of multiplications.The total number of elements in the search space by the naive method is

M
For the complex product with four inputs and a budget of three multiplications and five adds, this value is approximately 6.02 X 10 17 .
A massively parallel collection of 8000 processors, each checking one million nodes per second, would take more than 2 years in the worst case that there is only one such algorithm and that it is the last one checked.The existence of multiple algorithms in the search space and termination of the search when the first is found might reduce this time by an order of magnitude, but it would still not be the sort of problem casually attempted with 1992 technology!
To make the problem more tractable, we began the accumulation of "pruning rules" for eliminating subtrees in the search without any danger of missing a solution.These pruning rules are conservative because they still allow the resulting tree to be called an exhaustive search.Later we consider pruning rules that seem to greatly assist in finding clever methods that trade multiplications for adds, but cannot be used to prove the nonexistence of a method with a given operation budget.

CONSERVATIVE PRUNING RULES
To motivate the need for pruning rules, look at what is probably the simplest trick in all of elementary algebra: a 2 -b 2 =(a+ b)(a-b).The factoring reduces the number of multiplications from two to one, at the cost of increasing the number of adds from one to two.If we set up a search tree using Algorithm A, the first step could be any of mented by changing loop control variables instead of using explicit "if" statements.Some of the pruning strategies are rather straightforward and obvious.Nevertheless, they are very important in view of the significant amount of reduction in the search.
• Before we enumerate the pruning rules, it is necessary to introduce some notions.The program always keeps track of the path from the root of the tree to the current node along with the way each expression on the path is derived.Expressions are represented as polynomials in the input variables.The terms of the polynomials are ordered to make it easy for addition and subtraction operations and to check mathematical equality of expressions.The expressions on the path are numbered starting from the root.The way an expression is derived is represented by the numbers denoting the parent expressions and the operation used.Figure 3 makes the ideas clear.
We can also define a lexicographic ordering on the expressions based on the way they are derived.
Let/ 1 op 1 g 1 and/ 2 op 2 g 2 denote two expressions e 1 and e 2 on a path.We say e 1 precedes e 2 in lexicographic order if g 1 < g 2 or if g 1 = g 2 and op 1 precedes op 2 in lexicographic order or if g 1 = g 2 , op 1 = op 2 , and/ 1 < / 2 .The ordering on operations is defined to be"~"<"-"<"+"<" X".There are several possible choices for defining a lexicographic ordering and there is nothing special about the specific order chosen.The ordering is useful in defining some pruning rules.A number of pruning strategies are discussed below.For each pruning rule that is used, we state the rule along with a brief justification wherever it is appropriate.

equal to the total number of operations plus the number of variables.
Each expression on a path (except the variables) is formed by one operation.Therefore, paths of length more than the budgeted number of operations plus the number of variables cannot provide the desired solution.

Restrict the number of operations of each particular ~ype.
lf the budget of operations of a particular type is consumed on a path from the root to a node, all children of the node using the particular operation can be pruned.

Subtracting an expression from itself is not useful, and can be excluded from the search tree.
This assumes that 0 is not a goal expression.

If the expression at a node is the same as the expression at one of its parent nodes, delete the node and the subtree under it.
This is obvious because no shortcut will have the same expression computed twice.
In Figure 4, the node labeled b(3 : 0 -2) is deleted because it represents the same expression as the node labeled b ( 1).

Eliminate expressions with a leading negative term, except when one of the goal expressions contains a leading negative term.
Recall that the terms of expressions are ordered and hence there is no ambiguity in deciding if an expression has a leading negative term.An expression is negative if it has a leading negative term and positive otherwise.For each path P( 1 ) containing a negative expression, we show the existence of another path p!'21 which does not contain such expressions and is a solution if p(ll is a solution.p!'2) satisfies the property that every expression in it is the same as the corresponding expression in p(ll or its negative.We construct p!'2 starting from the root, such that at each stage the constructed partial path satisfies the above property.To start, the first m levels constitute the variables and hence satisfy the property.Let e~1 be the next expressi.onto be added to P( 2 1 and e1 1 ) be the corresponding expression on pr 1 I.Let e~3 1 I = e\ 1 1 op e&ll.We can show that for every choice of op and for every possible combination of signs of e1 1 and e2 , e3 • can be constructed such that ek 21 = -e~1 1 if e1 11 is negative and e~2 1 = e1 1 1 otherwise.For example, if e1 11 = e\ 1 1 + e& 1 1 and e~1 ) and e& 1 1 are negative, let e~2 1 = e\ 2 1-e& 2 1.Because e~2 1 = ei 11 and e! 2 1 = -e( 1 ) e( 2 ) = -er 1 ) -e( 1 ) = -e( 11 as de- sired.If all goal expressions are positive and P( 1 1 contains a solution, then p!'21 must also contain a solution by the above construction.Therefore, we can prune the subtree under any expression that is negative.

The level of the second input node should be the same as or greater than the level of the first input node.
This rule was explained in the beginning of this section.

Eliminate exploring paths with an identical set of expressions, by requiring expressions on each path to be in increasing lexicographic order starting from the root.
For each path P! 1 1 that does not contain expressions in increasing lexicographic order, we can show the existence of another path p! 2 1 containing the same expressions as p!ll, but in the proper order.Let e\ 1 1 and e& 1 ) be two expressions on P( 1 i that do not conform to the lexicographic order.Without loss of generality, let ei 1 ) be the expression nearer to the root.Because e& 1 1 precedes e\ 1 1 in lexicographic order, the expressions needed to compute e& 1 ) appear before e\ 1 1 on the path, and therefore e\ 1 1 and e& 1 I can be swapped.We can construct p! 2 1 by systematically swapping every two expressions on p(ll that do not conform to the lexicographic order.Because P! 1 1 and P! 2 1 contain the same expressions, p('2) represents a solution if p!11 represents a solution.
Using this rule, we can prune each node that precedes its parent in lexicographic order.In Figure 5, the two paths shown have identical expressions and hence the subtre.esunder them are identical.The tree is pruned at node labeled a -b(3: 0 -1) because it precedes its parent in lexicographic order.This avoids computing the same subtree twice.8.Among all the children of a node with mathematically equivalent expressions, choose the one that is smallest in the lexicographic ordering.This rule is also useful to avoid computing duplicate subtrees.The expression that is smallest in the lexicographic ordering is chosen because by rule 7, the subtree under such a node contains the subtree under a node representing the same expression and having the same parent node.For example, in Figure 6, the node labeled bc(5: 3 ~ 4) is pruned as it succeeds the node labeled b -c(5: 2 -3) in lexicographic order and has a common parent.
9. Levels to explore should be at least as many as goal expressions to be found.
A path representing a solution should contain all the goal expressions.Paths with not enough room for all the goals need not be explored.10.In a shortcut, each expression is used at least once.Each expression is formed by at most two expressions on the current path.Also, the number of expressions on the path is bounded.This places a limit on the number of expressions that can be used in forming subsequent expressions on the path.If there is no possibility that all the expressions (except the goal expressions) are used at least once, the tree can be pruned at this node.
T' ' ' ""' (3 "' ' f §i li:." -' '  Search tree with two children of a node representing the same expression. Let l = length of the path constructed so far, g = number of goals yet to be found, and u = number of unused expressions on the path.As only paths of length at most m + M + A are explored, the constructed path can be extended to contain (m + M + A) -l more expressions and at most 2 X (m + M +A -l) different expressions could be used to construct these.There are u expressions on the path explored that are unused and (m + M + A -l -g) expressions on any extended path that have to be used.Therefore, any extension of the path could be a shortcut and a solution only if Otherwise, the subtree under the path constnicted so far can be pruned.
It should be noted that some of the pruning rules are more expensive than the others.This can be viewed as moving the tree traversal cost to the node expansion cost, and the trade off in cost should be considered.For example, rule 4 involves determining whether the expression is identical to one of its ancestors.Implementing this rule is fairly cheap and effective when the node is close to the root because there are few nodes to compare and large subtrees to prune.However, when it is close to the maximum level (as defined by rule 1 ), the cost of comparison increases dramatically but the payoff decreases significantly.Rule 8 involves comparing the expression to its siblings generated before, making it impractical to implement even for nodes fairly close to the root.In fact, the siblings are not available in a depth-first search as only the path from the root to the current node is stored.In such a case, a restricted version of the rule might be more useful.For example, rule 8 implies that any expression in first degree of length two should be constructed using only the given variables and this can be checked with just two comparisons.Except for rules 4 and 8, the remaining rules can be implemented using a constant number of operations irrespective of the position in the tree of the node being tested.
There are other pruning rules that guarantee that if a solution were to be found in one of the deleted subtrees, then an equally economical solution is guaranteed to be found in another subtree to be explored.These rules usually stem from the commutativity of the operation the goal expressions represent or from symmetry considerations.
As an example, consider the complex product, which is a commutative operation.
Exchanging a with c and b with din any solution that computes the complex product also gives a solution for the same.Hence, in the search program, if a path from the root to a node can be derived from another path by the above exchange, the subtree under that node need not be explored.In fact, we can easily derive the missed solutions from the solutions found.
Such a pruning rule is very useful because it prunes the tree at the first level.For, if a path can be derived from another, the first node in the path can also be derived and the tree should have been pruned at the first node on the path itself.
As another example of a similar rule, consider the 2 X 2 matrix product.Interchanging the rows of the first matrix and/ or the columns of the second matrix does not affect the goal expressions to be computed.This prunes three of four nodes at the first level of the tree.

AGGRESSIVE PRUNING RULES
As the size of the search tree is exponential both in the number of variables and the number of operations, we need as many pruning rules as possible to be able to tackle interesting problems in reasonable time.Due to this, we added some pruning rules that seem to be intuitively appealing, without the guarantee that they eliminate only unpromising subtrees.These rules greatly assist in finding clever solutions but cannot be used to prove the nonexistence of any methods with a given operation budget: 1. Eliminate expressions with degree higher than the highest degree among goal expressions.
2. If the goal expressions are all homogeneous, do not allow expressions with terms of different degree.
3. If the goal expressions do not contain terms of the form n X expression, where n is an integer, In I > 1, and expression is a product of input variables, then we can also exclude all operations that result in such terms.This set of pruning rules, along with the conservative pruning rules, reduces the number of possibilities for the a 2 -6 2 search space from 15,552 to a much more manageable 24 expressions.The savings are more dramatic for larger search trees, of course.The resulting search tree for the computation of a 2 -6 2 is shown in Figure 7.
A few more pruning rules were found useful in dealing with goal expressions denoting the product of two mathematical entities like complex numbers, vectors, and matrices.None of the goal expressions for such a product contains a term with two variables drawn from the same operand and none of the known shortcuts contains any intermediate expressions with such terms.
We define the structure of a term to be the sequence of operand names from which the variables forming the term are drawn.For example, with respect to the complex product (a+ ib) X (c + id), the term be has the structure (exp1, exp2), where exp1 and exp2 refer to the first and the second operand, respectively.The following pruning rules are based on the structures of expressions.Due to these pruning rules, each expression is made of the terms of the same structure and the structure of any term can be taken as the structure of the entire expression.
4. +,operations are allowed only on operands of the same structure.5. X is allowed only on operands of different structures.
For example, in the search tree for computing the complex product, (a -b) X (a + b) is not allowed whereas (a + b) X (c + d) is allowed.

THE PARALLEL PROGRAM
Besides the speed gained by using the pruning rules, the search may be done in parallel to achieve further speed.The parallel approach is simply that of master-slave load allocation.In the serial version, a depth-first search of the tree is done until a path containing goal expressions is found.In the parallel program, several processors can be used with each processor exploring a subtree of the entire tree.The master-slave approach is inherently suited to multiple instruction multiple data (MIMD) computers.The primary obstacle to single instruction multiple data (SIMD) computers is the large number of branching tests from the pruning rules resulting in disparate control flow [3].For the MIMD approach, one processor is used as the master processor delegating work (subtrees) to the other processors.The master processor does a depth-first search of the entire tree.However, on reaching a specified number of levels, it delegates the underlying subtree to an idle processor instead of exploring it.The master processor then backtracks and continues the search.Each of the slave processors performs a depth-first search on the subtrees assigned to it and reports the solution to the master, if found.To avoid waiting for work, each slave processor is given its next subtree while it is computing the current one.With this, all the slave processors are busy most of the time.On 256 nodes, efficiencies of 99.8% are typical.
A key parameter in the parallel program is the number of levels the master explores before distributing the subtrees.If this is too small, there might not be enough subtrees to allocate to all the processors.Also, there is greater danger of load imbalance with fewer subtrees.lf this is too large, the master processor needs to do most of the work and the slaves remain idle.The choice depends on the size of the problem and the number of processors used.An appropriate value can easily be found by experimentation.For problems like complex product or matrix product on a system of 16 to 1024 nodes, exploring three levels on the master processor seems to be a good choice.
Notice that a superlinear speedup is possible, if we stop as soon as a solution is found.Because the tree is searched in parallel by many processors, there is a good possibility that some processor might get "lucky" and find the solution [19].We have found that this is indeed the case, most of the time.

Integer Powers of a Number
An example of an algorithm optimization discussed in Knuth [10] is that of raising a number to an integer power by repeated multiplications.By saving and reusing partial products, the number of multiplications to compute an is usually much less than the n -1 multiplications that would be done by a simple loop.for example, a 1 " can be computed with five multiplications using a 2 = a X a, a-1 = a 2 X a 2 , a 3 = a 4 X a, a 10 =a" X a", a 15 = ato X a".
Even this simple problem illustrates that minimizing operations results in tricks, i.e., methods that do not seem derivable by straightforward stepwise thinking.A straightforward method might be to express the exponent as a binary number, and compute products corresponding to every "1" in that number.For example, a 1 " = a1+'2+4+B, so a 2 = a X a, a 4 = a 2 X a 2 , a 8 = a 4 X a\ a 12 = a 8 X a\ a 14 = a 12 X a 2 , a 15 = a 14 X a.Because this method uses six multiplications, it is inferior to the ''clever'' method of more subtle derivation.
A further complication is that divide operations can sometimes reduce the work to get to a power, depending on the relative cost of multiplications and divides (and on the cost of handling divisions by zero).For instance, a• 31 can be computed by repeated squaring to get a 32 , and then dividing by a to get a 31 .The five multiplications and one division required will be faster than the seven operations needed using multiplications alone, if there is no cost for the a = 0 exception and division takes less time than two multiplications.
The optimization program of Section 4 easily generates optimal algorithms for the first powers of a number up to an exponent of 100.We simplified the program to consider only one initial input, a, and used the mathematical equivalence of the ring formed by multiplication and exponentiation to the ring formed by addition and multiplication.That is, the goal expression of 15a using only addition and subtraction is equivalent to a goal expression of a 1 :; using only multiplication and division.
To automate the discovery of minimal operation counts, the algorithm was surrounded by the following control structure, where m(n) is the number of operations to find an: Increment mlesl.End While Record method with mtest operations.Increment n.

End While
Only conservative pruning rules involving adds were applied.The aggressive pruning rules are either irrelevant or inappropriate, since we wish goal expressions of the form na, where n is an integer.For extra speed, the Search program was simplified.s--10--lD--40--"3~91 __ , ----...:.., An nCuBE 2 with 64 processors took 1.5 minutes to find all of the optimal methods tabled in Knuth, assuming no divisions.Divisions were then introduced with various weightings, because division takes longer than multiplication on most 1992 computers, by amounts that vary from computerto computer.The optimal sequence of operations with no divisions is shown in Figure 8 and with a division weighted the same as a multiplication is shown in Figure 9.The sequence of optimal operations when a division is weighted as two or more multiplications is the same as that with no divisions, for integer powers less than or equal to 100.
Note that division is much less help than one might think.Given the high cost of division on many machines, plus the cost of exception handling when a = 0, the use of multiplications alone appears very attractive for low integer powers.For integer powers higher than 200 or so, other methods such as those given in [ 6 J become ad vantageous.

Complex Product
We have already mentioned the trick for computing the product of complex numbers (a + ib) and (c + id) with only three multiplications.The search program found 12 different ways of doing this, 2 of which are given below: An exhaustive search with a budget of three multiplications and five adds took 21 h and 10 min on a 256-processor nCUBE 2. This represents approximately 10 14 integer operations.Reduction of the budget to four adds revealed no way to compute the goal expressions ac -bd and ad + be, providing a computational "proof" of the nonexistence of a method with fewer adds.
Aggressive pruning rules can be used to reduce the search time: Expressions involving ab or cd are disallowed, as are expressions involving sums ±a ± c, ±a ± d, ±b ± c, or ±b ± d (aggressive pruning rule 4).Nonhomogeneous expressions, expressions with degree higher than 2 (highest degree among goal expressions), and expressions having terms with coefficients other than ± 1 (see Section 6) are disallowed.Although we have no proof that optimal methods cannot use such steps, we empirically observed the efficacy of these pruning rules and make use of them in related problems such as cross product and matrix-matrix product.
For the complex product, a search with the aggressive pruning rules took only 0.18 son a 64-processor nCUBE, yet found all 12 methods.Aggressive pruning rules destroy proof of optimality, but speed the discovery of new methods.Without aggressive pruning rules, the method that forms the alternative title of this article would probably not have been tractable with current high-speed computers.

Cross Product
Among many applications, computer graphics uses the cross product of three-vectors extensively to determine intersections of rays and polygons o~ to create plane normals.The conventional algorithm is as follows: Given three vectors and ALGORITHM OPTIMIZATION

213
as the three goal expressions.While working on a synthetic scene generation program, we wondered if there might be a way to compute a cross product in less than six multiplications, at the cost of extra adds.The Sandia nCCBE 2, with 1024 total processors, was invaluable for this search; we used 256 of its processors for a period of 5 h and 40 min and discovered the method with the first set of stems distributed by the master.A search that went through all the stems would have taken a month, even with aggressive pruning rules.The method is shown at the end of Section 1.We believe this is the first publication of the method.

Strassen Products
The Strassen algorithm for 2 X 2 matrix products [20] was cited in the Introduction.Applied recursively to matrices of size N, it reduces the complexity of matrix-matrix products from order N: 3 to order N 1 og2 7 .The lack of an obvious pattern in the algorithm was the original motive for the work described in this article.We intended to generate the Strassen method by exhaustive search, and then attempt to find simlar tricks for 3 X 3 and 4 X 4 matrix products.For 3 X 3 matrix products, a method using only 23 multiplications has been found [13], presumably by hand.With optimal methods for 2 X 2, 3 X 3, and 4 X 4, we hoped to find a pattern that would reduce the exponent for matrix products without requiring very large values of N, the matrix size.
The budget for a Strassen product, using the Winograd improvement, is 22 operations (7 multiplications and 15 additions).Because the search grows exponentially in the number of operations in the budget, the search is significantly harder than those described m previous sections (Table 1).
Based on the number of "stems" generated by the parallel algorithm and the rate at which progress is made through that set of stems, we estimate that exhaustive search for the Strassen matrix product on the 1024-processor nGlJBE 2, even with aggressive pruning rules, would take many centuries.We continue to seek exploitable symmetries and search orderings that will result in discovery of methods, if not search of the entire space.

Buneman-Type Methods Using Extra Identities
The conventional method for rotating a two-vector (x, y) by 0 radians is to multiply it by a matrix of the form [ cos 0 -sin 0 sin 0] cos 0 to obtain (x', y').This requires four multiplications and two additions.This operation is critical, e.g., in the fast fourier transform (FFT) algorithm.If the rotation is to be done for many two-vectors, as it is for the FFT, one regards the cos 0 and sin 0 values as constants instead of as input variables.Therefore, it is legitimate to consider precomputing alternative constants based on 0, but not consider the effort to do so in the operation count.Buneman [ 5] discovered a trigonometric identity that accomplishes a two-vector rotation with three multiplications and three additions, assuming zero cost for computing tan(0/2): A similar approach exists for reducing the numher of multiplications for computing a fourth-degree polynomial [18] from 4 to 3, where precomputation of constants based on the polynomial coefficients can be amortized over many evaluations of the polynomial for various arguments.
The Search program described in this article does not currently have any way of discovering this type of trick.We do not yet see a brute force approach likely to generate such tricks automatically.

Massively Parallel Compiler Optimization
Current commercially available MPP systems either compile source programs on a single node of the ensemble or use a front-end computer that has a traditional serial architecture.Small-scale parallelism for the phases of compilation can be accomplished via pipelining, but that approach does not scale to thousands of processors.
The Search method described here suggests a strategy for a way of using large numbers of processors during compilation.A small number of processors can do conventional compilation using pipeline parallelism, but basic blocks that look amenable to optimization can be farmed out to the rest of the ensemble and run through the Search process.When conventional compilation is completed, the processors that were given sections to optimize can be polled for the best optimization found so far, and the optimizations can be collected as a final pass.The user might supply a time constraint on compilation, so the search for reduced operation counts can be limited by the user's patience instead of the completion of conventional compilation.Here, "operation" includes any instruction in the target computer's repertoire, not just arithmetic operations.Also, overlap of instructions and the number of clock cycles for each instruction would have to be used in the cost metric instead of the simplistic operation counts used in Search.It might make more sense to budget clock cycles than operations or instructions for MPP compiler optimization.This is fertile ground for future research.
As a test of the massively parallel compiler optimization concept, we did the following experiment: Given the C expressions Unoptimized compilation gave a method to compute fusing eight multiplies and three adds.
The Search program found that f could be computed with three multiplies and two adds in 0.18 s on a 16-processor nCUBE.The Shortcut found by the Search program is (a + b) (a 2 + 6 2 ).The SUN (Sun Release 4.1) and DEC (Ultrix) compilers were only able to reduce the op-eration count to seven multiplies and three adds.The Gnu compiler managed six multiplies and three adds.
A naive computation of g requires 10 multiplications and 3 adds, as found by unoptimized compilation.The SUN and DEC compilers could not find any improvements.The Gnu compiler found a method with nine multiplies and four adds.The optimum method ((a+ b) 3 -ab 2 ) taking 4 multiplications and 2 adds is found by the Search program in 0.36 s on a 64-processor nCUBE.

ALGORITHM OPTIMIZATION 215
It appears that even simple algebraic expressions can be improved by a factor of 2 over current compiler technology using the Search approach.

LIMITATIONS OF USE 1 0.1 Numerical Stability
The Search program can be used to derive methods for computing expressions with minimum number of operations but such a method is not guaranteed to be numerically stable.Thus, analysis of the numerical stability of the derived algorithm is necessary before advocating its use.
Numerical stability of algorithms for real and complex matrix multiplication is discussed in [7,8,15].Miller [15] analyzes the tradeoff between u(m,i,j,k) = pxi + peta + pzeta -pxi * peta -peta * pzeta -pzeta * pxi + pxi * peta * pzeta end do end do end do end do return end We tried to optimize the core of the computation given by the statement inside the loops, which, after a convenient renaming of the variables, is a + b + c -ab -be -ca + abc.He defines the notion of Strong stability and Brent stability, which is a much weaker requirement with Strong stability implying Brent stability.For a thorough discussion of these notions, see [ 15].It is shown that any algorithm for matrix multiplication Applying a similar analysis to the new Croos Product method discovered by the Search program, we found that six multiplications are necessary for Strong stability and that the method discovered by the Search program possesses Brent stability.A similar result is true for the complex multiplication methods.

0.2 Use of Constants
Shortcuts for several expressions involve clever use of numerical constants.For example, the expression abc -ab -ac -be + a + b + c is equivalent to (a -1)(b -1)(c-1) + 1, requiring only two multiplications and four adds.a 4 -8a 3 + 24a 2 - 32a + 16 can be computed as (a -2) 4 using only two multiplications and one add.Such clever methods involving the use of constants cannot be found by the Search program.

SUMMARY AND CONCLUSIONS
With the speed of computers continuing to increase about 60% per year, it is prudent to examine some of the problems and solution methods traditionally thought of as "intractable."Contrary to the philosophy taught in most computer science programs, even programs of combinatorially explosive complexity can yield interesting results for small problems.The computer-aided discovery of a cross product with five multiplications is an example of the kind of problem we can now pose to the fastest machines available.
In 1992, computers were capable of about 10 11 integer operations per second, so a run within the limits of human patience might involve 1 0 17 operations.At the current rate of performance improvement, computers will eventually be fast enough to "discover" the Strassen product trick in a 2 week run.As we find better rules for pruning the search tree, we might well move this date up by many years, and be able to attempt larger problems with the MPP computers of the future.

FIGURE 1
FIGURE 1 Partial search l!raph for a two-input problem.

FIGURE 2
FIGURE 2 Partial search tree for a two-input problem.

FIGURE 3 A
FIGURE 3 A typical path showing expressions and how they are derived.

FIGURE 4 A
FIGURE 4 A path showing two identical expressions.

FIGURE 5
FIGURE 5 Search tree with two paths having identical set of expressions.

FIGURE 9
FIGURE 9 Power tree with divisions weighed the same as multiplications.

f
= a * a * a + a * a * b + a * b * b + b * b * b and g=a*a*a+3*a*a*b+2*a*b*b + b * b * b, we used the Search program to optimize the calculation off and g.Then, we tried various C compilers with optimizations enabled.

Table 1 .
Minimum Operation Budgets Required for Various Tasks

Table 2
shows the number of operations used by various compilers in computing this expression.An optimal way of computing this expression, as found by the Search program in 24.6 son 32 pro-

Table 2 .
Number of Operations Used by Various Compilers and the Search Program in Computing a + b + c -ab -be -ea + abc Strong stability should have at least n 3 multiplications and that the Strassen's method possesses Brent stability.