Division-based versus general decomposition-based multiple-level logic synthesis

• A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.


INTRODUCTION
The term logic synthesis refers to all transformations in the design of digital hardware in which binary data are involved.In this paper, we will only consider a subset of logic synthesis methods, namely methods for the trans- formation of a multiple-output binary function into a (near-)optimal multiple-level network of primitive logic blocks.The term primitive logic block refers to any binary function that can be mapped one-to-one onto a primitive hardware building block in a certain technol- ogy.A primitive hardware building block is the smallest hardware unit considered that is used to implement binary functions in a certain technology.Examples of primitive hardware building blocks are the gates in the library of standard-cell implementations or configurable logic blocks (clbs) for Xilinx FPGA [54] implementa- tions.
Up to 1980, very special cases of the multiple-level logic synthesis received the most attention namely, the transformation of a binary function into an optimal two-level network (e.g.AND-OR-NOT, OR-AND-NOT, NAND-NAND, NOR-NOR and AND-EXOR implemen- tations), and the transformation into multiple-level EXOR, AND-EXOR, AND-OR-NOT or MUX networks.This interest resulted from the fact that these networks could be easily modelled, minimized and mapped one- to-one on networks of typical primitive hardware build- ing blocks provided by the electronic technologies of that time, for example, on those as in TTL and ECL tech- nologies and on PLAs or PALs.
Multiple-level networks often allow a more compact implementation of combinational logic in comparison with two-level networks.They enable the separate imple- mentation of common sub-expressions and sharing them among multiple functions or sub-functions.However, many functions do not result in compact AND-EXOR, AND-OR-NOT or MUX networks.On the other hand, modern microelectronic technology provides us with a huge number of various primitive hardware building blocks which can be used for obtaining more compact networks.Exploitation of these abilities requires new appropriate multiple-level synthesis methods.The intro- duction of a new generation of FPGAs [49] has recently generated very strong stimulus for research in multiple- level logic synthesis.The internal structure of the FPGAs is in fact a programmable multiple-level network and therefore, these devices require the use of multiple-level logic synthesis techniques in order to exploit their abilities.Unfortunately, the synthesis of general mul- tiple-level networks is much more complicated than the synthesis of two-level logic.The main reason for this is the difficulty in defining the nature of the "optimal solution" in the multiple-level synthesis problem.For example, in the case of two-level AND-OR-NOT logic, the "optimal solution" is the solution with the minimal number of product terms, which is a relatively good measure for the complexity of the implemented network.
In multiple-level logic, the structure of the logic is less uniform and can be considerably more complex than two-level logic.Also, the design decisions have a much more substantial impact on the many factors that decide the total quality of a multiple-level logic network: area, speed, power dissipation, testability etc.Furthermore, these factors are no longer simple functions of the implementation structure as it was the case for two-level logic.
During the last decade many different approaches have been proposed to solve the multiple-level logic synthesis problem.The most important of them are the following: algebraic and Boolean division techniques [2] [7][8] [ 10] [42] [45], multiple-level BDD and other decision graph ap- proaches [6] [13][14] [37], algorithms based on the minimisation of the com- munication complexity between blocks [23], multiple-valued logic based approaches [36][43] [52], methods based on spectral analysis techniques ( [ 17]   contains an overview), methods based on the iteration of gate transforms and gate reductions, th.e so-called transduction methods [40].
In recent years a number of papers have been published which implicitly or explicitly use a new concept of general structural decomposition as a synthesis paradigm (e.g.[15] [26][27] [30] [32]).The distinctive feature of these methods is that they are all special cases of a general full-decomposition as presented in section 3.In the general full-decomposition approach, an incompletely specified multiple-output binary function is decomposed into a network of communicating subfunctions (logic blocks) in such a way that this network realizes the specified behaviour, satisfies specified constraints and optimizes given objectives.Decomposition decisions are based on analysis of the structure of the information streams in the function and the relations between this structure and the specified con-straints and objectives.The constraints imposed by hardware building blocks and their possible interconnections are innately taken into account.This approach has a number of interesting properties, including the following: In many multiple-level synthesis approaches Bool-  ean expressions are used to describe functions.Very often these approaches use only a limited set (mini- mum functionally complete set) of Boolean opera- tors (e.g.AND-OR-NOT) and not the full set of operators implemented by a certain library of hard- ware building blocks.To implement the minimised expression, a transformation step called technology mapping must be performed in order to transform the expression into a network of hardware building blocks.If the repertoire of primitive logic blocks offered by a certain technology library differs sub- stantially from the set of Boolean operators used during synthesis, the work completed during syn- thesis is almost futile, because the real synthesis must be performed during the technology mapping.The synthesis methods based on general decomposition integrate the technology mapping phase into the synthesis: a network of logic blocks, that can be mapped one-to-one onto a network of primitive hardware building blocks, is constructed.The internal structure of Xilinx FPGAs and similar fine granularity FPGAs is in fact a programmable multiple level logic block structure which can be innately modelled using the theory in Section 3. Therefore methods for the synthesis of these types of FPGAs can be relatively easily constructed using this theory.This paper aims to present a comparative analysis of the general decomposition-based and division-based multiple-level logic synthesis approaches.We will inves- tigate the properties of the decomposition-based logic synthesis methods by introducing the general full-de- composition concept, presenting and discussing the ex- isting decomposition methods and comparing the decom- position methods with the classical multiple-level synthesis methods based on the division of Boolean expressions.We have chosen the division-based algo- rithms as our reference, because they are by far the most popular ones (as measured by the number of publications on this subject).
The remainder part of this paper has been organised as follows: Sections 2 and 3 contain introductions to the theory and the most important results obtained in divi- sion-based and decomposition-based logic synthesis, re- spectively.In Section 4, a comparison between these two classes is presented.Some concluding remarks can be found in Section 5.

DIVISION-BASED LOGIC SYNTHESIS
The fundamentals for division-based multiple-level logic synthesis were introduced by Robert K. Brayton in 1982 [7][8] [10].In this section, we will review the most important aspects of Brayton's original method and discuss its advantages, disadvantages and a few exten- sions to this method.

Basic Theory
The division-based multiple-level logic synthesis is based on manipulation of Boolean expressions.The theoretical framework for these expressions consists of binary Boolean algebras.These binary Boolean algebras are well known [5][21][47] and therefore, we will not repeat them here.
A Boolean variable is a single coordinate in a Boolean space.A literal is a Boolean variable or its complement.A cube c is a set of literals such that x cxc, for example {a, b} represents the cube ab; {a, a } is not a cube.Two trivial cubes 0 and exists; they are defined as the Boolean functions 0 and 1 respectively.A Boolean expression is a set of cubes.We will not write expressions as set of cubes, but we will use the well known sum-of-product term representations.The trans- lation between these two notations is straightforward.A Boolean expression is called non-redundant if no cube of the expression contains another cube properly.For example the expression a + ab is redundant because { a } C_ { a,b }.The expression a-rob is non-redundant.The support sup(f) of an expression f is sup(f) {xl::Icf:xcv,c}.Less formally, the support is the set of Boolean variables which appear either comple- mented or uncomplemented in the expression f.Two functions f and g are said to have a disjoint support if and only if sup(f) fq sup(g) .
The expressions describe the structure of a multiple- output Boolean function.Parentheses are used to specify the multiple-level character of the expressions; such a parenthesised expression is called a factored form and the process of obtaining a factored form of a Boolean function is called factorisation.An incompletely speci- fied single-output function .isrepresented by a triple of completely specified single-output functions: , (f,d,r), where f represents'the on-set, i.e. all input patterns for which evaluates to 1. Similarly, d and r are representations of the don't care-set and the off-set, respectively.The functions f, d and r should be multiple exclusive and they should form together the complete Boolean space (i.e.all vectors from the Boolean input space should belong to precisely one of these functions).The product of two expressions f and g (denoted f. g) is defined as f.g {cit_Jdj cifAdjg }.The product set is made explicitly non-redundant after calculating the product.If f and g have a disjoint support, then f.g is called the algebraic product, otherwise f.g is called the Boolean product.A completely specified function g is a Boolean divisor of an (incompletely specified) function -= (f,d,r), if two completely specified functions h and exist so that g. h:d and fC_g.h+iC_f+d, where g.h represents the Boolean product, h is called the quotient and is called the remainder of the division, similar to the names used in number calculus.However in our case, h and are not necessarily unique.For example, the function f ac + ab + bc + d + e_ where the underlined cube is a cube from the don't care set, has a Boolean divisor (a + b), with a quotient (a+c) and a rest expression d, as (a + b)(a + c)+ d ac + ab + bc + d C_ f.Because (a + b) and (a + c) have intersecting input sup- ports, the product is a Boolean product.
The disadvantage of the Boolean division approach is that the set of Boolean divisors is usually very large and therefore good divisors cannot be easily found.Consequently, an altemative approach has been considered.It is based on the fact that the sum-of-product term represen- tation for two-level functions is almost canonical and efficient common algebraic factors can be identified as being common factors of the product terms.The idea is motivated by the fact that manipulations of sum-of- product terms are in most cases quickly performed (many algebraic operations have linear time complexity).The disadvantage of this idea is that it does not guarantee optimal solutions.Brayton uses an alternative approach based on this idea.In this approach, the incompletely specified function 7"is minimised to obtain a two-level minimal representation of the on-set f of (using the two-level minimiser Espresso [9]) and algebraic division is used to manipulate f.A completely specified function p is an algebraic divisor of a completely specified function f if q and r exist, such that p :/: 0, q :/: 0 and f p q + r, where p.q the algebraic product, q is once again called the quotient.In the remainder of this article we will denote the quotient as q f/p; r is called the remainder of the division; q and r need not be unique.However, if we define q as the largest set for which f p q + r then q and r are unique.This special type of algebraic division is called" weak division.For example (a + b) is an algebraic divisor of the function f (a + b)(c + d + e) + g.Both q c + d, r ae + be + g and q c + d + e, r g are valid quotient/remainder pairs for f divided by (a + b).If the calculations are restricted to weak division then the second pair is the unique quotient/remainder pair.Note that in f p.q +r, the names of the divisor and the quotient are arbitrary, if p is a divisor and q is the associated quotient, then we can equally call q the divisor and p the quotient.
A cube c is said to (algebraically) divide expression f evenly if and only if /df:cC_d.An expression is said to be cube free if no cube divides the expression evenly (e.g.ab + c is cube free, but ab + ac is not cube-free, since cube a divides ab + ac evenly).The primary divisors of a Boolean function f are the elements of the set D(f) defined as: D(f) {f/clc is a cube}.The set of kernels of a Boolean function f is the set:K(f) {glgD(f)/ g is cube free }.In other words, the kernels F. VOLF, L. Jt3WIAK AND M. STEVENS of an expression f are the cube-free, primary divisors of f.The cube c used to obtain kernel k (k f/c) is called the co-kernel of k, and we use C(f) to denote the set of co- kernels of f.A kernel kK(f) is said to be of level n if kKn(f) and k Kn-(f), where K(f) {kK(f)lK(k) {k}} and Kn(f) {kK(f)l:llK(k):lKn-(f)}.For example, for the function f a-(b + c) + d the set of divisors is D(f)={a; b+c; a(b+c)+d; 1}.The set kernels for this function are: K(f)= {b + c} and K(f)= {b + c; a(b + c) + d}.
Theorem 1: Functions f and g have a common multiple cube divisor d if and only if :! :1 d kf (qkg [7].
This theorem states that two functions only have a multiple cube divisor if an intersection of a kernel of f and a kernel of g has more than one cube.It is the fundamental theorem used in the factorisation algorithms presented in the next sections.

Standard Factorisation Algorithm
The algorithm discussed below was introduced by R.K. Brayton in 1982 ([7][8] [10]).We will call it the standard factorisation algorithm, because all other factorisation algorithms are based on it.Each method presented in this paper will be characterised by sketching an outline of the major steps and by discussing the impact of these steps on the quality of the result.The standard factorisation method can be characterised as follows: The fundamentals of the method are based on the notion of kernel/co-kernel pairs.This notion is easy to comprehend, which makes easy reasoning of the synthesis process.Although the set of kernels can become very large, it is considered to be reasonably small for most practical synthesis problems.Kernels form a very small subset of all algebraic divisors of a function.Algebraic divisors are a special subset of another group of divisors: the Boolean divisors.Therefore, the set of kernels of a Boolean expression is a very small sub-set of all possible divisors and can be too restrictive to find near optimal solutions.The input to the algorithm is the two-level, locally minimised on-set f' of an incompletely specified function f,d,r) obtained using the two-level mini- miser Espresso [9].This makes incompletely speci- fied multiple-output functions much easier to handle, but the loss of the don't cares before the algorithm actually starts will almost certainly lead to a less satisfactory implementation.
In the type of factorisation problems presented here, two optimisation criteria are generally considered.
Firstly, the area occupied by the implementation of the circuit and secondly the maximal speed of the implementation of the circuit.The area of the implementation is usually split into two components: the active area (area used for active elements (transistors)) and the routing area (the area used for wiring).The maximum speed of the implementation is usually determined by its critical path; this is defined as the worst case response time of any output to a change in one or more of its inputs.In the standard factorisation algorithm, the number of literals is used as the optimisation criterion.The number of literal is a quite accurate measure for the implementation of AND-OR-NOT based Boolean expressions, because each literal is an input of a gate and the difference between the minimum and the maximum number of inputs for a gate of a certain technology is usually small (typical 2 or 3).Although this measure is reasonably accurate for the active area of a physical implementation, it is not adequate enough for the routing area or for the delay of the circuit.Since the routing area of complex multiple-level logic circuit can be much larger than the active area and heavy parenthesised expressions increase the delay, the number of liter- als can be a weak selection mechanism for large logic circuits.It is important to realise that the number of literals is only an adequate measure for the active area if we use an AND-OR-NOT based technology.If more complex gates are possible (like the AND-NOR-21 gate: f a. b + c) then a non- trivial technology mapping is required as this will transform a set of Boolean expressions with mini- mum literal count to a gate network using the minimum amount of area.This problem is known to be NP-hard.The same problem can occur when the Boolean expression uses operators with too many inputs: for example, the expression f a.b-c.d.e.h.i.j probably requires technology mapping, because not many technologies support 8-input AND gates.Furthermore, for FPGAs which have an internal structure of programmable networks of look-up tables (like Xilinx FPGAs), the number of literals is in fact completely irrelevant because these look-up tables can implement any function which has a limited number of inputs and outputs (regardless of how many literals are needed to describe this function).The general full-decomposition theory presented in section 3 does not have this disadvan- tage as in this theory, functions are modelled as networks of arbitrary building blocks.
Each kernel/co-kernel pair obtained using theorem is applied to all functions and the gain in the number of literals obtained by this pair is calculated.The kernel/co-kernel pair with the highest literal gain is selected and applied to all functions.Although this approach is fast and easy to understand, we consider it to be a weak point of the algorithm.Firstly, the selection is based on the momentary gain of the number of literals of a kernel.It does not predict the total gain in the number of literals that can be expected by choosing this kernel.This is important because the choice of a kernel may block other kernel/co-kernel pairs and, after choosing the cur- rently best kernel, no good kernel/co-kernel pairs may remain.The second objection is that the selec- tion algorithm is greedy.A greedy search strategy should only be used for solution spaces which are more or less continuous and regular.If the solution space is very rough, a greedy algorithm often finds a local instead of a global optimum.The solution space of logic synthesis is very rough [28] and therefore greedy algorithms are not suitable.A third disadvantage is that the selected kernel is globally applied.This means that a kernel/co-kernel pair with a large gain in some functions is not only applied to the functions where it results in a gain in the number of literals, but also to functions where it is a very bad choice, and therefore may block good kernels for that function.All these factors together provide evidence to show that to guarantee a near- optimal gain in the number of literals, a more sophisticated kernel selection algorithm should be used.
The kernel/co-kernels of Brayton are not algebra- ically compatible.This means that after choosing and applying a certain kernel, other kernel/co-kernel pairs may no longer be valid.The quality factors of the other kernel/co-kernel pairs can also change.Therefore, after each kernel selection the kernel/co- kernel set needs to be recalculated.This is a rather expensive operation and therefore Brayton tries to select a few kernels/co-kernels before rebuilding the kernel/co-kernel set, which means that the algorithm works with somewhat inaccurate estimations and extra checks are required to assess whether a kernel/ co-kernel pair is still feasible.The algorithm.continues to select the kernel/co- kernel with the highest gain in the number of literals, until the gain is smaller than a threshold value X.Then, all single-cube divisors which have a literal gain larger than X are extracted.If no more multiple-cube or single-cube divisors can be found then the threshold value X is decreased and the extraction process is continued.The choice of the sequence of values of X is determined by experi- ments (empiric data).

Lexicographical Factorisation
The lexicographical factorisation algorithm [1][2] [3] [45] was developed in the Laboratoire Conception de Systmes Intrgrrs of the Institut National Polytechnique de Grenoble.It aims to improve Brayton's method by removing some of the disadvantages previously men- tioned.Its basic idea is to find and use an order of the input variables in the factorised expression.This approach leads to a multiple-level random logic implemen- tation with an improved routing factor compared to the standard factorisation algorithm of Brayton.In the lexi- cographical factorisation, the variables are factored out in order of appearance in a certain variable ordering.
Suppose we have a function f abc + abd + ae + bc + be + c d and an input order {a, b, c, d, e}, then the factorised form of f with respect to this order is f a(b(c + d) + e) + b(c + e) + c d.The construction of the input order is based on kernel/co-kernel pairs.The precedence relation induced by the kernel/co-kernel pair (k, c) states that the variables in co-kernel c precede the variables of kernel k.For example, function f ac(bd + bd) has a kernel/co-kernel pair (bd + bd,ac).The precedence relation is: a and c precede b and d.A factorisation (k, c) is compatible with a reference order, if and only if its precedence relation is respected by the reference order.For example, ba(c + d) is com- patible with all reference orders in which b and a precede c and d.These reference orders are: {a,b,c,d}, {b, a, c, d }, { a, b, d, c } and {b, a, d, c }. Two factorisa- tions (k 1, c l) and (k 2, c2) are lexicographically compat- ible, if and only if at least one input order exists with which they are both compatible.Suppose we have two elementary factorisations on some function" (ce + d, b) and (d + f, c e).These factorisations are compatible because they are both compatible with the reference order { a, b, c, e, d, f}.
The lexicographical factorisation method can be char- acterised as follows: The lexicographical algorithm, like standard fac- torisation, has the locally two-level minimised Boolean functions as its inputs.The input-order is created in the first step.Part of the input order can be imposed externally to account for external fac- tors (e.g. late arrival times of some input).If this enforced input ordering is not complete, then the following input ordering algorithm is used to com- plete the order.A list of all kernel/co-kernel pairs is first constructed.The pairs are sorted with respect to their global gain in the number of literals.The pair with the largest gain in the number of literals and which not violates the input order is selected and the input order is updated with regard to the precedence relation of the selected kernel/co-kernel pair.These steps are repeated until no more compatible kernel/ co-kernel pairs can be found.The lexicographical factorisation algorithm uses a greedy selection algorithm to find a good input order.The constructed input order may not be the best one (in addition to the fact that any input order is a large restriction on the set of kernels).There- fore, a more sophisticated algorithm may be neces- sary to find a near optimal input order.Related to this problem is the fact that lexicographical factori- sation uses the number of literals to estimate the quality of an implementation.One of the main advantages of lexicographical factorisation over standard factorisation is the following theorem proven in [45]" Theorem 2 Lexicographical compatible kernel/ co-kernel pairs are algebraically compatible.Lexicographical factorisation therefore does not require the recalculation of the set of kernels/co- kernel pairs after one pair is selected from the set.The result is that lexicographical factorisation is a much faster algorithm when compared with stan- dard factorisation.
The factorisation is then performed and respects the variable ordering just created.Since the variable ordering is known, factorisation is very simple.Negated variables are factored out immediately after the non-negated variable.In the final step, common sub-expressions are identified and imple- mented as sub-functions.Because of the input order, the search for common sub-expressions is very efficient.It is performed as the last step, because high priority is given to internal simplifications of the expressions and this results in a low number of wires and short wires.The lexicographical factorisation results in imple- mentations which have a much smaller routing area compared to the standard factorisation algorithm of Brayton.Respecting the variable ordering can how- ever result in an increase of the active area.Many good kernels can not be used because the input order is very restrictive.Therefore, the method produces only good results for circuits with a high routing factor (the amount of active area used is larger than the active area obtained using Brayton's method).Most large circuits are known to have a large routing factor and experiments ( [45]) have shown that the lexicographical factorisation algorithm produces better results in less time for large circuits.
Unfortunately, only external inputs are accoutated for in this method.The method does not explicitly try to avoid long lines which result from internal sub-function creation (although implicitly this prob- lem is partially solved because the variable ordering keeps related sub-functions close to each other). 2.4 Concurrent Decomposition Algorithm This method was introduced by Janusz Rajski and Jagadeesh Vasudevamurthy of the McGill University in Montreal, Canada [41 ][42] [50].It is based on testability preserving transformations and the factorised multiplelevel network is fully tested by a complete test set derived from the original two-level circuit.The charac- teristic feature of the concurrent decomposition method is that it limits itself to the use of double cube divisors (i.e.kernels with only two cubes), single cube divisors with only two literals and the complements of these single and double cube divisors.It has been found that these divisors (in spite of their simplicity) can be used to synthesise circuits with small area and short delay times.Furthermore, by restricting the calculations to double cube divisors and single cube divisors with two literals, the calculations of these divisors are now polynomial time operations (whereas the calculation of the set of kernels can require an exponential amount of time).A double cube divisor is a cube-free multiple cube divisor with only two cubes.The set D(f) of all double cube divisors is defined as D(f)= {dIVi,j: 1_< -< n, <-j <-n, i=/=j: d {ci\ (c fq cj),cj\ (cifqc)} } where n is the number of cubes of f and c represents cube of f. (c fqc) is called the base of double cube divisor d.Note that the definition of a double cube does not exclude empty bases (i.e the situation for which c and c are disjoint).The complexity of the construction of D(f) is O(n2).
Given the function f ade + ag + bcde + bcg, the double cube divisors of f are de + g obtained from the cubes ade and ag or from the cubes bcde and bcg. a + bc obtained from the cubes ade and bcde or from the cubes ag and bcg.ade + bcg obtained from the cubes ade and bcg.ag + bcde obtained from the cubes ag and bcde.
The set of double cube divisors D(f) is represented by a number of subsets Dx,y,s(f), where x represents the number of literals in the first cube, y the number of literals in the second cube and s the number of literals in the support of f.Without loss of generality one can assume x -< y.Note that max(x,y) -< s -< x + y.The spe- cial sets Dexor(f) and Dexnor(f) denote the EXOR and the EXNOR double cubes respectively; D2,2,2(f)= Dexor(f Dexnor(f).Sx(f is used to denote the set of single cube divisors of f with exactly x literals.A feature of concur- rent factorisation is that it does not only take elements of D(f) and S(f) as divisors, but also uses the complement of these divisors.In [42] an important theorem is formu lated that describes the cases in which the complement of a double cube divisor is also a divisor: Theorem 3: Let f and g be two Boolean functions.If a) d 4: Sj for every diDl,l,z(f), sjES;z(g and b) d 4: dj for every diDexor(f), djdexnor(g and c) d 4: dj for every diDexnor(f),djDexor(g) and d) d 4: _dj for every diDa,2,3(f),D2.a,3(g)and e) d 4: sj for every diDl,l,2(g), sj S:z(f) then f has neither a complement double cube divisor, nor a complement single cube divisor in g.
This theorem is quite important from the practical viewpoint: by checking for a complement divisor of a dD(f), we have to search for them only in a small subset of Dx,y.s(f or S2(f).As in standard factorisation, a theorem for finding divisors among two functions exists [42].
Theorem 4: Expressions f and g have a common multiple cube divisor if and only if D(f)fqD(g) 4: 0.
Finding multiple cubes based on the sets D(f) and D(g) leads to a higher run-time efficiency because the set of double cube divisors is much smaller when compared to the set of all kernels.The method can be characterised as follows: The double cube divisors are constructed by an O(n2) algorithm, where n is the number of terms.Each double cube divisor d is stored in the appropriate Dx,y,s and its implementation cost is calcu- lated.If d has a single cube complement (which can be easily checked using theorem 3) then the cost for the single cube complement is also taken into account in the implementation cost.The quality of a Boolean expression is estimated by the number of literals in the expression.The quality factor specifies the gain in the number of literals which results from choosing both this double cube divisor and its complement.Double cube divisors are extracted separately for each function.The extraction of double cube divisors is only done once for the entire factorisation process.A similar algorithm is used to select all single cube divisors.The divisor selection algorithm is a greedy algorithm selecting the (single or double cube) divisor (and possibly its complement divisor) with the largest gain in the number of literals.Our objec- tions to this measure have already been discussed in section 2.2 and hold for concurrent decomposition.The set of double cube divisors is not algebraically compatible.However, because the complexity of the double cube generation algorithm is quadratic, this algorithm is much faster when compared to the standard factorisation (which occasionally can have exponential time complexity).
In order to preserve testability the algorithm has to be used on single-output functions and not on multiple-output functions.This means that good sub-expressions cannot be shared among different output functions.Furthermore, the transformation of a multiple-output function to a set of single-output functions can increase the number of product terms by at most a factor equal to the number of outputs in the multiple-output function.In an attempt to over- come this problem, the single-output Boolean func- tions are searched for common parts prior to execut- ing the concurrent decomposition algorithm.The common parts are implemented as sub-functions.It must be noted that the search for common parts involves common product terms not common sub- cubes, i.e. the intermediate variables occur in the functions as product terms with this intermediate variable as the only variable in the product term and they are not literals of a cube.The algorithm preserves testability but this is a severe restriction.In [42] a number of rules are specified which should be fulfilled in order to preserve testability.These rules state that single cube extraction, double cube extraction and concur- rent extraction on single-output functions preserve testability.Similar rules state that it is very difficult to preserve testability in multiple-output circuits (i.e.Boolean expressions with common sub-expressions).Therefore, the concurrent decomposition methods act on single-output functions.It seems reasonable to expect further gain in the number of literals if the testability condition is dropped and common sub-functions and common double cube divisors are allowed.

Example of Division Based Synthesis Methods
In this section the division based algorithms that are presented are illustrated by an example.Given the function w(a, b, c, d, e, f, g) defined as w(a, b, c, d, e, f, g) abce + abd + a ef + bef + efg.w is a single output completely specified function and is minimal with respect to the number and the size of the terms.The standard factorisation algorithm presented in section 2.2 applied to w results in the function x(a, b, c, d, e, f, g): x(a, b, c, d, e, f, g) ab(ce + d) + ef(a + b + g) (see also Figure a).The standard factorisation has obtained a solution with 10 literals.The lexicographical factorisa- tion algorithm cannot find this solution because the used co-kernel/kernel pairs are incompatible: the pair (ce + d, ab) requires (among others) the precedence relation a precedes e, whereas the second co-kernel/kernel pair of x (a + b + g,ef) requires the precedence relation e precedes a, which is incompatible with the first relation.The lexicographical factorisation finds the following factori sation that respects the variable ordering {d,e,c, f, a, b, g}" y(a, b, c, d, e, f, g) d.Yl + e(c'yl + f (Yl+ g)) with Yl a. b (See Figure lb).The function also requires 10 literals.As it can be seen, a subfunction Y has been introduced.Because the lexicographical factorisation algorithm uses NANDs as an internal representation, it was able to identify ab and a + b as common subexpressions.If this realisation is compared with standard fac- torisation then it shows the properties of lexicographical factorisation: each input is only once connected to a gate and therefore the routing complexity for the inputs is reduced.It can also been seen that the method does not try to minimise wires for subfunctions as the subfunction Y is routed globally.It is however possible to specify a threshold on the gain of the number of literals for sub-functions.If the gain of a certain sub-function is larger than the threshold, the subfunction is created and applied globally (expecting a large active area gain, but small extra area for wiring), otherwise it is implemented locally (costing extra active area, but negligible routing area).Furthermore, the lexicographical factorisation al- gorithm needs more gates than standard factorisation (increase of active area).The real power of the lexico- graphical factorisation algorithm can only be shown on large examples, where the gain in active area and the extra routing area for sub-functions is compensated by a large reduction in the routing area for the inputs.We refer to the benchmark results in [45] to illustrate the effec- tiveness of the lexicographical factorisation for large circuits.Using lexicographical factorisation, the partial input order can also be enforced.Suppose we want the inputs a,b and c to be extracted first (because these inputs have late arrival times due to external circumstances), the input ordering is first completed as { a,b,c,d,e,f,g } and the following factorisation is then found: z ab(ce + d)+ (az + bz + gz1) and z ef (see Figure lc) which requires 13 literals.It should be noted that the second part of z cannot be written as zl(a + b + g) without violating the precedence relation.Also, it should be noted that the critical path of a and b has reduced from 6 gates to 2 gates al; the expense of a somewhat more complex routing and an increase in active area (more inputs per gate).The concurrent factorisation algorithm searches explicitly for the complement of the kernel a+b, whereas in lexicographical factorisation this equivalence was implicitly found.Concurrent factorisa- tion finds the same factorisation as lexicographical fac- torisation (function y) (see Figure b).As it can be seen only double cube divisors and cubes with two literals are used.

GENERAL DECOMPOSITION-BASED MULTIPLE-LEVEL LOGIC SYNTHESIS
In this section, we will present a theory of general full-decomposition for combinational machines and give an overview of the decomposition-based methods for multiple-level combinational logic synthesis.Basic defi- nitions are presented in sub-section 3.1, the theory of general full-decomposition can be found in sub-section 3.2 and some special decomposition cases are the topics of sub-section 3.3.

Basic Definitions
A (completely specified) combinational machine M is an algebraic system defined by: where" M (I,O,k) I -a finite non-empty set of inputs, O a finite non-empty set of outputs, h -the output function h:I--->O.
The design requirembnts do not always completely specify a machine for example, certain input values may never occur due to external constraints or due to realizing the machine in such a way that some of the input values of the realization are not used for implementing the inputs of the originally specified machine.From the behavioural viewpoint, the designer does not care what will be the output value for such an input value.In all such situations one talks about so called "don't care" conditions."Don't cares" are commonly denoted by "-".In order to account for them, the combinational machine definition should be slightly modified by changing the definition of the machine function h.For the single- output machine: h:I--->O tO {-}.For the multiple-output machine: h= [hi], hi:I---->O tO {-} and O [Oi].A combi- national machine without "don't cares" will be referred as completely specified and with "don't cares" as incompletely specified.Machine M' (I',O',h') is a realisation of machine M (I,O,h) (see Figure 2) if and only if the relations :I--->I' (a function) and O:O'-->O (a surjective partial function) exist, so that 'q'xI" h(x) O(h'((x)).The machine composed as a structure consisting of , M' and O and being the realisation structure for M defined by M' will be denoted by str(M').It is possible to prove, that if M' is a realisation of M then for all possible inputs, the outputs produced by machine M and its realisation M' are identical after renaming.
Partitions and partition pairs, originally introduced by Hartmanis 19] are useful for modelling information and information flows inside and between machines.Let S be any set of elements.Partition "rr on S is defined as follows: "rr {BiIB _ S and B CI Bj 0 for :/: and B S} i.e. a partition "rr on S is a set of disjoint subsets of S whose set union is S.For a given s S, the block of a partition vt containing s is denoted as [s] while [s]xt [t] denotes that s and t are in the same block of o.Similarly, the block of a partition 7r containing S', where S' C_ S, is denoted by [S']ar.The partition containing each element of S in a separate block is called a zero partition and denoted by a'rs(0).The partition containing all the elements of S in one block is called an identity partition and is denoted by aXs(1).Let ar and ar2 be two partitions on S. [Si]arl [Si 1]arl either [Si]ar 2 IS liar2' 0 <-n-1.
From the above definitions, it follows that the blocks of ar'ar2 are obtained by intersecting the blocks of ar and are, while the blocks of ar + ar2 are obtained by uniting all the blocks of ar and ar2 which contain common elements.31, 2 is greater than or equal to 31"1:31"1 -< 31"2 if and only if each block of 31"1 is included in a block of are.Thus 31"1 -< are if and only if arc'are ar if and only if 31"1 + are are.Any partition 31.on S can be interpreted as an equivalence relation defined on S with the equiva- lence classes being the blocks of 31..Using this interpre- tation, the partition 31.gives information about the ele- ments of S with precision to the equivalence class.With this information, it is possible to distinguish elements from different classes although it is impossible to distin- guish elements from the same class.The partial ordering relation -< denotes the fact that if ar -< are then ar (and so the associated equivalence relation) provides informa- tion about elements of S, that is at least as precise as information given by 31"2 (and its associated equivalence relation).
A zero partition provides complete information about elements of S and an identity partition gives no information.The partition product can be interpreted as a product of the appropriate equivalence relations intro- duced by these partitions; it represents the combined information about the elements of S provided by both relations together.The partition sum can be interpreted as a sum of the appropriate equivalence relations introduced by these partitions and it represents the combined ab- straction of both relations.

Example
In Table I, the function table of an incompletely specified Boolean function is presented.The function has 3 input bits (x, xa, x3) and two output bits (y and ya).
For example, take x f: k(f) z (see Table I) and 0(hr(XIc(f)) 0(hr(6)) 0(8) Z. Verification of this relation for other combinations can be performed by the reader as an exercise.For machine M(I,O,k) as defined above "rrI(0)={a;b;c;d;e;f;g;h } is the zero input partition and "rri(I)={ a,b,c,d,e,f,g,h } is the input identity partition.Let "rr {a,b; c,d; e,f,g,h} and "rra {a,b; c,e; d,f,g,h} be partitions on set I. The product "rr .'rra{a,b; c; d; e; f,g,h} denotes the combined information of "rrl and -rra, for example in "rr the symbols c and d are equivalent and hence partition "rrl cannot distinguish between these two symbols.7r a can make the distinction between c and d (because they are in different blocks of "rra).The product -rrl .'rr a represents the partition that makes the union of the distinctions of 7rl and "rra and combines in one block only those elements which are {a,b; c,d,e,f,g,h} represents the information about distinctions present in both partitions.Finally, for the partitions '/1" 3 {a,b;c,d;e,f,g,h} and 'IT 4 {a,b; c,d,e,f,g,h}, "rr 3 ---"rr 4, because -rr 3 makes all distinctions that "fin makes.This can also be checked by the definition of " "i1"3'11"4=:)'1T3 .'1T4='/1"3and '1I'3.'11"4 (a,b; c,d; e,f,g,h } "rr 3. <-is a partial relation: therefore it need not be defined for all pairs of partitions: -rr 5 {a,b; c,d; e,f,g,h} and '/1" 6 {a,c,e; b,f; d,g,h} are not related by _< since "rr 5 '11" 6 { a,b; c,d; e,f,g,h } { a,c,e; b,f; d,g,h} { a;b;c;d;e;f;g;h } I. Given M (I,O,k), let "rr be a partition on I and let "rro be a partition on O. (,rq, -fro) is an I-O partition pair if and only if A'rrIk(A)CC, C'rr o (where: k(A) {k(x)lxA }) i.e ('rr I, "rr o) is an I-O partition pair if and only if each block of "rr unambiguously determines the block of 7r o in which the output is contained.If (wi, is a partition pair then "rr is called the first partition of the pair and "rro is called the second partition of that pair.Let "rr be a partition on I.The minimal second partition which forms an I-O partition pair with "rr as a first partition will be denoted mi_o('rri).The maximal first partition which forms an I-O partition pair with "rro as a second partition will be denoted Mi_o('rro).It can be proved [19] that: mi_0(Tri) I-I (7rjl(7l'i,Trj) is a I-0 partition pair) For a given 'l'ri, m('rri) describes the largest amount of information which can be computed about the output of M knowing the block of 'fix which contains the input.M(aro) describes the least amount of information which must be known about the input of M, in order to be able to compute the information about the output with preci- sion to "rr o. 'IT' is an input partition induced by an output partition "tr o' (notation: "rr ind('rr)) if and only if: Vx,yI: if [h(x)]'tr [h(y)]'tr then [x]'rr [y]r.In other words, if "rr is an input partition induced by an output partition rro' and, if it is known that the present output y of M is contained in a block Cxr, then it is known that current input I of M is contained in a block B'rr, where block B is unambiguously indicated by block C. It is possible to prove that "rr is an input partition induced by an output partition "rr if and only if "n' --> Mi_o('n'o'), i.e. the smallest input partition induced is '" '= Mi_o('rro').by a certain 7r o "rr "rr For the purpose of a bit decomposition (in which the input/output bits are appropriately distributed instead of the input/output symbols), the concepts of bit partitions has been introduced [25].Let B {bl,b 2 bB} be a set of (input or output) bits.Let T {tl,t2 tT} be a set of (input or output) symbols.Each input/output bit bkB, introduces a two-block partition "trT(bk) on the set of symbols (bit value patterns) T (in the case of incom- pletely specified machines on the subset of T for which the value of this bit is specified).One block of "rrv(bk) contains the symbols for which bit b k has the value 0 and the other block contains the symbols for which b k has the value 1.The product of the partitions "trv(bk) for all the bits bk: bk B will unambiguously define the set of all input/output symbols, i.e. it will be a zero partition.A partition "rr B on the set of bits B "rr B {b,b2 bk, (bk bB) } is called a bit-partition.In a bit-partition the important bits (for distinguishing between certain symbols) bl bk are kept in separate blocks and the don't care bits bk bB are kept in a single block called a don't care block (denoted by dcb(B)).The product (.) and sum (+) operations as well as the ordering relations (-<) for bit partitions are defined in the same way as for "normal" partitions with the following supplement: the product of a block (important or don't care) with important blocks is an important block in the product partition; whereas the sum of a block (important or don't care) with a don't care block is a don't care block in the sum partition.The zero bit-partition is defined as a bit partition with an empty don't care block.The identity bit-partition is defined as a bit-partition with all elements in the don't care block.xr.If xr ind('rrB) then, having -rr B the blocks of xr can be computed.If "rr B ind(-tr) then, having the block of 7r the values of all the important bits from xr B can be computed.

Example (continued)
The function from Table I and its associated machine description M(I,O,h) is again used.Given the partition Ti" { a,c; b,f; d,e; g,h) on set I and partition 7r o (w,x; y,z) on set O. (xr I, xro) is a I-O partition pair and this can be easily checked by checking all blocks of xr following the definition (e.g.h{a,c } {x,w} which is a subset of a block of "rro etc.), mi_o('rri) {w,x; y; z} repre-sents the maximal information about the output of M that can be calculated using "rr I. Similar Mi_o('rro) {a,c,g,h; b,d,e,f} represents the minimal input information that is necessary as a input to calculate -rro.
Let "rr {w,x;y;z} then {a,c,g,h;b,d,e,f} and {a,c,g,h; b,f; d,e } are both induced input partitions of The function from Table I as a binary function instead as a symbolic function will now be considered.Bit-partitions on the inputbits can be made.The set with the input bits is called X, i.e.X { x, x 2, x3}.In fact, an input bit represents a symbolic partition which contains in the first block all symbols for which this bit is 0 and in the second block, all symbols for which this bit is l" 7rx(X) { a,b,c,d;e,f,g,h}, 7rx(x2) {a,b,e,f;c,d,g,h} etc.A bit partition on X is the partition xr x { x, x 2, (x3)}.Given any partiion "r, "r is a symbol partition induced by xr x if and only if: "r --> 1-I "rrx(bx), which can be calcu- bk(X-dcb('tr0) lated as follows" "r-> "rrx(x)."rrx(x2) = "r-> {a,b,c,d; e,f,g,h } { a,b,e,f,c,d,g,h } a-_> { a,b;c,d;e,f;g,h }.Simi- larly, an example of a bit partition induced by the symbol partition {a,c,e;b,d,f,h;g} is the partition "r x {x 3, (x,x2) }.In this case a" x is the only possible bit partition.

General Full-Decomposition
A theory of general decomposition of sequential ma- chines is presented in [29].In this paper we are con- cerned with the synthesis of combinational logic how- ever, a combinational machine is merely a special case of a sequential machine with one state and a trivial next- state function.Therefore, the general decomposition theory can also be applied to our problem.A special case of the general full-decomposition theory [29] related to combinational circuits is presented below.
In a general full-decomposition of a combinational machine M (I,O,k) we need to find a composition of n cooperating partial machines M (Ii,Oi,hi) as well as the mappings I---I and 0: Oi---O in order that the composition of M together with the mappings and 0 realize machine M. The implementation of the general decomposition model requires three components: the input coder (pre-processor) , the simultaneously work- ing communicating component machines (main processors) M and the output decoder (post-processor) 0. The component machines, input coder and the output decoder are implemented as combinational circuits.The model is general and it contains all elements necessary for the construction of circuit networks which implement com- binational circuits: parallel processing elements with possibilities for information exchange between them; divergent pre-processing elements for abstracting and splitting information and representing it in the appropri- ate form; and convergent post-processing elements for joining and combining information from parallel proces- sors and representing it in the appropriate form.The full-decomposition can be characterised by the type of connections between the component machines and by the type of input/output encoding/decoding.In a general composition, each partial machine can use (partial) output information from another in order to compute its own output.However, two special cases of a general composition are possible: a parallel and a serial compo- sition.In a parallel composition, no connections exist between the partial machines.Each partial machine is able to compute its own output independently.In a serial composition, machines are ordered and only the compo- nent machines M i, i-> j, can use information from the machines Mj in order to compute its own output.The formal definition for a general composition is given below.
A general composition is said to be in canonical form if and only if the connections rules Con compute the vector values and have the following form: Con (Yl Yn) (COnl,i(Yl) Cnn,i (Yn)), i.e a (partial) output information j, --< --< n, is separately transmitted to the input of a certain machine i, 1 <--< n, i.e. without combining it with a (partial) output information from other partial machines k, -< k-< n, k :/: j.A general composition is said to be in maximally pre-processed form, if the connection rules Con compute the scalar values i.e. information transmitted from various partial machines to a certain machine is combined prior to connection to the input of this machine.Of course, the compositions in partially pre-processed form lying between the two above extremes, are also possible.A general composition GC of n combinational machines Theorem 5 A combinational machine M has a general full-decomposition if and only if n partition doubles (xri i, 7ri*') exist and they satisfy the following conditions: (1) Tl'ii Trtii IT*'I, where" ,Ti"ii H IT*lI A special case of theorem 5 for two combinational machines is presented below.For simplicity in presenta- tion we will use this cse in the sequel of the paper.
Theorem 6 A combinational machine M has a general full-decomposition with two partial machines and without local connections if and only if two partition doubles (r I, r*i) and (-q, "r'i) exist that satisfy the following conditions:
The combinational machine str(MGc) is a general full-decomposition of machine M if and only if the general composition machine Mc realizes M (seeFig- ure 3 for the case of two partial machines).In a similar way, formal definitions for parallel and serial composi- tions and decompositions can be introduced.In [29] the following theorem has been proved.ar" h2(A2,B 1') { x IxA2AxB } ]'r* Since condition (1) of theorem 6 is satisfied, hl(A1,B2') and h2(A2,B 1') are unambiguously defined, i.e.M and M e are two well-defined deterministic combinational machines which are able to compute their outputs from their inputs.Let Con,2 and Cone, be two functions defined as follows: Con l,e * * :'IT i-->ari and Con2,:'r i->Ti.COnl,e(B1 defined functions which are able to unambiguously compute their values from their arguments.It is clear that the above-detailed construction of machines M and M e and functions Con,e and Cone, is a general composition of machines M and M e without local connections.Since condition ( 2) is satisfied, it is possible to construct the general composition of M and M e as a legal composi- tion, i.e. the exchanged information can be computed (directly or indirectly) from the primary input informa- tion of partial machines.
Let : I ----> arI 'l'i be a function, 0:ar* q'*I----->O be a surjective partial function, and ,B2) B (qB2 if B (qB2 4: .Since, (ar*I" T*I, "1"1"O(0)) is an I-O partition pair (3), the output of the original machine M can be unambiguously computed by 0 from the outputs of the partial machines M and M 2. xlr(x) unambiguously computes the inputs of M and M e.Therefore, the general composition of the machines M and M e as defined above, realizes the output behaviour of machine M. Construction of the decompositional realisation structure following theorem 6 is strictly analogous.By repeated use of the general full-decomposition model or its special cases, all func- tionally correct combinational circuit structures can be constructed.
Since we are building a subfunction for implementing "rr* E the partitions -rr* A, "rr* B and "tr* E are all partitions on I and the partition pair property is reduced to the <- property.It must be shown that the partition doubles satisfy all these conditions.As a result of the fact that each block of ,1T A is included in a block of 'IT* A] qT A qT* A. Since all output information of machine A is used as input information for machine B, it can be assumed that axe, "rr* g and condition (l) is then satisfied whenever condition (2) is also satisfied.For condition (2) we need to first calculate the product "tr A ffl" B (0,8;1,9;2,10;3,11; 4,12;5,13;6,14; 7,15}.Using these results, it is easy to see that condition (2) is then satisfied.Finally, for condition (3) we need to calculate "rr* g "tr* a {0,8;4,12;1,2,3,9,10,11; 5,6,7,13,14,15}.With this product partition, it is obvious that condition (3) is satisfied and hence the decomposition is correct.In Table III, the function tables for the different logic blocks of the implementation of f2 of f are presented.It is interesting to note that this approach uses different gates: OR, NAND, EXOR and the gate ab.All these gates are innately and directly obtained from decomposition without the use of (non-trivial) technology mappings.

Special Cases of General Full-Decomposition
Today, none of the methods that have been published have been able to produce near optimal solutions for a multiple general full-decomposition.All the published results relate to special cases of the presented model.In this section, a number of special cases will be discussed.

Input-bit parallel full-decomposition
In parallel decomposition, no information flows between the partial machines, and therefore the partitions "rr and "r in theorem 5 are reduced to "fro(I).In input-bit decomposition, the input decoder is reduced to the appropriate distribution of the input bit lines and this results in the replacement of the input partitions "rr and "r by the bit-partitions "rrIB and "riB.In this way, the following theorem was obtained from theorem 5.
Theorem 7 A combinational machine M has a non-trivial input-bit parallel full-decomposition with two component machines (see Figure 5) if and if only two partition doubles (XriB, -rr*i) and ("riB, "r'i) exist that satisfy the conditions" (1) 'IT "fl'* and T T'I, where 'IT "-ind('triB and ri ind("riB).
A well-known and extensively studied special case of the input-bit parallel full-decomposition is the input- encoder problem.In this case, the output decoder 0 is implemented as a PLA and the input encoders M have multiple exclusive sets of input bits.The problem is often modelled using multiple-valued logic [43][44] [46].The concept of multiple-valued logic is in fact very similar to that of partition theory.The general function of two-bit encoders is to replace the inputs a, a, b and b of the PLA with the signals a + b, a + b, a + b and a + b.It can be proved that the size of the new PLA (without the size of the two-bit encoders which generate these signals from a and b) cannot be larger than the size of the original PLA.However, the total size (the size of the new PLA and the encoders) can be larger.Fortunately in many practical cases, considerable gain in the total area can be obtained.The major problem with this method is the choice of pairs of inputs.In [43][46] a heuristic algorithm is presented which tries to find sub-optimal pairs of inputs.The drawback of this method is that it ignores interactions between pairs of inputs.
A similar, but more sophisticated and general way to solve the input-encoder problem was presented by Ciesielski et al. [15] [52].He implements the input encoders using PLAs.The encoders can have any num- ber of input signals and some of the input bits may be directly fed to the output decoder.Characterisation: In the first step, a set of inputs is partitioned into a number of disjoint subsets.Two heuristic ap- proaches are presented for these: the first is based on integer programming whereas the second is based on a modified min-cut algorithm.Benchmark results from a large and varied set of machines show that the integer programming approach is better, but the graph-partitioning approach is faster.
A classical multiple-valued minimization is then used to find the best implementation of the PLA O.
The results that have been presented (in the form of benchmarks) are very promising, the only drawback of this approach is that the input sets may not intersect.
Another special case of the input-bit parallel decom- position was considered by W. Wan et al. [51].Given a certain incompletely specified multiple output function f(xl Xn), the method presented in [51] searches for two disjoint subsets A and B of all inputs of f (i.e.A,BC {x Xn}), a multiple output function F and k single output functions g gk in order that the function F (gl(A),g2(A) gk(A), B) realizes the "care" behav- iour of f.This approach differs from the input encoder problem presented earlier in this section because the input sets used for the encoder blocks gk are not required to be mutually disjoint.The decomposition is targeted towards Xilinx FPGAs.The input set A is limited to 4 elements hence the functions gk have no more than 4 inputs.Because gk can implement any function of 4 variables with the same cost, the goal of the decomposition is to implement as much functionality as possible into the function gk and make the function F as simple as possible.If the number of inputs of F is too large, the decomposition algorithm can be iteratively applied to F. Unfortunately, it has been impossible to characterise and evaluate this method more precisely, because further Ie =tib,i ilk] M IB=[ib.,ib.. e ib n M' FIGURE 5 Input-bit parallel full-decomposition of M into two component machine M and M2.
details of the method, other than the outline sketched in [51], were not available to us.However, the benchmark results presented in [51 show that the method performs well compared to previous methods that have been presented for synthesis on Xilinx logic blocks.Luba et al. introduced yet another special case (see Figure 6) [31], where one of the component machines (M2) is replaced by an identity function.The first step of this algorithm is to find the inputs IB 2 which have to be fed directly to output decoder 0. The search algorithm tries to find the best set of inputs, so that the number of inputs of output decoder 0 does not exceed a user specified bound (for Xilinx clbs this bound is set to 4).The algorithm then tries to find an implementation for machine M1 using a minimum number of possible inputs (i.e. IBeLIIB3 contains a minimum number of elements).First a disjoint decomposition is used (i.e IB 3 has no elements).If this fails, inputs from IB 2 are added to IB 3 until machine M can be constructed.Unfortunately, no heuristics are described and no results on large benchmark sets are presented, therefore it is impossible to estimate the effi- ciency of this method for large circuits.However, the results that are presented are very promising.
In recent years, a number of methods for the more general input-bit parallel decomposition problem have been presented.In [22][23][24], the set of input bits is partitioned in two disjoint subsets.The goal of this method is to minimize the number of bits needed for communication between M and M2 and output decoder O. Although.this method does not explicitly use the partition theory it can be easily formulated with this theory.The strength of this method is that it allows the estimation of communication complexity without having to construct the machines M and M 2 and the output decoder O. Characterisation: Good heuristic solutions for the calculation of communication complexity without the actual need to construct the blocks.This algorithm can have a linear complexity for circuits with low communica- tion complexities.Heuristic procedures for the partitioning of the input sets exist.
Benchmark results on large examples are relatively good.

Bit-parallel full-decomposition
In the bit-parallel full-decomposition, both the input decoder and the output decoder are reduced to the appropriate distribution of the input/output bit lines (see Figure 7).The theorem for this type of decomposition can be obtained from theorem 7 by replacing the output partitions o and "r o with bit-partitions "rroB and Theorem $ A combinational machine has a non-trivial bit parallel full-decomposition with two component machines (see Figure 7), if two partition doubles ('rr m, "rroB) and ('riB, "rOB) exist that satisfy the conditions" (1) obk,Oa.cb,oB,('Tl'i,'rro(Obk) are I-O partition pairs, where IT ind('rrlB).
(3) 'IToB TOB '/I'OB(0).Solutions to this decomposition problem have been presented independently in [27], [30] and [20].In [27] the problem is called the output decomposition problem.An output decomposition consists of partitioning the set of Boolean functions (outputs) into a number of disjoint subsets, each implemented by a separate component  machine.The output decomposition in [27] aims at partitioning a multiple-output function into a minimal number of limited programmable logic building blocks (such as PLAs, PALs, etc.) and in minimizing the number of interconnections between the blocks.The problem is modelled as a multi-dimensional constrained optimization problem with constraints imposed on the number of inputs, outputs and terms.It is solved by a special multi-dimensional packing algorithm.
First, the information processing structure of the original combinational machine and its relation to the characteristics of building blocks are analyzed.From information about the correlations between the input, term and output variables as well as information about the constraints, the expected minimum number of building blocks and the ex- pected number of input bits, output bits and terms per building block are computed.The expected values show how difficult it is to satisfy each of the constraints with a given number of blocks and indicate the amount of attention that must be paid to each of the constraints during the partitioning pro- cess.The active input bits and terms for each single-output function are also computed.Based on this information, affinities (from the viewpoint of a certain partitioning problem) between each two (single or multiple-output) functions can be com- puted.
With the above information, a limited number of near optimal solutions are constructed in parallel by performing a multi-dimensional packing while us- ing a beam-search algorithm.Since the decision making during the search is based on uncertain information, the search is guided by the heuristic elaborations of the rule of minimizing the uncer- tainty of choices.At each step, the decisions are taken which ensure the highest certainty of achiev- ing the optimal solutions and, under this condition, the decisions that minimize the uncertainty of infor- mation for the future choices.Information that is used directly for decision making consists of rela- tions between the characteristics of single-output functions and constraints imposed by (partially constructed) building blocks and, correlations be- tween the single-output functions and functions in (partially constructed) blocks.Published experimental results show that the method is very effective; in almost all cases it was able to find the global optimum in reasonable time even for very complex functions (e.g. a function with 131 inputs, 253 terms and 91 outputs (cpio) or a function with 45 inputs, 428 terms and 43 outputs (apex 1)).The search algorithm has a number of parameters which enable a trade-off between the quality of solutions and the required computation time.
A similar method was published few months later in [20].It uses less information about the original multiple- output function than the method presented in [27] and elaborates information less precisely.An interesting con- cept not present in the method published in [27] is that of relaxing the term constraints and dynamically processing the terms.
Another approach to bit parallel full-decomposition is presented in [30].In this paper the problem is referred to as parallel decomposition.Characterisation: The actual parallel decomposition is preceded by argument reduction.Argument reduction is a tech- nique which minimizes the number of inputs of a Boolean function, as opposed to the classical mini- misation which aims at finding a minimum number of product terms.It is used to find function repre- sentations which use minimum number of input variables.This process is similar to term reduction which finds the minimum number of product terms.
A parallel decomposition algorithm uses the results of the argument reduction and constructs two-block parallel decompositions.Unfortunately in [30], only the idea of parallel decomposition is presented with no algorithms and heuristics.Only few results of experiments are shown however, these are very promising.
In a later paper [32], this decomposition method is combined with the input-bit parallel decomposition method mentioned in the previous section.This already allows for the construction of complex networks of blocks, but it is not yet a general full-decomposition in its most complete form.In this paper, Luba stated that the bit parallel full-decomposition should be used as a preliminary step to the more general input-bit parallel decomposition procedure.A heuristic is then presented E VOLE L. J0WlAK AND M. STEVENS which shows how these two different decomposition approaches should be alternated and which parameters should be used to tune these algorithms.Some parts of the method use exhaustive or greedy algorithms and the selection of parallel or serial decomposition is empirical.Results presented for quite small circuits show that this algorithm works reasonably well for ACTEL cells and very well for Xilinx cells.The question is, however, whether the algorithm will work effectively and effi- ciently for complex circuits.

COMPARISON OF THE APPROACHES
In the previous two sections, we presented the concepts of division-based and general decomposition-based mul- tiple-level logic synthesis and discussed a number of synthesis algorithms that use these concepts.From this discussion it should be clear that the division-based approach is a very special case of the general decompo- sition approach, limited to decompositions with partial machines and decoders exclusively implemented with AND, OR and NOT or NAND or NOR gates.Therefore, the methods which are based on division can easily be transformed to equivalent decomposition methods.An example is presented in [35] where division-based syn- thesis is used to preform a special input-bit parallel decomposition.For special cases of logic implementa- tions in the form of exclusively AND-OR-NOT, NAND or NOR networks, the solutions with division-based synthesis may be appropriate, under the condition that they take into account all the important objectives and constraints and involve effective and efficient algorithms.
The general decomposition approach has a number of advantages over the division-based approach.The main advantage of general full-decomposition is its general character.The general decomposition model and theorem enable modelling and construction of all possible com- binational network structures, while the traditional logic synthesis methods, including the division-based meth- ods, model circuits in terms of special minimum or almost minimum functionally complete systems.Such a functionally complete system is able to express each function, but it models functions as networks composed of exclusively special sub-functions which are included in a certain functionally complete system (e.g.AND-OR-NOT, NAND, NOR, EXOR-AND, MUX), while the general decomposition approach models them in terms of all possible sub-functions.If a certain element library includes more types of primitive gates than those in- cluded in a certain minimum functionally complete system or includes look-up tables, technology mapping is necessary.The network synthesized using exclusively the gates from the minimal functionally complete system must be mapped into the network composed of any elements from the library.If the repertoire of sub- functions offered by a certain implementation technology differs substantially from the set of gates provided by a given minimal functionally complete system, the work done by a traditional synthesis method is almost futile.Since the initial network is constructed without any regard to future implementation, to guarantee a possible to implement or optimal solution, the technology map- ping must again perform synthesis using the previously synthesised network as only a functional specification.Using decomposition-based synthesis this problem does not exist: the synthesis process constructs a network of functional blocks, which are in one-to-one correspondence with physical hardware blocks.
A further advantage of decomposition-based synthesis is its total character.During the decomposition, attention is paid not only to the active elements (operators) but to all the elements and aspects which can influence the quality of the results (i.e.inputs, outputs, interconnects and functionality) and to their interrelations.In the division-based approach all these aspects, except for active elements, are completely ignored.The only excep- tion is the lexicographical factorisation [45], which takes the interconnections into account by accepting a predefined input ordering during synthesis.Of course, it is possible to improve or further develop division-based methods by taking into account all these elements and their relations to the actual objectives and constraints, but it will not enlarge their range of application to circuits substantially different from AND-OR-NOT circuits.
Another important aspect is the use of don't cares in incompletely specified functions.In division-based syn- thesis, incompletely specified functions are first mini- mized using a two-level minimizer and then the actual synthesis is performed.In this way all don't cares are removed and hence the design freedom is drastically reduced.The synthesis problem is transformed to this of finding an implementation of a completely specified function (without considering influence of the don't cares on realisation of the actual design).These don't cares are lost for ever.Also, the multiple-level structure itself can introduce don't cares [5] [ 11].It has been shown that very complex and time-consuming techniques are necessary to effectively use don't cares in division-based synthesis [4].The decomposition approach does not require prior two-level minimisation and innately uses the freedom given by don't cares in order to optimize the resulting network structure (see for exalhple [31]).
In Table IV, some synthesis results are presented that compare the division based synthesis with the general decomposition based synthesis.The results are taken from [34].The goal of the experiments was implemen- tation with the minimum number of primitive logic blocks being 5-inputs, 2-outputs look-up tables.Table IV compares the number of clbs needed to implement a number of benchmark circuits from the MCNC logic synthesis benchmark-set [53].In the second colunm (marked Luba) the results of the general full decompo- sition method described in [34] are presented.The remaining columns present the results of different divi- sion based algorithms.These algorithms use (modified versions of) one of the factorisation algorithms presented in section 2 to minimise the two-level representation, followed by a technology mapping phase.The goal of the technology mapping phase is to transform the AND-OR-NOT network obtained by division into a feasible net- work with a minimum number of clbs.A feasible network is a network where blocks are only used if they do not violate the constraints imposed on them (in our cases: the blocks can implement any Boolean function that uses no more than 5 different input bits and 2 different output bits).In the third column (labelled MIS-PGA) the results are presented for the method described in [38][39].This method heavily depends on the classical kernel/co-kernel minimisation and tries to map the minimum network to a feasible structure.In column 4, the results for the same set of benchmarks are presented for the ASYL system [48].The ASYL system implements, among others, the lexicographical factorisa- tion algorithm.In [48] a very simple but effective modification of the lexicographical factorisation algorithm is presented, and this is able to construct feasible networks.The technology mapper Chortle 18] (column 5 of the table) presents a method for the technology mapping of multiple AND-OR-NOT networks (obtained using any division based synthesis algorithm) into a network of clbs.
As the table shows, the general decomposition ap- proach gives results, which are in all cases better than the division based approaches.These dramatic improve- ments are obtained because of two major reasons: In the general decomposition approach, we deal with partitions representing any function with a certain number of inputs and outputs and not with AND-OR-NOT based functions.Therefore, the syn- thesis process automatically finds feasible networks using the set of all available primitive logic blocks, without the need of technology mapping.The general decomposition approach can handle incomplete specified functions, whereas the division based synthesis uses two-level minimised functions (where the design freedom is removed a priori).

5, CONCLUSIONS
In this paper, we have discussed and compared multiple- level logic synthesis methods based on division of Boolean expressions, and logic synthesis methods based on the theory of general full-decompositions.It is clear that the synthesis based on division can be considered as a special case of a general decomposition-based synthesis.To date, more research work has been done in the field of division-based synthesis, but preliminary results obtained from the general decomposition-based methods show their large potential and have revealed a number of very promising properties.In section 2, we also showed that the known division-based algorithms have a number of weaknesses which can be improved without leaving the division paradigm.In section 4, we showed that division-based synthesis has a number of fundamental weaknesses which require a more general solution, i.e. a general full-decomposition.Recently, substantial progress has been made in developing algorithms for many special cases of the general decomposition.However, a lot of work has still to be performed in order to .efficientlyexploit the opportunities created by modern microelectronic technology.

FIGURE
FIGUREDifferent division based realisations of function w.(a) Standard factorisation.(b) Lexicographical factorisation without predefined input order and concurrent factorisation algorithm.(c) Lexicographical factorisation with partial defined input order {a,b,c }.

FIGURE 2
FIGURE 2 Realisation of machine M by machine M'.
*ii,o(0) is an I-O partition pair.

FIGURE 3
FIGURE 3 General full-decomposition of machine M into two component machines M and M2 without local connections.

[
B1]ar and Cone,l(B2 [B2]'.Since '> ar*i and '> Con and Con e are two well arI TI T 15currently being developed by the authors is able to find

FIGURE 4
FIGURE 4 Comparison of division-based and general full-decomposition based realisations for function f.(a) Division-based gate-network (b) General-full decomposition based gate-network (c) Step of the General-full decomposition based algorithm (d) Step 2 of the General-full decomposition based algorithm FIGURE6 Input-bit parallel der ,mposition as proposed by Luba et al.
The partition product a'rl.a'r2 is the partition on S such that [s]arl"are [t]ar'are if and only if [s]ar [t]ar and [s]are [t]are.The partition sum ar + are is the partition on S such that [s]ar + are [t]ar + are if and only if a sequence: s s o, s s t, siS for 1..n, exists for which either