Formal Codesign Methodology with Multistep Partitioning

A codesign methodology is proposed which is suitable for control-dominated systems but can also be extended to more complex ones. Its main purpose is to optimize the trade-off between hardware performance and software reprogrammability and reconfigurability. The methodology proposed intends to cover the development of the whole system. It deals in greater detail with the steps that can be made without the need for any particular assumption regarding the target architecture. These steps concern splitting up the specification of the system into a set of individually synthesizable elements, and then grouping them for the subsequent mapping stage. In order to decrease the complexity of each partitioning attempt, a two step algorithm is proposed, thus permitting a wide exploration of possible solutions. The methodology is based on the TTL language, an extension of the T-LOTOS Formal Description Technique which provides a large amount of operators as well as a formal basis. Finally, an example pointing out the complete design cycle, excepting the allocation stage is provided.


INTRODUCTION
The design of complex systems comprising hard- ware and software elements is of considerable interest on account of its extremely varied applications, thanks to the current availability of low-cost hardware devices.It is of fundamental importance to optimize both the cost and the performance of such systems; various studies have been carried out on this kind of design, which is *Corresponding author.commonly called codesign.Codesign is an ap- proach to the development of systems composed by both hardware and software modules [1,2].Its main purpose is to optimize the trade-off between hardware performance andsoftware reprogramm- ability and reconfigurability.Moreover the aim of codesign is to be able to design a whole system without excessive preliminary constraints on mapping the module onto hardware and software parts [3, 4].
At present the sector which seems to offer most prospects of codesign methodology application is that of embedded control-dominated systems, thanks to their low complexity.In these systems output signals are caused directly by input signals, which generally means that the systems do not require extremely complex processing of the input signals.
Embedded systems are often used in life-critical situations, where reliability and safety are more important criteria than performance.For this reason we believe that the design approach should be based on the use of a formal model to describe the behaviour of the system before a decision on its implementation is taken.
In this paper we propose a codesign method- ology which is not only suitable for the above- mentioned systems, but can also be extended to more complex ones.The systems to which our codesign methodology is applied are control- dominated ones.
In order to achieve the final partitioning it is necessary to define the processor, hardware com- ponents and interfaces generally referred to as the target architecture.The methodology proposed currently refers to an architecture including a single general-purpose processor and a few appli- cation-specific hardware components (ASIC or FPGA), a single-bus master software component and a single-level memory hierarchy [5][6][7].
As the methodology proposed intends to cover the development of the whole system, that is, from the specifications in terms of both time and behaviour to implementation of its components (the software components by using a programming language, the hardware ones by synthesis) certain choices have to be made, especially that of the technique used to .describe the system.
The language used for specification of the system is TTL [8] (Templated T-LOTOS), an extension of T-LOTOS [9] specially developed for use in codesign.TTL is also a valid tool for the subsequent stages of development, on account of its formal bases and the operators it provides.TTL allows consistency to be tested using mathematical properties instead of simulation approaches; in this sense the methodology is said to be formal.The issues this paper deals with in greater detail are the way in which the specification of the system is split up into a set of individually synthesizable elements, and the way in which they are grouped prior to the mapping stage.These choices are made without the need for any particular assumption regarding the target architecture.It will need to be chosen before mapping.However, the paper does not deal in detail with the problem of mapping, because it is possible to use most of the approaches in literature.
The final part of the paper presents a case study in order to evaluate the proposed design metho- dology.
The steps needed to go from specification to implementation are sketched in Section 2. Section 3 gives a brief description of the formal technique used in the design.Section 4 describes in more de- tail the process of specification and decomposition.Section 5 explains preclustering, which is a part of partitioning.Section 6 discusses implementation and Section 7 introduces a case study for the present methodology.Section 8 provides the authors' conclusions.

AN OUTLINE OF CODESIGN METHODOLOGY
Figure outlines the main steps which go from specifications to implementation of the system.The first step in the methodology is, therefore, the development of the specifications, using TTL.The specification stage is followed by the splitting stage which is subdivided into two steps: refine- ment and decomposition.The first splitting step, called refinement, makes the description of the system less abstract, thus passing from specifica- tion of the requirements of the system (maximum abstraction) to a structured representation (minimum abstraction).The refinement step, in which specifications are made less abstract, in reality includes several cycles of subsequent refinement.The TTL language provides adequate support during the whole stage, thanks to its operators and formal basis.
The second splitting step, called decomposition, consists of dividing the specifications up into a set of elements (called tasks) which can be synthesized separately.Decomposition is based on syntactic and semantic specification characteristics, as dis- cussed in greater detail in Subsection 4.2 (a similar approach to specification can be found in [10]).
Thanks to this approach the computational com- plexity required is quite low.
At this stage, however, there are still no con- straints on whether an element is to be implemen- ted in hardware or software.
In the methodology proposed, the set of tasks obtained from the decomposition step undergoes two further stages which together perform parti- tioning.In the first (predustering) the number of tasks is reduced below a certain threshold (grouping them into so-called clusters) in order to make the subsequent mapping stage less computation- ally complex.Preclustering is followed by a mapping stage in which the clusters are classified as hardware or software, grouped together if necessary and mapped on the target architecture; this last stage performs the functions usually referred to as partitioning in codesign.Dividing partitioning into two stages speeds the operation up and improves the cost-performance trade-off of the system being developed.
The software partitions obtained are then translated into C, the hardware partitions into synthesizable VHDL.There is, however, nothing to prevent the choice of other languages; the current choice was dictated by the wide availability of tools for these languages.If the results of the mapping phase are unsatisfactory, the clustering process can be repeated starting from any stage, varying, for instance, the number of clusters produced by preclustering.This fact is a peculiar- ity of the methodology proposed thanks to the choice of TTL and the use of the same language throughout design.Finally, the interface and the software scheduler are also generated, on the basis of the target architecture and the hardware and software partitions.

TTL: THE SPECIFICATION LANGUAGE
In literature the problem of the technique to be used to describe a system has been widely discussed.
In this paper we use TTL (Templated T- LOTOS) as the specification language.It is an extension of T-LOTOS ( [9,17] ) which is suitable for the codesign approach.
The main extensions TTL introduces to T- LOTOS are modularization (allowing, for exam- ple, the use of libraries), use of templates (allowing the definition of a generic process) and the introduction of an iterative construct (loop) [8].
The main features of TTL are: A high degree of abstraction.This makes it possible to concentrate on what is to be done without being affected by problems regarding actual implementation.For example, the high degree of abstraction of TTL guarantees that the language is suitable for describing both hardware and software, regardless of the target architecture.
Concurrency.This feature makes it possible to model systems made up of various parts which evolve in parallel, a situation typical of hard- ware systems.
The possibility of inserting time references.This makes it possible to specify the timing cons traints and estimate the evolution in time of the system.This feature is necessary for real time systems.The possibility ofusing component libraries.This allows time to be saved in the specification stage and leads to more efficient design thanks to the re- utilization of already developed and thus care- fully tested and optimized-components. Formal basis.This allows a mathematical approach (as opposed to a simulation one) to be used to test the consistence of each refinement step with respect to the previous one.Moreover, the formal basis allows us to check that the specification possesses useful properties like deadlock free- dom, liveness, respect of time constraints.
To manage the time constraints, we have identified two kinds of time attributes which can describe a wide range of situations: min/max and rate constraints [5].In TTL min/max constraints can be directly expressed by the time attributes of TTL actions.This is written in TTL as: min constraint oftl time units on a given action a.
This means that action a has to occur after a delay of at least tl time units.This is written in TTL as a{tl...c}.
max constraint of tl time units on a given action a.This means that action a has to occur within a maximum of tl time units.This is written in TTL as a{0.., tl}.
minimax constraint of tl time units on a given action a.This is a combination of the two previous cases.If the minimum delay is l, and the maximum t2, it will be written as a{tl.., t2}.
fixed delay of tl time units on a given action a.It is possible to fix a definite delay by writing a{tl}.
The language does not allow rate constraints on actions to be specified directly.This is not a prob- lem, however, as for practical purposes a rate de- lay can always be expressed as min/max or a fixed delay [18].TTL has been developed in such a way as to use all the existing tools for T-LOTOS (e.g., Lola [19]).In fact it is possible to translate a TTL specification into T-LOTOS using only syntactical transformations.TTL can be supported by a set of graphic tools which allow the designer to specify the behaviour of the system in a simple, immedi- ate, familiar way.A possible approach would be like the one followed in [20], which illustrates a technique by which it is possible to go from speci- fication of the system by time diagrams to a T-LOTOS specification; since TTL is a superset of T-LOTOS a similar tool can be built.
The language has two components: the first is the description of the behaviour of processes and their interaction, and is mainly based on the CCS [21] and CSP [22] models; the second is the description of the data structure and expressions, and is based on ACT ONE [23], a language for the description of Abstract Data Types (ADTs).
The syntax of the most important TTL operators is summarized in Table I; a complete description of TTL syntax and semantics can be found in [24].The process of specifying a system is generally composed of several refinement steps.a system-level description and proceeds by split- ting the system into increasingly smaller pieces, until it reaches a level at which the single pieces can either be constructed by combining library components or are described directly.
Figure 2 shows the refinement process during the preliminary stages of codesign.
Level 0 coincides with top-level system specifi- cation: at this level it is preferable to describe the system in as abstract a way as possible.The next n steps go from the abstract description of the system to a concrete one: at each refinement step the functional blocks are split into more elementary ones, leaving the behaviour of the system un- changed.Consistency between the description of the system at level n and that at level n-1 is verifi- able thanks to the formal base of the language.Traditionally the consistency between what is specified at level n and level n-1 was checked by simulation.The use .ofa language like TTL supports a better approach to system specification.The aim of the refinement step is to obtain specifications which can be efficiently implemented and, at the same time, represent the same system as level 0.
The division into modules also requires defini- tion of the signals that have to be exchanged among the modules, which are usually called inter- nal signals.
Exploiting the formal basis of the language, TTL aids the designer throughout the refinement process, giving mathematical certainty that the descriptions in the various steps are consistent.
The modularity of TTL also makes it possible to implement and include library components which have already been tested and used.This plays an important role in this step of the methodology, as the replacement of certain blocks with library modules which have a hardware counterpart nota- bly increases the efficiency of the system.

Decomposition
The aim of the system decomposition stage is to identify the main functional blocks of the system being developed.These blocks represent the bricks which will be used in the subsequent stages to per- form partitioning.Decomposition consists of splitting up the TTL specifications until a set of tasks which can be synthesized separately is found.
The main aim of codesign methodologies is to identify the blocks which permit the trade-off between performance and manufacturing costs to be optimized.It is, however, not desirable at this stage to make choices that constrain when a block must be mapped onto hardware or software.This would reduce the degree of freedom in the partitioning phase and therefore the possibility of obtaining a near-optimal system.In addition, making implementation choices at this stage would reduce the possibility of re-designing the system.Therefore to obtain the maximum independence from the final implementation, the decomposition criteria used must not involve choices which depend on the target architecture and considera- tions concerning implementation.On the basis of the considerations made so far, therefore, the data used must be obtained from the characteristics of the specifications alone.
Given the features of TTL there are two possible alternatives for the choice of parts we consider to be elementary.
Considering the single TTL constructs to be elementary, i.e., considering the operators which make up the behavioural expressions (external offer, choice, etc.).
Considering the processes to be elementary.
The first hypothesis can be discarded straight away as it would lead to an excessively high number of tasks, thus introducing too high a degree of complexity and fragmentation.The second is more plausible, also in view of subse- quent translation from TTL into the language which will be used to implement the system.
Due the tool currently used to translate the specification into synthesizable languages, if other processes are instanced inside a given process they have to be part of it.This hypothesis can be dis- carded using an ad hoc developed TTL synthesizer or some other translator which also accepts generic processes.
The decomposition process starts from the main specification and decomposes it according to the parallelism between the various processes.Figures 3a and 4a give some examples of decomposition into tasks.In the first example three tasks are obtained as they instance no other processes and are each parallel with the other two.In the second example the result of the decomposition process is two tasks, as process P2 instances P3 and so they constitute a single atomic element.
Decomposition can be performed automatically by means of a recursive algorithm which applies the considerations made previously.
The starting point of the decomposition algorithm is the tree which represents the hierarchy of TTL processes according to how they are in- stanced; the trees for the processes in the example given above are shown in Figures 3b and 4b.Each node in the tree can be labelled with an attribute which indicates whether it is made up of a parallel combination of other processes.The possible values of this attribute are: para, which indicates that the node is made up of a parallel combination of other processes (as in the specification in the first example in Fig. 3); nopara, which means the opposite of para.The attribute nopara is also assigned to processes which instance themselves (like process P1 in Fig. 3).
Figure 5 shows an example of a tree of proces- ses, where the single nodes are marked with the relative attribute.The algorithm for the decom- position into tasks is described in Figure 6 (the function do_task is described in C-like language).
In the algorithm, attrib(node) indicates the func- tion which returns the value of the node attribute,  while branchk.th(node)indicates the function which gives the k-th branch of the node and Taski is the i- th task.When applied to the example in Figure 5 the algorithm gives the following results: T1 P1, T2 P3, T3 P4.
By adopting this algorithm it is possible for a given specification to be decomposed into a low number of tasks.However, too low a number of tasks would mean few alternatives in the partition- ing stage and therefore little chance of exploring hw/sw trade-offs.
The optimal case is when the specification is composed of a hierarchy of processes of two types: 1) Processes which are only a parallel combination of other processes; and 2) Processes which instance themselves.In this case, in fact, the tasks are equivalent to the branches of the process tree, and so the maximum possible number.
To achieve close to optimal results, the initial specification of the system has to be made in a style that will favour the partitioning process.In practice, it has to be made in such a way that the processes fall into one of the following categories: Processes which instance themsel.ves;Processes obtained by means of a parallel com- bination of several processes.
It should be pointed out that these rules should only be taken as suggestions as to the specification style to be adopted and not as TTL constraints.

PARTITIONING
After splitting the specifications up into tasks according to the criteria outlined above, the parti- tioning stage starts.It aims to map tasks onto hardware or software components.In our ap- proach the partitioning is divided into two stages in order to reduce the complexity and the computational cost, which are critical in developing complex systems; these stages are called preclustering and mapping.The main difference between the two stages is that mapping is made after choosing the target architecture (e.g., the type of processor, hardware circuits, bus etc.), accord- ing to the actual delay introduced by the modules and their manufacturing (monetary) costs, while preclustering groups the tasks together according to their "coupling degree".We focus our attention on the preclustering stage showing an algorithm which is able to perform it at a very low computational cost.The results of preclustering can be used by most mapping strategies, to be found in literature, without any change.

Pre-clustering
The aim of preclustering is to reduce the number of tasks to be partitioned with the purpose of reducing the complexity of the problem of partitioning.
The number of "sets of tasks" (which will be called clusters) generated by preclustering is obviously of .criticalimportance for mapping.If this number is too high the complexity of the problem is not significantly reduced; whereas if it is too low, the mapping will not achieve a good cost-performance trade-off.This is due to the fact that the only stage where delays and manufactur- ing costs are taken into account is mapping.
A lower computational cost would suggest executing preclustering until is possible to reach such a low number of clusters that they will be allocated without making any choices in the mapping stage.On. the other hand, the greater number of parameters taken into account during the mapping stage would suggest giving it as many optimization chances as possible by providing a large number of clusters.The best solution is probably a compromise between the two strate- gies.The most suitable number of clusters that preclustering has to provide the mapping with is very difficult to establish a priori and up to now our methodology has proceeded by trial and error.
However, we are working on partially automating this choice, basing it on data collected during previous design cycles and interactions with designers.
The preclustering algorithm adopted to group tasks attempts to minimize the coupling degree among the tasks defined as the "number of interactions between two tasks".We believe the coupling degree is critical for implementation of the final device, mainly because the higher it is, the higher the communication will be, which increases the cost connected with interfaces.The preclustering stage works on the system before the choice of target architecture, so it is not possible to know.the manufacturing cost or the delay cost.The coupling degree, instead, can be evaluated and it appears to be a valid heuristic method to reduce the complexity of problems: in fact, by reducing, the coupling degree tasks with higher interactions will be grouped together and will be mapped on the same partitions (either software or hardware).
On the basis of the tasks output by the decomposition process, the preclustering algorithm constructs a weighted (with respect to the coupling degree) graph of the various tasks and works on this to group them into separate clusters.
Construction of the graph is preceded by classifi- cation of the task interaction point by identifying the type of data exchanged with the other tasks.Then each type of data is associated with a weight which depends on the amount of interaction introduced by the transaction.
The weighted graph has a biunique correspon- dence with the set of tasks output by the decom- position process.More specifically: Each task corresponds to a vertex in the graph; Each interaction point corresponds to an edge with a weight given by the function coupling Degree (interactionPoint) which gives the weight associated with the type of gate.
Figure 7 gives a simple example to clarify the concept.We assign a weight of one to the Boolean type and a weight of sixteen to the Int type; thus the resulting graph is shown in.Figure 8. Therefore the values returned by the function are the following: couplingDegree(gl couplingDegree(g4) 16 couplingDegree(g2) couplingDegree(g3 Given a graph with p nodes Vl, V2,..., Vp, it is possible to associate with it an adjacency (or distance) matrix, p xp in size, in which the element aij is equal to the weight of the edge which con- nects nodes v; and b" (if the edge does not exist we assume that it has a weight of 0).
If,_in the previous example, we decide to provide two tasks as input to the mapping stage algorithm, it would be natural to combine task (2) and task (3) as they are the ones which interact the most.
Figure 9 shows the preclustering algorithm written using a C-like syntax.
When execution terminates, the set C will contain the n clusters which minimize the total coupling degree function, defined as: globalCouplingDegree(C h C c Figure 10 shows the various steps of the algorithm when applied to the simple example in Figure 7, with n 2. Figure 11, on the other hand, shows a more complex example, in which p= 5 and n 3.  The proposed algorithm is optimal at minimiz- ing the globalCouplingDegree function with the same number of final clusters, in the sense that the final configuration is one in which the function reaches the absolute minimum.There may, how- ever, be several configurations in which the global CouplingDegree takes on the minimum value. In the algorithm, the clusters r and s, to be grouped together, are those for which the element A[r, s] is the maximum of all the elements in the matrix.If there are several elements which take on the maximum value, the algorithm used in the examples chooses one at random.Some enhance- ment could be made in order to improve the effectiveness of the algorithm in choosing the best element.To make this clearer, let us consider the example shown in Figure 12, which only repro- duces the portion of the graph we are interested in.On the basis of the algorithm illustrated above, task (2) could be clustered with either (1) or (3, 4)  obtained by a previous iteration of preclustering.A variation to the algorithm suggests clustering task (2) with task (1) as this solution, in minimizing the coupling degree, produces final clusters with a A[i,j],.v is the weigthed graph matrix (ofp nodes), where n <_ p is the final cluster number and C is the set of all cluster.
Initially let C containts all the tasks provided by decomposition, and let c=p.lower number of tasks for each.This modification, along with some others, has not been shown for the sake of simplicity but they are implemented in the working program.Having smaller clusters means it is easier to explore hw/sw trade-offs and consequently obtain a better final solution.

Mapping
This is the stage where the various clusters output by the preclustering stage are classified as hard- ware or software and allocated to the target architecture.The purpose of this stage is to allocate each module either to software or hardware trying to maximize the performance of the system and minimize the cost (in terms of money) of manufacturing.To achieve this result the system must find the best allocation for each module.The mapping is influenced by the target architecture chosen because it imposes requirements on the dimension (of the hardware part, memory avail- able, etc.) and on the interlaces between hardware and software.Moreover, the scheduling algorithm has a strong impact on the performance of the system [25,26].
The methodology has been devised in such a way as to leave a wide choice of partitioning methods.The mapping problem is not addressed by this paper; some interesting strategies can be found in [27] and [28], each one can easily be integrated in our methodology and benefits from the reduction in the number of input tasks.The last stage in the development of a device, performed using typical codesign techniques, is the definition of the interfaces (i.e., software drivers and hardware counterpart) between the modules and the scheduling algorithm needed to manage the active tasks.
Such an algorithm is needed because the various modules allocated to software use shared re- sources, such as the CPU, and also because it is required to manage the exchange of information between hardware and software.
Choice of the interfaces affects the performance of the system as a whole and is closely correlated with the scheduling algorithm.Interfaces and scheduling algorithms can indeed be said to represent a single feature of the system and they have to be chosen when the target architecture is defined [29].
The scheduling algorithm is essentially the operating system of the device being developed and its main aim is to activate all the software tasks correctly and in the right sequence and, at the same time, manage synchronization of the hardware modules; all these operations have to be performed in such a way .asto respect the time constraints of the device (max delay, max response time, etc.).Choice of the appropriate scheduling algorithm has to reach a compromise between the need for a complete, reliable manager and the need to avoid using excessive resources, especially memory and CPU time, to manage itself.This last point is even more important when the device comes under the category of control-dominant systems, where the management routines for single signals are relatively simple and so do not require very long processing times.
Two possible kinds of scheduling algorithm are interrupt-driven and soft-managed.
The interrupt-driven technique is based on the use of classical interrupt management techniques to schedule both software and hardware tasks.
Each task (hardware or software) which requires the exchange of an output signal generates an interrupt which activates the related routine.Even though it is logically simple and immediate, the interrupt-driven algorithm introduces the com- plexity inherent in the problem of saving the context of any routine which may be active when the interrupt occurs and managing priorities.In addition, this algorithm requires memory in which to store context information and a device to manage several interrupt lines so as to be able to cope with all the hardware tasks present.
The soft-managed technique is based on a simple-polling algorithm, modified to deal with synchronizing the .varioustasks.Here the re- sources needed to manage the algorithm itself are very few, but care must be taken to prevent the time required by the polling cycle from introducing an excessive delay in the management of signals.
The technique also has to allow the parallel evolution of all the hardware clusters at least until they require input/output from other modules.The algorithm may also allow some tasks to be queried more frequently if their delay requirements are greater.
Choice of the scheduling technique also affects how the software modules are translated from TTL to C because, according to the choice made, different management interfaces and different signal synchronization techniques will have to be inserted.It will be necessary to follow the rendezvous rules imposed by the TTL synchroni- zation protocol, as happens with all the techniques of the same family.
The scheduler can be described in TTL and so it is possible to check that the system comprising the scheduler and the modules behaves correctly before passing on to actual synthesis of the hardware modules, which is costly in terms of time, by simulation of global behaviour.In the future, by exploiting TTL's capacity to describe time quantitatively, it will be possible to obtain the scheduler program in such a way as to respect the time constraints by construction.6.2.From TTL to VHDL As said above, a TTL specification comprises a behaviour and a data part.These two parts require different translation procedures.

Data Part
This part is translated by establishing a relation between the types of data in TTL and those of VHDL, in the sense that each type in one language is made to correspond to a type in the other.

Behaviour Part
This part of a TTL description is made up of a set of processes combined by binary operators.It is possible to identify three types of semantic elements to be translated: events, processes and operators.
Events: Synchronization in TTL is achieved by means of multi-way rendezvous.VHDL, on the other hand, achieves synchronization by using signals.It is therefore necessary to decompose the sophisticated TTL rendezvous into VHDL signalS.
Processes: A TTL process is quite similar to a VHDL entity where the PORTS can be seen as low-level gates.
Operators: TTL operators are translated into the instructions provided by VHDL.
In this phase of the methodology it is possible to use a tool comprising two modules; the first translates from TTL.into T-LOTOS (an extended version of LOTOS including time) and the second from T-LOTOS into VHDL.
The first step involves expanding the modules, templates and loops of TTL to obtain the specification in standard LOTOS with explicit time constraints (which in turn is quite easy to translate into T-LOTOS).For the second step it is possible to use Harpo [30], which accepts T- LOTOS in input and outputs VHDL.Harpo is currently being developed but alread presents interesting features, such as the possibility of generating a synthesizable VHDL code.A draw- back, however, is the fact that the code generated is too large.

EXAMPLE: PONDAGE POWER PLANT CONTROLLER
As an example of application of the method proposed, we present a system to control the production of electricity in a hydroelectric plant.The aim of the example is to show the applicability of the method to quite complex real systems. 7.1.Specifications The controller essentially has two functions: it has to check the level of the reservoir to make sure it does not exceed a certain limit, and then directly control the production of electrical power.
The system provides for two functioning modes, manual and automatic.In the first mode the parameters involved in power production are supplied manually from the outside, while in the second mode everything is controlled automati- cally by a daily production program.
The controller, presented comprises several blocks: the clock, the daily program, the control panel, the regulator and a set of actuators.Figure 13 shows the structural interconnection between the various blocks, which we reached after performing several refinement steps on the abstract specifications of the system.Figure 14 shows the main TTL specification of the system.
In giving a detailed description of the features of the individual blocks, we will make use of the modularity offered by TTL.
Control Panel The Control Panel sets the func- tioning mode for the system (manual or auto- matic).Figure 15 shows the declaration of the Control Panel.It comprises a public process called main and three private processes (which cannot be exported).Figuer 15 gives a definition of the main process.As can be seen, the Control Panel module is in turn a parallel combination of four processes; CNTRL, B1, B2 and B3.
The CNTRL process (a definition of which is given in Fig. 16) has two functions: it detects the occurrence of the signal auto_t (as opposed to manu_t) and informs the regulator of the automatic (as opposed to manual) function- ing mode by emitting the signal auto (manual); it memorizes emission of the signals prgstart, prgstop and prgwidth, so as to restore normal functioning when passing from manual to automatic.
The processes B1, B2 and B3 perform a sort of logical OR on the input signals, so as to guarantee correct functioning both in the automatic mode and during the transition from manual to auto- matic.Figure 17 gives a definition of these three processes.
Daily Program This block manages the daily automatic production of electricity.Figure 18 shows the declaration of the module.
The Daily Program module is a parallel combination of two processes, DP1 and DP2 (see Fig. 18); the first turns the plant on and off, while  prgwidth?x:int4;pwidth!x;B3[pwidth,prgwidth,p3 [] p3?x:int4;pwidth!x;B3[pwidth,prgwidth,p3]  endproc FIGURE 17 Control panel B e B3 process definition, the second manages differentiated production of electricity according to the time of day.
On the basis of the time signal, the process DP1 (see Fig. 18 for a definition) turns the plant on and off.It is turned on at 6.00 am (by emitting the prgstart signal) and turned off at 9.00 pm (by emitting the prgstop signal).
The process DP2 (see Fig. 18) uses the time signal to regulate the level of production of electricity.It acts indirectly on the aperture of the valve (signal prgwidth: 0 completely closed, 10 completely open) which regulates the flow of water into the power plant.According to the daily requirements, production is divided into three time bands: From 9.00 pm on the previous day to 5.00 am on the next day, aperture 0, corresponding to no production of electricity; From 6.00 am to 6.00 pm, 70% of maximum production; From 7.00 pm to 8.00 pm 50% of maximum production.
Clock Automatic management of production requires knowledge of the real time, which is provided by the block called Clock.
Figure 19 shows the statement of the module and definition of the main process.As can be seen, the Clock block has been decomposed into a parallel combination of two processes counter and ck. Figure 19 also gives a definition of the ck process, which produces a tick every second, exploiting the possibility TTL offers of inserting quantitative time references into the description.
For the purpose of automatically managing the production of electricity, it is sufficient for the time signal to be emitted every hour.We therefore implemented the counter process (see Fig. 19) which uses a typical construct of TTL (loop) to create a counter which puts out the information every 3600 ticks.
Regulator This block deals directly with control of the plant.It has two main functions: Checking that the level of the reservoir does not exceed a certain emergency threshold, in which case it tries to restore normality by acting on an outlet valve; Checking the level of production, and turning the plant on and off by manual or automatic controls.
Figure 20 describes the statement of the regulator module, and the main process.Let us analyze the functioning of the processes which make up the regulator.
The processes R1 and R2 function in a similar way (see Fig. 21).The former turns the plant on by opening the valve (open signal) of the duct which goes from the reservoir to the plant; the latter turns the plant off by closing the valve (close signal).Both processes deal with correct manage- ment of the input signals in relation to the functioning mode (manual or automatic).
The process R3 (See Fig. 22) manages the level of production in the plant by acting on the valve which regulates the flow entering the plant.By means of the width signal, a sensor communicates the current aperture of the valve, which is compared with what has been programmed (in the automatic functioning mode) or set manually (manual functioning mode).On the basis of the difference between the two values, it acts on the motor which regulates the valve, emitting the signals inc and dec (which indicate the direction in which the engine has to move to increase or decrease the angle of aperture) and the signal fwidth (which indicates the relative angle of rotation of the valve).This process is also constructed in such a way that the automatic and manual functioning modes are managed appro- priately.
The process R4 (Fig. 23) controls the level of the reservoir.The current level is provided by the signal level (0 to indicate that the reservoir is empty, 10 that it is completely full).The safety level is set to a value of eight, which corresponds to 80% of the maximum capacity.exceeded the process R4 activates signals to open an outlet valve so as to bring the situation back to normal.
Act1, Act2 and Act3 The blocks Actl, Act2 and Act3 deal with interfacing between the control system and the actuators which drive the valves.
More specifically: Actl runs the motor which controls the valve of the duct going from the reservoir to the plant.
There are two possible positions for this valve open and closed.Act2 controls the outlet valve which serves to keep the level of the reservoir below a certain safety level.Here again there are only two possible positions open and closed.Act3 serves as an interface between the system and the stepper motor which controls the inlet valve.There are eleven positions for this valve, from zero to ten, which correspond to 0% and 100% of the angle of aperture of the valve (and therefore indirectly to the level of production).
Act and Act2 have a similar structure, the statement of which is given in Figure 24 where definition of the main process is also given.Act3, as said above, serves as an interface with a stepper motor which can move by steps towards increasing or decreasing angles, according to whether a signal up or dw is sent.The aim of Act3 is to send as many up (or dw) signals as the steps supplied by fwidth.Figure 24 gives the declaration of the module and a definition of the main process.

Decomposition
The specification of the system is given in such a way as to obtain the maximum number of tasks in the decomposition phase.Figure 25 shows the tree which represents the hierarchy of TTL processes on the basis of how they are instanced.

Partitioning
Pre-Clustering Application of the pre-clustering algorithm passes through construction of the adjacency matrix .for the weighted graph.We assume that the function couplingDegree has the following values: 5 for the time signal (minimum number of bits required to represent the 24 hours of the day); 4 for prgwidth, pwidth, p3, fwidth (needed to represent the 11 positions of the valve); for all the other signals.
In this case the matrix of the graph is the one shown in Figure 26.Applying the clustering algorithm to this matrix with n= 10, we get the following clusters: C T1, C, T2, C3 T3+ T4+ T + T+ T3, C4-T5, C5--Z6, C6--T7, C7--T8-+-T15, C8--T9, C9--TlO, ClO--Z14 In this example, we chose to reduce the number of tasks from 15 to 10, on account of particular efficiency requirements.As mentioned previously, in fact, the final number of clusters has to be chosen in such a way as to: Reduce the number of tasks as far as possible (and consequently the complexity of the sub- sequent mapping phase); )Aet3 Not impose constraints on the mapping of the target architecture.
If we had decided to take the final number of clusters to be used as input for the mapping algorithm down to seven, the pre-clustering algorithm would have put out the following clusters: C1 Zl -1--Z2+ Z3+ T4+ Z8+ Tll -t-T12-[-T13 + T15, C2-Z5, C3-Z6, C4-T7, C5-Z9, C6--T10, C7-T14 For reasons linked to minimization of the coupling degree, the cluster C1 is composed of nine tasks; such a critically large cluster which would represent a hard constraint in the mapping stage.This means that a given cluster could be mapped without taking into account the para- meters directly linked with the target architecture, which should be decisive for mapping.
In traditional design methodologies, the way in which specification of the system was made was a constraint for the subsequent mapping on the target architecture, as there was a tendency to map the blocks which functionally constituted the specification (e.g., in this case the regulator, the control panel, the daily program, etc.) directly onto hardware or software.In our case, instead, the composition of a cluster is not linked to the functions it performs but is a result of application of an algorithm which minimizes the degree of coupling between the parts of the system.For example, it would have been hard to envisage a cluster like C3, which is made up of processes belonging to different functional blocks and which in substance represents an optimal choice with respect to the coupling degree parameter.

RELATED WORK AND CONCLUSIONS
In this section we will examine the different approaches that can be found in literature to solve each aspect of codesign.Several techniques have been proposed to tackle the specification of hardware and software; in the following we will sketch the characteristics of some of them.
Esterel [31] is a synchronous language based on FSM.The synchronous hypothesis states that time is described as a sequence of instants, between which no action can takes place.This hypothesis permits the system to be modelled using only a single FSM exhibiting a totally predictable beha- viour.Unfortunately the resulting FSM is gener- ally fairly large, thus making it difficult to specify systems with a large amount of concurrency.
Another technique belonging to FSM is State-Charts [32].It is a graphical specification language which allows hierarchical decomposition, timing, concurrency and subroutines.It allows a concise specification and a clear documentation, but it lacks in specification of software submodules.
Among the other languages used for co-specifi- cation we can cite two examples: Cx, the entry language for COSYMA [33], which extends ANSI C with delays, tasks and task communication, and Hardware C [34] Which can be translated into a flow graph.
In the methodology introduced in this paper the specification language used is TTL.It is derived from T-LOTOS, an FDT based on the CCS and CSP process algebras.TTL appears suitable for describing control-dominated systems, as dis- cussed throughout the paper.
As shown in the paper a key problem in codesign methodologies is the validation of the model of the system being developed.Simulation is still the main tool used for this purpose and consists of comparing the model against a set of specifications.Many methods have been proposed in literature, they differ in their method of coupling hardware and software components.For example, in [35] a single custom simulator is used for both hardware and software, whereas another approach proposes using a software process running on a host computer loosely connected with a hardware simulator [36].
TTL aims to perform verification on the specification.Formal verification is the process of checking that the behaviour of the system satisfies a given property, also described using a formal method.This approach has been widely adopted to verify the correctness of protocols and it appears useful in hardware/software property checking.It also allows the congruence between two successive refinement steps to be checked without using a simulation approach.For these reasons we refer to our methodology as a "formal codesign methodology".
Several solutions to the partitioning problem are proposed in literature.Some use a graph model to represent the operations performed by devices and associate a cost to them [33].Others perform the partitioning together with the implementation of the scheduling algorithm as, for instance, in [29] where the specification is made with a hardware description language and synthesis tools are used to estimate the costs.The basic idea of performing scheduling and partitioning together is to minimize the response time.
Our methodology divides the partitioning stage into two steps.The first (preclustering) is based only on the properties of the system and aims to reduce the complexity of problems.This is obtained by a simple algorithm whose complexity is Very low especially compared with that of the mapping algorithm.The second step groups the remaining clusters and maps onto the target architecture.The strategy used to reduce the complexity of mapping is based on minimization of the interaction among clusters.
Finally some problems dealing with mapping have been discussed, including the choice of the scheduling algorithms needed to allow hardware and software modules to coexist.Proper choice of the scheduling algorithm is, however, an open problem to which further studies must be devoted.

(FIGURE 3 FIGURE 4
FIGURE 3 Decomposition example of parallel process only.

FIGURE 5
FIGURE 5 Example of decomposition with tree labelling.

FIGURE 7
FIGURE 7 Example of gate classification after decomposition.

FIGURE 10
FIGURE 10 Example of simple pre-cluster algorithm application.

FIGURE 11
FIGURE 11 Another example of pre-cluster algorithm application.

FIGURE 12 A
FIGURE 12 A sample of graph showing enhanced algorithm.

FIGURE 13
FIGURE13 Complete scheme of a pondage power plant controller.

FIGURE 20
FIGURE20 Regulator module declaration and main processDefinition.
FIGURE 21 Regulator R process definition.
It starts with step 1. Create Process Tree from the Specification and labelling ofnodes.step 2. Set i=0 and calling function do_task(tree).
If this level is 26GURE26Weighted graph matrix of pondage power plant controller.