Design, Implementation, and Test of a Multi-Model Systolic Neural-Network Accelerator

A multi-model neural-network computer has been designed and built. A compute-intensive application in the field of power-system monitoring, using the Kohonen neural network, has then been ported onto this machine. After a short description of the system, this article focuses on the programming paradigm adopted. The performance of the machine is also evaluated and discussed.


INTRODUCTION
Neural networks are gaining recognition as a novel technique to solve large classes of problems better than by using traditional algorithms.One of the problems that neural networks encounter in practical applications is the huge computing power required.Conversely, one of the aspects that make neural networks interesting is their high degree of intrinsic parallelism.The union of these two elements is a solid ground for dedicated computers designed for connectionist algorithms 1].
This paper presents a special purpose machine where several popular neural algorithms can be run.The system achieves massive parallelism thanks to a systolic array with up to 40 40 processing elements.This paper aims at outlining the many problems that arise in practice for a user to program and use this kind of machine.For this purpose, the hardware structure of the machine is overviewed in section 2. Section 3 shows how the machine is programmed, from the implementation of low-level routines for the systolic array up to the user library routines.Each of these software layers raises di erent problems in terms of performance: the array microcode has to exploit the hardware in the best way, while the higher level routines should hide all the approximations and algorithmic modi cations introduced by the dedicated hardware.The performance assessment is discussed in section 4. Finally, the use of the system for an application of the Kohonen network in power-system security assessment is described in section 5. Section 6 draws some conclusions on the whole of the project.

MANTRA I SYSTEM
The MANTRA I computer is a massively parallel machine dedicated to neural-network algorithms.It has been designed to provide the basic operations for the following models: (1) Singlelayer networks (Perceptron and delta rule); (2) Multilayer feedforward networks (back-propagation rule); (3) Fully connected recurrent networks (Hop eld model); and (4) Self-organizing feature maps (Kohonen model).A description of these algorithms can be found in any classic introductory book on neural networks (e.g., 2]).The Kohonen feature maps are used in section 3 to illustrate how these algorithms are mapped on the system.
The MANTRA I accelerator is based on a bidimensional systolic array composed of custom processing elements (PEs) named GENES IV.In the present section, the hardware of the machine is overviewed starting from its system integration in a network of workstations and proceeding down to the internal architecture of the machine and of its computational core.

MANTRA I System Integration
The MANTRA I machine is controlled by a TMS320C40 Digital Signal Processor (DSP) from Texas Instruments.Two of its six 8-bit builtin communication links connect the machine to another TMS320C40 processor inside a SUN SPARCstation.From a software point of view, the intermediate DSP is transparent ( gure 2).
The MANTRA I machine (the systolic array and its control processor) are completely controlled by the front-end workstation but could be easily integrated into any other computer system based on TMS320C40 processors.

MANTRA I System Architecture
The structure of the MANTRA I system 3] is shown in gure 3. The control module is the SISD system based on the DSP.It controls the parallel or SIMD module by dispatching horizontally coded instructions through a FIFO.The SIMD module is frozen when no instruction is pending.Three FIFOs are used to feed data to the SIMD module and two to retrieve results.Temporary results can be held in four static RAM banks connected to the systolic array.The large DSP dynamic RAM can be used when the capacity of the static RAM is insu cient to contain the application.Two units based on look-up tables, noted ( ) and 0 ( ), are inserted on the data path and are typically used to compute the nonlinear function of neuron outputs.The latter unit is coupled with a linear array of auxiliary arithmetic units called GACD1 required in some phases of supervised algorithms.

GENES IV Processing Element
The systolic array at the heart of the SIMD part of the machine is a square mesh of GENES IV PEs 4], each connected by serial lines to its four neighbours, as shown in gure 4(a).All input and output operations are performed by the PEs located on the North-West to South-East diagonal.
Each PE, whose structure is shown in gure 4(a), contains one element of a matrix W (weight unit).The WGTin{WGTout path (shown in gure 4(b) but omitted in gure 4(a)) is used to load and retrieve matrices.Two vectors Ĩh and Ĩv are input at each cycle.i-th row of the stored matrix, usually containing the weights of a neuron.These operations have been chosen to implement most popular neuralnetwork algorithms including those mentioned at the beginning of section 2.
All of the operations may also be performed on the transposed matrix W T (with Ĩh and Ĩv as well as Õh and Õv exchanged).This is shown in table 1 only for the operation mprod T .For problems involving matrices and vectors larger than the physical array size, the task can be divided in smaller sub-matrices and sub-vectors treated sequentially.The partial sums of several consecutive mprod, mprod T , and euclidean operations can be accumulated thanks to the additive term Ĩh or Ĩv .The weight unit consists of two registers: one is used for the current computation while the other is connected to the WGTin{ WGTout path.This makes it possible to load a matrix in the background, without any overhead.
Since an instruction is associated with each pair of input vectors, a new operation can be started on each cycle and processed in a pipelined fashion.
The result is available 2N cycles later.The computation is performed on signed xedpoint values.The inputs and the weights are coded on 16 bits.The weights have 16 additional bits, but these are used only during learning (weight update operations).Outputs are computed on 40 bits.
A VLSI chip with a sub-array of 2 2 PEs has been designed in CMOS 1 m standard-cell technology.It contains 71,690 transistors (3,179 standard cells) on a die measuring 6:3 6:1 mm 2 .

MANTRA I SOFTWARE
Several problems arise when putting to work a specialized computer like MANTRA I. Some of them hardly come to light at early stages of prototype testing and only manifest themselves when running a real application.Users are not supposed  to program MANTRA I directly but have a set of libraries available on the front-end workstation.The rst neural-network algorithm implemented on MANTRA I is the Kohonen selforganizing feature map, because this model is required by the target application described in section 5.
Section 3.1 is devoted to the description of the Kohonen algorithm.Section 3.2 describes the mapping of the Kohonen routine on the systolic array to yield a basic Kohonen program in xedpoint arithmetics.Section 3.3 outlines the problems in the actual production of microcode for the array and describes the approach taken to handle the task.Finally, section 3.4 contains the details on the software interface between the lower programming level and the user level, and explains how this interface hides to the user some con- straints speci c to the systolic hardware.

Kohonen Self-Organizing Feature Maps
Kohonen's Self-Organizing Feature Maps (SOFMs) are among the most widely used unsupervised arti cial neural-network models.
Their learning algorithm performs a non-linear mapping from a high-dimensional input space onto a set of neurons 5].These neurons are organized as regular maps and a topological relation between them is de ned.Two-dimensional grids or meshes, as shown in gure 5, are typical topologies, but hexagonal grids or more exotic topologies are possible as well.
All neurons share the same inputs.Training is an iterative process: for every input vector x, its similarity with the weights of each neuron i is measured in the n-dimensional input space.The most frequently used similarity measure is the Euclidean distance: (x j ?W i;j ) 2 ; for i = 1; 2; : : :; m; (1) Other common similarity measures include the Manhattan distance and the scalar product.Usually, the input space has a much higher dimensionality than the topological space of the map.The winner neuron I 2 f1; 2; : : :; mg is de ned as the neuron whose weight vector is the closest to the input vector: in (x; W I ) in (x; W i ) ; 8i 2 f1; 2; : : :; mg: (2) During the learning phase, the weights are updated as: W i := W i + ( map (i; I)) ?x T ?W i ; for i = 1; 2; : : :; m; (3) where map (i; I) is the distance between neuron i and the winner I on the topological map.The neighbourhood function restricts the update to neurons close to the winner.The basic idea of the update rule is to bring the winner and its neighbours closer to the input vector.The adaptation gain should be decreasing during the training process to ensure its convergence.
Apart from the di erences arising from the choice of the similarity measure in and the distance on the map map , variations exist in the way the winner is detected and the weights updated 6].

Mapping Kohonen Networks on the Systolic Array
On MANTRA I, the rst step of the computation consists in evaluating the Euclidean distances between the input vector and the synaptic weights of each neuron, with the euclidean operation (table 1).The winner is then implicitly identi ed by processing the vector containing the m distances (m being the number of neurons) with the min operation.The sigma unit is used to convert +1 to 0 and any other value to 1.The result is a binary vector of m elements, all equal to 0 except for the neuron(s) closest to the input.The vector containing the m elements ( map (i; I)) is computed by multiplying the above binary vector by an m m matrix (mprod operation).The elements of this symmetric matrix are i;j = ( map (i; j)) and therefore contain all the information about the topology of the map.The advantage of this formulation is that there are no constraints on the dimensionality of the map, on the arrangement of the neurons (orthogonal, hexagonal, or other grids), nor on the shape of the neighbourhood (e.g., rectangular or triangular).
Finally the weights can be modi ed by injecting this update vector as Ĩh and the original input vector as Ĩv , and by performing the kohonen operation.Weight matrices larger than the systolic array can be decomposed into sub-matrices, which are then time-multiplexed on the array.
As it is the case with most dedicated hardware systems for neural networks, the described mapping produces an algorithm which is, in several re-spects, slightly di erent from the basic algorithm.The main di erences are: Fixed-point arithmetic.GENES IV PEs are designed for xed-point number representation.In the absence of general analytical results on the required precision (see also section 4.3), simulations of the application described in section 5 have been used to determine the precision to be implemented.Batch update.A characteristic of the systolic architecture is that the time required to produce a result (latency) is much longer than the delay between two successive inputs (inverse of the throughput).Therefore, it is natural to process batches of input vectors to hide the latency.For this purpose, the same weights are used to compute the distances for all these vectors.Hence, the last ones of the batch do not take advantage of the weight modi cations that would have resulted from the rst ones.
Learning parameters discretization.In the described implementation, all the information on the topology of the map is contained in a relatively large matrix.This matrix depends on the learning and neighbourhood coe cients and , both evolving with time during the learning process.While for some shapes of the function an update of the topology matrix inside the array is possible, the current implementation recomputes a new matrix in the SISD module.This implies that, for e ciency purposes, the learning factors and , instead of continuously evolving during the learning process, should be changed as seldom as possible.However, this does not appear as a major obstacle to the algorithm convergence.Multiple winners.In traditional implementations, when multiple neurons have the same minimal distance from the input, one is arbitrarily selected (e.g., the one with the smallest index); on the contrary, in MANTRA I all these neurons get updated.This multiple update is similar to the sequential presentation of some input vectors, each slightly closer to one of the winners.Hence, it represents a small distortion of the probability distribution and should not hinder the convergence of the network.A more severe consequence is that, if two or more neurons have the same weights and the neighbourhood is null, the neurons can no longer be separated, therefore reducing the mapping capabilities of the network.The weight resolution of MANTRA I is rather high and this problem should seldom occur.Additional techniques could also be applied to minimise the problem in critical cases.

Scheduling the Systolic Array
Hand-coding programs for MANTRA I is an extremely complex task.As shown in gure 3, a single instruction controls all pipeline stages of the SIMD module in parallel.This implies that an instruction may depend on several activities started at di erent times in the past and sometimes conceptually independent.The complexity and recon gurability of the pipeline make it dicult to mimic its delays in the instruction decoder.However it would be desirable to code the whole processing of a data set in the same instruction.The basic idea that has been implemented is to describe complete operations on groups of data independently from other concurrent activities 7].Operations include, for instance, the complete matrix downloading process or the computation of a set of distances.
Each of these algorithm building blocks, called microtasks (MT), is similar to a microoperation in a microcoded processor, and abstracts the programmer from low-level machine details.In contrast with traditional microoperations, MTs extend over many cycles, often in the order of twice the number of PEs per side.If no special action is taken, executing MTs in sequence causes the parallelism of the machine to be lost.MTs should start as early as the required resources are available and, in general, before all phases of the previous MT have completed.This process is very similar to issuing instructions in a pipelined processor: it requires verifying the availability of resources and data and generating stall cycles if these conditions are not met.
In contrast to what happens in pipelined processors, here the sections of the pipeline are heavily interdependent and data are synchronously transmitted between stages.Therefore, if one stage is stalled, the others will be halted as well.Hence, hazards must be forecasted and an MT can only be started if it can be completely performed.The blocks of microcode are therefore treated as rigid entities and the program is compacted by overlapping each MT with the neighbouring ones, provided that no con ict arises.For each cycle, the horizontally coded MANTRA I instruction is then determined by the MTs taking place in the di erent parts of the machine.
As a result of this kind of program optimization, the programmer can describe all activities as if completely serialized, including typically concurrent tasks such as background weight matrix exchanges.Compared to a hypothetical hand-coded implementation, the loss in performance observed in the implementation of the Kohonen algorithm appears negligible 7].The compaction algorithm is also very quick, making it possible to prepare the horizontal code just before run-time when the problem size is completely de ned.3.4 High-level Implementation 3.4.1 Requirements.The low level Kohonen procedure does not hide all the machinedependent aspects that should be made transparent to the user.A further software layer is required with two primary purposes: (1) Automatic conversion of user's oating-point data (weights, input vectors) to and from xed-point representation.(2) Discrete variation of all parameters that should evolve in time during the learning process, namely the learning coe cient and the radius of the neighbourhood function .
3.4.2Constraints imposed by the hardware.The hardware architecture imposes a few constraints in order to get the best performance from the MANTRA I machine.First, the optimal epoch should be at least 2N, where N is the number of PEs per side in the bidimensional systolic array.This is because 2N is also approximately the depth of the pipeline realized by the systolic machine.In the current con guration, N = 20 so that the optimal epoch is T = 40 (larger values do not improve the parallelization but make the algorithm more remote from the sequential version).The number of input vectors passed to one call of the low-level procedure should be a multiple of the chosen epoch, also for performance considerations.
Another set of constraints concerns the number of input vectors and the epoch.Each time that one of these is changed, the microcode for the systolic unit is prepared and recompacted internally in the low level Kohonen procedure (section 3.3).This process causes an overhead in the computation that should be avoided as far as possible.Therefore, the value of the epoch and of the number of input vectors should be kept constant in the multiple calls to the low level procedure that will occur during a learning process, if at all possible.
3.4.3Implementation.The MANTRA I machine communicates with user processes, running on the Unix front-end computer, in a client/server fashion.No more than one user process at a time may be connected to the machine.A user process issues remote procedure calls executed by the server program running on the MANTRA I control processor.The potential parallelism between the front-end workstation and the MANTRA I SISD component has not been exploited so far: for the sake of software simplicity, processes on the Unix system wait idle until the end of the remote procedure call.
The software upper layer is implemented in the server program on MANTRA I DSP.The procedure provided to the end user implements a complete Kohonen algorithm.It is included in a library that users can link to their own programs running on the workstation.Thus the MANTRA I machine operates as an accelerator for the workstation.
The user has to provide the high-level procedure with the neural-network size, oating-point input data, a function u (t) describing the evolution of the adaptation gain during learning and a twodimensional function u (d; t) describing the evolution of the neighbourhood function (d).Finally, the desired number of learning iterations should be provided.The procedure returns the weight matrix after training.If desired, the weights may also be extracted at intermediate points during learning, in order to monitor the training process.
The software upper layer rst searches the training set for the maximum and minimum values for each parameter of the input vectors.Each input component x i in the interval x min ; x max ] is then mapped to the interval ?2 14 ; 2 14 ?1]; the weights are initialized with small random values.The Kohonen update rule then ensures that no over ow in the xed-point computation will ever occur during learning.
The number of iterations required by the user is divided into a number of intervals of equal length l.This length is, if possible, a multiple of the required epoch length.For each of these intervals, the low-level Kohonen procedure will be called with l input vectors, randomly chosen in the training set, and with constant values for and for the function (d), computed by discretizing the original u (t) and u (d; t) functions provided by the user.

PERFORMANCE ASSESSMENT
The performance of the machine depends on the e ciency of the di erent levels that build up the user library.The rst component is the e ciency of the low-level routines used to access the systolic hardware.To that, one should add the effects of the algorithmic modi cations outlined in sections 3.2 and 3.4.2.This section presents performance measurements of the low-level routines and discusses the impact of the key modi cations on the connectionist performance.For the latter purpose, a general framework to judge of the performance of hardware dedicated to neural networks is introduced.

Performance of the Low-level Subroutine
The peak performance in connection updates per second (CUPS) for single-layer networks can be roughly computed as: where N 2 is the total number of PEs, f is the clock frequency, N PS = 40 is the number of clock cycles per operation (bit-serial communication), and U is the utilization rate.The constant n op evaluates the number of operations required to update a connection; for instance, in a Kohonen network with the same number of inputs and neurons, n op = 4 (euclidean, min, mprod, and kohonen).Considering the largest possible con guration of the MANTRA I system (40 40 PEs) running at a clock frequency of f = 8 MHz, the peak performance (U = 100 %) of the system is 80 MCUPS for the Kohonen network.Equation (4) gives the performance for single layers; otherwise the global performance is given by a weighted harmonic mean of the individual layer performances.For instance, a back-propagation network with one hidden layer and the same number of inputs and neurons on both layers would have a peak performance of 128MCUPS.
In practice, the performance degrades from the ideal value because of several factors.Namely, the utilization rate U is the product of three independent elements: A spatial utilization rate, expressing the fact that, depending on the size of the map, some PEs may be left idle during some phases of the computation; A temporal utilization rate, indicating the ratio of active instructions (array instructions other than no operation) over the complete microprogram; A dynamic utilization rate, which is the ratio of clock cycles when the SIMD module is active over the total.It models situations when the control DSP is delaying the parallel module because of unavailability of data or instructions.The performance P has been measured on the current con guration of the MANTRA I prototype (8 MHz, 20 20 PEs).The results are shown in gure 6.In the most favourable conditions, a fraction slightly below 70 % of the ideal performance has been measured.

Neural-Networks Performance on Dedicated Hardware
In addition to the MANTRA I machine, other programmable machines, based on custom digital chips and aimed at running neural algorithms, have been described in the literature (see 1] for a review).Among the most interesting, in terms of advertised performance and versatility, one can mention the CNAPS machine from Adaptive Solution 8], the MY-NEUPOWER machine from Hitachi 9] and the SYNAPSE machine from Siemens 10].All these machines are essentially based on SIMD parallel architectures and somehow modify the algorithms to make them more suitable to the hardware 11].
The kind of modi cations listed in section 3.2 is far from being peculiar to MANTRA I: for instance, almost all of the machines dedicated to neural networks implement xed-point arithmetic since it is one of the fundamental sources of simplication of the hardware.Similarly, batch processing is a typical requirement of pipelined machines or of systems that associate a heavy cost to partitioning large networks on smaller hardware.In the latter case, it is sensible to average this cost over a batch of input vectors.
At the same time, these problems are not speci c to Kohonen maps but, in di erent ways, affect most algorithms.Concerning batch update, while some connectionist models have an intrinsically batch nature (such as conjugate gradient optimization techniques), others are natively conceived to perform a weight update after each vector is presented (e.g., stochastic back-propagation or Kohonen self-organizing maps) and may su er from an injudicious conversion to batch update.The problem of incorporating these e ects in a fair assessment of the achieved speed-up is addressed in the next section.measures the quadratic error incurred by representing the input data through the closest neurons in the map.)The speed-up achieved by a neuro-computer compared to a reference machine, should be de ned by where E 0 is some predetermined value of the convergence metric, and t hw (E 0 ) and t cc (E 0 ) are respectively the times necessary for the neurocomputer and for a reference computer (for instance, a conventional workstation using doubleprecision oating-point variables) to reach the desired convergence value E 0 ( gure 7).
In order to link this new de nition of speed-up to the traditional CUPS ratings, a measure of the quality of convergence of the algorithms may be introduced: the algorithmic e ciency of a neurocomputer implementation of model M can be dened as follows: where k hw (E 0 ) and k cc (E 0 ) are respectively the number of iterations necessary for the neurocomputer and for a reference computer to reach the same E 0 ( gure 8).The e ciency, always positive, will be typically below unity, showing the losses introduced by the hardware constraints.The de nition (5) of speed-up and ( 6) of algorithmic e ciency then yield S M (E 0 ) = cc hw A M (E 0 ); (7) where hw and cc are the times necessary to process one learning iteration on the neuro-computer and on the reference computer, respectively.The ratio cc = hw expresses the traditional notion of hardware performance measured as the ratio of the CUPS on the special purpose hardware and on the reference system.Equation ( 7) weighs the hardware speed-up with the algorithmic e ciency of the implementation.A good implementation should have an e ciency as close as possible to unity.The more the algorithm has to be tuned to t the hardware constraints, the smaller the resulting e ciency.
This suggests that there may be a trade o between improving the parallelization e ciency in a neuro-computer implementation and preserving an acceptable algorithmic e ciency.A compromise might lead to the optimum global speed-up.

E ects of the Hardware Constraints
The implementation constraints on MANTRA I induce important modi cations of the Kohonen algorithmas described in section 3.4.2.These may lead to a poor algorithmic e ciency or even prevent the convergence at all.The two most important ones are the quantization on a nite number of bits of the input signals and synaptic weights and the batch updating of the weights.

Quantization e ects. Contrary to
other neural networks, the Kohonen algorithm with quantized weights and inputs has received little attention so far.Three factors in uencing its correct convergence can be put in evidence 12].Clearly, there is a minimal number of bits required to encode the weights, depending on the input distribution and dimension, as well as the number of neurons.Second, the adaptation gain must decrease slowly enough, or have an initially large value, since otherwise the weight updates get rounded to zero before the algorithm has converged.Finally, the neighbourhood function should decrease with the distance from the winner neuron, especially if the input dimension is low.These qualitative results were con rmed by a mathematical analysis based on the Markovian formulation of the algorithm 13], giving the necessary and su cient conditions for the self-organization of the map in the case where the input and weight spaces are one-dimensional.Roughly speaking, the results proved for the continuous case 14] also apply in the quantized case if the number of bits is large enough.

Batch updating.
As explained in section 3.2, the Kohonen algorithm is implemented in a modi ed batch version on MANTRA I.The batch mode is fundamental to exploit the parallelism of the systolic array 15].Up to now, the di erences between the batch and the classical online versions have not been studied in the literature for this particular model.The MANTRA I version is not purely batch, since the winner neuron is computed with the value of the weights at the beginning of the epoch but the weight update is an on-line operation.This implementation proved to be the most economic in terms of hardware complexity.In fact, while more counterintuitive, the convergence of the implemented algorithm has been proved in the case of scalar inputs and time invariant parameters, whereas the pure batch algorithm convergence can only be proved with more restrictive assumptions on the parameters 16].In the one-dimensional case, it has been proved, using Markov chains' properties, that when the neighbourhood function is rectangularly shaped and time-invariant, and the adaptation gain is also time-invariant, the weights selforganize with probability 1.When the adaptation gain decreases to zero, it has then also be proved, using the Ordinary Di erential Equation (ODE) method 17] that the weights converge with probability 1 to the same asymptotic values as the ones reached by the original, unmodi ed algorithm.
Comparative simulations for a wide range of learning parameters have been performed to support the theoretical results.On a real-size benchmark (speech codebook classi cation, 10 10 neurons, 12 inputs) the quadratic quantization error has been measured as a performance indication of the self-organizing map (E). Figure 9 shows the evolution of the convergence metric for the three versions of the algorithm.Already after the 8th epoch, the original algorithm is within 20 % of the best result, assessed after 200 epochs.The fully batch version is distinctly slower during the whole convergence and needs 28 epochs to fall below the same threshold.Thus, the algorithmic e ciency of the batch version compared to the reference algorithm is only A M = 8=28 = 0:29.For lower values of E 0 , the e ciency will become even lower and eventually reach 0, since the asymptotic minimumerror reached is larger in the neuro-computer than in the reference algorithm.However, the best nal error achieved by the MANTRA I algorithm with batches of 50 vectors is within a few ppm from the result of the standard version.Moreover, the algorithmic e ciency at the same E 0 is 0:57 and for lower values of E 0 it becomes close to 1.These results indicate that the batch nature of the MANTRA I algorithm does not severely hinder the convergence speed of the model.

POWER-SYSTEM APPLICATION
Putting the MANTRA I machine to work on a real application was one of the main objectives of the project.A target application in the eld of power-system security assessment has been chosen because of its heavy computational requirements.Section 5.1 brie y describes the application.Section 5.2 gives an estimate of the computational power required.Finally, section 5.3 gives an overview of the convergence quality and of the performance currently reached on the prototype for this application.

Application to Power-System Security Assessment
The application of Kohonen self-organizing feature maps to power-system security has been rst developed and implemented on a conventional workstation 18].The purpose of this application is to predict whether the power ows in the branches (lines and transformers) and the bus voltages of a system will, after an unforeseen outage, exceed the supported limit of the corresponding components or not.Power ows in a power system may be computed by solving a set of nonlinear equations using an iterative method, for instance Newton-Raphson's.Given a current operating point, clas-sical static security analysis considers every possible combination of outages and iteratively computes a power ow for each.This heavy computational burden prevents real-time computation on sequential hardware.Moreover, conventional simulation provides only quantitative results, leaving the interpretation of the current system state and its potential stochastic evolution to the powersystem operators.Since decisions in power transmission system control centres often have to be taken under time pressure and stress, a fast, concise, and synthetic representation of security assessment results is essential.
In this approach, the operational points of the multi-dimensional power-system state space are mapped onto a two-dimensional Kohonen network by dividing the security space into categories.The centres of each class are the neurons located at the coordinates de ned by the weight vectors.The two-dimensional picture of the network gives a quite accurate interpretation of the situation.

Computational Requirements
In the proposed application, a new security analysis and a new learning process are to be performed every day to account for the daily changing operating conditions.A utility, even if it controls only a small part of the system, should take a major part of the network into account to yield accurate results.The Swiss high voltage transmission network, for instance, consists of roughly 150 busses and 250 lines.The learning phase requires the processing of input vectors composed of 300 elements.The number of neurons in the feature map may not realistically be much more than 1000, since for each of them an expensive power-ow computation has to be performed after the neuralnetwork training to interpret the results.Considering that 10 5 iterations is an ordinary training length for a large Kohonen network, the number of connection updates necessary for a real-world power system may be roughly evaluated to a minimum of 300 1000 10 5 = 3 10 10 connection updates for a daily training.Such an amount of computation takes more than 8 hours on a conventional workstation (approximately 1 million of connection updates performed per second), which is too slow to be used in daily operation.An increase in computational power of one to two orders of magnitude is required.Accelerators such as the one proposed in this paper could drastically reduce the learning time for larger power systems down to acceptable levels.

MANTRA I Performance on the Power-System Application
As described in section 4, the performance of the system has been addressed by attempting to divide the hardware speed-up from the algorithmic e ects.The test experiments are described in the next sections.
5.3.1 Quality of Convergence.Figure 10 shows the evolution of the quantization error for one run of the power system application on the MANTRA I machine compared to a run of the original algorithm.The training set has been generated using the IEEE 24-bus 38-line power system, one of the smallest standard test systems developed for benchmark studies of power system software.With a dimension of 76 for the input vectors, this power system was well suited for the test of the MANTRA I and could t into the DSP dynamic RAM of the prototype.The same set of learning parameters has been used for the original algorithm and for the MANTRA I _ Figure 10 allows a rough evaluation of the algorithmic e ciency of MANTRA I in this application.Each unit on the x-axis corresponds to 120 training iterations.
It should be noticed that because of a combined e ect of integer arithmetic, batch implementation and multiple winners, MANTRA I converges with a slightly higher nal error than the sequentially implemented oating point version of the Kohonen algorithm.On a target error rate of 50 % more than the minimum error of the original algorithm, the latter and the MANTRA I implementation need respectively 12 120 and 16 120 iterations to reach the desired error rates.This yields an algorithmic e ciency of 12=16 = 75 %.
To con rm that the algorithmic e ciency on MANTRA I is high and not very far from unity, it was tested with additonal data from other applications.Test runs con rmed that the discrepancies between the MANTRA I version and the original version of the Kohonen algorithm are smaller than the standard deviation of the original algorithm itself.For instance, averaging ten runs for each of several sets of learning parameters in an application to speech-vector codebook quantization, MANTRA I actually performed better in approximately 50 % of the cases.
Like most neural network algorithms, the performance of the Kohonen algorithm, including the version running on MANTRA I, is quite sensitive to an inappropriate choice of the training parameters.Because the Kohonen algorithm converges stochastically towards an equilibrium point, different random initializations of the weight vectors coupled with a non-ideal choice of training parameters can result in di erent error rates.Experiments with comparatively small sets of training data (for instance 1562 vectors in the case of the power system data, each presented several times choosen at random) indicated that for several runs di erences in the nal quantization error were higher than expected and the reasons for this behaviour have to be investigated further.Whether the ideal learning parameters are identical for the original and the MANTRA I version of the Kohonen algorithm is also an open question at present time.Even though no evidence of a strong change in robustness has been found, more extensive experiments should be conducted to produce statistically signi cant observations.
In conclusion, the nal rate reached by MANTRA I appears acceptable for the powersystem application and the number of iterations required is approximately equivalent to that needed by a traditional oating-point version of the Kohonen algorithm.5.3.2Hardware Speed-up.Table 2 shows the actual MCUPS performance reached by the current machine with the Kohonen algorithm on the power-system application, for di erent problem sizes.For e ciency reasons, the number of neurons is chosen as a multiple of the number of PEs per side of the array (20 in the current machine), so that as few processors as possible remain idle.The actual performance depends heavily on the problem size; the machine per- forms badly on small problems, where conventional workstations could be su cient.However, for larger problems, up to 14 MCUPS have been reached for the Kohonen algorithm, while a conventional platform performs around 1 MCUPS on the same algorithm 11].

CONCLUSION
The MANTRA I machine has been designed, realized, and tested.This paper concentrated on the phases beyond hardware development, toward practical applications and real use.Topics such as programming and practical performance seem unfortunately seldom described in literature 1] and appear to be often disregarded as secondary to the hardware design itself.The MANTRA I experience shows that a number of key problems arise only when one tries to put the hardware to work and abandon toy problems to tackle real ones.
A rst problem is the criticality of the low-level programming of this sort of dedicated machine, which may lead either to an overwhelming programming complexity or to poor hardware utilization and performance.The di culties arise because of the many independent execution units (e.g., systolic arrays, look-up tables) are exposed to the programmer's view and because of long pipelines.Techniques to preserve the hardware e ciency and at the same time to structure the code have been presented.
On another side, neural networks are not so insensitive to algorithmic modi cations and to reduced precision as often supposed, especially by hardware designers.For instance, the MANTRA I system has to use a counter-intuitive version of the Kohonen algorithm for which convergence properties similar to those of the original algorithm have been proved.Simulations are also presented to con rm the theoretical results.
Finally, other problems had to be taken into account when interfacing the dedicated hardware itself with a conventional computational server and designing an e cient programming environment.
Despite the many di culties, the MANTRA I prototype, with one fourth of the supported processing elements, displays a performance about one order of magnitude above that of a conventional workstation.Still, to provide users with dedicated machines with performances on neuralnetworks close to those of supercomputers but at desktop prices, larger machines should be build with more sophisticated technologies (e.g., custom layout instead of standard cells, higher integration and clock rate).Also, more advanced packaging technology would also be required to solve severe reliability problems that have been experienced on the current prototype.
On the grounds of the gained experience, future research should address the following critical directions: (1) Find more exible and powerful programming models to make it possible the lowlevel programming of novel neural-network models by trained users and thus o er the exibility that users require.(2) Find techniques to perform the microcode compaction on-line in hardware to avoid an overhead that may become impractical for algorithms whose control ow is heavily data dependent.These include emerging connectionist models such as evolutive networks.(3) Improve the generality of the basic PE architecture to support this broadening of the algorithmic target.
Only addressing the above problems as a whole and not restricting oneself to the latter, dedicated systems for neural networks may become attractive for potential users and competitive with large and expensive computational servers.

Figure 2 :
Figure 2: The MANTRA I system integration.

Figure 1 :
Figure 1: The MANTRA I system.

Figure 5 :
Figure 5: Kohonen feature map arranged on a two-dimensional grid.

Figure 6 :
Figure 6: Performance of the system in millions of connections updated per second.

Figure 7 :
Figure 7: Convergence speed of a neurocomputer compared to a conventional computer as a function of time.

4. 2 . 1 Figure 8 :
Figure 8: Convergence speed of a neurocomputer compared to a conventional computer as a function of algorithm iterations.

Figure 9 :
Figure 9: Comparison, on oating-point simulations, of the convergence speed of the Kohonen MANTRA I batch implementation with that of the pure batch algorithm and with the original on-line algorithm.The graph measures the quantization error of the network after every learning step consisting of T = 50 vectors.

Figure 10 :
Figure 10: Measurement on the power-system application of the convergence speed of the Kohonen MANTRA I implementation compared to a oating-point implementation.

Table 1 :
GENES IV array basic operating modes.
shows the operations that can be performed.W i represents the i ( Ĩv T ?W i )

Table 2 :
Implementation of the Kohonen algorithm: measured sustained performance for di erent problems.