A Parallel Processing Approach to Transition Prediction for Laminar Flow Control System Design

The performance of transport aircraft can be considerably improved if the process by which the wing boundary layer becomes turbulent can be controlled and extensive areas of laminar flow maintained. In order to design laminar flow control systems, it is necessary to be able to predict the movement of the transition location in response to changes in control variables, e.g., surface suction. At present, the technique which is available to industry requires excessively long computational time - so long that it is not suitable for use in the "design process." Therefore, there is a clear need to produce a system which delivers results in near realtime, i.e., in seconds rather than hours. This article details how parallel computing techniques on a KSR-1 produce these performance improvements.


1: Introduction
In order to sustain an aircraft in straight and level flight, two fundamental conditions must be satisfied.The first is that the lifting force generated by the wings must be equal to the weight and the second is that the thrust from the engines must be equal to the drag.
The drag force opposes the motion of the aircraft and acts in the direction of flight.It is made up of two components, these being the 'pressure' drag and the 'viscous' drag.Pressure drag is produced by the variation of the air pressure acting on the aircraft surface and is closely related to the lift force -in fact, it is the penalty that must be paid to produce the lift.Viscous drag on the other hand, is the result of tangential, or shearing forces, because, as a result of viscosity, air sticks to the surface.
On a typical transport aircraft in the cruise condition, the two components of the drag are approximately equal i.e: 50% of the drag is due to the action of viscosity.This means that a substantial fraction of the fuel which an aircraft carries is there to overcome viscous drag.It follows that, if the viscous drag can be reduced, a substantial saving in fuel and, consequently, operating cost is possible.
In flight, the effect of air viscosity is confined to a very thin layer close to the aircraft -the so-called 'boundary layer'.On the surface viscous forces require that the flow speed must be zero.However, a few millimetres above the surface the air must be travelling at almost the flight speed of the aircraft.
The motion of air within this thin boundary layer takes one of two possible forms.It may set up a simple, steady flow where each layer of air slips slowly over its neighbour -this is called the laminar condition.The alternative is an unsteady flow with almost random fluctuations causing the various layers of air to be mixed violently with one another -this is the turbulent condition.
Not surprisingly, a turbulent flow produces more viscous drag than the laminar one -the ratio is approximately ten to one.At very low speeds, flows are always laminar.However, as speed is increased, there comes a stage at which laminar flow becomes unstable to the small disturbances which are always present in reality.At higher speeds, these disturbances are amplified and, when the amplitudes are sufficiently large, there is a breakdown and turbulent flow is produced.
It follows that, for a given aircraft configuration, there is a flight speed above which the boundary layer flow will be turbulent and the viscous drag will be large.Unfortunately, for all but the smallest aircraft, the boundary layers are turbulent in the cruise condition.
Until recently this situation was accepted as unavoidable and aircraft designs have been optimised on the assumption that boundary layer flow would always be turbulent.However, it has been known for over 60 years that, if some of the air in the boundary layer could be sucked through the surface, then the speed at which laminar flow becomes unstable can be increased.
The effect of suction is sufficiently powerful for laminar flows to be achieved at typical aircraft cruise conditions when the suction velocity is only 0.05% of the flight speed (i.e:only about 10cm/s!).If suction could be engineered, the drag of an aircraft could be reduced by as much as 30%.This would represent a quantum leap in aircraft performance since, in the current commercial climate, an aircraft which could deliver a 5% drag improvement relative to its competitors would capture the entire market.
There are two major obstacles to the development of an aircraft which uses the surface suction technique for Laminar Flow Control (LFC).The first is the provision of a suitable porous surface through which the air can be drawn.This has always been a serious problem since, in the past, surfaces which were porous did not have good load bearing properties.However, this difficulty is now effectively resolved because of the recent development of the laser perforating technique.This enables traditional aerospace materials -titanium, aluminium, steel and even composite material to be drilled with millions of holes as small as 50µm in diameter, placed in any desired pattern, with any desired spacing.This leaves the second problem which is that, in order to produce a design for a laminar flow control system, it is necessary to be able to estimate the conditions under which the boundary layer flow will undergo a 'transition' from the laminar to turbulent state.Moreover, it is necessary to be able to produce these estimates sufficiently quickly so that a conventional design process is not slowed down.
The problem of predicting the conditions necessary for the onset of transition is a particularly challenging one.In fact, at present, there is no complete theory for transition.Nevertheless, various semi-empirical methods have been developed over the years and some are appropriate for use in design.
Of these, the one which is most accurate and allows for all the important parameters i.e: flow compressibility (mach number), surface temperature (heat transfer) and wall transpiration (suction) is the 'e N ' method.This is based on an approximate formulation of the stability problem for a boundary layer which, when solved by a suitable numerical technique, produces dispersion relations for the unstable disturbances.These relations are used to track the disturbance amplitude development and an empirical criterion is used to determine the breakdown (transition onset) condition.
However, whilst the e N technique allows for all the physical effects which can influence transition, it requires a great deal of computation -so much so that the elapse time between predictions is far too long for it to be described as a design tool.One possible solution to this problem is the application of parallel computing techniques.
This paper chronicles the parallelisation of a laminar to turbulent transition prediction code, developed by the LFC Group, in the Department of Engineering, at the University of Manchester.
Section 2 introduces modelling techniques for the onset of turbulence and Section 3 describes the parallelism inherent in the solution method.Section 4 introduces the KSR1 and discusses both the scalar optimisations and parallelisation methods used.Finally, Section 5 summarises the main conclusions of the paper and Section 6 discusses future work.

2.1: Solution Methods
The general problem of predicting the onset of transition in a flow is extremely complex.Strictly speaking, the complete approach requires the full unsteady Navier-Stokes equations to be solved for a range of disturbances, which span the complete spectrum of free-stream fluctuations, surface roughness, surface vibration and sound.Such a calculation would have to be performed with very fine resolution of length and time scales and it would be necessary to specify every possible form of disturbance in order to ascertain which were amplified most rapidly.Moreover, the computations would have to be carried out sufficiently far downstream to capture the non-linear processes which lead to laminar flow breakdown and the ultimate establishment of turbulent flow.Even with the most powerful machines currently available, such calculations are only possible for simple flows under highly restrictive and ultimately unrealistic conditions e.g: fully developed pipe flow with temporally developing disturbances.For engineering purposes, when the basic flows are much more complex, an alternative approximate approach is called for.
A major simplification of the problem is produced by limiting the consideration to the development of small amplitude disturbances in a boundary layer flow, since this allows linearisation of the governing equation.This approach was first proposed in the 1920's by Prandtl's group in Göttingen [1].The complete analysis is available in many standard texts e.g: Mack [2] but, in essence, the arguments run as follows.
• The instantaneous fluid properties are expressed in terms of a mean component plus a fluctuating component e.g: , etc.
• It is assumed that the complete unsteady flow satisfies the Navier-Stokes equations.
• The amplitudes of the disturbed quantities are assumed to be sufficiently small for products of fluctuating components to be negligible.
• The mean flow is assumed to satisfy the steady, boundary layer equations.
• The normal-to-surface component of the velocity is assumed to be negligibly small compared with the streamwise component.Thus the flow is taken to be parallel.
Having taken the above steps the resulting equations for the disturbance components are found (by inspection) to have harmonic solutions.Since the problem has been linearised, a general disturbance can be constructed by superimposing normal modes of the form (EQ 1) where, in general, , , and are complex quantities.By substituting expressions for the disturbance quantities of the form of Equation 1 into the governing equation, a system of equations is obtained which can be used to determine the characteristics of travelling waves propagating through the flow.These waves are known as Tollmien-Schlichting (T-S) waves.It is interesting to note that, originally, the above stability analysis was carried out in the absence of any experimental evidence that such waves could exist.Verification of the existence of T-S waves and their pre-cursor role in the process of boundary layer transition was not provided until the 1940's, by Schubauer and Skramstadt [3].
Finally, in order to produce further simplification, the flow may be assumed to be incompressible; it is then possible to show -see Squire [4] -that, for a two-dimensional boundary layer, the most unstable waves propagate in the mean flow direction i.e: . Consequently, from the point of view of transition prediction, only two-dimensional disturbances need to be considered.This being the case the stability problem reduces to the solution of a single, fourth order ordinary differential equation: In the early days of stability computation, the final step was to assume that the disturbance grew in time but not in space i.e: was zero.This produced a well posed mathematical problem.However, in the physical world in which waves are observed to propagate in the mean flow direction a slightly more realistic approach is to assume that the disturbances grow in space but not in time i.e: is zero.This spatial form for the T-S wave is the preferred option for use in transition prediction.
Solutions to the O-S equation permit the calculation of the dispersion relation for the disturbance waves of a specified frequency.The secondary problem then is how to use this information to predict the onset of transition.It has already been noted that breakdown of laminar flow occurs when the amplitude of amplified travelling disturbances becomes large.From Equation 1, it is immediately apparent that for a wave of fixed frequency, if, at the point of neutral stability, x 0 , the disturbance amplitude is A 0 then at station x, where x>x 0 : (EQ 4) In general, the boundary layer will change its thickness and Reynolds number between any two stations and, consequently, will vary with x.However for the purposes of evaluating , it is assumed that locally the flow does not vary with x.Hence, the amplitude ratio relation is only approximate.Nevertheless, it is a quantity which can be readily calculated and it does bear some relationship to the stability of the mean flow.
By examining a range of experiments in which transition was observed, it has been proposed that transition onset correlates with the condition where the wave which has undergone the greatest total amplification has just reached an amplitude ratio of e 9 i.e: (EQ 5) This is the basis of the so called e N transition method when the critical (transition onset) value for N in low disturbance environments is 9.

2.2: Numerical Method
The boundary layer (see Figure 1) will, in general, vary in thickness along the aerofoil.Therefore, we solve the O-S equation at a number of equally spaced positions along the aerofoil, finding the wave amplification rates at each; these positions are denoted 'stations'.A particular wave's amplification rate corresponds to the imaginary part of its wavenumber .At each station the O-S equation is solved (finding ) for a number of equally spaced frequencies; the number and bounds of these is specified.
An initial wavenumber approximation , must be supplied by the user for the lowest frequency at the first station.When the actual wavenumber corresponding to that frequency has been calculated, the frequency is perturbed and a new wavenumber found for the perturbed frequency.The original and perturbed values are then used to make an approximation to the wavenumber solution, for the next frequency, at the same station (Figure 2a shows the dependencies, at the first station, denoted by vertical arrows).
When the instabilities for each frequency at a station have been calculated the results are stored.The wavenumber corresponding to the lowest frequency (f 1 in Figure 2a) is then used as an approximation to the wavenumber for the lowest frequency at the next station (Figure 2b shows the dependencies here, denoted by horizontal arrows); this method is possible as the boundary layer and velocity profile are slowly varying (therefore adjacent solutions have similar values).Thus, once the initial wavenumber approximation has been supplied, subsequent approxima-

FIGURE 2. Dependencies of frequencies and stations
To solve the O-S equation, finding a wavenumber solution from an initial approximation at each frequency and station, we use a shooting method [5].The shooting method solves boundary value problems (to find the solution for an ordinary differential equation between two points with known boundary conditions).In this case the freestream gives one boundary condition and the aerofoil the other (see Equation 3).An approximation is supplied at one boundary and the system of ordinary differential equations integrated to the other boundary (from the freestream to the aerofoil).This is repeated with another approximation.In our algorithm the two approximations, Z (1) and Z (3) , are integrated at the same time.The integrator used is a fourth order Runge-Kutta integrator.
The shooting method we use suffers from the problem of parasitic error growth.For this case (a two-dimensional wave in a two-dimensional boundary layer) the two solutions Z (1) and Z (3) each consist of four components.Z (3)  grows more rapidly with decreasing y than Z (1) .The parasitic error follows Z (3) and when the difference in magnitude of Z (3) and Z (1) becomes sufficiently large Z (1) is no longer independent of Z (3) .Before this occurs we apply Gram-Schmidt orthonormalisation; in fact we do this for each iteration.The large solution Z (3) is normalised component by component to give the new solution: where the overbar refers to a complex conjugate and {} a scalar product.Z (1) is then replaced by: (EQ 7) where underbar refers to the quantity in the numerator.The numerical integration proceeds with the new values of Z (1) and Z (3) .
A linear combination of Z (1) and Z (3) , can be found / ⁄ = which satisfies the boundary condition at the aerofoil but will not satisfy the condition unless the wavenumber approximation is an eigenvalue of the equation (the correct value).The residual can therefore be found.
The real part of the wavenumber approximation is perturbed by a small amount and the integration repeated.The imaginary part of the wavenumber approximation is then perturbed by a small amount and the integration repeated.Corrections and to the initial approximations and are obtained from the residual and numerical approximations to derivatives using Equations 8 and 9: (EQ 8) The corrected and are used to start a new iteration and the process continues until and are reduced below a preset criterion.This method is a quasi Newton-Raphson search.

2.3: Algorithm in Context
So far we have not discussed how initial data, such as wavenumber and velocity profiles, for the aerofoil are calculated.This section overviews this process.
We begin with an aerofoil and its corresponding pressure distribution.A simplified diagram of an aerofoil is given in Figure 3.

FIGURE 3. Aerofoil
This shape has a surface static pressure distribution which is of the form given by Figure 4.
Once a particular aerofoil shape is chosen the pressure distribution, corresponding to that shape, is used as input to a 'mean flow code'.This program calculates the various boundary layer parameters needed at each station, such as Reynolds number, boundary layer thickness and velocity profile; its output is used as input to the 'stability code'.These two programs are separated so that certain control parameters can be set and others checked before running  The control parameters we set are the initial wavenumber approximation for the first frequency at the first station, the frequency range to be examined and the number of frequencies within this range.
It is known from previous experience that in practice two dimensional instability waves tend to occur within the region 500 to 5,000 Hz.Therefore, the first time the instability program is run, a spread of frequencies across this range is examined, for a number of stations along the aerofoil; typically 50 to 100 stations and 5 to 10 frequencies.The first run is effectively exploratory, to find which frequencies and stations to concentrate on.We then re-run the program with a greater number of frequencies and stations over the range of interest; typically 100 to 400 stations and 10 to 40 frequencies.This process may be repeated two or three times before an 'sufficiently accurate' picture is obtained.
The output consists of two files, a large diagnostic file which records virtually all relevant variables, and a file which gives the amplification rate ( ) for each frequency at each station.
is the amplification rate at a station (for a particular frequency).To convert this into an amplitude ratio, needs to be integrated along the aerofoil.The natural log of the amplitude ratio is then plotted against position along the aerofoil for each frequency; an example is shown in Figure 5.
Figure 5 shows a number of amplitude ratio plots for various frequencies.It is the profile of these frequencies, i.e: the largest amplitude ratio at any given position along the aerofoil, that is important when predicting where turbulence starts.This is because turbulence only begins above a certain (experimentally determined) value of N, regardless of which frequency first reaches it.position pressure . Amplitude ratios

2.4: Available Parallelism
This section discusses the parallelism apparent from the numerical method described in Section 2.2.

2.4.1: Pipelined Station Parallelism
When examining Figure 2b it is clear that computation for frequencies at a station can begin once the wavenumber for the first frequency (f 1 ) of the previous station has been calculated; this gives a parallel pipeline effect demonstrated in Figure 6.

FIGURE 6. Overlapped solutions
In Figure 6 the maximum overlap (number of stages in the pipeline) is the number of frequencies per station.In fact the overlap is greater than this as a station does not rely on the perturbed frequencies wavenumber solution for the previous station; it can proceed as soon as the actual wavenumber for the frequency at the previous station has been calculated.This effect is shown in Figure 7.

FIGURE 7. Maximum station overlap
Thus the potential number of stages in the pipeline (assuming all wavenumber frequency pairs take the same time) is two times the number of frequencies per station.As will be seen in the next section the pipeline is effectively reduced to that shown in Figure 6 when the frequency and perturbed frequency are parallelised.

2.4.2: Frequency and Perturbed Frequency
As described in Section 2.2, the wavenumber is solved for each frequency and for that frequency perturbed.The perturbation allows a wavenumber approximation to be extrapolated for the next frequency at that station.
At first glance one would expect these two solutions to be independent; however to decrease the number of iterations needed to converge for the perturbed frequency, our sequential algorithm uses the wavenumber solution of the frequency itself; this imposes a dependency.
To enable these two solutions to proceed in parallel the same wavenumber approximation used for the frequency can be used for the perturbed frequency.This change in the algorithm may increase the amount of computation as the solution to the perturbed frequency could take longer to converge.
Another possible problem is that for a particularly bad wavenumber approximation the perturbed frequency may not converge -therefore the parallel algorithm is potentially less stable than the sequential algorithm.This problem can be eliminated as, on failure, the parallel algorithm can revert to its sequential form.For all test cases examined so far this has not been required.
The parallelism so far described is shown in Figure 8:

2.4.3: Wavenumber Approximations
For each frequency, the wavenumber approximation and perturbations of its real and imaginary parts ( and ) are integrated through the boundary layer.As suggested in Section 2.2, these are independent and can therefore be calculated in parallel.
The parallelism so far described is shown in Figure 9: Figure 9 shows a potential loss of efficiency when parallelising at the frequency and perturbed frequency level described in the previous section.In the example given one solution converges in two iterations whilst the other takes three.

2.4.4: Integration of Solutions
Each wavenumber approximation and perturbed values of its real and imaginary parts has two dependent solutions Z (1) and Z (3) .These are integrated from the edge of the boundary layer to the aerofoil, see Section 2.2.Each integration is independent, however orthonormalisation is applied at each step, see Figure 10.

3: Programming Techniques
The codes described in this paper were written and are maintained by the LFC Group in the Aeronautics Department of The University of Manchester, U.K. The stability code is called Melissa.
Melissa is written in standard Fortran77 and as a result has run without modification on all platforms tried.The code itself is approximately 1000 lines long.
The O-S solver utilised in Melissa was taken from an earlier, more general purpose code, written in FortranIV.Due to both the language used and the original authors coding practice in this earlier code, the O-S solver section of Melissa has a typical 'dusty deck' form.

3.1: Code Restructuring
To help understand the algorithm used to solve the O-S equation and make the code more readable certain code restructuring was performed.
Loops implemented with a counter and conditional branch using a GOTO were converted into DO loops.Redundant loops and code segments associated with the equation solver performing obsolete functions were removed.Tangled control flow and conditional GOTO's were converted into their IF THEN ELSE form.A number of redundant input variables and input variables read more than once were removed.Large code fragments were converted to subroutines to aid readability.Finally some COMMON blocks were removed and variables passed as arguments.

3.2: The Kendall Square Research KSR1
The scalability of shared memory multiprocessors has traditionally been limited to tens of processors due to memory access contention.As a result it has been widely accepted that distributed memory is the key to scalable parallel machines, however these machines have been notori-integ Z (1)   integ Z (3)   orthonormalise Z (1) and Z (3)   α+∆α i α α+∆α r ously difficult to program.The KSR1 is a distributed memory machine that provides a single address space, supported by proprietary hardware [6].The advantage being a shared memory programming model for the user.This technique has been termed virtual shared memory (VSM).This term can cause confusion as the KSR1 also supports virtual memory (VM) with an address space of 1 million megabytes (2 40 ).
Each KSR1 processor is a 20MHz RISC-style superscalar 64-bit unit operating at 20 Mips and 40 Mflop/s (peak).A KSR1 system contains from 8 to 1088 processors with a peak performance range from 320 to 43,520 Mflop/s.
Each processor has 0.5Mbyte of subcache, split equally between instructions and data, and 32Mbyte of cache.It is therefore a non-uniform memory access (NUMA) style memory system.In this system instructions and data are not bound to specific physical locations, rather they migrate to where they are being referenced; this is termed a cacheonly memory architecture (COMA).
The interconnect topology is a 2 level hierarchy of slotted uni-directional rings, known as ring0 and ring1.Each ring0 can have a maximum of 32 processor memory pairs and has a bandwidth of 1 Gbyte/s.The ring1 connects up to 34 ring0's and has a bandwidth of 1-4 Gbyte/s depending on configuration.The KSR1 at Manchester is a 64 processor machine.
A thread (termed pthread by KSR) is a sequential flow of control within a process and is the underlying mechanism used to execute the parallel constructs available to Fortran programmers.These constructs -parallel regions, parallel sections and tile families -form a high level interface to pthreads.The user inserts these parallel constructs, seen as comments to other compilers, around appropriate blocks of code.A pthread library for thread creation, barriers, locks, condition variables etc. can be accessed directly by the programmer if a finer level of control is required.

3.3: Scalar optimisation on the KSR1
The core element of Melissa, the O-S equation solver originally included the case of oblique waves (threedimensional waves); the values of these were set to zero in the input files.The redundant code associated with this was removed.
Two core functions, in which nearly all computation takes place, were each called twice with the same input parameters.These two calls were replaced by a single call and the result shared.
The innermost functions are called millions of times and have little work within them.These were manually inlined reducing the calling overhead.
In combination these scalar optimisations produced over four fold improvement in solution time, see Section 4.3.

3.4.1: Parallel stations
In Section 2.4.1 we describe the potential parallelism available by overlapping station solutions, see Figure 6.As the overlap is equal to the number of frequencies per station the maximum parallelism that can be usefully employed is equal to the number of frequencies.
If, for example there are 4 frequencies, thread 1 will be used to calculate stations 1, 5, 9 etc., thread 2 will calculate stations 2, 6, 10 etc., and so on; see Figure 11.

FIGURE 11. Maximum Station Parallelism
This ensures that all threads are kept as busy as possible and minimises the number of threads used.As mentioned in Section 2.4.1 this is, in fact, an oversimplification.To create the appropriate number of threads on the KSR1 we use a parallel region directive.A function is called within the parallel region which returns a unique value to each thread and is used in combination with an explicit modulo function to ensure each thread only calculates the appropriate stations.
We now need to delay the thread at the next station until the thread at the current station has calculated the α solution for its first frequency.To implement the above we use a 'mutex'; this allows only one thread through a section of code at a time (the first frequency calculation is effectively an ordered critical section) and a condition variable to ensure that the pthreads obtain the mutex in the correct order.These were implemented using calls to appropriate KSR pthread libraries.The code implementing this is shown below.The condition flag (flag(I)) is initialised to false for all instances except for I=1.If the 'wrong' thread grabs the mutex it yields on a pthread_cond_wait() until it is woken by the thread which has calculated the previous station wavenumber.When a thread has finished calculating the wavenumber required for the next station to start, it unlocks the mutex, sets flag(I+1) to true, and wakes the next thread if it is sleeping on the condition variable; this is shown below:

IF (I.NE.NSTAT) flag(I+1)=.true. call pthread_mutex_unlock(mu1, istat) IF (I.NE.NSTAT) THEN call pthread_cond_signal (icond(I+1), istat) ENDIF
A problem found in the use of parallel regions and condition flags is in the KSR1's Fortran77 optimisation; it assumes the code runs sequentially making appropriate optimisations.It is therefore possible for the compiler to 'optimise out' a condition variable as, sequentially, the condition is always true.On the KSR1 such explicit locks need to be declared volatile.This is a Fortran77 extension which stops the compiler from optimising that variable.making wavenumber results for the first frequency global so that the next station could read them and sequentialising the output to a file; this occurs after all the wavenumbers have been calculated at a station.This latter change was implemented with a naive spin lock condition variable.
It ensues that we cannot neglect the time taken for each thread to complete I/O before calculating the wavenumber solutions for its frequencies.This overhead can be reduced by adding another lock and condition variable, allowing the thread for a station to proceed with its I/O as soon as the I/O has been dealt with by the previous station, see Figure 12.

3.4.2: Parallel frequencies
As discussed in Section 2.4.2, the algorithm requires modification to calculate the wavenumber for the frequency and perturbed frequency in parallel.This involves employing the same wavenumber approximation by both the frequency and perturbed frequency.To implement this we need only change one IF statement.To run this in parallel we add the KSR tile directive given below:

.) END DO C*KSR* END TILE
The teamid argument to the tile directive is discussed in Section 3.4.6.

3.4.3: Parallel wavenumber approximations
As discussed in Section 2.4.3 the wavenumber approximation and perturbations of its real and imaginary parts are integrated through the boundary layer.As in the previous section to implement this we simply need to add the KSR tile directive given below:

.) END DO C*KSR* END TILE
The teamid argument to the tile directive is discussed in Section 3.4.6.

3.4.4: Integration of Solutions
As discussed in Section 2.4.4 each integration step of the two dependent solutions Z (1) and Z (3) can be executed in parallel, however this is not true for the whole integration due to the orthonormalisation of the solutions, see Figure 10.This was implemented using the KSR parallel sections directive, see below: The teamid argument to the parallel sections directive is discussed in Section 3.4.6.

3.4.5: Inner functions
When the code is examined we observe some inner functions which could be called in parallel.For example functions U and U1 independently search for the appropriate velocity and acceleration data respectively.However, although a large proportion of the computational time is spent in these routines, the time per call is too small to obtain any benefit on the KSR1.

3.4.6: Teams and nested parallelism
When a parallel construct is encountered, the 'appropriate' number of threads are allocated to the work within it.To achieve this, threads which have already been created and have finished their previously allocated work are utilised; such threads are kept in an 'idle pool'.If there are not enough threads in the idle pool then threads are created to make up the shortfall.
As threads start referencing instructions and variables, the appropriate data is fetched from remote processors (if there is no instance of that data in the correct state locally).When threads finish their work they return to the idle pool.
If another parallel construct is encountered later, which accesses the same variables, it is sensible to use the same threads and processors that executed previously, as the processors local memory may still have valid copies of data.If the same construct is encountered, then the instruction cache may also still have valid copies of instruction code, and the thread will have the correct data structures associated with it.The idea of data re-use here is, of course, the same as re-use of cache on a single processor machine.
To ensure the same threads and processors are used for subsequent parallel constructs we use 'teams' of threads.A team is created with a defined number of threads.This team can then be referenced in association with a number of parallel constructs and the same threads associated with the team will be utilised each time it is called.
Teams of threads were implemented for each level of parallelism in the code.Another important effect of using teams is observed when the granularity of work is close to that of the start up cost of the construct and the construct is called many times; this is because a team reduces subsequent construct start up overheads by a significant margin.
A subtle problem in the use of teams arises when parallelism is nested.This is due to run time thread creation.A team groups together a number of threads including the currently active thread, thus sets of calls to create appropriately sized teams at the beginning of the code are not desired.It must be the active thread, arriving at the parallel construct, that creates the team.Therefore the team is created (once) at the same nested level as the parallel construct.

4: Performance realisation 4.1: Test Data
As outlined in Section 2.3, we first apply a low resolution search to find the frequencies and aerofoil positions of interest; we then do a more detailed analysis around these areas.The low resolution search typically uses 5-10 frequencies and 50-100 stations.The high resolution analysis typically uses 10-40 frequencies and 100-400 stations.
In this paper we use two test cases to represent these different resolutions.Test case A has 4 frequencies and 40 stations.Test case B has 40 frequencies and 100 stations.

4.2: Results
The results reported in this paper were achieved on the KSR1-64 at Manchester University.Timing runs were taken in a multi-user environment.To ensure the exclusive use of the appropriate number of processors during the runs and to minimise interference from other users, cells were allocated exclusively to the program during its execution.This was achieved with the command 'allocate_cells -A n', where n is the number of processors and -A avoids 'loaded' cells such as those with ethernet cards.All results shown in the next sections are the average of three consecutive runs and all timings were made using the unix timer 'time'.
Results for varying numbers of processors are presented as temporal performance graphs [7].The dotted line in each graph is the 'naive ideal' line , (often termed 'linear speedup') where is the elapsed time for the serial program for a fixed problem size and is the elapse time for the same problem size on processors.
Other results are presented in tabular form in which we give absolute time and solutions per second.A solution is defined as the calculation of the wavenumber for one frequency (and its perturbed complement) at a station.
The results described in this paper extend work

4.3: Single cell
A number of scalar optimisations have been applied to the code during the course of our work.Table 1 shows the performance improvements obtained.
The solution rates differ for the two cases as case B has a better convergence rate for the iterative method used to solve the O-S equation.

4.4: Parallel stations
The temporal performance for cases A and B for increasing numbers of processors are shown in Figure 13 and Figure 14 respectively.

FIGURE 13. Case A Parallel Stations
The two results in each graph (S and Sio) represent the advantage of performing I/O as soon as it is possible to do so; this effectively increases the pipeline length, see Figure 12.
The performance results 'drop away' from the naive ideal line due to two major factors.The first is that we run out of overlapped work to do; note that this is not a sharp cut off, as the pipeline overlap varies during the run due to the iterative method used.The second effect is due to the overhead of 'filling up' the pipeline.This effect is more prominent in case B (Figure 14), as the ratio of the pipeline length (a function of the number of frequencies), to the number of stations, is larger than in case A.

4.5: Parallel frequencies
To implement parallel frequencies we modified the algorithm (see Section 2.3.2).Table 2 shows performance comparisons for the serial code and the parallel version run both serially, and in parallel, for test cases A and B.
The parallel algorithm runs more slowly than its sequential counterpart.The temporal performance of cases A and B with the parallelism of parallel frequencies and parallel stations combined is shown in Figure 15 and Figure 16 respectively.Again the two graphs show the advantage of performing I/O as soon as it is possible to do so.Parallelising at the frequency level has the advantage of reducing the pipeline length but has the disadvantage of modifying the algorithm to a less efficient form.Figure 15 and Figure 16 show that parallel frequencies are only beneficial when the pipeline length is large, i.e: close to the number of stations and that for small numbers of processors it is better to parallelise at the station level only.
The interesting features in the parallel station and parallel frequency line (S+F) in Figure 16

4.6: Parallel wavenumber solutions
Table 3 shows the performance results when solving wavenumber solutions in parallel.
The temporal performance of cases A and B, when the parallelism of parallel wavenumbers and parallel stations are combined, are shown in Figure 17 and Figure 18 respectively.

FIGURE 17. Case A Parallel Stations and Wavenums
The results of parallelising at the station level only, are also shown for comparison.There is a clear advantage in parallelising at the wavenumber level as we obtain improved performance in both case A and B. In case A the improvement is due to the increase in the amount of parallelism available.In case B this improvement results in better performance for the number of processors used; a major part of this is due to the reduction of the pipeline overhead.

4.7: Parallel integration
When the integration was parallelised as described in Section 3.4.4,the program ran more slowly.The degradation in performance is due to the startup overhead of the KSR parallel section construct being close or greater than the work inside each section.To reduce this overhead we expanded the KSR parallel sections (see Section 3.4.4)outside the integration loop, thus reducing the number of section startups.This was achieved by duplicating the loop and synchronising manually via shared variables  Clearly we only obtain a modest improvement in the solution time when parallelising sections.

5: Conclusions
We have shown that for a typical low resolution search, we can reduce the solution time from minutes to near interactive times.In test case A, we have reduced the solution time from 6 minutes to under 8 seconds on 24 processors (this is the maximum number of processors that can be usefully employed on this problem size).
We have also demonstrated that for a typical high resolution search, we can reduce the solution time from hours to seconds.In test case B, we have reduced the solution time from over 1hour 51 minutes to under 40 seconds on 54 processors.
These results were obtained by a combination of scalar optimisation and parallelisation.Scalar optimisation accounts for over four fold improvement in the results.
These results open up the possibility of much larger runs involving more stations and frequencies.However, more significantly, we demonstrate near interactive performance for smaller runs, enabling a transformation of search methods and opening up new possibilities for this methods use in industry.
The algorithm has a reasonable amount of functional parallelism (for the standards of today's machines) but would not be classified as being massively parallel; useful parallelism on the KSR1 is limited to six times the number of frequencies in a run.The majority of this parallelism is coarse grained with little communication and should therefore be efficient on both shared memory and distributed memory machines.
There is further parallelism available, within which a large amount of the total execution time is spent; however the time per call is too small to exploit on the KSR1.
Conceptually, a message passing style would be a natural way to exploit the functional parallelism available and synchronisation needed in this algorithm, particularly at the 'level' of parallel stations.However, the combination of a shared memory model and dynamic thread control on the KSR1, coupled with automatic thread and data migra- tion, offered significant advantages when parallelising the existing sequential program.These features enabled each level of parallelism to be separately parallelised with little code modification; they also enabled each level and combinations of levels to be incrementally developed and tested; finally, we could effectively ignore thread placement and the migration of shared data.

6: Future Work
The algorithm described here models incompressible fluid flow.A compressible version is now operational; this has similar characteristics but is more computationally intensive.The compressible code will be the subject of future parallelisation effort.
The dependency of a station on the previous station may be broken by 'inputting' an accurate initial wavenumber approximation either at every station or, more feasibly, groups of stations.This would increase the amount of parallelism available and reduce the functional pipeline overhead, see Figure 19.The accuracy of initial wavenumber approximations is one of the current focuses of the LFC group.

FIGURE 19. New Parallelisation Strategy
The algorithm described here is also constrained to a two-dimensional flow.A three-dimensional version of their model is now operational.The three-dimensional code will again increase the computational requirements.In this case a different parallelisation strategy will also have to be adopted, as the three-dimensional version imposes new orderings on the computation (such as dependencies across the wing).
Work aimed at improving the efficiency of a search procedure in one of the core routines, eliminating any further redundant work and reducing the parallel overheads of air flow suction holes to control the boundary layer the time consuming stability program.

FIGURE 8 .
FIGURE 8. Station and Frequency Parallelism

FIGURE 14 .
FIGURE 14. Case B Parallel Stations up may also yield further performance benefits.

Case B Parallel Stations and Wavenums
1, see below.

Table 4
shows the performance of these different versions.