Executive Summary

Amphiphiles are molecules with hydrophobic tails and hydrophilic heads. When dispersed in solvents, they self assemble into complex mesophases including the beautiful cubic gyroid phase. The goal of the TeraGyroid experiment was to study defect pathways and dynamics in these gyroids. The UK's supercomputing and USA's TeraGrid facilities were coupled together, through a dedicated high-speed network, into a single computational Grid for research work that peaked around the Supercomputing 2003 conference. The gyroids were modeled using lattice Boltzmann methods with parameter spaces explored using many 128^{3} and 256^{3} grid point simulations, this data being used to inform the world's largest three-dimensional time dependent simulation with 1024^{3}-grid points. The experiment generated some 2 TBytes of useful data. In terms of Grid technology the project demonstrated the migration of simulations (using Globus middleware) to and fro across the Atlantic exploiting the availability of resources. Integration of the systems accelerated the time to insight. Distributed visualisation of the output datasets enabled the parameter space of the interactions within the complex fluid to be explored from a number of sites, informed by discourse over the Access Grid. The project was sponsored by EPSRC (UK) and NSF (USA) with trans-Atlantic optical bandwidth provided by British Telecommunications.

In terms of Grid technology the project demonstrated the migration of simulations (using Globus middleware) to and fro across the Atlantic exploiting the availability of resources.Integration of the systems considerably accelerated the time to insight.Calculations exploring the parameter space of the interactions within the complex fluid were steered from University College London and Boston, informed by scientific discourse over the Access Grid.Steering requires reliable near-real time data transport across the Grid to visualization engines.The output datasets were visualized at a number of sites including Manchester University, University College London and Argonne National Laboratory using both commodity clusters (Chromium at ANL) and SGI Onyx systems.
In the following we summarise the extent to which the Project met its stated objectives: -securing high-level recognition for UK computational science on the world stage: the demonstration of the experiment at SC'03 resulted in the project winning the HPC Challenge competition in the category of "Most Innovative Data Intensive Application".Grid aspects of the project have been written up for a paper presented at GGF10 and scientific publications are in preparation.
-enhanced expertise and experience to benefit the UK HEC and e-Science communities: the project resulted in the first transatlantic federation of major high performance computing facilities through the use of Grid technology.The project pointed to the need to develop robust applications that can exploit a range of different high-end systems architectures.-added value to the construction/operation of the UK e-Science grid: the technical teams developed valuable interworking relationships, as well as experience of the practical issues in respect of federation of such facilities by a "high performance backplane" .The results feed directly into the proposed ESLEA project.-data to inform strategic decisions about the future UK involvement in global grids: remote visualisation, checkpointing and steering require access to very high bandwidths which need to be dedicated and reservable.
-long-term strategic technical collaboration with ETF sites.The project established and strengthened relationships in particular between the visualisation and Globus communities in the UK and the USA.The project gave an insight into what is happening in the USA and established strong technical relationships between operational HEC Grid centres in both countries.-long-term scientific collaborations with users of the ETF sites.The project built on a strong and fruitful existing collaboration and such a relationship should be the cornerstone of future projects.-experiments with clear scientific deliverables -an explicit science plan was developed, approved and executed.Data analysis from the experiment is still ongoing.-choice of applications to ensure that benefits are returned to computational science and e-Science community -the discretisation grid based technology is key to a number of other applications in particular in the area of computational engineering and the experiences have already been reported through publications accepted by GGF10.-a reciprocal arrangement that encourages ETF participants to contribute complementary demonstrators -it was not possible given the late approval date to put effort into supporting a separate US driven project.
This was a significant achievement, which helped to confirm that the UK does have a world-class presence in the area of High-end Computing and e-Science.Lessons learned include: • The proposal was developed without peer review although the experiment built on existing peerreviewed projects.Collaborative ventures like this are valuable to explore how to work with counterparts, and tease out technical issues.They gain enormous "intangible" benefits as noted above.The projects may not always be appropriate for peer review if such concentrates purely on the scientific output.• The original intent was to have two experiments, one from the UK and one from the USA.In the event, there wasn't one from the USA.Future experiments should be bilateral -given that these experiments will be planned and approved much further in advance of the target date this should not be a problem in the future.• The cost of carrying out the experiment was higher than initially estimated, as computer time was not budgeted for explicitly in the original proposal.A considerable amount of work had to be undertaken to port, optimise and validate the simulations on large numbers of processors.There does need to be some flexibility in terms of resource allocation as it is the nature of these exploratory experiments that it is not always possible to quantify the exact resource requirements in advance of undertaking a considerable amount of work.• There was some disruption to the other users of the HPCx and CSAR services.Planning in future years should be improved to a) reduce the disruption to the minimum, and b) give the users plenty of warning of any scheduled down time.No complaints were received from the users about the level of disruption, indeed a number of positive messages were received.• HPCx and CSAR worked well together -a good example of collaboration between the two services and the project also provided a good example of collaboration between the High-end Computing and e-Science UK Programmes; • Site configuration issues are important as most applications assume a specific system performance model.It became clear during the project that applications need to characterise performance on a range of target systems in terms of CPU, memory bandwidth, local I/O bandwidth, parallel file system bandwidth and network interconnect speeds between the file system and the outside world.• Visualisation facilities within the UK are not capable of meeting the needs of the largest-scale capability simulations undertaken within this project.• Effort needs to be put into developing a scalable solution for Certificate Authorities to accept each other's certificates.Work needs to be put into managing communications between systems with multiple access addresses and effort needs to be put into defining the external visibility of highend computing systems during the specification of their requirements.• Technical interconnection by a high speed link in even this "simplest scenario" way still required significant planning due to the large amount of equipment en-route, and that unless high end machines are purchased with this concept in mind, they will most likely not have the required high performance network interfaces or transport stacks in place to utilise the available bandwidth.
It is difficult at SC for anyone to make an impact, but the success of the experiment did attract a great deal of interest with a number of Press Releases and media articles.This helped the UK to have a much stronger performance at SC2003 than at previous meetings.

Introduction
The TeraGyroid project was an ambitious experiment to investigate the new opportunities for computational science created by the Grid -for example, demonstrating new scientific capabilities and international collaborative working -at the same time as establishing technical collaborations to support the development of national and international Grids.The federated resources of the UK Highend Computing facilities and the US TeraGrid were harnessed in an accelerated programme of computational materials science that peaked during the Supercomputing 2003 conference (SC'03).The application (LB3D), provided by the RealityGrid project (http://www.realitygrid.org),used lattice-Boltzmann simulations of complex fluids to study the defect dynamics and visco-elastic properties of the gyroid mesophase of amphiphilic liquid crystals [1,2,3,4].The resources available on the combined US-UK grids enabled and accelerated the investigation of phenomena that had been out of reach hitherto at unprecedented time and length scales.
A central theme of RealityGrid is the facilitation of distributed and collaborative exploration of parameter space through computational steering and on-line, high-end visualization [5,6].The TeraGyroid experiment realised this theme in dramatic fashion at SC'03.A series of demonstrations under the guise of "Transcontinental RealityGrids for Interactive Collaborative Exploration of Parameter Space" (TRICEPS), won the HPC Challenge competition in the category of "Most Innovative Data Intensive Application" [7].The same work was also demonstrated live before an Access Grid audience during the SC Global showcase "Application Steering in a Collaborative Environment" [8].A family of simulations at different resolutions were spawned and steered into different regions of parameter space by geographically distributed scientists on both sides of the Atlantic.The collaborative environment of Access Grid was key, not only for project planning, but also for managing the distributed experiment and discussing the evolution of the simulations.Figure 1 shows a snapshot of the Access Grid screen as seen in Phoenix during SC Global.The remainder of this report is organised as follows.In section 2, we summarise the participants within the Project and their respective roles.In section 3 we describe science behind the experiment, the LB3D simulation software, the encapsulating RealityGrid environment and the porting of this application to and performance on the various facilities.In section 4 we discuss the Grid Software Infrastructure -Globus and the components used within the project and developments that had to be undertaken.Section 5 discusses the visualization of the output from the various facilities, in section 6 the networking infrastructure implemented for the experiment is described and discussed.Section 7 describes the scientific achievements.Section 8 summarises management information related to the organisation, promotion and resources used within the project and section 9 summarises lessons learned.

The TeraGyroid Testbed and Project Partners
The TeraGyroid project brought together three distinct groups: The testbed and networks are depicted schematically (in much simplified form) in Figure 2. It was known from previous experience [9] that the end-to-end bandwidth available on production networks, coupled with the need to share this fairly with other competing traffic, would seriously hamper the ability to migrate jobs -which requires transfer of large (1 TByte for the largest 1024 3 simulation) checkpoint files -across the Atlantic, and to generate data on one continent and visualize it on the other.Fortunately, BT donated two 1 Gbps links from London to Amsterdam, which, in conjunction with the MB-NG project (http://www.mb-ng.net), the SuperJANET development network, the high bandwidth SurfNet provision and the TeraGrid backbone, and completed a dedicated experimental network to the SC'03 exhibition floor at Phoenix.
3 The Science and the Lattice-Boltzmann Model

The gyroid mesophase
An amphiphile is a chemical species whose molecules consist of a hydrophobic tail attached to a hydrophilic head.This dichotomy causes the molecules to self-assemble into complex morphologies when dispersed in solvents, binary immiscible fluid mixtures or melts.
Mesophases have been studied in detail for a long time and are a subject of vast technological significance.Some mesophases are liquid crystalline, with features intermediate between a liquid and a solid.The gyroid is one of these: similar to phases ubiquitous in biological systems, it has important applications in membrane protein crystallisation, controlled drug release and biosensors [10].The gyroid exhibits weak crystallinity and the presence of defects, which play an important role in determining its mechanical properties.
Gyroid mesophases have been observed to form in certain regions of a model parameter space.Instead of achieving "perfect" crystalline gyroids, with no defects in their structure, several regions containing a gyroid phase are generated, each with a slightly different orientation; there are interface or "defect" regions between the individual gyroid domains, a situation analogous to magnetic domains.Figure 3 shows a volume rendered dataset of a 128 3 system, the minimum size that confers stability on the gyroid phase, avoiding appreciable finite size effects.One can easily see the different domains that stay intact even after very long simulation times.The close-up shows how perfectly the gyroid forms within a given domain.Our previous simulations of liquid crystalline mesophases show that surfactant-surfactant interactions need not be long ranged in order for periodically modulated, longrange ordered structures to self-assemble [2].The objective of TeraGyroid is to study defect pathways and dynamics in gyroid self-assembly via the largest set of lattice-Boltzmann (LB) simulations ever performed.

Lattice-Boltzamnn Model
The LB3D lattice-Boltzmann model provides a hydrodynamically correct mesoscale fluid simulation method which describes the equilibrium, kinetic, and flow properties of complex surfactant-containing fluids with applications to bulk and confined geometries [3].The mesoscopic, particulate nature of the lattice-Boltzmann model means that it can be applied to the study of condensed matter on length and timescales intermediate between the microscopic and the macrosopic, while retaining the hydrodynamic interactions that are critical to the evolution of structure.A distinguishing feature of the model used in this work is that it takes a "bottom-up" approach, simulating the behaviour of fluids by specifying mesoscale interactions between fluids, rather than imposing postulated continuum behaviour.
The lattice-Boltzmann model captured within the LB3D code, which is fully computationally steerable using a library developed within RealityGrid; allows researchers to interact with the simulations from local workstations while the compute jobs are running on a massively parallel platform.LB3D was recently awarded a gold star rating for its supreme scaling characteristics on 1024 processors of HPCx.The performance of the code has been optimised by EPCC, who also added parallel HDF5 support.Parallel HDF5 is a data format that allows every processor to write to the same file at the same time.It helps us to reduce memory requirements of the code and improves its scalability substantially.Such a scalable code, combined with access to massive computing power, allows much larger amounts of condensed matter to be probed, effectively bridging the gap in length and time scales between microscopic and macroscopic descriptions that has afflicted simulations for decades.
The TeraGyroid poject undertook simulations involving lattices of over one billion sites and for extended simulation times to follow the slow dynamics.It is not only the size of the simulations that is unprecedented-we are also not aware of any other "bottom-up" approaches to study defect dynamics in self-assembled mesophases.This is not surprising since the computational effort that would have to be invested is surely at least on the scale of what we have available within the TeraGyroid project.
The TeraGyroid experiment represents the first use of (collaborative, steerable, spawned and migrated processes) based on capability computing.Scientific scenarios within the experiment included: • exploration of the multi-dimensional fluid coupling parameter space with 64 3 simulations accelerated through steering; • study of finite size periodic boundary condition effects, exploring the stability of the density of defects in the 64 3 simulations as they are scaled up to 128 3 and 256 3 ; • exploring the stability of crystalline phases to perturbations as the lattice size is increased over long time scales; and, • exploring the stability the crystalline phases to variations in effective temperature through simulated annealing experiments.

Applications Porting
The LB3D is a state-of-the-art application written in Fortran90 with dynamic memory allocation.The model is characterised by the number of grid points in three dimensions multiplied by the number of variables associated with each grid point.For a three dimensional grid with 1024 grid points in each dimension there are approximately 10 9 points (1 Gpoint) and the code uses up to 128 variables at each point resulting in double precision (8 bytes per variable) in 1 TByte of main memory.This ignores the requirement for temporary arrays and the vagaries of how memory is allocated and de-allocated dynamically.Problems were experienced in porting the application to Teragrid sites but these were cleared up once a version of the Intel compiler was found that compiled the code correctly.
Lesson: Site configuration issues are important as most applications assume a system performance model, which is usually tailored (for obvious reasons), to the centre where they undertake most of their work.It became clear during the project that high performance applications need to characterise performance on a range of target systems in terms of CPU, memory bandwidth, local I/O bandwidth, parallel file system bandwidth and network interconnect speeds between the file system and the outside world.
One of the major achievements of the project was to undertake (possibly) the world's largest ever Lattice-Boltzmann materials simulation with: 1024 3 lattice sites Finite-size effect free dynamics 2048 processors 1.5 TB of memory 1 minute per time step on 2048 processors 3000 time steps 1.2TB of visualization data running on LeMieux at Pittsburgh Supercomputing Centre.This is the class of simulation that should be running routinely on the next generation of UK High-end Computing Facility.Access to facilities in the US TeraGrid project has demonstrated the scalability of this class of simulations and in subsequent sections illustrates the scope of scientific questions that can be addressed in this particular research area.

RealityGrid Environment
In RealityGrid, an application is instrumented for computational steering through the RealityGrid steering library.A fully instrumented application, such as LB3D, supports the following operations: Emit and consume semantics are used because the application should not be aware of the destination or source of the data.Windback here means revert to the state captured in a previous checkpoint without stopping the application.In RealityGrid, the act of taking a checkpoint is the responsibility of the application, which is required to register each checkpoint with the library.LB3D supports malleable checkpoints, by which we mean that the application can be restarted on a different number of processors on a system of different architecture.Checkpoint/recovery is a key piece of functionality for computational steering.Sometimes the scientist realises that an interesting transition has occurred, and wants to study the transition in more detail; this can be accomplished by winding back the simulation to an earlier checkpoint, and increasing the frequency of sample emissions for on-line visualization.An even more compelling scenario arises when computational steering is used for parameter space exploration, as in TRICEPS.
The simulation evolves under an initial choice of parameters until the first signs of emergent structure are seen, and a checkpoint is taken.The simulation evolves further, until the scientist recognises that the system is beginning to equilibrate, and takes another checkpoint.Suspecting that allowing the simulation to equilibrate further will not yield any new insight, the scientist now rewinds to an earlier checkpoint, chooses a different set of parameters, and observes the system's evolution in a new direction.In this way, the scientist assembles a tree of checkpoints -our use of checkpoint trees was inspired by GRASPARC [11] -that sample different regions of the parameter space under study, while carefully husbanding his or her allocation of computer time.The scientist can always revisit a particular branch of the tree at a later time should this prove necessary.This process is illustrated in Figure 4, in which a Lattice-Boltzmann simulation is used to study the phase structure of a mixture of fluids.Here one dimension of the parameter space is explored by varying the surfactant-surfactant coupling constant.
Figure 5 shows a schematic representation of the LB3D three-tier architecture for computational steering for the case in which a visualization component is connected to the simulation component.One or both of these components may be steered using the steering client.Both components are started independently and can be attached and detached dynamically.The service-oriented middle tier solves numerous issues that are problematic in a two-tier architecture.The services that support the checkpoint tree are not shown in the figure.Each node in the tree is implemented as a persistent, long-lived Grid service containing metadata about the simulation, such as input decks, location of checkpoint files and so on.The persistence of the tree is achieved by exploiting the ability of OGSI::Lite to mirror, transparently, the service data of a Grid service in a database.On-line, real-time visualization is an important adjunct for many steering scenarios, but we avoid confusing the two, deliberately separating visualization from steering.Control and status information are the province of steering.Visualization is concerned with the consumption of samples emitted by the simulation.The reverse route can be used to emit samples from the visualization system to the simulation, where it might be interpreted for example as a new set of boundary conditions.The volume of data exchanged between simulation and visualization is large and highperformance transport mechanisms are needed.We provide several mechanisms, such as: (a) writing to and reading from disk, relying on a shared file system or a daemon responsible for transferring files from source to destination; or (b) direct process-to-process communication using sockets.Our current implementation of (b) is based on Globus-IO, but this introduces a dependency that greatly complicates the process of building the steering library and any applications that use it.

Grid Software Infrastructure
Simulations and visualizations must be launched, and services must be deployed.We do this using a suite of scripts that use the command-line interfaces to the GRAM and GridFTP components of the Globus Toolkit.We also provide a graphical tool, or "wizard", that allows the user to choose an initial condition (which can be done by browsing the checkpoint tree and selecting a pre-existing checkpoint), editing the input deck, choosing a computational resource, launching the job (automatically starting the SGS), and starting the steering client.The wizard also provides capabilities to select a visualization technique and visualization resource, start the visualization component and connect it to the simulation.The wizard handles the task of migrating a running job to a different resource on the grid, which involves taking a checkpoint, transferring the checkpoint files, and restarting the application on the new host.The wizard shells out to command-line scripts, which in general require some customisation for each application, to accomplish these tasks.With the exception of Globus-IO, we do not program to the Globus APIs directly.
Note that we did not use the MDS components of Globus (GIIS and GRIS).This is not because of any doubt about the importance of such services -we view them as essential for the long-term future of the grid -but because our experiences with MDS as deployed on the grids available to us give us reason to doubt the robustness of the services and the utility of the information published through them.Instead, we maintain lists of computational and visualization resources in client-side configuration files read by the wizard.
The RealityGrid registry ran on a Sony PlayStation 2 in Manchester.The Globus Toolkit was installed on all systems, with various versions (2.2.3, 2.2.4,2.4.3 and 3.1) in use.Our use of Globus (GRAM, GridFTP and Globus-IO) exposed no incompatibilities between these versions, but we note that the GT 3.1 installation included the GT 2 compatibility bundles.
For the TeraGyroid simulation to work it was necessary to implement a 64-bit version of Globus GT2, on the HPCx IBM system.In particular the globus_io libraries which are used to communicate data between the steering code and the computational code running on the HPCx nodes.Whilst we had previous experience of Globus GT2 on AIX systems and had already installed the GT2.2 version released as an installp binary by AIX, this did not contain any 64-bit code and also did not have the required libraries.Progress in establishing the GRID software Infrastructure took a lot longer than anticipated, nearly a month in total.The following is an example of the steps that had to be taken to enable HPCx for the experiment to: i) it was decided, following compatibility tests, to use the latest GT2.4.3 source ii) trial builds were done by in 32-bit more on a Power 3 system and the HPCx test and development system.This was successful iii) work then focussed on building a 64-bit code from source.This was unsuccessful as some code would not build and users' certificates could not be read.iv) it was discovered that the version of OpenSSL (0.9.6i) shipped with GT2.4.3 is obsolete and does not work on AIX 5 in 64-bit mode.The new v0.9.7c does.However it was not straightforward to retro-fit the open source distribution into the Globus package framework so that the two would build together.v) We finally managed to compile OpenSSL independently and change the library names to match the required names for the Globus distribution then build the rest of that.The build still would not accept e-Science certificates vi) This was discovered (by Mike Jones, Manchester) to be a problem with the way the old certificates had expected an "eMail=" field.It was found that actually an "=eMailAddress" field was produced which is now the accepted standard.Fortunately it was possible to change a simple character string to make this work.vii) This resulted in a working version and system changes could be made to enable the Globus gatekeeper to be started from inetd and map user's certificate names onto local user names.viii) The same software was ported onto login.hpcx.ac.uk.ix) changes were made to port the software onto a complementary login node using the dns name for the trans-atlantic MB-NG network (changes required in local dns tables to get correct reverse lookup).
x) HPCx has head nodes (login) and worker (development/ interactive/ production) nodes, the latter are on a private network.This is common practice on large systems, especially clusters.
Stephen Booth (EPCC) wrote a generic port forwarded which starts using inetd when a node process attempts to contact an external ip address.This was used to get the visualization data off the nodes and to send back steering parameter changes.
Lesson: Unfortunately, a large amount of effort had to be invested in trying to work with or around the Globus Toolkit, supposedly the standard middleware for Grid applications and that underpinning both UK e-Science Grid and US TeraGrid.The RealityGrid experience over two years has been that construction of applications or Grid services using Globus is not straightforward.In addition, a vast amount of systems administrator time was required to compile and install it on the many machines involved in the project, often due to its tendency to require custom-patched versions of third-party libraries, rather than building on top of existing infrastructure.We have found that it is much easier to construct Grid services using OGSI::Lite, a lightweight implementation of the emerging Open Grid Service Infrastructure standard [12].

Visualization
Amphiphilic fluids produce exotic mesophases with a range of complex morphologies.These can only be fully understood using sophisticated visualization software, semi-immersive or fully immersive virtual realitytechniques being essential in some cases.The complexity of these data sets makes visualization of large images itself a major graphics challenge.Using the VTK library, we are able to view up to 512 3 data sets of the gyroid morphology using a PC (2 GB memory and hardware volume rendering are minimum requirements.Visualization of billion node models requires 64-bit hardware and multiple rendering units (such as multipipe OpenGL or chromium, with which SGI IR4 pipes and TeraGrid Viz resources are equipped).The size of the datasets we generated is challenging most available visualization packages today; indeed, the 1024 3 data sets were only visualized for the first time on the show room floor at SC03 using a ray-tracing algorithm developed at the University of Utah running on 100 processors of an SGI Altix.Therefore, access to large scale visualization resources is an important task.

Lesson: The UK high-end visualization capabilities (and Teragrid) are not yet capable of dealing with large scale grid based simulations of order 1 Gpoints.
Scientists and developers collaborated using Access Grid nodes located in Boston, London, Manchester, Martlesham, and Phoenix.The tiled display on the left of Figure 1 was rendered in real time at the TeraGrid visualization cluster located at ANL, using Chromium -a cluster facility.The display on the right was rendered, also in real time, on an SGI Onyx system in Manchester.The visualizations are written in VTK with patches to refresh automatically whenever a new sample arrives from the simulation.The video streams were multicast to Access Grid using the FLXmitter library.SGI OpenGL VizServer™ was useful to allow a remote collaborator to take control of the visualization.When necessary, the steering client was shared using VNC.An IRC back channel also proved invaluable.All Access Grid and VizServer traffic was routed over the MB-NG production networks.

Collaboration
The event provided a good opportunity to trial the complexity in establishing a collaboration between all of the relevant network bodies in the chain.This is a larger number than one might initially guess -University of Manchester Computing Service -Daresbury Laboratory Networking Group -MB-NG and UKERNA -UCL Computing Service -BT -SurfNET (NL) -Starlight (US) -Internet-2 (US) for all of whom coordination was required to establish configuration, routing and commissioning.A set of meetings was held between the network managers in the UK, Netherlands and the US to solve all the technical problems and agree on what the network should provide.Once the network was in place networking people were joined by application people running the tests to solve end-to-end problems related to inconsistency between test traffic and real applications traffic.In fact this went exceptionally well, demonstrating both good will and that the right peer relations exist between technical responsibles.The configuration, showing router positioning, is shown in the figure below.
BT deserves special mention for project management and for the provision of the link from UCL to SurfNet in Amsterdam.This required interworking differing networking elements, such as native Gigabit Ethernet, CWDM, and concatenated STM16 across two STM-64 SDH rings.
HPCx, the CSAR Origin 3800, and the visualization supercomputers in Manchester and UCL were all "dual-homed", connected simultaneously to the dedicated network and to the SuperJanet 4 academic backbone.Complex routing tables were required on the UK end, while route advertisement sufficed on the US end.Dual routing enabled researchers to login in and issue low data rate control and steering commands via the production network whilst enabling the high bandwidth transfer of files via the dedicated experimental network.This exercise has provided very valuable pre-UKLIGHT experience which will feed into methods of operation which must be in place when we come to exploit the UKLIGHT experimental links, which come online in early 2004.
Lesson: We have the appropriate peer relation chain to enable connection from the UK to the USA.As a result complex dedicated circuits can be provisioned at short notice.

Performance and Achieved Transport Rates
The installed capacity, composed of a mix of layer 1 technologies (OC-192, OC-48, 1 Gigabit and 10 Gigabit Ethernet), had an overall capacity of 2 x 1 Gigabit per second.Between any two hosts, up to 1 Gbit/s could be achieved.
From Manchester to Chicago we were able to achieve a UDP rate of 989 Mbit/s using Jumbo frames and 957 Mbit/s using a standard MTU size; both tests were memory to memory.However during SC2003, the maximum achieved throughput for actual checkpoint files (using TCP) was only 656 Mbit/s.This was primarily due to: -TCP stacks on end systems: It is well understood that standard TCP has performance problems in high bandwidth high latency environment.Since the machines involved in the tests were SGI with the IRIX operating system, no kernel patches were available for trying other protocols other than the standard vanilla TCP stack.
-MTU mismatch: At the time of the transfers, there were still issues with configuring the hosts with MTUs of 8192, so standard MTUs were used.
-Disks on end systems: There is a performance hit on the end-to-end transfer when standard endsystem disks are used.High performance disk systems are required for optimum performance.
-Network interface card: The configuration of the interface card driver is significant in achieving the best end-to-end performance.We didn't have time to investigate all the different options for the SGI end hosts.This is typical, and demonstrates practically that which is well known -that practical transport rates are limited by the end systems once the protocol limitation is removed.All of these need addressing in the end systems well in advance, in order to take advantage of multi-Gbit/s network links.
The plot shows the achieved rate in a test immediately following the SC2003 event.The measurements shown in this plot was conducted between a Linux machine based at Manchester and an SGI IRIX machine.Initially, up to 700 Mbit/s was observed with out-of-the-box ("Vanilla") TCP, but once packets are lost, it remains below 50 Mbit/s for the duration of the test.A better performance is observed when the sender's TCP stack is changed for High Speed TCP (HSTCP) or scalable TCP as shown in the plot.The connection was very reliable.We had tests which ran overnight for 12 hours using a specialist piece of network equipment to generate continuous high rate traffic streams (SmartBits) at line rate.This showed 0% losses on the link up to Chicago.

Scientific achievements
During the weeks around and mainly at the time of Supercomputing 2003 in Phoenix, USA, about two terabytes of scientific data were generated.Obviously, the analysis of such huge amounts of data is a very time consuming task and will take much longer than the simulations themselves.In addition, we are grateful to have been given the chance to consume the CPU time we were allocated for the period up to Supercomputing 2003, but couldn't use up by then.This gives us the opportunity to try to answer questions are arising during the analysis of the data already generated.
It is important to make sure our simulations are virtually free of finite size effects.Therefore, we simulated different system sizes from 64 3 to 1024 3 .Most of the simulations were for about 100,000 timesteps, but some reach up to 7,500,000 and we may eventually pass the one1 million mark.For 100,000 timesteps we found that 256 3 or even 128 3 simulations do not suffer from finite size effects, but we still have to check our data for reliability after very long simulation times since we are especially interested in the long-term behaviour of the gyroid mesophase.This includes trying to generate a 'perfect' crystal.Even after 600,000 timesteps, differently orientated domains can be found in a 128 3 system.Even though the number of individual domains decreased substantially, we can not find a 'perfect' gyroid.Instead, defects are still moving around and it is of particular interest to study the exact behaviour of the defect movement.Since all generated gyroids are different, this can be done best by gathering statistics within the data of each simulation, i.e. counting the number of defects per data set, the direction of their movement, their speed, their lifetime and other interesting parameters.Being able to gather statistics implies that large numbers of measurements are available which is the reason for the large ( 5123 and 1024 3 ) systems we simulated because only a large system has a sufficient number of defects.

Simulating a 512
3 or 1024 3 system raises some technical problems.First, the memory requirements exceed the amount of RAM available on most of the supercomputers we have access to and limits us to two or three facilities.Second, achieving a sufficent simulation length requires a lot of CPU time.For example, running on 2048 CPUs of Lemieux we are only able to do about 100 timesteps of a 1024 3 system per hour (including IO).Third, such large simulations generate data files which are 4GB each and checkpoints which are 0.5TB each.It is quite awkward to move around such amounts of data using the infrastructure available today.Due to the amount of manual intervention required and restrictions in machine availability, these huge simulations, probably the biggest lattice-Boltzmann simulations ever performed, are still ongoing.
For a 64 3 or 128 3 system it is very easy to run a simulation from scratch for 100000 timesteps, but for a 1024 3 system this would require more than two million CPU hours or 42 days on Lemieux.Obviously, it is very hard to justify the use of such an amount of resources for a single simulation.Therefore, we decided to scale up a smaller system after a gyroid was formed by filling up the 1024 3 lattice with 512 identical 128 3 systems.This was possible because we use periodic boundary conditions.In order to reduce effects introduced due to the periodic upscaling, we perturbed the system and let it evolve.We anticipate that the unphysical effects introduced by the upscaling process will decay after a comparably small number of timesteps, thus resulting in a system that is comparable to one that started from a random mixture of fluids.This has to be justified by comparing to data achieved from test runs performed on smaller systems.
Another experiment, performed in order to speed up the generation of a well-established gyroid, and which is still ongoing, uses the influence of the surfactant temperature on the mobility of the surfactant particles.After differently oriented gyroid domains have been formed, the system will be evolving very slowly towards a perfect crystal.Therefore, we increase the surfactant temperature, which will start to break up the gyroid.After a short while, we push the temperature back to its original value, hoping that the system will anneal into a state which contains fewer different domains.This is a typical application where computational steering is useful, since we can immediately see how the system behaves and use this information to adapt the change of the temperature.
Other experiments which are still ongoing are related to the stability of the gyroid mesophase: We are interested in the influence of perturbation on a gyroid.Using computational steering we are able to tune the strength of the perturbation and find out if the crystal stays intact or breaks up.Similar experiments are being done by applying constant or oscillatory shear.Here, we study the gyroid stability in dependence on the shear rate, and expect to find evidence of the non-Newtonian properties of the fluid.
Figure 8 shows a cropped plane from a volume rendered example dataset of the upscaled 1024 3 system.The periodic artefacts introduced due to the upscaling can still be seen here.
The RealityGrid environment provides not only a very convenient way to combine job handling on HPC resources, but the use of high performance visualization facilities is naturally included.And without this feature, computational steering would be of very limited use for us.Without the new platform independent checkpointing and migration routines, we would not have been able to adapt our scientific requirements to the available compute resources as easily.We also benefitted substantially from the code optimizations and IO improvements that have been introduced by various people within the RealityGrid project.
None of the experiments we have done or are still doing would have been possible without the benefits obtained from software development within the RealityGrid project.Computational steering allowed us to monitor our simulations and to use the available resources as efficiently as possible.We rehearse and staged several distributed, collaborative sessions between the geographically-dispersed participating scientists, making use of Access Grid.These collaborators were based at UCL, Manchester, Boston University and BTexact (Ipswich) (as well as Phoenix during SC03), who were each able to run and steer simulations.Indeed, Access Grid proved an indispensable tool for coordinating the project, having been used successfully to conduct both regular multi-site management meetings and small ad hoc conferences that might otherwise have required a telephone conference or physical meeting.IRC channels proved very effective for user communications during assigned supercomputing slots before and during SC03.

Coordination/ Organisation
A proposal was developed by K. Roy, W.T. Hewitt, University of Manchester) for a joint experiment between UK facilities at Daresbury Laboratory and CSAR at Manchester and centres participating in the Extensible Terascale Facility (ETF) in the USA.The aim of the proposal was to demonstrate the potential of ETF by using an UK e-science example on a combined grid at SC2003 using RealityGrid.
A number of comments were received on this proposal.They expressed the view that for this experiment to be genuinely useful, it would be necessary to ensure that it's not just another transatlantic metacomputing demo for Supercomputing.Rather, it should be an experiment that demonstrates the new opportunities available through the Grid, for example new scientific capabilities and international collaborative working, and that it should seek to establish technical collaborations to support the development of national and international Grids.
Given the UK HEC focus on capability class applications the project would be successful if it undertook a simulation exploiting a community code that runs on the integrated resources of the ETF and UK facilities which amounts to some 6TB of memory and some 5K processors.An application that could exploit this scale of computational resource would be well on the way to the class of application that would be targeted at the early phase of an HPCy class system.It is worth noting that the world Grid capabilities are changing rapidly with a doubling in the HPCx performance (6TF Linpack) next year and the installation of the ETF facilities -10TF at NCSA and 4TF at SDSC.

Lessons Learned
Inevitably, the application and visualization must be ported to, and deployed on, each resource on the grid where they are to be run.We do not find that Globus-based grids today do much to facilitate this.Instead, the increasing variety of resources that the grid makes accessible means that users and developers are spending an ever-increasing fraction of their time porting applications to new systems and configuring them for the vagaries of each.
Considerable negotiation was necessary to persuade TeraGrid sites to recognise user and host certificates issued by the UK e-Science Certificate Authority (CA), and to persuade UK sites to recognise the five different CAs used within TeraGrid.In our case, the agreement was facilitated by a sense of common purpose.In general, however, establishing the necessary trust relationships to federate two grids is a process that will scale poorly if every systems administrator has to study the CP and CPS of every CA; there is a clear role here for Policy Management Authorities (see http://www.gridpma.org).
It is normally the task of the RealityGrid migration wizard on the user's laptop to initiate the transfer of checkpoint files using the third-party file-transfer capability of GridFTP.This proved problematic when one of the systems involved was dual-homed, and complicated by the fact that the systems were not accessible from everywhere via their MB-NG addresses.The problem would not have occurred if GridFTP provided the ability to specify different addresses for control and data channels.We worked around the problem by setting up a checkpoint transfer service on one of the dual-homed systems for which both networks were visible.This transfer service was aware of which hosts were dual-homed, which interface would provide the best bandwidth for each pair of endpoints, which host certificate to associate with an interface for Globus authentication, which options to globus-url-copy yielded the best performance, and which checkpoint replica was "closest" to the site where it was required.The transfer service was implemented as a Perl script and invoked remotely using globusrun, which introduces a significant start-up overhead.
It is often difficult to identify whether a problem is caused by middleware or networks.We are used to the symptoms caused by the presence of firewalls, and were generally able to resolve these without undue difficulty once the relevant administrators were identified.However, even more baffling problems arose.One problem, which manifested as an authentication failure, was ultimately traced to a router unable to process jumbo frames.Another was eventually traced to inconsistent results of reverse DNS lookups on one of the TeraGrid systems.A third problem manifesting as grossly asymmetric performance between ftp get and put operations was never completely resolved, but MTU settings appear to have been implicated.
The maximum performance we achieved for trans-Atlantic file transfers during SC'03 was about 600 Mbps, a respectable fraction of the theoretical upper limit of 1 Gbps.But for some pairs of endpoints, the best we could achieve was a good deal less than this, for a variety of reasons.UDP tests between UCL and Phoenix yielded >95% of the theoretical peak, which suggests that there is still room for improvement in end equipment configuration (TCP/IP stacks, disk I/O) and GridFTP.
In order for the simulation to exchange data with the visualization it is necessary for at least one process of the simulation to establish a connection with the visualization system.This was not possible on HPCx, where the back-end compute nodes are confined to a private network.On other cluster systems, such as Lemieux at PSC or Newton at CSAR, only some nodes have direct connections to the internet.In the latter case, it is possible, by arrangement with the system administrator, to pin one process of the application to the node with internet connectivity.In the former case, the only option is to forward internet connections via the front-end machine or head node.On HPCx, the forwarding of connections was carried out using bespoke software written by Stephen Booth at EPCC.

Figure 1 :
Figure 1: Access Grid screen as seen in Phoenix during the SC Global session on application steering.

Figure 3 :
Figure 3: A volume rendered dataset of a 128 3 system after 100000 simulation timesteps.Multiple gyroid domains have been formed and the close-up shows the extremely regular, crystalline, gyroid structure within a domain.

•
Pause, resume, detach and stop • Set values of steerable parameters • Report values of monitored (read-only) parameters • Emit "samples" to remote systems for e.g.on-line visualization • Consume "samples" from remote systems • Checkpoint and windback.

Figure 4 .
Figure 4. Parameter space exploration gives rise to a tree of checkpoints.

Table 1 .
We also used a number of visualization systems including SGI Onyx systems at Manchester, UCL, NCSA and Phoenix (on loan from SGI and commissioned on site), and the TeraGrid visualization cluster at ANL.

longer time to plan the project (formal go ahead on an indicative plan was only granted in early September) the project could have developed more precise estimates of resources and a more relaxed schedule of slots for development work with more warning. Whilst the experiment was disruptive to users neither of the UK sites had any complaints from users during the project.
The indicative plan (with estimated resource requirements) to undertake a scientific experiment on the gyroid phases of ternary fluids on the integrated UK-Teragrid high-end computing resource was discussed with Rick Stephens and Peter Beckmann from the Teragrid project at the 'All Hands' meeting in Leicester in early September 2003.It is a small miracle that an award winning demonstration was undertaken some 2 months later at Supercomputing in November 2003.The following diary illustrates some of the activity: September 2003: first joint meeting of UK and US sites via Access Grid.Discussed background, potential scenarios and requirements, networking, Globus infrastructure, securing ids on Teragrid facilities and applications porting.Established peer group interactions.-1 October 2003: Review of UK progress -discussion on where to visualize largest datasets, size of calculations that could be run on each facility.Scientific programme of runs to be developed and placed on WIKI.Detailed resource request plan and schedule of outages to be prepared.Draft flier and advance publicity to be prepared.-10 October 2003: Progress review -BT to provide network to NetherLight by 31 October.Acceptance of UK certificates by Teragrid required.Resource requests submitted to EPSRC and Teragrid sites.DNS names of all systems required.Need to identify which visualization servers could cope with largest datasets.Visibility of production nodes to outside world highlighted as an issue at PSC and HPCx.-20 October 2003: Scientific project plan published.64 bit Globus version for AIX -problems in porting this.Resources and schedule of test slots agreed. 1 Gpoint simulation not able to fit into HPCx memory.-31 October -application built on all systems, visualization for 768 3 demonstrated on Chromium and SGIs.Flier completed.Network in place -being tested.-30 October, 4 November, 6 November, 11 November, 13 November, 15 November -6 six hour development slots on all facilities.18 November -21 November -experiment and demonstration at SC'03.All of these meetings were conducted via Access Grid with 20+ participants at many of them.The meetings could more sensibly have been split into separate political and technical sessions but time did not permit.The split into project teams worked well with the team leaders progressing actions vigorously between meetings.Explicit publications form the project include: The TeraGyroid Experiment: S. M. Pickles, R. J. Blake B. M. Boghosian, J. M. Brooke, J. Chin, P. E. L. Clarke, P. V. Coveney, R. Haines, J. Harting, M. Harvey, S. Jha, M. A. S.Jones, M. Mc Keown, R. L. Pinning, A. R. Porter, K. Roy, and M. Riding.
Lesson: With a