LATENCY ON DISTRIBUTED MEMORY MULTIPROCESSORS

This paper describes many of the issues in developing an efficient interface for comnau-nication on distributed memory machines and proposes a portable interface. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that we have run on the iPSC/860, we propose an interface that will better match current distributed memory machines. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure invocations is very flexible for minimizing message latency, based on the type of communication being performed. This paper describes the tests performed

Thus, a new interface should be designed to take advantage of the hardware that is typically found in newer machines and allow the use of application specific knowledge to use this hardware more efficiently. This can be done by making the interface look more like the underlying hardware, which in turn will give the user more control of the hardware.
An extremely low level interface like this will probably not be of interest to a broad range of users because of the added complexity that it will have when compared to the simple semantics of sends and receives. However, there are many users that will be able to take One reasonthat Active Messages are moreefficient than sendsand receivesis becausean active message, upon arrival at a processor,is processedimmediately by a specifiedroutine that wasdesignedexplicitly for that message. Therefore,the operating systemhasvery little to do with the message and latenciescan be controlled by the specifiedroutine.
Although Active Messages will probably operate more efficiently than the send receive model, it is not clear that there is not a more efficient model to use. The overheadrequired to select, verify, and call the correct routine to use, along with effectsof in.terrupting the processorand the cache,will probably be considerablymore than the hardware latency, a time that we would like to match with the softwarelatency. An example of where such fine grained communication would be required is a pipeline algorithm that will be describedin more detail in the next section. In this algorithm, a messageconsistsof a single floating point value and the number of operations betweencommunication The interrupt logic on the 860chip, usedto handle asynchronouscommunication events, is a very large componentof the messagelatency. It should be noted, however,that in the test that is usually performed to measurelatency, bouncing messages betweenneighboring nodes, interrupts will not occur. In this test each of two nodes repeatedly waits for an incoming messageand then immediately sendsa messageto the other processor. When a processorwaits for a message to arrive the interrupt logic is turned off and the processorsits in a very tight loop waiting for the status register to changebeforepulling the message from the input buffer. In a more realistic situation where a messagearrives before it is needed, causingan interrupt, the message latency canincreasefrom 70usto 130us.(The higher time was measuredwhen eachprocesswould wait for a messageby continuously executing the probe function until a message arrived.) The causeof interrupts being so expensiveon the 860 chip is mostly due to the large state of the processor.This includes32 floating point registers,32 integer registers,an add, multiply, and load pipeline, and fairly complexinstruction modesthat all must be savedand reconstructedbefore resuming normal processing.The result of this is that it takes on the order of 1000instructions to handle an interrupt. This doesnot include any of the time to processa message.
Other sourcesof increasedlatency include the time to do a trap into the operating system and the effectson the cacheof messages asynchronouslyarriving at a node. The time to executean operating system call, although not as severeas communication interrupts, is roughly 50 instructions. We havenot measuredthe cost that an interrupt will haveon the instruction and data cachesbut we expect that it would be substantially more than the hardwarelatency.
The operating system on the iPSC/860 (NX/2) controls the communication hardware and interrupt mechanisms. The communication model is based on sendingand receiving contiguousblocks of typed messages.NX/2 must handle the general caseof having any message of any sizearrive at any time without the operating systemcrashing. This requires a complexsystemthat handlesbuffer management,handshakeprotocols, interrupts, security and other issues. The operating system, although quite complex, handlesthis generalcase very well.
Another aspectof the operatingsystemthat addsto the latencyof communicationis that the operating system usesthe same communication network as the applications. Because of this, NX/2 must be reliable and secure: This requirement adds to the overall message latency. Outgoing messages must be checkedfor valid addressesand the processormust assumethat any systemmessage can arrive at any time and must be handled properly.
This aspect of using the samehardware for both the operating system and user applications, and not hav!ng hardwaresupport for message security is a problem that will make the goal of reducing the latency to a few instructions very difficult if not impossible. The only good solution for theseproblems are to handle thexnin hardware. It should be noted that the CM-5 is onemachinethat doesnot havetheseproblems. We will not considerthis problem any further and will assumethat a single user hascontrol of the entire machine.
The software latency associatedwith sending messages is primarily due to a complex general purpose operating system that can handle any situation. However, most of the time an application doesnot need tile power of a generaloperating system. Many times the application writer has specific information that can be used to take advantageof the hardware. Examples of this include the knowledge of how large a buffer is required, that only a single type of message will be sent, that the data from a message will always be placed in a specific location, etc. Thus, we want to give the user more control of the hardware to take care of these special cases. This is done at the risk of adding complexity to the sendreceive model but we feel that this added complexity will be worth the added potential for writing efficient programs. In order to test this idea we modified the NX/2 operating system to give the user more using the Intel primitives. In the modified versionthe bulk of the time spent wasin the trap handler saving and restoring the processorstate. This final program is much more complicated than the other programs becauseof the complexnature of how the processes synchronize.In the other programstile synchronization is donebasedon the FIFOs while the synchronizationinvolved in tile last program is similar to that found in shared menaorysystems. Although it was not measured,we also expect that the time required to handle the synchronizationis significantly more than that in the other programs.
In summary, these tests imply severalcharacteristics for any proposedcommunication system. Firstly, interrupts are very expensiveand should be avoided. This will become more important asthe sizeof the processorstate that nmst be savedand restoredis growing with the complexity of the microprocessorsused. Second,the operating system should be minimally involved with communications, As seenabove where communications can be generatedwith oneor two assemblyinstructions, any interaction with the operating system will be very expensive. This removal of the operating system implies that issuestypically handled by the underlying system, such as security and IO, should be handled elsewhere. Finally, many numerical algorithms have a repetitive communication pattern, such as a pipeline, and the ability to setup the communication paths once and then reusethem can make very efficient useof the hardware.

5
Implementing a Sparse Matrix Solve In this section we discuss some of the issues in implementing a more complex, fine grained To implement these types of problems, it is critical to have the ability for one processor to affect the memory of another processor with a minimum of effort.
In the above example this requires the ability to fetch data from a processor once a specific condition has been met on that processor.  aregeneralpurposeprocessors but in reality the communicationprocessorcould be a custom processor.for handling communications. An interesting problem with using two processorsasopposedto oneis the effect of cache. If a singleprocessorhandled both communicationand computation and the data to be sent wereresidingin the cacheat the send,then there would be no traffic betweenthe memoryand the processor;the processorwould sendthe data directly to the network. If two processors are usedthen the data will haveto first be flushed from the main processor'scacheso that the secondprocessorcan useit. For small messages it will probably be more efficient to have the computation processorsendthe data.
The processors communicatewith other nodesby receivinginformation from the network using the input buffer or by sending information source ::= internal_buffer I network_buffer is_more ::= boolean The send routines have three parameters. The first describes the destination of tile information, the second describes the information being sent, and the third is used to chain together several sends so as to send multiple packets of information separated over space or time. The destination parameter can indicate that the information is sent to a buffer on one or more nodes, the internal buffer on the present node connected to the other processor, or the same location that a previous chain of messages has been sent to. For example, data can directly be sent to another node (corresponding to a csend on the iPSC/860), or a procedure can be sent to the communication processor that will in turn send the data (corresponding to an isend on the iPSC/860). The final parameter is a boolean variable that indicates whether this message corresponds to a chain that will continue at a later time. This is essentially a mechanism to allow a channel to be opened between the current processor and another processor (or set of processors). The ability to create a channel and send multiple packets of information through it has two advantages. The first is that the cost of setting up the channel can be spread out over multiple messages and the second is that it can be used to ensure that multiple messages will arrive without interruption by messages from other nodes.
A similar mechanism exists in hardware on the iWarp machine, developed at Intel. In the iWarp, a channel can be created between two nodes in the machine and data can then be sent through the channel with very low latency.
The receive routines consist of two types of functions.
The first describes where the information will come from and is either the network buffer or the internal buffer. The second describes the type of information that is to be received. This can be either data or a procedure. If it is the former, then a pointer to where the data should be stored, along with the number of bytes that are to be read, is also in the parameter. If the information is a procedure then there is no additional information.
In this case the receive call will not return until after the procedure has been received. Note that procedure messages can have parameters following immediately after the message using the chaining facility described in section 6.
The final type of routine is used to see if there is information of the correct type in one of the two buffers.
There are two types of parameters in this call. The first describes which buffer is to be examined and the second describes the type of data that is expected. The output of this routine describes the state of the specified buffer to be one of the following: there is information of the correct type, there is no information, or there is information but ll it is the wrong type. it would apparently be more efficient to implement this system using a preprocessor that would translate the calls into more precise code.
The second issue concerning efficiency is security. On machines like the Intel Paragon and the Intel iPSC]860, the same network that is used for applications is used for operating system calls. Therefore, in order to ensure the integrity of the operating system, the application must first make a trap into the operating system before using the communication hardware.
There are three potential solutions to this problem. The first is to ignore this problem when an application has sole use of the machine. In this case, if the operating system is corrupted it will have to be reboote_t.
The second is a more general solution and simply requires the trap handler to be reduced to minimize the overhead associated with traps. The current overhead of doing a trap on the iPSC/860 is roughly 50 instructions. This could possibly be reduced to 15 or 20. This woulcl still be too large to send a large number of single words at a time. Therefore, to send noncontiguous data, there would also need to be commands to send multiple blocks of contiguous data and data separated by a fixed stride.
Another problem with implementing this interface efficiently has to do with the types of networks found on some distributed memory machines.
In order to transmit large blocks of data with a high bandwidth, the data must arrive in the order in which it is sent. On the CM-5 this is not guaranteed in hardware and must therefore be handled in software. This has two consequences. The first is that the data must be handled an extra time, which uses memory bandwidth, and the second is that the data can not be left in the buffer until it is needed. This also causes extra memory traffic.
The final issue is implementing this interface on a system that has no form of communication processor. In this case the second processor must be emulated on the first processor using interrupts. Assuming that the cost of interrupts is not very expensive this would be a valid solution.
However, on a machine sucil as the iPSC/860, where interrupts are very expensive, this would not be practical. receiving data. This implies that there must be hardware support for isolating usersfrom eachother. Further, it is important that messages canbe processedwithout interrupting the computation processor.This suggestssomeform of processorbe usedfor handling communication. Finally, there must be relatively fast synchronizationbetweenthe communication system and the computation processor. Basedon this hardware model we have defined a low level interface that more closely approximatesthe hardwarefound in many distributed memory machines.