A Migration Method of MPI Program Combining Local Library Replacement and Instruction Translation

Binary translation acts as a main method used to solve software compatibility among different instruction set architectures (ISAs), yet themain objects that the binary translator deals with are serial programs but not parallel programs.We propose a hybridmethod combining local library replacement and instruction translation based on a formal model built to describe the equivalent when migrating MPI programs between different clusters. The shared codes in a MPI program (MPI library function call) are treated by executing local libraries, and the other parts are done by dynamic binary translation. Also, during the course of dealing with local library functions, we propose a method of program flow redirection by designing two algorithms along with hierarchically encapsulating local libraries. A framework called MPI-QEMU is designed to implement migrating MPI program of 64 bits from X86-64/Linux platform to the domestic SW platform which is verified by experiment.


Introduction
Dating back to the 1960s, the binary translation technology [1][2][3] was generated and developed as a tool for equivalent migration of programs.Binary translation is defined as an equivalent migration process of the instruction sequence from one machine to another.Binary translation can be divided into three types according to the way of translation [4]: interpretive execution, static translation, and dynamic translation [5].Interpretive execution applies the one-byone instruction simulation; static translation applies the mode of "executing after translating"; dynamic translation applies the mode of "executing while translating."With the production of multicore processors, this enabled the combination of parallel processing and binary translation to further improve the efficiency of program migration [6].For example, PQEMU [7] uses the multithreading mechanisms for translation; HQEMU [8] uses multiple cores to run the binary translator and dynamic optimizer, respectively, in a multicore platform, so as to implement the coordination of the translator and LLVM-based optimizer.As the new ISA computer system increasingly attaches great importance to high performance applications, binary translation is applied to the processing of a parallel program and the CUDA program [9].The work [10] proposes a dynamic binary translation framework, Ocelot, based on MCUDA [11], which implements the dynamic translation of the CUDA program to a multicore CPU platform.The work [12] points out the related research lacking in implicit synchronization processing in the process of translating the CUDA program to a multicore platform and puts forward the implicit synchronous detection method based on dependency analysis.The author proposes a method of dividing and conquering in [13] and uses static translation to implement migrating CUDA binary program to the domestic SW platform [14].
MPI (Message Passing Interface) [15] program has been occupying a large share in the parallel program market, the successful migration of which has an important demonstration and guidance significance.However, little related research was carried out both at home and abroad, and there is almost no research result that can be used as a reference.As a result, the migration of the MPI program faces some difficulties and challenges.The first is how to properly deploy the MPI program binary translator.The processing object of the traditional binary translator is a serial program, whose runtime environment is a standalone machine, so such translator will be deployed under the environment of standalone machine, while the MPI program is a parallel program, whose runtime environment is a cluster; the premise of program running is to deploy a copy of the MPI program on each node of the cluster, which requires a binary translator deployed on each node.Therefore, as a whole, there are multiple MPI program binary translators working at the same time and each translator is independent to translate the MPI program on the same node.The second challenge is how to ensure synchronism and consistency among the MPI processes at the time of dynamic translation.The translator is only in charge of loading and dynamic translation of the MPI program on the node, and it does not need to interact with the MPI program or with the translator on other nodes.Therefore, the translator itself does not have to be a MPI program.Although the translator encapsulated into the MPI process can also be used as a solution, there is no need to do so considering the additional encapsulation and communication cost.Indeed, it is the MPI process that performs real message interaction which is translated by the translator on different nodes.Interaction among MPI processes is implemented by calling the MPI library function.If the instruction translation is applied to process the MPI library function, the dynamic generation of the translation codes makes it very difficult to maintain the communication among the MPI processes synchronous and consistent in time and space.In addition, when there are a large number of MPI library functions contained by the MPI program to be translated, this may cause inefficient translation and there will be repeated translation of the same library functions.Thus, the processing of the MPI library function is the key to the migration of the MPI program and is also the focus of this paper.
In view of this, this paper carries out a study on the passive MPI program migration.Firstly, it puts forward a formal representation of the binary translation of the MPI program, based on which a MPI program migration model MPI-QEMU is designed, so as to propose a migration method of MPI program combining code-sharing local library replacement and instruction translation.It can combine the dynamic instant translation and processes message interaction protocol, to intercept and redirect the message interaction function call during the course of dynamic translation.And, also, it shields the interaction details among the MPI processes and implements the transparency processing of the MPI library.
Aiming at solving the dynamic binary translation of the MPI program, the main contributions made by this paper include the following: (1) This paper proposes a formal representation method of the MPI program binary translation, which provides theoretical support for the MPI program equivalent migration.
(2) A migration method of MPI program combining the local library replacement and the instruction translation is designed to solve the migration of the MPI program.
(3) This paper proposes a method of program flow redirection during the course of dynamic translation, to implement the local replacement of library function call.
(4) A project called MPI-QEMU is designed which implements migrating MPI binary application from X86 platform to the domestic SW platform, which expands the scope of the software supported by a domestic supercomputer to a certain extent.Section 2 of this paper gives the formal representation of the MPI program binary translation; Section 3 introduces a migration method of MPI program combining local library replacement and instruction translation; Section 4 expounds the implementation process of local library replacement; Section 5 shows the experimental data.In this paper, the installation platform of the binary translation system is called the local platform; the platform to be simulated is called source platform; the program to be translated is called source program and it is a binary executable file; the translated program is called the local program; in addition, the domestic SW (ShenWei) processor mentioned in this paper is a Chinese processor with independent research and development and has been used in a Chinese supercomputer system.On November 14, 2016, the Chinese Sunway TaihuLight supercomputer using the SW processor won the championship in new ranking published by the site https://Top500.orgin Salt Lake City, USA; the proposed approach in this paper requires Linux as the OS and is not OS-independent.

The Formal Representation of the MPI Program Migration
MPI is a parallel process message interaction interface widely used in distributed memory architecture.In fact, it is a standard specification of a message passing function library, to define such interface library in the form of independent language.It supports a variety of programming languages like C, C++, and Fortran, with different versions of implementation on multiple platforms; the predefined message operation function is used to complete the sending and receiving of information and implement the parallel processing of tasks.Indeed, the migration of the MPI program is to make a MPI program run on different architecture clusters keeping the semantic consistency.In order to better understand the migration of the MPI program and find a solution, this section uses the formalism method.First of all, the problem of MPI program running on a single cluster is represented by formalization, to introduce quintuple array representing the cluster and the instruction sequence interpretation function representing the program running on a cluster; secondly, according to the computable equivalence theory of Turing machine, the mapping of the program among the clusters is represented by formalization, that is, the equivalence of a program running on different clusters.Through the mapping process from the source platform instructions to the local platform instructions, a binary translator prototype for general program migration among the clusters is represented.Then, the composition of the MPI program is represented by formalization, based on which the binary translator is improved.Finally, the method of MPI program migration is derived.

The Program Running on a
Cluster.The MPI program mainly runs in a cluster environment, which usually consists of multiple computing nodes; the abstract representation of the clusters needs to consider the machine state of multiple nodes.In order to share information between the processes of different computing nodes, the data must be sent, received, or broadcasted through the network; the interaction between the nodes is the fundamental characteristic to distinguish between the cluster and a group of independent computers.The cluster  composed of  nodes can be represented by the following quintuple array:  = (  ,   ,   , , ) . (1) In the formula, the interactions between the nodes of the cluster are implemented by the elements of the set and the message is sent or received through the elements of the set;  :   × →   represents the interpretation function of the cluster to the message interaction in a state.
Since all  ∈  operations of the cluster can be simulated by the   sequences, the representation of a cluster can be simplified.Let  = (S, I, Θ) represent the cluster composed of m computing nodes, in which S =  1 ×  2 × ⋅ ⋅ ⋅ ×   represents the set of cluster states; I =  1 × 2 ×⋅ ⋅ ⋅×  represents the set of the instruction vectors of the cluster; Θ =  1 ×  2 × ⋅ ⋅ ⋅ ×   : I × S → S represents the interpretation function of the cluster to execute the instruction vector.
When the cluster executes the program, a continual instruction sequence that needs to be executed can be represented as follows: Let  = ⟨I  , I +1 , . . ., I + ⟩ represents an orderly sequence of the instruction vectors,  ∈ I * ; the implementation process of  can be represented by the following function: So, the execution process of the instruction vector with the length of  can be represented as where the function Θ * represents the running of the program in the cluster.
Figure 1: Formal representation of migrating general programs between clusters.1, the process of transformation from t to Ω  (t) is actually an instruction translation process, which outlines a binary translator prototype for general program migration among the clusters.

The Composition of the MPI Program.
The foregoing section introduces the migration of general program among the clusters.However, the special nature of the MPI program determines that the instruction translation cannot be simply regarded as the migration method of the MPI program.
MPI program running on the cluster contains a lot of interaction operations.Each interaction operation corresponds to a MPI function of the platform; these interaction operations have a limited number but are called frequently.Since the MPI is an open-source cross-platform agreement, these interaction operations are shared. Let Since the sequence  and the sequence Ω rep () are compiled from the same codes shared, they have the same program semantics and function.
Therefore, in the migration of the MPI program from the source platform   to a local platform   , the translation replacement function Ω rep () instead of instruction translation can be used for the  share part.
W  represents the program of source platform and consists of W share and W private , which can be processed by different strategies, so as to derive the migration method of the MPI program.
When simulating to execute the program of the source platform in the local platform   , for the W private part, the Ω  function is directly used to translate the instructions of the source platform to the local platform; for the W share part, the local library provided by the local platform is used for processing, that is, to implement it through the function Ω rep .In the local platform, some additional processing work shall be done to ensure the consistency of data during the course of platform switching. prologe represents the operation of switching the local platform   to the source platform;  epiloge represents the operation of switching back to   after the simulated execution of the corresponding code.The two instruction sequences are often used to store the register state to memory, pass the parameters to the mapped instructions, and preread the register values and memory information; usually, there are only a few instructions like these, which are frequently called in dynamic binary translation.The above model makes it possible to migrate the MPI program.

A Migration Method of MPI Program
Combining Local Library Replacement and Instruction Translation  3. The implementation of MPI-QEMU is as follows: firstly, the MPI executable file to be migrated is loaded into the local memory through the loading module, during which information extraction is carried out against the MPI function calls, so as to store the related message of MPI function call in the executable file into the special structure, for example, function name, function call address, and function address.Then, taking basic block as a unit, the source binary file is disassembled to generate the corresponding intermediate code; in case of the MPI function calls, the program control flow is redirected according to the function call address identified in the loading process, to generate and execute the instruction sequence that calls the local MPI library function.Otherwise, the intermediate instruction is translated to local instructions for execution in accordance with the rules of translation; during the course of execution, the basic block translated is stored in the code cache for reuse.On the local platform, the MPI executable file is executed in a simulated environment realized by the dynamic binary translator without considering the implementation details of the underlying library; when MPI-QEMU processes the calling of the MPI library function, it is to use the local MPI library function to simulate executing instead of dynamically translating them without the consideration of message sending and receiving.Accordingly, under this framework, the dynamic binary translation is the combination of the executable file and the local library; the executable file indirectly calls the local codes to implement the message interaction.

The Implementation of Local Library Replacement
The processing of MPI library function is the key to implement the MPI program migration.This paper puts forward a method of program flow redirection, to identify the MPI function call in the process of disassembling The extraction and storage of the library function call information are completed through two loops at the program loading stage.An empty func_info queue is initialized prior to the start of the loop; each time, a member of the .rel.plt section is read in the first loop and a new object is added to the func_info queue (representing a dynamic link function); the offset field value of the extracted members is taken as the function address of such object; then, the offset of the member in the .rel.plt is reused to calculate the corresponding position in .symsection and the name field value is extracted as the function name.The second loop is to modify the function call address of each object in the func_info queue; before each modification, consistency validation must be carried out for the function address.When the function address of the object to be processed is consistent with the function address calculated by the .pltfield, the address of the current members, index, and its width are used to compute the function call address.The algorithm design is shown in Algorithm 2.

Opcode Interception.
At the translation stage, an opcode interception algorithm is designed to implement the redirection of the program flow.The disassembly of each instruction will judge whether it is MPI function call.When the MPI function call is determined, the program jumps to the local MPI function encapsulation for execution, without instruction translation; the library function call information which has been cached into func_info structure is used for processing; otherwise, the traditional instruction translation will be carried out.Algorithm 3 shows the opcode interception algorithm.When the translation is carried out taking the basic block as a unit, each instruction is disassembled to get the opcode and identify the instruction type.If it is the library function call (e.g., the opcode of call instruction in the x86 platform is E8), the function name will be checked if it exists in the cache array and contains the mpi_ as a prefix; if yes, parameter passing will be carried out to generate local MPI function call instruction.During the course of call instruction interception, the call instruction operand and the address of its next instruction are used to compute the call address of current library function.According to the keyword, search in the cache array; if it is found successfully, it shall be the MPI function call instruction.

Local Library Function Encapsulation and Call.
The library function call is implemented by executing the local library encapsulated in advance, which means that the MPI function shall be encapsulated before the execution.This paper applies a method of hierarchical encapsulation of library function as follows: implementation language → the MPI library function name → parameter type, number of parameters, and return value type; such division can gradually narrow the scope of search when matching library function, so as to improve the hit rate.In the process of implementation, the parameters passing and return values processing shall be carried out with the help of context.Due to the limited number of functions contained in the MPI library, there is a small amount of work to do, which can be implemented by referring to the MPI library of a platform.
Due to the complexity and diversity of library functions, a series of issues, such as the parameter acquisition, return value capture, format character string analysis, and data element acquisition of the structure, need to be processed in the library function encapsulation.The library function encapsulation is a challenging job.When the MPI program to be translated contains a large number of MPI library function calls, the local library replacement method can significantly improve the efficiency.

Experimental Test
In the formula,  mpi-qemu and  native , respectively, represent the execution speed in the MPI-QEMU and source platform.If the ratio is greater than 1, this represents the notion that the MPI-QEMU is faster; otherwise, the source platform is faster.
(1) IMB Test.IMB focuses on testing the performance of every MPI interactive function.This paper adopts the four items of IMB to test, that is, IMB-MPI1, IMB-EXT, IMB-NBC, and IMB-RMA.The test results are shown in Tables 3-6, including three columns in every table, that is, the MPI library function name, number of processes, and the ratio from left to right.In the 46 MPI functions contained in the four test items, only the ratio of 2 functions did not reach 0.9; the ratios of the others are above 0.9 and close to 1, and there is even 1 function with a ratio greater than 1.This indicates that the library function encapsulation algorithm can make full use of the local MPI environment to accelerate each message interaction; the MPI-QEMU has almost the same performance with the execution on source platform.
(  reaching the theoretical process acceleration ratio; through the calculation, the ratios of the four test cases are 85.1%, 92.9%, 94.5%, and 96.4%, with the average acceleration ratio of 92.2%.executable file, the local library functions call is applied to implement migration through the program flow redirection at three stages; for the non-cross-platform part, the binary translation based on instructions translation is applied to implement migration.The experiment verified the correctness and effectiveness of the proposed method.For the next step, we will begin to explore the migration of large-scale commercial MPI binary programs and new ways for the improvement of migration efficiency.

Definition 2 .
W private (C  , C  ) = {: the private part matching the source code of W  in the source platform C  , not shared with the local platform C  }.According to Definitions 1 and 2,

2. 4 .
The MPI Program Migration.The program logic of MPI executable file in the source platform is unique and difficult to analyze, but the behavior of calling the shared library function is predictable.When the MPI program is migrated from the source platform   into the local platform   , different strategies are applied for the W share part and the W private part, which together constitute the MPI program.The translation replacement function Ω rep () is used for the W share part, while the instruction translation function Ω  () is used for the W private part, for processing.The binary translation function is defined as follows:

Figure 2 :Figure 3 :
Figure 2: The migration model of the MPI program.
represents the state of the cluster, a Kd Cartesian product composed of the state of  nodes;   =  1 ×  2 × ⋅ ⋅ ⋅ ×   represents the instruction set of the cluster, a Kd Cartesian product composed of the instruction set of  nodes;   =  1 × 2 ×⋅ ⋅ ⋅×  :   ×  →   represents the state transition of the cluster, an interpretation function of the cluster to the instruction set in a state;  represents the cluster interactions, including all instructions that can change the state of the cluster; Program among the Clusters.It can be obtained by the computable equivalence of Turing machine that any programs running on a universal Turing machine can be computed by another universal Turing machine.To apply the theory to the cluster environment, we can conclude that a program running on the cluster of the source platform can be simulated by the local cluster, and the simulation process is implemented by executing the source platform instructions on a local platform.The instructions execution can lead to the change of state of the cluster.To this end, the state mapping function and instruction sequence mapping function are, respectively, defined as follows:  : mapping the instruction sequence of the program in the source platform cluster to that of the local platform.Let   represent the cluster of source platform; let   represent the local cluster; let  (,) represent the initial state of   ; and let t represent the instruction sequence of the program on   .Figure1shows the process of a program of source platform cluster mapped to a local cluster.  uses Θ  * to explain and execute t, causing the change in   state; Ω  maps  (,) to derive the initial state of   , Ω  ( (,) ); Ω  maps t to derive the instruction sequence of   , Ω  (t);   uses Θ  * to explain and execute Ω  (t), causing the change in C  state.As shown by the dotted box of Figure Ω  : S  → S  : mapping the cluster state of source platform to that of the local platform.Ω  : I *  → I * represent the program of the source program   ; let   represent the program of the local program   ; Mathematical Problems in Engineering let  share represent the cross-platform part of the program.share may vary with different platforms.So, we define it depending on the platform.Definition 1. W share (C  , C  ) = {: the cross-platform part matching the source code of   in the source platform   }.Correspondingly, W share (C  , C  ) is the cross-platform part in the platform   .Let Ω rep be the injective function from the sequence W share (C  , C  ) to the corresponding sequence W share (C  , C  ): As mentioned in the above section, the migration of the MPI program can be classified for processing according to the code types of the constituent part, based on which the migration model shown in Figure2is established.The model ignores the underlying implementation details and shows a method combining the local library replacement and instruction translation from the macro perspective.
3.1.TheMigrationModel of the MPI Program.The MPI program in the source platform to be translated can be deemed to be composed of several parts divided by the MPI function; during the course of migrating the MPI program, for the MPI function call, it belongs to the part of cross-platform, so that it can be processed by Ω rep function (i.e., the local library replacement method); for the non-MPI function call, it can be processed by Ω  function (the traditional dynamic translation method).3.2.The Framework and Implementation Process of Migrating MPI Program.Based on this model, this paper constructs a framework for migrating MPI program called MPI-QEMU on the basis of the open-source dynamic binary translator called QEMU.MPI-QEMU can call local MPI library functions through the method of software instrumentation, and the system framework design is as shown in Figure

Table 2 :
Test cases and the result of correctness test.
This paper uses IMB and NPB test set to test the correctness of MPI-QEMU, with the test items and results shown in Table2.The result shows that using

Table 4 :
Test cases of IMB-EXT.

Table 5 :
Test cases of IMB-NBC.
2) NPB Test.NPB test set applies the classic algorithms and modules commonly used, to reflect the system's ability to solve practical engineering problems.This paper uses the NPB of MPI version to test the performance of MPI-QEMU in a multiprocess pattern.Figures4-7, respectively, show the test results of NPB-IS, NPB-CG, NPB-EP, and NPB-FT.The horizontal axis represents the number of processes and the longitudinal axis represents the running time of the program, with the unit of seconds.The experimental results show that as the number of processes increased, the running time greatly reduced; and the running time is inversely proportional to the number of processes, almost This paper carried out the beneficial attempt of migrating the MPI program.A migration method of MPI program Figure 6: Test result of NPB-EP.