Communication Optimization Technology Based on Network Dynamic Performance Model

+is work analyses different communication modes in applications of supercomputing, proposes a communication dynamic performance model based on topology awareness, and realizes the prototype system of all-to-all communication and stencil communication optimization based on this model. Basic tests on the optimization of all-to-all communication and stencil communication were carried out on the Sunway TaihuLight System, and this achieved obvious optimization results. Several applications, including molecular dynamics simulation and turbulence simulation, have been optimized and tested. +e average performance has been improved obviously. It can be expected that, for other large-scale applications, this optimization method can also be used to obtain significant improvement in communication performance.


Introduction
Although supercomputers have been making breakthroughs at peak computing rates, their application levels have lagged behind. While researchers strive to improve the application level of high performance computing, researchers in the field [10,11] designed a method called star-MPI (self-tuning adaptive routines for MPI collective operations), which can dynamically select the algorithm for ensemble communication in a network with unpredictable performance. is method tests various possible schemes and uses a certain prediction mechanism to delete the algorithm with low performance to save testing time. Vadhiyar et al. [12] used an automatic optimization technique similar to FFTW for aggregate communication tuning. First, test the optimal buffer size applicable to the algorithm under a certain number of processes, then test the performance of different algorithms against a certain message size, and finally repeat the above steps for different numbers of processes, so as to determine the optimal set communication algorithm under different number of processes. Subramoni et al. [13] analyzed the factors causing network congestion in the largescale InfiniBand cluster, represented the dynamic topology characteristics of the system by generating path matrix, and optimized Alltoall implementation, which achieved 12% performance improvement for P3DFFT on the 4,096 core network. Mamadou et al. [14] used p-Logp point-to-point model to predict the performance of different algorithms to determine the optimal implementation algorithm of Alltoall based on the dynamic changes of system network load, and achieved good results on Infiniband and Gigabit Ethernet networks. Patarasuk and Yuan [15] optimized big-message All-Reduce under the tree network structure, enabling each process to send and receive the minimum amount of data and avoid the occurrence of blocking, and achieved performance improvement on Myrinet, InfiniBand, and Ethernet clusters. Kandalla et al. [16] modeled the communication performance by detecting the topology information of the large-scale InfiniBand network, analyzed the performance overhead of collection communication, and optimized Gather and Scatter routines. Ma et al. [17], based on the process distance, network hardware topology, and runtime communicator information, generated topology aware Broadcast and All-Gather implementations. Gallardo et al. [18] implemented the MPI Advisor, an easy-to-use software tool for programmers to dynamically monitor application execution and optimize the MPI environment to improve performance. Bhatele et al. [19] speculated the possible causes of network communication blocking by dynamically monitoring the performance of the application.
Other studies optimize the aggregation communication for the system network characteristics. Usually, MPI collection communication is designed according to the assumption that one node can only communicate with another node at a certain time. Chan et al. [20] improved several collection communication functions including Broadcast, Reduce, Scatter, Gather, All_gather, Reduce_scatter, and all-Reduce, aiming at the feature that one node can communicate with multiple other nodes at the same time in the IBM Blue Gene/L system. Faraj et al. [21] analyzed that, in the network composed of cut-through and store-and-forward switches, when the message is large enough, the subnet composed of a minimum spanning tree connection can achieve nearly optimal performance for Alltoall broadcast communication. Zhang and Deng [22] proposed that the average distance between nodes could be reduced more effectively and the broadcast communication performance could be improved by adding shortcut connections with strategies rather than network dimensions on Torus network. Song and Hollingsworth [23] proposed a new broadcast communication algorithm using MPI-2 unilateral communication and pipe-logging mechanism, and the quantitative analysis and experiment of P LogP parallel computing model verified that the algorithm had better performance improvement than the traditional algorithm. Mamidala et al. [24] analyzed performance scalability and performance/memory consumption in achieving set communication and unilateral communication using InfiniBand Reliable Connection (RC) and Unreliable Datagram (UD). In systems using InfiniBand network, MPI communication function was usually used in transmission mode RC. However, in large-scale networks, in order to save memory consumption in establishing full connection in RC, Koop et al. [25] suggested that using Unreliable Datagram (UD) realizes MPI's aggregation communication function. Qian and Ahmad [26] implemented several RDMA multiport communication functions based on the characteristics of its network multi-Rail on the QsNetII cluster. Hasanovn [27] optimized the parallel matrix multiplication algorithm on large-scale network systems by reducing communication overhead. Mistry et al. [28] found that switching components on InifiniBand network would become the bottleneck of Alltoall communication.
Some researchers have developed set communication optimization based on process-node and process-CPU core mappings. Karlsson et al. [29] improved the performance of multidimensional process groups in broadcast communication in different dimensions by applying hierarchical optimization process-CPU core mapping. Balaji et al. [30] analyzed the influence of process-node correspondence in three-dimensional Torus network topology structure of Blue Gene/P system on application performance and provided application communication mode information to optimize the communication performance before application loading. Based on Torus network topology, Mittal et al. [31] designed methods for each subcommunicator's nonblocking routing data when the subcommunicator formed by multiple discontinuous nodes concomitant communication in a loosely synchronized manner and verified the performance in the Blue Gene/P system. Bhatele et al. [32] developed a tool called Rubik to optimize the communication performance of the subcommunicator in the application by adjusting the process-node mapping relationship. Karlsson et al. [33] optimized the multidimensional MPI set communication on the multidimensional Torus network structure and reduced the communication traffic between nodes on Jaguar system by changing the process-CPU kernel mapping relationship to optimize the performance. Zahavi et al. [34] proposed that when an application runs on a fully or partially filled fat tree structure, the MPI process-node mapping relationship should reflect the structural characteristics of the network, and the simulation verified that its nonblocking routing method has higher performance in Alltoall communication.

Communication Characteristics of Different Types of Applications
In order to carry out the research of communication performance optimization technology based on topological structure, it is necessary to study the characteristics of communication mode applied in the supercomputing system. erefore, the communication characteristics of turbulent flow application and crystal silicon solidification process simulation application are studied.

All-to-All Communication.
e communication characteristics of direct numerical turbulence simulation applications are all-to-all communication.
e core of direct numerical turbulence simulation is the Fourier transform of a three-dimensional cube, which is also the most difficult part of optimization. is part of the data volume is large. For the 3 d cube with side length of 16,384, the data volume is huge, up to 16 TB. Standard practice requires the entire data to be transposed, resulting in frequent data transfers, one data transfer per iteration time step, and more than 10 such cube FFTs. e calculation design of this part is as follows. e original data are stored in ordinary three dimensions, and the right-most dimension is the continuous dimension. e whole cube has N^3 singularly complex numbers. e array dimension representation method is used, and the initial data is marked as . We use P processes to participate in the calculation. e data is divided equally into P parts, and each process is allocated N/P squares with N * N sides. at is, the cube slices are assigned to each process on the first dimension. At this point, the data distribution is denoted as ( en the local FFT of the two-dimensional matrix is completed in each process. en, an all-to-all communication takes place between all processes to complete a transpose of the 3D data on x dimension, transforming the dimension into a continuous dimension on a single process. To do this, the second dimension also needs to be split into , and then a local data transpose is done; that is, z and y are swapped, It is found that there are significant performance differences when using different nodes for communication. As shown in Figure 1, the abscissa represents different node groups; each group has 64 nodes, a total of 32 groups for all-toall communication, and different curves represent 5 performance measurements. It can be seen that the performance of different groups differs significantly, and the performance of each node of the same group has certain stability. is shows that, by changing the process-computational kernel mapping to optimize the implementation of complex set communication, effective performance improvement can be expected.

Stencil Communication.
e communication features of the silicon solidification process simulation application are stencil communication mode. We tested the effect of different communication patterns and process dimensional distribution patterns on performance.
In one-dimensional communication mode, each process sends data of unit message length (2 K) to 26 surrounding neighborhoods at the same time. After communication, each process receives all messages from 26 surrounding neighborhoods. An example of a one-dimensional communication pattern is shown in Figure 2.
In the two-dimensional communication mode, in the first communication, each process sends data (2 K) of unit message length to the surrounding 8 neighborhoods at the same time. After the communication, each process receives all messages from the surrounding 8 neighborhoods. On the second communication, each process will send the message data containing its 8 neighborhoods (2 K * 9) to the upper and lower neighborhoods at the same time. After the communication, each process receives all the messages from the surrounding 26 neighborhoods. An example of two-dimensional communication mode is shown in Figure 3.
In 3D communication mode, for the first communication, each process sends data (2 K) of unit message length to left and right neighborhood at the same time. After communication, each process receives all messages from about 2 neighborhoods. In the second communication, each process will send the message data containing its two neighborhoods (2 K * 3) to one neighborhood before and after at the same time. After the communication, each process receives all the messages from the surrounding eight neighborhoods. In the third communication, each process will send the message data containing its 8 neighborhoods (2 K * 9) to the upper and lower neighborhoods at the same time. After the communication, each process receives all the messages from the surrounding 26 neighborhoods. e example figure of 3D communication mode is shown in Figure 4. e performance comparison of the three communication modes is shown in Figure 5. It can be seen that the 3D 2-2-2 mode has obvious performance advantages. e ranking of processes in different dimensions demonstrates the complexity of neighborhood relationships, which is also critical to performance. Under the three different permutations, the above three dimensional communication mode is adopted in the communication mode. We test the performance trend of computing plus communication (unit message length is 2 K), communication only (unit message length is 2 K), and communication only (unit message length is 8 K) under different sizes. It can be seen that the choice of different communication modes has a significant impact on performance, and it can also be expected that the improvement of process-computational kernel mapping optimization can also promote the improvement of communication performance.

Communication Dynamic Performance Model
In addition to considering the physical structure of the network, this scheme considers the dynamic performance model based on the network for optimization, which is an innovative work of this study. e work of this paper is carried out on the Sunway Taihulight supercomputer.
e Sunway Taihulight supercomputer consists of 40 computing cabinets and 8 network cabinets. In each computing cabinet, four supernodes composed of 32 computing plug-ins are distributed among them. Each plug-in is composed of four operation nodal plates, and one operation nodal plate contains two highperformance processors "Shenwei 26010." One cabinet has 1024 processors, and the whole machine has 40,960 processors. Each single processor has 260 cores, the motherboard is designed for double nodes, and each CPU has 32GBDDR3-2133 solidified on-board memory.
is optimization method may not be directly applicable to other nontree network structures. e corresponding performance model should be established according to the specific network structure. However, the thought in this paper can be used for reference.  ∈ N(M), and d is real number, indicating the network communication performance between node a and node b. It can be seen that what the model describes is actually a fully connected directed graph weighted by the network performance between nodes, as shown in Figure 6. e technical route proposed in this paper is to test the communication performance of each link of the system (including the communication between storage components within the node and the network between nodes) through the example test set, so as to build the dynamic topology model of the communication of the whole heterogeneous system. e specific communication instance test set can include the following: ① For the internal nodes, the data transmission performance between storage components under different granularity is tested to fully describe the "distance" between each storage component. ② For nodes, the bandwidth and delay of communication between nodes under different transmission granularity are tested, and the "distance" between nodes is depicted. Stress test the throughput performance and other constraints of network switches at all levels. ③ Test the model of the interplay between the performance of various concurrent transports. ④ is profiling process should be conducted in an efficient and automated manner and can be retested at intervals during application execution to modify the dynamic topology model. e dynamic communication model is constructed by detecting the dynamic topology of data communication. e dynamic communication model is represented by graph structure: each point in the graph represents network nodes, and the edge between nodes represents network characteristics such as bandwidth between node pairs.
Considering the network dynamic performance model, the process-computation kernel mapping optimization is carried out for applications with different communication characteristics: ① Different types of communication characteristics have different requirements for communication. For example, whereas full-to-full communication requires network relationships between nodes, stencil 2-2-2 communication only requires network relationships between associated neighbor nodes. ② e structure of dynamic communication graph is taken as a complete graph, and the optimal subgraph is sought to make it match the performance requirements of different communication characteristics mentioned above. ③ e node characteristics of the subgraph should conform to the known network physical structure model.  Mathematical Problems in Engineering ④ Validate process-computational the availability of kernel mapping optimizations with examples: in addition to the all-to-all communication and stencil 2-2-2 communication modes described above, consider using other MPI collection communication modes for validation. For example, for broadcast communication mode, it is necessary to construct a subgraph to form a tree structure corresponding to the implementation of broadcast communication mode and make this tree structure reach the optimal level.

Optimize All-to-All Communication Based on the Dynamic Performance Model
is section takes optimal set communication based on dynamic performance model as an example to demonstrate the design idea of the scheme. As shown in Figure 7, under the communication pressure condition, the bandwidth and delay of communication between nodes under different transmission granularity were tested.
According to the bandwidth and latency characteristics, the topology of each node is represented as a full connection diagram, and the distance between nodes represents the network performance between nodes. It can be seen that in this dynamic topology that nodes 1-4 are located in the switch network of the same layer, while nodes 5-8 are located in the switch network of another layer.
If the all-to-all-communication process-computational kernel mapping optimization is carried out at this time, if two nodes are needed, then 2-3 nodes are selected as the best; if four nodes are needed, then 1-4 nodes are selected as the best. e dynamic topology structure can not only optimize the node selection and process-computation kernel mapping optimization, but also optimize the implementation of set communication. For example, if the broadcast communication of the eight nodes in the figure is realized, it is According to the design of dynamic performance model, the prototype system began to implement and test.
rough the example test set, the communication performance of each link of the system is tested, and the dynamic topology model of the whole system communication is built. e test method of the test set is as follows: only considering the network communication performance between the main core, repeated ping-pong communications will be carried out between any node pair at the same time. Several rounds will be conducted in this process to record the communication performance between each node. e dynamic communication model is expressed as a graph structure. According to the graph structure, the optimal fully connected subgraph is sought, and all-to-all communication performance is tested. e algorithm to find the optimal fully connected subgraph is shown in Figure 8.
According to the above implementation methods, based on the network dynamic performance model, the all-to-all communication features of the program are tested by changing the process-computational kernel mapping.
Since the test is carried out in a shared partitioned environment and the workload and network load change at any time, the following factors will be considered for the test: the test operation program is a program that has carried out several rounds of MPI_Alltoall communication. For each batch of tests, several times will be performed to eliminate data with obvious abnormal performance results (there is an order of magnitude difference between the performance results of the two adjacent tests). e test job before optimization is issued with command and uses the default node allocation mode. When the optimized test job is submitted, specify the nodes and mapping mode selected by the optimization. To ensure fair competition, the two types of work will be submitted in different terminals at the same time. If the nodes selected by both parties have duplicates, the two test jobs are submitted in turn.
From the test results shown in Table 1, it can be seen that significant optimization effect of communication performance can be achieved after node optimization selection and process-computational kernel mapping optimization based on dynamic topology structure. e values in the table represent the time in seconds needed to complete a round of communication. For the operation with large communication data volume and node size, the performance improvement before and after optimization is more obvious. It is also expected that the larger the job node size is, the easier it is to benefit from node optimization selection and processcomputational kernel mapping optimization.
Note that the test is carried out in a shared partitioned environment. e workload and network load change at any time, so the acceleration effect test may be different each time (but it also meets the requirements of the real scenario). After repeating several tests, the optimization effect can be clearly reflected.

Optimize Stencil Communication Based on the Dynamic Performance Model
is section takes the stencil communication optimization based on dynamic performance model as an example to demonstrate the design idea of the scheme.
In all nodes on the network, the communication performance between each node is tested. Combine nodes that do better at communicating into smaller stencil blocks (2 by 2 by 2) and then build larger stencil blocks (4 by 4 by 4) from smaller stencil blocks. is process iterates until the node size required for the application is constructed as shown in Figure 9.
rough the example test set, the communication performance of each link of the system is tested, and the dynamic topology model of the whole system communication is built. is process is similar to the all-to-all communication optimization implementation process and will not be repeated here. e algorithm to construct a communication node block using stencil is shown in Figure 10.
Based on the above implementation method, a program with communication characteristics of stencil is tested by changing process-computational kernel mapping based on the network dynamic performance model.
Since the test is carried out in a shared partition environment and the workload and network load change from time to time, the following factors will be considered  Mathematical Problems in Engineering for the test. e test operation program is a program that has conducted several rounds of 3D mode stencil communication. For each batch of tests, several times will be performed to eliminate data with obvious abnormal performance results. e test job before optimization is issued with command and uses the default node allocation mode. When the optimized test job is submitted, specify the nodes and mapping mode selected by the optimization. To ensure fair competition, the two types of work will be submitted in different terminals at the same time. If the nodes selected by both parties have duplicates, the two test jobs are submitted in turn. As the message length decreases, the number of communication iterations increases, making the observation time easy to measure.
From the test results shown in Table 2, it can be seen that significant optimization effect of communication performance can be achieved after node optimization selection and process-computational kernel mapping optimization based on dynamic topology. e values in the table represent the time in seconds needed to complete a round of communication.  The initial set of selected nodes is set empty The two closest nodes are selected from all candidate nodes While (number of selected nodes < number of required nodes){ Select the node newOne from the candidate nodes, Ensure that the sum of newOne and all selected nodes is minimum; Add the newOne node to the selected node set; } Figure 8: Algorithm for finding the optimal fully connected subgraph.

Application Optimization Examples
At present, several applications including molecular dynamics simulation and turbulence simulation have been optimized using this technique. e performance of these applications in the Sunway TaihuLight system was tested.
Molecular dynamics simulation is a computer simulation method, usually using computer software, according to the basic principles of Newtonian mechanics, molecular movement as the main object of simulation, and all the motion of the particles in the research system with the evolution of the time. Molecular dynamics simulation can not only get the molecular motion but also observe the microscopic details of atomic motion. e application mode of communication is stencil mode. For molecular dynamics simulation application, the single-step communication performance before and after optimization is compared as shown in Table 3. e values in the table represent the time in seconds needed to complete a round of communication.
A common numerical method for turbulence simulation is to directly solve the NS equation with periodic boundary conditions, namely, the Fast Fourier Transform method, more accurately known as the potential pseudo-spectral method, which has the advantage of being able to deal with periodic boundary conditions and has high order accuracy. A typical turbulence program requires more than 12 arrays to represent different physical components. e communication mode of this application is all-to-all communication mode. For turbulence simulation application, the single-step communication performance before and after optimization is compared as shown in Table 4. e values in the table represent the time in seconds needed to complete a round of communication.
It can be seen from the above data that this technology can effectively optimize the communication performance of each application. Especially for molecular dynamics simulation applications, the communication performance was improved about twice under the size of the Sunway Tai-huLight system half machine and full machine, as shown in Figure 11. e time in the figure represents the time in seconds needed for one round of communication.
is technology also improves the scalability of application communication performance. As shown in Figure 12, the horizontal axis is the number of nodes used in the application, and the vertical axis is the single-step communication time. e time in the figure represents the time in seconds needed for one round of communication. It can be seen that the single-step communication time after optimization has better scalability than before optimization.
The initial node block set is empty; Smaller node blocks are constructed from all candidate nodes and added into the node block set. While (Node block size < number of nodes required){ In the node block set, larger node blocks are constructed from smaller node blocks.
Add the larger node blocks to the node block set; Clearing smaller node blocks in the node block set; } Figure 10: Algorithm to construct stencil communication node block.

Conclusions
In this paper, the communication performance optimization technology based on topological structure is presented. e communication characteristics of different types of applications are analyzed, and the implementation of dynamic topology detection mechanism of data communication is studied. According to the dual factors of network physical structure and network dynamic performance model, complex set communication is optimized by improving process-computation kernel mapping. Several applications, including molecular dynamics simulation and turbulence simulation, have been optimized and tested. e average performance has been improved obviously. It can be expected that, for other large-scale applications, this optimization method can also be used to obtain significant improvement in communication performance.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare no conflicts of interest.