Modeling the Training Iteration Time for Heterogeneous Distributed Deep Learning Systems

Distributed deep learning systems efectively respond to the increasing demand for large-scale data processing in recent years. However, the signifcant investment in building distributed learning systems with powerful computing nodes places a huge fnancial burden on developers and researchers. It will be good to predict the precise beneft, i.e., how many times of speedup it can get compared with training on single machine (or a few), before actually building such big learning systems. To address this problem, this paper presents a novel performance model on training iteration time for heterogeneous distributed deep learning systems based on the characteristics of the parameter server (PS) system with bulk synchronous parallel (BSP) synchronization style. Te accuracy of our performance model is demonstrated by comparing real measurement results on TensorFlow when training diferent neural networks with various kinds of hardware testbeds: the prediction accuracy is higher than 90% in most cases.


Introduction
In recent years, we have witnessed the boom of deep learning which has been successfully applied in numerous areas (also changed these areas), including image processing [1], speech processing [2], natural language processing [3], humanmachine gaming [4], autonomous driving [5], and health care [6].As the complexity of application scenarios and user requirements grow fast, deep learning models scale larger and larger, having massive data as training input, which exceed the processing capability of one single machine.Terefore, large-scale computer clusters are used to speedup the training of such big neural network models, driving the rapid development of distributed deep learning systems [7][8][9][10].
It requires signifcant investment to build such distributed learning systems, which consist of multiple highend servers connected between each other with high speed networks.For example, the Google Brain project, which began exploring the use of very large-scale deep learning systems in 2011, costs an average of more than $100 million per year.For researchers in general universities, a distributed learning system often contains several computing nodes, each node has several GPUs/CPUs/FPGAs/TPUs, and requires a network connection of at least 10 Gbps, which cost up to $100,000.For such large investment, it will be good to predict the precise beneft, i.e., how many times of speedup it can get compared with training on single machine (or a few), before actually building such big learning systems.However, above question is still hard to answer.
Although several recent works [11][12][13][14][15][16][17][18] have tried mathematically modeling (some combined with experimental methods) the computation and communication part in distributed training, they are still not accurate enough in practice.Particularly, when modeling the communication time in multimachine training environment, they simply assume that all worker nodes (machines) have the same processing speed and are synchronized during communication (passing parameters/gradients), fairly sharing the network bandwidth.However, such communication model is too idealistic for large distributed learning systems in practice with worker nodes potentially being very heterogeneous with each other [19,20].Under such practical scenarios, it is neglected by previous models that worker machines may be complexly overlapped or interleaved with each other on both computation and communication during each training iteration.For example, the worker with the best computing capability pushes gradients as soon as it fnishes calculation, which overlaps with the calculation of the other workers.
To address the above problem, we present a novel performance model on training iteration time for heterogeneous distributed deep learning systems.By carefully analyzing the steps of computing and communication during a training iteration, we can model the iteration time by inductively calculating the overlap and interference among multiple worker nodes.With our model, one can easily predict the speedup for diferent distributed hardware platforms only if knowing the target neural network and its confgurations, before actually building the system and training on it.
We have evaluated the accuracy of our performance model by comparing real measurement results on Tensor-Flow [21] when training diferent neural networks on various kinds of hardware testbeds.Experimental results show that our model can predict the training iteration time in an average accuracy over 95%, with the worst accuracy being 78.3%, under various conditions (training alexnet and vgg11/16/19 using various number of machines with or without GPU connected with 1 Gbps or 10 Gbps networks).
Te rest of the paper is organized as follows.Section 2 introduces the background and contributions of our research.Section 3 generally explained the input and output of our prediction model.Te foating-point trafc statistics formula is derived in Section 4. Training Time Modelings of standalone/distributed platforms with CPU/GPU are performed in Section 5. Te experimental results are explained in Section 6.
1.1.Notes.Te performance model presented in this paper only focuses on (one of) the most widely used architecture of distributed deep learning systems, i.e., data-parallel parameter server (PS) system with bulk synchronous parallel (BSP) synchronization style [22,23], which is shown to have better performance in both practice and theory [24,25].However, our performance model is easy to be extended using the same methods to diferent architectures such as all-reduce [26] and stale synchronous parallel (SSP) synchronization style [27,28].Te extension is omitted in this paper.As previous works (i.e., [15]), we model the computation part based on the number of foat-point operations and the communication part based on the network bandwidth and delay.We assume that distributed deep learning systems can fully utilize the hardware processing capability (which is the case in most of the current implementations), so implementation issues are not considered in our performance model.Moreover, this paper focuses on predicting the training speedup when using various hardware resources; hence, we do not model the convergence time for training a neural network to a desired accuracy, since the required number of iterations to converge will not change when using diferent number or diferent kinds of machines when the other confgurations of the neural network remain unchanged.

Background, Related Works, and Problems
2.1.Background.In March 2017, Jef Dean, head of Google Brain, gave a speech titled "Building Intelligent Systems with Large Scale Deep Learning" at UC Santa Barbara [29], where he predicted that machine learning expertise could be replaced by super computing power.Distributed machine learning systems are positioned to provide greater computing power.In distributed machine learning systems, the main parallel models include model parallelism (diferent machines (GPU/CPU, etc.) in a distributed system are responsible for diferent parts of the network model) and data parallelism (diferent machines have multiple copies, each machine is assigned a diferent data, and the computation results of all the machines are then merged in some way).Te main system architecture includes PS (this architecture isolates the calculation of each worker node, and each worker node only interacts with the server) and All-Reduce (this architecture integrates data from diferent training nodes, and distributes the results to all training nodes after the integration is complete).Te main parameter update methods include BSP (Bulk Synchronous Parallel), ASP (Asynchronous Parallel), and SSP (Stale Synchronous Parallel).
Among these combinations of system architecture and parameter update methods, PS + BSP is the most popular one that is demonstrated to have better performance in both practice and theory [24,25].With PS architecture, the bottleneck bandwidth is utilized up to twice more efcient than All-Reduce.Furthermore, PS + BSP can be used as asynchronous.
Terefore, the performance model presented in this paper only focuses on data-parallel parameter server (PS) system with bulk synchronous parallel (BSP) synchronization style.[11][12][13][14][15]17]. Tey build pure mathematical models or combined with experimental measurements to model the computation part and communication part.All of them assume all workers start to pull and push in a synchronized way, and calculate the time as parameter_size/gradient_size/bandwidth.

Related Work. Previous work of predicting the training iteration time include
In computation part, the references [11,13,17] combine mathematical models and experimental measurements, while [12,14,15] predict the training iteration time with mathematical models alone.

Problems and Our Contributions.
Te iteration time modeling is not accurate in previous work, since distributed learning systems have heterogeneous computation hardware and each computing unit (GPUs/CPUs/FPGAs/TPUs) may have diferent computing power.As shown in Figure 1(a), the modeling in previous work is modeled as synchronized, which requires multiple computing units in the system to have the same processing speed.However, as shown in Figure 1(b), the computing process is generally unsynchronized in practice, since heterogeneous hardwares have diferent computation time.Terefore, the accuracy of total iteration time modeling can be improved by further considering the efect of heterogeneous computation hardware.

2
International Journal of Intelligent Systems Terefore, we combine the performance and working principle of training equipment, neural network structure, and network bandwidth to establish a mathematical model of one round of iterative time for distributed training under PS + BSP.Our analysis is based on stochastic gradient descent (SGD) since SGD and its variants are the main algorithms for training DNN models.What is more, sigmoid is one of the famous activation functions because of continuity and diferentiability.Since it is proposed and widely used earlier, it is used as the frst step of our modeling of activation function.Other activation functions will be further modeled in future work.Moreover, we measured the iteration time of a variety of deep neural network (DNN) models and compared it with the predicted iteration time; the experimental results that demonstrate our prediction model is highly accurate.

Performance Model Overview
According to the foating-point calculation formula of neural networks that derived in Section 4, the foating-point statistics module in Figure 2 output the specifc foating-point operations of various neural networks to the next module.Afterwards, combined with the foating-point operations, and the learning system's network bandwidth and hardware parameters, an iterative time prediction module for both single machine and distributed system can be established in Section 5. Prediction times output by the iterative time prediction module are demonstrated to be accurate in Section 6.

Derivation of Floating Point Operations Calculation Formula
According to our investigations, although some foatingpoint calculation statistics of neural networks have been proposed on some academic or technical platforms, there are some errors after our verifcation.Terefore, out of scientifc rigor, the foating-point trafc statistics formula is derived in this section by combining the specifc calculation process of forward-propagation and back propagation.

Formula Derivation of Convolutional Floating Point
Operations.Te matrix multiplication operation is the most basic calculation in the forward-propagation convolution operation.Te premise of the matrix multiplication is that the size of the two multiplied matrices must be the same.After the two matrices are multiplied by a moment, the output is a natural number.Te calculation of the moment multiplication can be expressed by formula (1).Among them, a ij and b ij are the elements of n -order matrices A and B, respectively.
Now, we use 5 × 5 × 3 samples and two 2 × 2 × 3 convolution kernels F0 and F1 to illustrate the process of forward-propagation.Te 5 × 5 × 3 samples are expanded to obtain three 5 × 5 matrices, which is shown in Figure 3. First of all, the frst three 2 × 2 matrix among the three sample matrices and the three 2 × 2 matrices of the convolution kernel F0 are, respectively, subjected to moment multiplication (marked in red in Figure 3).Secondly, the results of the three moment multiplication plus the F0 's bias value 1 equals International Journal of Intelligent Systems to the frst matrix value 4 in the output matrix (also marked in red in Figure 3).Te specifc calculation process is as follows: Te calculation process has a total of 24 foating-point operations.It is noted that all parameters other than the bias value correspond to one foating-point operation.So further, the calculation of the number of foating-point operations f in this part can be generalized to formula (3).Among them, inch represents the input third dimension value, and kw and kh, respectively, represent the width and height of the convolution kernel. ( Tird of all, move the "matrix to be multiplied" in the input sample to the left by one position.Similarly, perform a moment multiplication with the convolution kernel F0, and fnally add a bias value of 1 to obtain the second value 7 of the output.According to the above method, the frst 4 × 4 matrix output can be calculated.
Next, we calculate the value of each element of the second 4 × 4 matrix output.Te only diference between this process and the previous steps is that the convolution kernel F0 is replaced with F1.After the convolution operation, the second two-dimensional tensor in the output tensor can be obtained, so that the complete 4 × 4 × 2 matrix tensor can be obtained.
It can be observed from this example that each element in the output 4 × 4 × 2 matrix tensor corresponds to 24 foating-point operations, that is, the amount of foatingpoint operations consumed by this example is 4 × 4 × 2 × 24 � 576.Tat is, outn × outh × outw × 24, where the value of 24 is obtained by formula (3), and outn, outh, and outw are the length, width, and height of the output tensor, respectively.Terefore, we can combine formula (3) to deduce the statistics of foating-point operations of each layer of convolution as formula (4).
Te actual neural network is usually composed of multilayer convolution and a fully connected neural network, such as Alexnet and VggNet.Te parameters of the fully connected layer usually occupy more than 90% of the entire neural network, although unlike the convolutional layer, the same parameter will not be recalculated.But its foating-point capacity cannot be ignored.Te fully connected layer is usually a one-dimensional tensor.Terefore, the calculation of foating-point operations can also use formula (4), where kh, kw, outh, and outw are all set to 1, and inch and outn are, respectively, the number of elements contained in the two one-dimensional tensors before and after the layer.

Formula Derivation of Back-Propagation Floating Point
Operations.Back-propagation is the process of calculating the gradient, and the calculation of the gradient is performed based on the calculation result of the forward-propagation.
According to the calculation of forward-propagation, ( 5) and ( 6) can be derived as follows: where σ is the activation function; here, we choose Sigmod function as the activation function.z (l) is the result obtained by calculating the convolution of the forward-propagation of the l -th layer, and adding the bias value b (l) , a (l) is the output processed by the activation function σ of the forwardpropagation training result of the l -th layer.
Te mean square error is the most commonly used loss function, so that the loss function is derived as (7) based on the mean square error.Among them, 1/2 is to ofset the coefcients obtained after the derivation, and has no efect on the surface calculation.
Te gradient of the weight w and and bias value b of the last layer of the neural network (the output layer) zC(w, b)/zw (l) and zC(w, b)/zb (l) is calculated as zC(w, b) Te common part of ( 8) and ( 9) can be written as Te value of σ ′ (z (l) ) is shown in equation (16).Te gradient of the output layer (l-th layer) has been calculated in ( 8) and ( 9); in the same way, the gradient of the l − 1-th layer can be calculated.Since the error of the output layer l has been calculated above, according to the backpropagation theory, the error of the current layer is the composite function of all the neuron errors of the previous layer, that is, the error of the previous layer can be used to express the error of the current layer.
Terefore, for the calculation of the intermediate layer gradient, there is a recursive formula (17).
Finally, new parameters w and b can be obtained by ( 18) and ( 19) Among them, the η is the learning rate (generally a fxed value).
In summary, it can be seen from ( 8) that each parameter of the last layer of parameters has 4 + sig foating-point operations, where sig is the number of foating-point operations required for the calculation of the Sigmod function.Te value of sig is explained in section 4.3.Combined with 2 foating-point operations in (18), the total number of foating-point operations is 6 + sig.Te gradient corresponding to each parameter of the middle layer can also be obtained as 6 + sig times by formula (17) and equation (18).Te gradient (zC(w, b)/zb (l) ) of the bias value b corresponding to w is calculated as in equation ( 9), and this value has been calculated in equation (8).Terefore, the calculation of equation ( 9) has no additional foating-point operations, and there are 2 foating-point operations in equation (19).From the above, the statistical formulas for back-propagation foating-point numbers are shown in equations ( 22)-( 24) in section 4.3.

Derivation of Total Floating Point
Operations.Te total foating-point number of forward-propagation needs to be calculated separately according to the number of layers of the neural network.According to formula (4), the foatingpoint number of the i-th layer is shown in formula (20).Among them, inch represents the number of input channels International Journal of Intelligent Systems of the i-th layer, kh and kw, respectively, represent the length and width of the i-th layer convolution kernel, outn, outh, and outw, respectively represent the matrix number, height, and width of the output tensor of the i-th layer.Te total FLOPS of forward-propagation is shown in formula (21).
Te sum of the number of parameters w and b is the sum of the parameters of the neural network (para).Te number of b(parab) is the result of length × width × height of the tensor size output by each layer.Te number of w is the number of all parameter (para and parab).Terefore, the FLOPS of back-propagation is shown in equations ( 22)-( 24).Among them, FLOPS w represents the number of foatingpoint calculations required for a weight in a back-propagation, and FLOPS b represents the number of foating-point calculations required for a bias value in a back-propagation.
From the formula (15), the value of sig is 2 + x, where x represents the number of foating-point calculations consumed to calculate the exponential function e x .Since the calculation of the exponential function e x in Python3 is essentially an approximation obtained by calculating the Taylor expansion of e x (equation ( 25)), we can get equation (26).Among them, i is Taylor's expansion series, and this value is uncertain in the underlying implementation of Python3.Python3 is dynamically adjusted according to the actual calculated value, and the calculation accuracy is increased by 3 digits each time.Due to the complexity and uncertainty of the parameters in deep learning, we will take the value of i based on the results of experiments, and through the results of multiple experiments, i is approximately equal to 20.
In summary, in a round of iteration with a batch size of bs, the number of foating-point numbers (FLOPS) is expressed as formula (27).International Journal of Intelligent Systems represents the total time of one-round iteration, T cpu is obtained by formula (28).T pull is the total time for all workers to pull parameters, and T push is the total time for all workers to send gradients.T pull and T push are both communication time, and the communication performance is mainly manifested in T push .n is the number of workers, P and G are, respectively, the amount of parameters and gradient (need to be converted to bit), and B is the bandwidth of the parameter server (bit/s).

Predictive Modeling of Distributed Training Time Based on CPU and GPU
Te one-round iteration time modeling on platforms with nonequal computing power and equal communication power are illustrated in Algorithm 1. Te main scenario considered in this paper is the case where multiple workers as well as PSs are connected to the same switch.In this case, it is common that each machine have equal communication power.And the link between the PS and the switch is the main communication bottleneck.Figure 4 gives a brief explanation of Algorithm 1.In the fgure, nodes w0w3 have the same push time and pull time, but the calculation time is diferent and gradually increases.To calculate one round of iteration time T, frst initialize T to the sum of the calculation time of w0 and the push time.Second, since T � T cpu0 + T push > � T cpu1 and T cpu0 > T cpu1 , the push of w0 has not ended when w1 is calculated.Tanks to the communication bottleneck in the link between the PS and the switch, the gradient of w1 will enter the sending queue and wait for the gradient of w0 to be sent before pushing, and the push of w1 will fnish after the push of w0 ended with T push , which can be shown in yellow.Terefore, the total time consumption from the start of local calculation of w0 to the end of the push of w1 is equal to the sum of calculation time of w0, the push time of w0 and the push time of w1, which represented as T. As shown in Figure 4, the current value of T can be represented by moving the push legend of w1 directly behind the push legend of w0.Tird, since the calculation time of w2 is larger than the current value of T, so T is directly set to the sum of the calculation time of w2 and the push time of w2.Fourth, similar to the second step, the current value of T is the calculation time of w2 plus the push time of w2 and the push time of w3.Finally, add the pull time to T, that is the result of one round of iteration time (By default, the parameter server will use a balanced algorithm to ensure that each computing node can receive new parameters at the same time).
In our analysis, the ratio of computation time to communication time and the local computation time for each worker is uncertain.When the ratio of computation time to communication time is diferent, the one-round iteration time will be calculated diferently according to Algorithm 1 and Algorithm 2. In particular, when the communication time is greater than the computation time of each worker, the one-round iteration time is simply summarized as where T min is the local computation time of the fastest worker(with the best computing capability in n workers).In this case, the bottleneck push process starts as soon as the fastest worker completes its computation and the iteration time is not related to the computation time of other workers.
In other words, under the communication bottleneck, replacing some of the workers with those with worse computational performance does not change the iteration training time, but it signifcantly saves computational resources compared with equal computing power.And it will be better to make the communication time as equal to the computation time as possible.When there is a communication bottleneck, replacing a higher bandwidth network device can result in a better performance improvement than replacing a better computing device.Te reverse is also true.

Training Time Modeling of Stand-Alone GPU Platform.
Te foating-point computing power of GPU is usually expressed in FLOPs/s per second.Te maximum GPU foating-point computing power for computing with stream processor is gpuhz × sp × fc, where gpuhz represents the GPU clock speed, sp represents the number of stream processors, and fc represents the number of foating-point operations that can be performed per stream processor in a single clock cycle.From this, the single iteration time mathematical model of a single-GPU computer can be established as formula (33).In this formula, the value of FLOPs is obtained by formula (27).In addition, the value of fc depends on whether the Fused-Multiply-Add(FMA) instruction set is used.If it is used, fc � 2. Otherwise, fc � 1.

Training Time Modeling of Distributed GPU Platform.
Since sending data is executed by the CPU, a training device that uses GPU for distributed training can execute the two tasks of training and sending data in parallel.In this scenario, the frst push of the training node occurs after the back-propagation completes the calculation of the gradient of the frst layer.In addition, since most of today's GPUs are clocked at more than 1.5 Ghz, even if there is only one stream processor, its calculation speed is 1.5 × 32 Gbit/s.Terefore, if and only if the sending speed is greater than 48 Gbit/s, the sending time will be less than the calculation time.However, the number of stream processors of current GPUs is usually more than hundreds, so the calculation speed is obviously much faster than the sending speed.In this way, the calculation time of one round of iterative time modeling for distributed multimachines using GPUs can be regarded as the forward-propagation time plus the time to complete the frst layer of back propagation.Tis time can be International Journal of Intelligent Systems obtained by dividing the total time of back propagation by the number of pushes.Te number of pushes depends on the experimental results, which are shown in Table 1.Te oneround iteration time modeling algorithm of distributed GPU platform is shown in Algorithm 2. Now, we give a briefy explanation of Algorithm 2 in conjunction with Figure 5.In the fgure, w0 w3 are four training nodes, and the red square represents the forwardpropagation time.Te blue square and the green square represent the back-propagation time and push time, respectively, and both are divided into several small segments according to the number of pushes.As mentioned above, the calculation speed is greater than the sending speed, so each green segment is slightly larger than the blue segment.First, when w0 fnishes calculating the frst blue segment, it starts to send the gradient.When the second blue segment is calculated, since the frst gradient has not yet been sent, it enters the sending queue and waits.Tat is to say, the green segment representing the push of second gradient needs to be placed behind the green segment representing the push of frst gradient, and same for subsequent operations.Second, when w1 calculates the frst gradient (the frst blue segment) and starts sending, since the gradient of w0 has not been sent yet (green square), the gradient of w1 enters the waiting queue.Te corresponding push time is directly added to the push time of w0.At this time, the total time T equals to the forward-propagation time of w0 (red square), plus the calculation time of the frst gradient of w0 (the frst blue segment), plus the push time of w0 and w1 (green square and yellow square).In the third step, we can see that the forwardpropagation time of w2 plus the back-propagation time of the frst layer is greater than T, so the value of T is set to the forward-propagation time plus the back-propagation time of the frst layer plus the push time of of w2.Fourth, it can be seen that the forward-propagation time of w3 plus its frstlayer back-propagation calculation time is less than T. Terefore, the push time of w3 can be added to T as the current total time T. Finally, adding the total pull time pull × 4 is the iteration time T. Finally, by adding T to the total pull time (pull × 4, the four gray squares in the fgure), one round of iteration time T can be obtained.
Similarly, when the communication time is greater than the computation time of each worker, the one round iteration time can also be summarized as ??.Te iteration time is only related to the computation time of the fastest worker.

Experimental Preparation and Experimental Environment
Introduction.Existing neural network-related papers generally do not count the amount of foating-point operations, and there is no reliable statistical work on the amount of foating-point operations in neural networks.As a prework for performance evaluation, we have performed statistics on the forward-propagation and back propagation of a number of commonly used convolutional neural networks according to formulas in Section 4. In addition, we also obtain the number of push gradients in one round of iteration of some neural networks according to related distributed experiments.Te result is shown in Table 1.

Experimental Results of CPU and GPU in Stand-Alone
Platform.Te experimental results of prediction time and actual measurement time of one round of iteration on standalone CPU platform are shown in Table 2 and Figure 6.It can be seen that the experimental results in this section are relatively ideal with an average accuracy of more than 90%, which demonstrates the efectiveness of the proposed prediction model.Te experiments prove that our modeling of the computation time is essentially correct.Tere are two main reasons for the error, including the randomness of system scheduling, and the limit of foating calculation accuracy of Python 3.7.Due to these two reasons, even if the same experiment is performed on the same device multiple times, the results will not be exactly the same.Te experimental results of prediction time and actual measurement time of one round of iteration on stand-alone GPU platform are shown in Table 3 and Figure 7. Since the FMA instruction are not used in these experiments, the value of fc is 1.Tere are two reasons for the errors in stand-alone GPU training.Te frst one is the computing power is too strong, which fnishes training too fast that the time management function cannot be accurately captured.Te second one is some training models such as Alexnet are too small and the GPU's stream processors are not all used.It means the GPU's computing power is not fully utilized, which result in errors in the prediction.It can be seen that the accuracy of single-machine GPU training increases as the scale of the neural network increases.Vgg19 is the largest of these four neural networks, and the accuracy is above 96% under the actual measurement with bs � 32 and bs � 64.Te scale of Vgg16 is slightly inferior to that of Vgg19, and the accuracy can also be stabilized at around 90%.However, Alexnet and Vgg11 have relatively large fuctuations in the accuracy between bs � 32 and bs � 64 due to the smaller neural network scale, which verifes the above-mentioned reasons for the errors.

Experimental Results of CPU and GPU in Distributed
Platform.Te CPU frequency of the two training nodes is 2 Ghz (Intel ® Xeon ™ E5-2620), and the CPU frequency of the parameter service node is 2.2 Ghz (Intel ® Xeon ™ E5- 2660).Te experimental results of prediction time and actual measurement time of one round of iteration on distributed CPU platform are shown in Table 4 and Figure 8.It can be seen that the performance on Alexnet is only mediocre that the accuracy is 78.3% and 82.3% when the bs is 32 and 64, respectively.However, the accuracy of the other three convolutional neural networks can reach around 95%.
Since the distributed platform is built on two virtual machines on a physical machine, we speculate that the parameter server only performs one pull operation on the physical machine, and the two virtual machines obtain    International Journal of Intelligent Systems updated parameters through the shared space.Due to the small scale of Alexnet, the training time is short, so less counting of the time of a pull has a great impact on accuracy.
If the measurement time is added to the time of a pull, the accuracy of can be increased from 78.3% and 82.3% to about 86% and 89%, respectively.Te other three convolutional neural networks have a much larger scale than Alexnet, which means less counting of the time of a pull has a little impact on accuracy.Te experiment of distributed GPU platform uses two physical machines and each one has a GPU (NVIDIA quadro RTX4000).Te CPU of the parameter service node is Intel ® Xeon ™ E5-2660.It can be seen from Table 5 and Figure 9 that the accuracy is relatively stable and mainly around 95%. Tis is because the GPU has strong computing power, and its calculation time is negligible in one round of iteration time.Terefore, the measurement time is mainly composed of the communication time of parameters and gradient transmission, which can maintain a high level of accuracy and stability.
We also complete experiments of distributed heterogeneous GPU platform by adding a heterogeneous GPU (GeForce GTX 1060 6 GB), which only has close to half the computing performance of the other GPU.Since the communication time in our experimental environment is greater than the computation time, the iteration time can be formulated by formula (32).It can be seem from Table 6 and Figure 10 that the accuracy is relatively stable and over 93%.Similarly, this is because the GPU has strong computing power, and its calculation time is negligible in one round of iteration time.To further verify the modeling accuracy, we increased the network bandwidth to 10GbE, and the result is shown in Table 7 and Figure 11.It can be seem that the accuracy is relatively mainly over 80% and the iteration training time is reduced signifcantly by alleviating the network bottlenecks.However the accuracy decreases by         (1) Communication time error: we experimented and found that there is startup overhead a in the pull and push process, which is not related to batch size and the bandwidth we use.In our experimental environment, the startup overhead a is about 0.5s, which cannot be ignored in our 10 GbE experiments.(2) Computation time error: in the GPU training process, it exits the process of memory copy from host to GPU, which cannot be ignored if the local computation time of worker is short.
As for Resnet50, the startup overhead can be ignored in our experiment and percentage of computation is much higher than the others.So, it still maintains high accuracy.
In this section, we built a distributed PS system based on the Linux system and TensorFlow/benchmark. Trough the use of synchronous communication algorithms, distributed measurements were carried out on various types of CPUs and GPUs.Trough the analysis of the results, it is known that although the cause of the error cannot be solved by mathematical modeling, the mathematical model established in this paper still has a high level of accuracy of around 90%.

Conclusions and Further Study
Based on the PS architecture, this paper deeply studies various key factors that afect the performance of distributed machine learning training, and combines these factors to establish the one round of iterative time prediction model for training on stand-alone/distributed CPU/GPU platforms.In addition, this paper also designed rigorous experiments to verify that the accuracy of the proposed performance prediction model is higher than 90% in most cases.In addition, the highest accuracy rate 99.4% is achieved in the prediction of stand-alone CPU Platform with Vgg19E.As for our future work, on the one hand, considering the increase of training nodes will lead to the emergence of communication bottlenecks, we will study one round of iterative time modeling of multiparameter servers.On the other hand, we are going to incorporate local SGD synchronization communication into the scope of modeling.

Figure 1 :Figure 2 :
Figure 1: Comparison of the communication time modeled in previous works and in practice.(a) Modeled as synchronized, (b) but unsynchronized in practice.

Figure 3 :
Figure 3: One round of iterative time modeling framework.

Figure 4 :
Figure 4: One-round iteration time modeling algorithm with nonequal computing power CPUs.

Figure 7 :
Figure 7: Comparison between prediction time and measurement time on stand-alone GPU platform.(a) bs � 32 and (b) bs � 64.

Figure 8 :
Figure 8: Comparison between prediction time and measurement time on distributed CPU platform.(a) bs � 32 and (b) bs � 64.

Figure 9 :
Figure 9: Comparison between prediction time and measurement time on distributed GPU platform.(a) bs � 32 and (b) bs � 64.

Figure 10 :
Figure 10: Comparison between prediction time and measurement time on distributed heterogeneous GPU platform with 1 GbE.(a) bs � 16 and (b) bs � 32.

Figure 11 :
Figure 11: Comparison between prediction time and measurement time on distributed heterogeneous GPU platform with 10 GbE.(a) bs � 16 and (b) bs � 32.
(27)TrainingTime Modeling of Stand-Alone CPU Platform.Te Single Instruction Multiple Datastream (SIMD) technology enables multiple data to be calculated in parallel within the same CPU clock cycle with only one instruction.In general, the parallel data calculation of Single Instruction to Multiple Execution depends on the number of CPU registers.If it is 64 bits, it can be split into 8 8 bit registers, and 8 8 bit data operations can be completed at the same time.Te efciency is increased by 8 times.Similarly, it can be divided into 2 32 bit or 4 16 bit registers as well.According to the operating principle of the CPU, the mathematical model of the single-core cpu's one round iteration time can be established as formula(28).Among them, Tcpu represents the training prediction time.Te value of FLOPs is obtained by formula(27)in Section 4.3.cpuhz represents the CPU frequency, and simd represents the number of data that can be operated in parallel after optimization with SIMD instructions.

Table 2 :
Prediction time (PT), measurement Time (MT), and accuracy on stand-alone CPU platform.

Table 3 :
Prediction time (PT), measurement time (MT), and accuracy on stand-alone GPU platform.

Table 4 :
Prediction time (PT), measurement time (MT), and accuracy on distributed CPU platform.

Table 5 :
Prediction time (PT), measurement time (MT), and accuracy on distributed GPU platform.

Table 6 :
Prediction time (PT), measurement time (MT), and accuracy on distributed heterogeneous GPU platform which connected with 1GbE.

Table 7 :
Prediction time (PT), measurement time (MT), and accuracy on distributed heterogeneous GPU platform which connected with 10 GbE.