Pipelined Training with Stale Weights in Deep Convolutional Neural Networks

,


Introduction
Machine learning (ML), in particular convolutional neural networks (CNNs), has advanced at an exponential rate over the last few years, enabled by the availability of high-performance computing devices and the abundance of data.Today, CNNs are applied in a variety of fields, including computer vision [1], biological and medical science [2], social media [3], image analysis and classification [4,5], and urban planning [6] to name a few.
However, modern CNNs have grown in size and complexity to demand considerable memory and computational resources, particularly for training.
is growth makes it sometimes difficult to train an entire network with a single accelerator [7][8][9].Instead, the network is partitioned among multiple accelerators, typically by distributing its layers among the available accelerators, as shown in Figure 1 for an example 8-layer network.e 8 layers are divided into 4 computationally balanced partitions, P 0 , . . ., P 3 , and each partition is mapped to one of the 4 accelerators, A 0 , . . ., A 3 .Each accelerator is responsible for the computations associated with the layers mapped to it.
However, the nature of the backpropagation algorithm used to train CNNs [10] is that the computations of a layer are performed only after the computations of the preceding layer in the forward pass of the algorithm and only after the computations of the succeeding layer in the backward pass.Further, the computations for one batch of input data are only performed after the computations of the preceding batch have updated the parameters (i.e., weights) of the network.ese dependences underutilize the accelerators, as shown by the space-time diagram in Figure 2; only one accelerator can be active at any given point in time.
e underutilization of accelerators can be alleviated by pipelining the computations of the backpropagation algorithm over the accelerators [7-9, 11, 12], that is, by overlapping the computations of different input batches on the multiple accelerators.However, this overlap causes an accelerator to potentially use weights that are yet to be updated by an accelerator further down in the pipeline.e use of such stale weights can negatively affect the statistical efficiency of the network, prevent the convergence of training, or produce a model with lower inference accuracy [7-9, 11, 12].
Existing pipelined training approaches either avoid the use of stale weights (e.g., with the use of microbatches [8]), constrain the training to ensure the consistency of the weights within an accelerator (e.g., using weight stashing [9]), utilize weight adjustments (e.g., weight prediction [11]), or limit the use of pipelining to very small networks (e.g., [13]).However, these approaches underutilize accelerators [8], inflate memory usage to stash multiple copies of weights [9], or are unable to handle large networks [13].
In this work, we explore pipelining that allows for the full utilization of accelerators while using stale weights. is results in a pipelining scheme that, compared to existing schemes, is simpler to implement, fully utilizes the accelerators, and has lower memory overhead.We evaluate this pipelining scheme using 4 CNNs: LeNet-5 (trained on MNIST), AlexNet, VGG, and ResNet (all trained on CIFAR-10).ese CNNs are commonly used in the literature for the evaluation of pipelined training, and they represent models with a wide range of parameter sizes and complexity.We analyze the impact of weight staleness and show that if pipelining is introduced in early layers in the network, training does converge and the quality of the resulting models is comparable to that of models obtained with nonpipelined training.For the 4 networks, the drop in accuracy is 0.4%, 4%, 0.83%, and 1.45%, respectively.However, inference accuracies drop significantly when the pipelining is deeper in the network, up to 12% for VGG and 8.5% for  is drop makes the pipelinedtrained models inferior to ones trained without pipelining.On the one hand, limiting pipelining to early layers is often not a limitation since the early convolutional layers in the network typically contribute to the bulk of the computations and thus are the ones to use and benefit from pipelining.On the other hand, we also address this drop in accuracy by a hybrid scheme that combines pipelined and nonpipelined training to maintain inference accuracy while still delivering performance improvements.
We demonstrate the potential of our approach to pipelined training using ResNet-56/110/224/362 trained on CIFAR-10 and CIFAR-100 with PyTorch on a 2-GPU system.We show that our pipelined training delivers a speedup of up to 1.8X with only a drop of no more than about 2-3% in inference accuracy.
us, this work makes the following contributions: ( e remainder of this paper is organized as follows.Section 2 briefly describes the backpropagation for training of CNNs.Section 3 reviews the current literature on pipelined training.Section 4 details our pipelining scheme and how nonpipelined backpropagation and pipelined backpropagation are combined.Section 5 highlights some of the implementation details.Experimental evaluation is presented in Section 6.Finally, Section 7 gives concluding remarks and directions for future work.A set of appendices provide the training hyperparameters, more detailed results on memory usage, and a proof of convergence for our scheme.

The Backpropagation Algorithm
e backpropagation algorithm [10] consists of two passes: a forward pass that calculates the output error and a backward pass that calculates the error gradients and updates the weights of the network.e two passes are performed for input data one minibatch at a time.
In the forward pass, a minibatch is fed into the network, propagating from the first to the last layer.At each layer l, the activations of the layer, denoted by x (l) , are computed using the weights of the layer, denoted by W (l) .When the output of the network (layer L) x (L) is produced, it is used with the true data label to obtain a training error e for the minibatch.2 Applied Computational Intelligence and Soft Computing In the backward pass, the error e is propagated from the last to the first layer.e error gradients with respect to preactivations of layer l, denoted by δ (l) , are calculated.Further, the error gradients with respect to weights of layer l, ze/zW (l) , are computed using the activations from layer l − 1 (i.e., x (l− 1) ) and δ (l) .Subsequently, δ (l) is used to calculate δ (l− 1) .When ze/zW (l) is computed for every layer, the weights are updated using the error gradients.
In the forward pass, the activations of the layer l, x (l) , cannot be computed until the activations of the previous layer, i.e., x (l− 1) , are computed.In the backward pass, ze/zW (l) can only be computed once, and x (l− 1) and δ (l) have been computed.Moreover, δ (l) depends on δ (l+1) .Finally, for a given minibatch, the backward pass cannot be started until the forward pass is completed and the error e has been determined.
e above dependences ensure that the weights of the layers are updated using the activations and error gradients calculated from the same batch of training data in one iteration of the backpropagation algorithm.Only when the weights are updated can the next batch of training data be fed into the network.ese dependences limit parallelism when a network is partitioned across multiple accelerators and allow only one accelerator to be active at any point.is results in underutilization of the accelerators.It is this limitation that pipelining addresses.

Literature Review
ere has been considerable work that explores parallelism in the training of deep networks.In data parallelism [14][15][16][17][18][19], each accelerator has a copy of the model.e accelerators process different minibatches of training data simultaneously in iterations, aggregating gradients to update weights at the end of each iteration.
is is done synchronously [14,17] or asynchronously [16].More related to our work is model parallelism [16,[20][21][22][23] in which a large model is partitioned into different accelerators, each responsible for updating the weights for its portion of the model.e data dependences, described in Section 2, allow only one accelerator at a time to be active, resulting in underutilization.Pipelined parallelism addresses this underutilization and is the focus of our work.Below, we review salient work on pipelined parallelism in training.
Early work on pipelined training focuses on small networks and does not study pipelined parallelism in detail.Petrowski et al. [24] introduced the idea of pipelined backpropagation in neural network training.However, they realized the idea for only a 3-layer perceptron on a torus of 16 processors.Mostafa et al. [13] implemented a proof-ofconcept validation of pipelined backpropagation training for a 3-layer fully connected binary-state neural network with truncated-error FPGA.However, the implementation does not have the coarse-grained layer-wise pipelined parallelization.
More recently, PipeDream [9] implemented pipelined training for large neural networks such as VGG-16, Inception-v3, and S2VT across multiple GPUs.It limits the usage of stale weights by a technique referred to as weight stashing.
e technique keeps multiple versions of the weights during training, to ensure that the correct (i.e., nonstale) weights are used in each pipeline stage.is technique results in high inference accuracies and high utilization of the accelerators but increases the memory footprint of training.
GPipe [8] implements a library in TensorFlow to enable pipelined parallelism for the training of large neural networks.It pipelines microbatches within each minibatch to keep the gradients consistently accumulated.is eliminates the use of stale weight during training but at the expense of "pipeline bubbles" that degrade performance.GPipe utilizes these bubbles to reduce memory footprint by recomputing forward activations instead of storing them during the backward pass of training.
e approach results in high inference accuracies with no increase in memory footprint, but the pipeline bubbles underutilize the accelerators, resulting in lower performance.
Huo et al. [12] implemented decoupled backpropagation (DDG) using delayed gradient updates.ey showed that DDG guarantees convergence through a convergence analysis.Similar to PipeDream, DDG uses multiple copies of the weights, thus increasing memory footprint.Further, DDG pipelines only the backward pass of training, leaving forward pass unpipelined, which underutilizes resources.Huo et al. [25] followed up by proposing feature replay (FR) that recomputes activations during backward pass, similar to GPipe, reducing memory footprint and improving inference accuracy over DDG.Nonetheless, also similar to GPipe, the recomputations lower speedups.
Chen et al. [11] introduced weight prediction to mitigate weight staleness.Although their pipelined training shows improvement in throughput, they trained their networks for only 5000 iterations and it is not clear if their method can achieve standard model quality; their resulting model accuracies are much lower than typical for the models they train.
Guan et al. [26] presented XPipe, which combines elements of GPipe and PipeDream implementations of pipelined training to improve efficiency by allowing the overlapping of the pipelines of multiple microbatches from different minibatches.Nonetheless, they avoid the use of stale weights using weight prediction.
Kosson et al. [27] extended weight prediction in a finegrained pipelined scheme that inserts pipeline registers between every pair of layers and limits the minibatch size to 1, aiming for a hardware implementation.ey used a weight adjustment scheme to tackle weight staleness.
Park et al. [28] described HetPipe that combines data parallelism in the form of virtual workers with the pipelined parallelism of PipeDream, targeting heterogeneous clusters of GPU workstations.Jia et al. [29] proposed FlexFlow, a framework that explores data and model parallelism in the training, but they did not consider pipelined parallelism.Li et al. [30] proposed Pipe-SGD that pipelines computation and communication as opposed to the forward and backward passes.e model is not partitioned across the accelerators.Instead, the pipelining is used to overlap communication of weight updates and compute to hide Applied Computational Intelligence and Soft Computing communication time and control the staleness at only 1 cycle.erefore, large models may not fit on an accelerator.
A common theme to the above body of work is that it employs various techniques to avoid the use of stale weights.
ese techniques introduce either computational inefficiencies or memory footprint increases.In this work, we propose the use of stale weights and study their impact on the quality of trained models.We show that when pipelining is implemented in the early network stages or when hybrid training is used, we can train models with high prediction accuracy, smaller memory footprint, and higher performance.
For example, in contrast to PipeDream and DDG, we do not maintain multiple copies of weights, reducing memory footprint.In contrast to GPipe and Huo et al. [25], our approach has no pipeline bubbles and does not replicate computations, resulting in better performance.Further, compared to Chen et al. [11], our pipelined training can produce models with a final quality that is comparable to the standard model quality for VGG-16 and ResNet with different depths on CIFAR-10/CIFAR-100 datasets.

Proposed Pipelined Training Method
4.1.Pipelined Backpropagation.We illustrate our pipelined backpropagation implementation with the L layer network shown in Figure 3, using conceptual pipeline registers.Two registers are inserted between layers l and l + 1, one register for the forward pass and a second for the backward pass.e forward register stores the activations of layer l (x (l) ).e backward register stores the gradients δ (l+1) of layer l + 1.
is defines a 4-stage pipelined backpropagation.e forward pass for layers 1 to l forms forward stage FS 1 .e forward pass for layers l + 1 to L forms forward stage FS 2 .Similarly, the backward pass for layers l + 1 to L and 1 to l forms backward stages BKS 1 and BKS 2 , respectively.e forward and backward stages are executed in a pipelined fashion on 3 accelerators: one for FS 1 , one for both FS 2 and BKS 1 , and one for BKS 2 (we combine FS 2 and BKS 1 on the same accelerator to reduce weight staleness, as will become evident shortly).In cycle 0, minibatch 0 is fed to FS 1 .e computations of the forward pass are done as in the traditional nonpipelined implementation.In cycle 1, layer l activations x (l) are fed to FS 2 and minibatch 1 is fed to FS 1 .In cycle 2, the error for minibatch 0 computed in FS 2 is directly fed to BKS 1 , the activations of layer lx (l) are forwarded to FS 2 , and minibatch 2 is fed to FS 1 .is pipelined execution is illustrated by the space-time diagram in Figure 4 for 5 minibatches.
e figure depicts the minibatch processed by accelerator cycles 0 to 6.At steady state, all the accelerators are active in each cycle of execution.
e above pipelining scheme utilizes weights in FS 1 that are yet to be updated by the errors calculated by FS 2 and BKS 1 .At steady state, the activations of a minibatch in FS 1 are calculated using weights that are 2 execution cycles old or 2 cycles stale. is is reflected in Figure 4 by indicating the weights used by each forward stage and the weights updated by each backward stage.e weights of a forward stage are subscripted by how stale they are (negative subscripts).Similarly, the weights updated by a backward stage are subscripted by how delayed they are (positive subscripts).
Further, since the updates of the weights by BKS 2 require activations calculated for the same minibatch in FS 1 for all layers in the stage, it is necessary to save these activations until the error gradients with respect to the weights are calculated by BKS 2 .Only when the weights are updated using the gradients can these activations be discarded.
In the general case, we use K pairs of pipeline registers (each pair consisting of a forward register and a backward register) inserted between the layers of the network.We describe the placement of the register pairs by the pipeline placement vector, PPV � (p 1 , p 2 , . . ., p K ), where p i represents the layer number after which a pipeline register pair is inserted.Such a placement creates (K + 1) forward stages, labeled FS i , i � 1, 2, . . ., K + 1, and (K + 1) backward stages, labeled BKS i , i � 1, 2, . . ., K + 1. Forward stage FS i and backward stage BKS K− i+2 correspond to the same set of layers.Specifically, stage FS i contains layers p i + 1 to p i+1 , which are inclusive.We assign each forward stage and each backward stage to an accelerator, with the exception of the FS K+1 and backward stage BKS 1 , which are assigned to the same accelerator to reduce weight staleness by an execution cycle.In total, 2K + 1 accelerators are used.
We quantify weight staleness as follows.A forward stage FS i and backward stage BKS K− i+2 use the same weights that are 2(K − i + 1) cycles old.Further, a forward stage FS i must store the activations of all layers in the stage for all 2(K − i + 1) cycles which are used for the corresponding backward stage BKS K− i+2 .We refer to these saved activations as intermediate activations.We define the degree of staleness as 2(K − i + 1).For each pair of stages FS i and BKS K− i+2 , let there be N i weights in their corresponding layers.e layers before the last pipeline register pairs always use stale weights.us, we define percentage of stale weight as ).On the one hand, the above pipelined execution allows a potential speedup of 2K + 1, using as many accelerators, over the nonpipelined implementation, keeping all the accelerators active at steady state.On the other hand, the use of stale weights may prevent training convergence or may result in a model that has an inferior inference accuracy.Further, it requires an increase in storage for activations.Our goal is to assess the benefit of this pipelined execution and the impact of its downsides.
Appendix C presents an analytical proof of the convergence of our pipelined training scheme.

Implementation
We implement pipelined training in two ways: simulated in Caffe [31] (version 1.0.0),where the whole training process is performed on one process with no parallelism, and actual with parallelism across accelerators in PyTorch [32] (version 1.0.0.dev20190327).
e simulated implementation is used to analyze statistical convergence, inference accuracy, and impact of weight staleness, for a large number of stages/accelerators, unconstrained by parallelism and communication overhead.In contrast, the actual implementation reports real performance and serves as a proof-of-concept implementation that demonstrates the performance potential of pipelined training with stale weights.PyTorch is used instead of Caffe to leverage its support for collective communication protocols and its flexibility in partitioning a network across multiple accelerators.e versions of Caffe and PyTorch we use have no support for pipelined training.us, both were extended to provide such support.
We develop a custom Caffe layer in Python, which we call a Pipeline Manager Layer (PML), to facilitate the simulated pipelining.During the forward pass, a PML registers the input from a previous layer and passes the activation to the next layer.It also saves the activations for the layers connected to it to be used in the backward pass.During the backward pass, a PML passes the appropriate error gradients.It uses the corresponding activations saved during the forward pass to update weights and generate error gradients for the previous stage, using existing weight update mechanisms in Caffe.x (1)  x (l)  x (l)

Applied Computational Intelligence and Soft Computing
To implement actual hardware-accelerated pipelined training, we partition the network onto different accelerators (GPUs), each running its own process.Activation and gradient data are communicated among accelerators using an asynchronous send/receive communication protocol, but all communication must go through the host CPU, since point-to-point communication between accelerators is not supported in PyTorch.
is increases communication overhead.Similar to the PMLs in Caffe, the activations computed on one GPU are copied to the next GPU (via the CPU) in the forward pass and the error gradients are sent (again via the CPU) to the preceding GPU during the backward pass.
e GPUs are running concurrently, achieving pipeline parallelism.

Evaluation
6.1.Setup, Methodology, and Metrics.Simulated pipelining is evaluated on a machine with one Nvidia GTX1060 GPU with 6 GB of memory and an Intel i9-7940X CPU with 64 GB of RAM. e performance of actual pipelining is evaluated using two Nvidia GTX1060 GPUs, each with 6 GB of memory, hosted in an Intel i7-9700K machine with 32 GB of RAM.
We elect to use the above CNNs for two reasons.First, they are commonly used in the evaluation of pipelined training (e.g., VGG in PipeDream [9] and ResNet in GPipe [8], which we compare to in our evaluation).Second, these networks have increasing sizes, ranging from the small LeNet to the large VGG and the progressively larger ResNets. is range in size allows us to effectively assess the impact of stale weight on pipelined training.We leave the use of larger networks, such as BERT [39] or DLRM [40] to future work.
We evaluate the effectiveness of pipelined training in terms of its training convergence and its Top-1 inference accuracy, compared to those of the nonpipelined training.We use the speedup to evaluate performance improvements.e speedup is defined as the ratio of the training time of the nonpipelined implementation on single communication-free GPU to the training time of the pipelined training.

Training Convergence and Inference Accuracy.
Pipelined training is done using 4, 6, 8, and 10 pipeline stages.Table 1 shows where the registers are inserted in the networks using their PPVs (defined in Section 4).Pipeline registers are inserted among groups of convolutional layers, resulting up to 8 pipeline stages for AlexNet and ResNet-20 and 10 pipeline stages for LeNet-5 and VGG-16.
Figure 5 shows the improvements in the inference accuracies for both pipelined and nonpipelined training as a function of the number of training iterations (each iteration corresponds to a minibatch).e figure shows that for all the networks, both pipelined training and nonpipelined training have similar convergence patterns.ey converge in more or less the same number of iterations for a given number of pipeline stages, albeit different inference accuracies. is indicates that our approach to pipelined training with stale weights does converge, similar to nonpipelined training.
Table 2 shows the inference accuracies obtained after up to 30,000 iterations of training.For LeNet-5, the inference accuracy drop is within 0.5%.However, for the other networks, there is a small drop in inference accuracy with 4 and 6 stages.AlexNet has about 4% drop in inference accuracy, but for VGG-16, the inference accuracy drop is within 2.4%, and for ResNet-20, the accuracy drop is within 3.5%.us, the resulting model quality is generally comparable to that of a nonpipelining-trained model.
However, with deeper pipelining (i.e., 8 and 10 stages), inference accuracies significantly drop.ere is a 12% and a 8.5% inference accuracy drop for VGG-16 and ResNet-20, respectively.In this case, the model quality is not comparable to that of the nonpipelined training.is result confirms what is reported in the literature [9] and is attributed to the use of stale weights.

Impact of Weight Staleness.
We wish to better understand the impact of the number of pipeline stages and their location in the network on inference accuracy.We focus on ResNet-20 because of its relatively small size and regular structure.It consists of 3 residual function groups with 3 residual function blocks within each group.In spite of this relatively small size and regular structure, it enables us to create pipelines with up to 20 stages by inserting pipeline register pairs within residual function blocks.
We conduct two experiments.In the first, we increase the number of pipeline stages (from earlier layers to latter layers) and measure the inference accuracy of the resulting model.
e results are shown in Table 3, which gives the inference accuracy of pipelined training after 100,000 iterations, as the number of pipeline stages increases.e 8-stage pipelined training is created by a PPV of (3,5,7), and the subsequent pipeline schemes are created by adding pipeline registers after every 2 layers after layer 7. Clearly, the greater the number of stages is, the worse the resulting model quality is.6 depicts the inference accuracy as a function of the percentage of weights that are stale.e curve labeled "increasing stages" shows that the drop in inference accuracy increases as the percentage of stale weights increases.
In the second experiment, we investigate the impact of the degree of staleness described in Section 4.Only one pair of pipeline registers is inserted.e position of this register slides from the beginning of the network to its end.At every position, the percentage of stale weights remains the same as in the first experiment, but all stale weights have the same    Applied Computational Intelligence and Soft Computing degree of staleness.e result of this experiment is shown by the curve labeled "sliding stage" in Figure 6. e curve shows the inference accuracy also drops as the percentage of stale weights increases.However, it also indicates that the drop of inference accuracy remains more or less the same as in the first experiment in which the degree of staleness is higher.us, the percentage of stale weight appears to be what determines the drop in inference accuracy and not the degree of staleness of the weights.e percentage of stale weights is determined by where the last pair of pipeline registers are placed in the network.It is the position of this pair that determines the loss in inference accuracy.erefore, it is desirable to place this last pair of registers as early as possible in the network so as to minimize the drop in inference accuracy.
While at first glance this may seem to limit pipelining, it is important to note that the bulk of computations in a CNN is in the first few convolutional layers in the network.Inserting pipeline registers for these early layers can result in a large number of stages that are computationally balanced.For example, our profiling of the runtime of ResNet-20 shows that the first three residual functions take more than 50% of the training runtime. is favors more pipeline stages at the beginning of the network.Such placement has the desirable effect of reducing the drop in inference accuracy while obtaining relatively computationally balanced pipeline stages.

Effectiveness of Hybrid Training.
We demonstrate the effectiveness of hybrid training, also using ResNet-20.Figure 7 shows the inference accuracy for 20 K iterations of pipelined training followed by either 10 K or 20 K iterations of nonpipelined training.is inference accuracy is compared to 30 K iterations of either nonpipelined or pipelined training with PPV (5,12,17).e figure demonstrates that hybrid training converges in a similar manner to both pipelined and nonpipelined training.Table 4 shows the resulting inference accuracies.
e table shows that the 20 K + 10 K hybrid training produces a model with accuracy that is comparable to that of the nonpipelined model.Further, with an additional 10 K iterations of nonpipelined training, the model quality is slightly better than that of the nonpipelined model.is demonstrates the effectiveness of hybrid training.

Pipelined and Hybrid
Training Performance.Our evaluation using simulated pipelining explored pipelines with up to 20 pipeline stages (up to 10 accelerators).In this section, we implement and evaluate a proof-of-concept implementation with actual pipelining.
e goal is to demonstrate that pipelined training with stale weights, with and without hybrid training, does deliver performance improvements.
Specifically, we implement 4-stage pipelined training for ResNet-56/110/224/362 on a 2-GPU system.Each GPU is responsible for one forward stage and one backward stage.us, the maximum speedup that can be obtained is 2. We train every ResNet for 200 epochs for CIFAR-10 dataset and 300 epochs for CIFAR-100 dataset.Tables 5 and  6 show the inference accuracies with and without pipelining, for the CIFAR-10 and CIFAR-100 datasets.ey also show the speedups of pipelined training over the nonpipelined one.e tables indicate that the quality of the models produced by pipelined training is comparable to those achieved by the simulated pipelining implementation.e tables also show that speedup exists for all networks.Indeed, for ResNet-362, the speedup is 1.8X. is is equivalent to about 90% utilization for each GPU.Finally, the tables reflect that as the networks get larger, the speedup improves.is is because for larger networks, the ratio of computation to communication overhead is higher, leading to better speedups.
Moreover, we combine the    7).More analysis of memory increase appears in Appendix D.

Comparison to Existing Work.
We compare our pipelined training scheme with two key existing systems: PipeDream [9] and GPipe [8].We believe that PipeDream and GPipe are representative of existing key approaches that implement pipelined training, including decoupled backpropagation (DDG) [12] and feature replay (FR) [25] (discussed in Section 3).We compare on the basis of three aspects: the pipelining scheme, performance, and memory usage.
Our pipelining scheme is simpler than that of Pipe-Dream and GPipe in that we do not require weight stashing nor do we divide minibatches into microbatches.is leads to less communication overhead and is amicable to rapid realization in machine learning framework such as PyTorch or in actual hardware such as Xilinx's xDNN FPGA accelerators [41].
Our pipelining scheme, as PipeDream, eliminates bubbles that exist in the pipeline leading to better performance.For example, we obtain a speedup of 1.7X for ResNet-110 using 2 GPUs in contrast to GPipe that obtains a speedup of roughly 1.3X for ResNet-101 using 2 TPUs.We also obtain similar performance compared to PipeDream for similar networks.When the number of pipeline stages grows, pipeline bubbles exhibit more negative effect on performance shown in GPipe on a 4partition pipelined ResNet-101 using 4 TPUs as its bubble overhead doubled compared to that of the 2-partition pipelined ResNet-101.
Our scheme uses less memory compared to PipeDream, although it introduces more memory overhead compared to GPipe.PipeDream saves intermediate activations during   training, as we do.However, it also saves multiple copies of a network's weights for weight stashing.e memory footprint increase to weight stashing depends on the total weight memory compared to activation memory, the number of active minibatches in the training pipeline, the minibatch size, and the training dataset.In some cases, weight stashing can have a significant impact on memory footprint.For example, for AlexNet trained on CIFAR-10 with a minibatch size of 128 using a 4-stage pipelined training, in which the weight memory is much larger than the activation memory, PipeDream's memory footprint increase is 177% more than ours.A more detailed memory usage comparison is presented in Appendix D.

Concluding Remarks
We propose and evaluate a pipelined execution scheme of backpropagation for the training of CNNs.e scheme uses stale weights, fully utilizes accelerators, does not significantly increase memory usage, and results in models with comparable prediction accuracies to those obtained with nonpipelined training.e use of stale weights has been recognized in the literature to significantly affect prediction accuracies.us, existing schemes avoid or limit the use of stale weights [7][8][9]12].In contrast, we explore the impact of stale weights and demonstrate that it is the placement of the last pair of pipeline registers that determines the loss in inference accuracy.is allows us to implement pipelining in the early layers of the network with little loss to accuracy while reaping computational benefits.Limiting pipelining to such early layers is not a disadvantage since the bulk of computations is in the early convolutional layers.Nonetheless, when deeper pipelining is desired, we introduce hybrid training and show that it is effective in mitigating the loss of prediction accuracy for deep pipelining, while still providing computational speedups.Our scheme has the advantage of simplicity and low memory overhead, making it attractive when accelerator memory is constrained, in particular for specialized hardware accelerators.
Our evaluation using several CNN networks/datasets confirms that training with our scheme does converge and does produce models with inference accuracies that are comparable to those obtained with nonpipelined training.Our proof-of-concept implementation on a 2-GPU system shows that our scheme achieves a speedup of up to 1.82X, demonstrating its potential.
is work can be extended in a number of directions.One direction is to evaluate the approach with a larger number of accelerators since pipelined parallelism is known to scale naturally with the number of accelerators.Another is to evaluate the approach on larger datasets, such as ImageNet.Finally, our pipelining scheme lends itself naturally to hardware implementation due to its simplicity.us, another direction for future work is to evaluate pipelined parallelism using Field Programmable Gate Array (FPGA) or ASIC accelerators.

A. Training Hyperparameters for Simulated Training
LeNet-5 is trained on the MNIST dataset with stochastic gradient descent (SGD) using a learning rate of 0.01 with inverse learning policy, a momentum of 0.9, a weight decay of 0.0005, and a minibatch size of 100 for 30,000 iterations.e progression of inference accuracy during training is recorded with 300 tests.
AlexNet is trained on the CIFAR-10 dataset with SGD with Nesterov momentum using a learning rate of 0.001 that is decreased by 10X twice during training, a momentum of 0.9, a weight decay of 0.004, and a minibatch size of 100 for 250,000 iterations.One test is performed every epoch to record the progression of inference accuracy.
VGG-16 is trained on CIFAR-10 dataset with SGD with Nesterov momentum using a learning rate starting at 0.1 that is decreased by half every 50 epochs during training, a momentum of 0.9, a weight decay of 0.0005,  ResNet is trained on CIFAR-10 dataset with SGD using a learning rate starting at 0.1 and 0.01 for nonpipelined and pipelined training, respectively, that is decreased by 10X twice during training, a momentum of 0.9, a weight decay of 0.0001, and a minibatch size of 128 for 100,000 iterations.Batch normalization is used during training throughout the network.One test is performed every 100 iterations to record the progression of inference accuracy.

B. Training Hyperparameters for Actual Training
For the baseline nonpipelined training, ResNet-56/110/224/ 362 is trained on CIFAR-10 and CIFAR-100 dataset for 200 and 300 epochs, respectively, with SGD using a learning rate of 0.1 that is decreased by a factor of 10 twice (at epoch 100 and 150 for CIFAR-10 and at epoch 150 and 225 for CIFAR-100), a momentum of 0.9, a weight decay of 0.0001, and a minibatch size of 128.Batch normalization is used during training throughout the network.
For the 4-stage pipelined training, the hyperparameters are the same as the nonpipelined baseline, except for the BKS 2 learning rate.

C. Convergence Analysis
Experimental evaluation shows that our pipelined training converges for large networks.Nonetheless, a convergence analysis provides a theoretical foundation for our pipelined training across networks.Our analysis is analogous to that of Bottou et al. [42] and Huo et al. [12] in that it shows that our pipelined training algorithm has similar convergence rate to both decoupled parallel backpropagation and nonpipelined stochastic gradient descent.Our training algorithm is summarized in Algorithm 1.We show that this algorithm converges in a fashion similar to Huo et al. [12].
We start by making the same assumption as in [12,42].Specifically we make the Lipschitz-continuous gradient assumption that guarantees that ‖∇f(u) − ∇f(v)‖ 2 ≤ L ‖u − v‖ 2 .In this assumption, f(.) is the error function, L > 0, and u, v ∈ R d .We also make the bounded variance assumption that guarantees that ‖∇f x i (w)‖ 2  2 ≤ M, where f(.) is the error function, M > 0, for any sample x i , and ∀w ∈ R d .Because of the unnoised stochastic gradient E[∇f x i (w)] � ∇f(w) and , the variance of the stochastic gradient is guaranteed to be less than M.
Based on these two assumptions, if there are K forward stages in our pipelined scheme, each iteration of Algorithm 1 satisfies the following inequality ∀t ∈ N: is can be shown true as follows.From the Lipschitzcontinuous gradient assumption, we obtain the following inequality: From the weight update rule in Algorithm 1, we take expectation on both sides of inequality 2 and obtain the following: From inequalities (C.3) and (C.4), we have the following inequality: is proves inequality (C.1).From inequality (C.1), if the value of learning η t is picked such that the right-hand size of inequality (C.1) is less than zero, the error function is decreasing.erefore, using this property, we can analyze the convergence of Algorithm 1 for a fixed learning rate and a decreasing learning rate.
For a fixed learning rate η, we show that Algorithm 1 converges.Given the Lipschitz-continuous gradient and the Bounded variance assumption and a fixed learning rate η t � η, ∀t ∈ 0, 1, . . ., T − 1 { } and ηL ≤ 1, if we assume that the optimal solution that minimizes our error function f(w) is w * , then the output of our Algorithm 1 satisfies the following inequality: is inequality holds because when η t is constant and η t � η, taking expectation of inequality (C.1), we have Summing inequality (C.7) from t � 0 to T − 1, we have Suppose that w * is the optimal solution for f(w); then, f(w * ) − f(w 0 ) ≤ E[f(w T )] − f(w 0 ), and the following inequality is obtained: thus proving inequality (C.6).
In inequality (C.6), when T ⟶ ∞, the average norm of the error gradient is bounded by η 2 LKM that is finite.is shows that Algorithm 1 converges for a fixed learning rate η.

D. Memory Usage Comparison
e pipelining scheme in this work uses less memory compared to PipeDream, although it introduces more memory overhead compared to GPipe.PipeDream saves intermediate activations during training, and so does our scheme.However, PipeDream also saves multiple copies of a network's weights for weight stashing, increasing the memory footprint further.
e memory footprint increase due to weight stashing depends on the total weight memory compared to activation memory, the number of active minibatches in the training pipeline, the minibatch size, and the training dataset.
When the weight memory is smaller than the activation memory for a given minibatch size, the memory increase due to weight stashing is not significant.For example, Pipe-Dream's memory increase percentage is only 1% worse than ours for ResNet-20 even though 4 copies of weights would be saved by PipeDream, as shown in Table 9 (torchsummary in PyTorch is also used to report memory usage for weights and activations for a network and to calculate the additional memory required by the additional copies of activations and weights).is result also holds for ResNet with other depths since the amount of weights and activations grows linearly with the depth of the network.
However, when the weight memory is larger than activation memory for a given minibatch size, weight stashing will have a significant impact on memory footprint.For AlexNet and VGG-16 trained on CIFAR-10, in which the weight memory is much larger than the activation memory, with a minibatch size of 128 using a 4-stage pipelined training, additional 4 copies of weights must be saved due to weight stashing, one per active minibatch in the pipeline, resulting in much more memory increase: a 214% increase in memory footprint that is 177% more than ours (37%) for AlexNet and a 124% increase in memory footprint that is 49% more than ours (75%), as shown in Table 9.
e minibatch size also has an impact on the memory footprint because it directly influences the total amount of activation memory required during training: the larger the minibatch size, the more the activation memory required.Figure 8 shows the memory increase percentage for our scheme and that of PipeDream as a function of minibatch size for the 4-stage pipelined training of LeNet-5, AlexNet, VGG-16, and ResNet-20 in Table 9.When the minibatch size is small, weight stashing has a significant impact for all networks on memory.As the minibatch size increases, the memory increase for ours and PipeDream is similar for ResNet-20.However, for AlexNet and VGG-16, PipeDream still requires more memory then ours due to weight stashing.
Moreover, the input size affects the memory footprint due to weight stashing because it directly affects the amount of activation and weight memory: the larger the input size, the more the activation and weight memory required.Figure 9 shows the memory increase percentage for our scheme and PipeDream as a function of batch size for the 4stage pipelined training of VGG-16 on ImageNet [44].For a minibatch size of 32, PipeDream uses 28% more memory than ours due to weight stashing (PipeDream uses a minibatch size of 32 for the training of VGG-16 on ImageNet).

Figure 4 :
Figure 4: Illustration of pipelined computations of each cycle.
Pipelined/Nonpipelined Backpropagation.Hybrid training combines pipelined training with nonpipelined training.We start with pipelined training and after a number of iterations, we switch to nonpipelined training.is can address drops in inference accuracy of resulting models because of weight staleness, but it reduces the performance benefit since during nonpipelined training, the accelerators are underutilized.4 Applied Computational Intelligence and Soft Computing e extent of the speedup obtained by hybrid training with a given number of accelerators is determined by the number of iterations used for pipelined and nonpipelined training.Assume that n np iterations are used to reach the best inference accuracy for nonpipelined training, and that in hybrid training, n p iterations (n p ≤ n np ) are pipelined followed by n np − n p iterations of nonpipelined training to reach the same inference accuracy as nonpipelined training.e speedup of hybrid training with respect to the nonpipelined training with 2K + 1 accelerators is n np /(n p /(2K + 1) + (n np − n p )).For large K, the speedup approaches an upper bound of n np /(n np − n p ).

Table 2 :
Inference accuracy for simulated pipelined training.

Table 5 :
Inference accuracy and speedup of actual pipelined/hybrid training for CIFAR-10.

Table 6 :
Inference accuracy and speedup of actual pipelined/hybrid training for CIFAR-100.
Applied Computational Intelligence and Soft Computing and a minibatch size of 100 for 250,000.Since it is relatively more difficult to train VGG-16 compared to other models, batch normalization and dropout are used training throughout the network.One test is performed every epoch to record the progression of inference accuracy.
Table 8 shows the learning rate for all ResNet experimented.