TensorFlow has provided many excellent tutorials for performing distributed training, although most of these examples depend heavily on the Keras API, which may limit users who want to implement more complex models and training methods. This tutorial provides a concise example of how to use it tf.distribute.MirroredStategy with custom training loops in TensorFlow 2.4. To do this, we customize CycleGAN [1] guides Hard and TensorFlow and enable multi-graphics training.

Remember that CycleGAN is used as an example due to its (relatively) complex losses and calculation methods compared to related tutorials, although this guide does not apply to CycleGAN itself.

The complete code used in this tutorial is available at github.com/bryanlimy/tf2-cyclegan.


In general, any existing custom training circuit code in TensorFlow 2 can be converted to work tf.distribute.Strategy In 6 steps:

  1. Format tf.distribute.MirroredStrategy
  2. Share tf.data.Dataset
  3. Calculating and combining copy loss
  4. Format templates, optimizers, and checkpoints tf.distribute.MirroredStrategy
  5. Apply the calculation to each copy
  6. The returned values ​​and metrics are combined

Packaging requirements

This tutorial requires the following Python packages:


In addition, CUDA, cuDNN and NCCL are required. Look tensorflow.org/install/gpu learn more about GPU support in TensorFlow. If you use aconda in a virtual environment, you can install the above libraries with the following command:

conda install -c nvidia cudatoolkit=11.0 cudnn nccl

You can use the command nvidia-smi check the number of video cards available in the system. For example, I’m using a system with 8 Nvidia GeForce 2080Ti graphics cards:

nvidia-smi output [image by author]

Format tf.distribute.MirroredStrategy

First import the required packages:

We can then format tf.distribute.Strategy:

Note that TensorFlow automatically detects all available devices when devices=None. I prefer to keep the environment variable manually CUDA_VISIBLE_DEVICES at the beginning of each run devices argument programmatically. Note that the GPU ID is nvidia-smi does not necessarily correlate CUDA_VISIBLE_DEVICES.

Run the above code snippet on 3 GPUs (i.e. CUDA_VISIBLE_DEVICES=0,1,2) would result in the following:

Number of devices: 3

Data set settings

Loading horse2zebra data set TensorFlow datasets and process 4 data sets (train_horses, train_zebras, test_horses and test_zebras) in the same way as Hard tutorialincluding normalization and magnification. In addition, we take the first 5 samples from the test series and create plot_ds for drawing.

It is important to keep the following in mind when building tf.data.Dataset for decentralized training:

  • Make sure that all data sets have the same number of samples before packing tf.data.Dataset.zip. horse2zebra the dataset consists of an uneven number of horse and zebra pictures, so we have to call tf.data.Dataset.take below.
  • Batch each data set GLOBAL_BATCH_SIZE instead of BATCH_SIZE.
  • tf.data.Dataset.cache and tf.data.Dataset.prefetch data sets that ensure sufficient amounts of data are entered into the graphics cards without stopping the calculation.
  • You can check out the excellent tutorial better performance through the tf.data API by TensorFlow. I will also write a tutorial tf.dataincluding writing and reading TFRecords effectively, so stay tuned.

Once we have built data sets, we can share them simply by callingstrategy.experimental_distribute_dataset method. Please note that we do not share plot_ds because we are going to draw a conclusion for one device.

Build generators and separators

The architectures of the generators and separators are identical Hard tutorial.

Sample – specific loss calculation

The loss works tf.losses and tf.keras.losses returns by default the average loss for a given batch. The average loss calculated with distributed training should be proportional to the global batch size instead of each copy. Therefore, I define my own loss calculation, where the function returns the sample-specific loss. For example, if the label tensor has a shape (32, 256, 256, 3), then the restored average loss has a shape (32,). An additional step is needed to calculate the average batch loss.

Things to consider when writing decentralized activities for decentralized training:

  • Check the shape of the loss function.
  • If loss functions are used tf.keras.losses , the loss reduction must be one of the following: NONE or SUM.
  • Note that some functions are tf.losses and tf.keras.losses, like tf.losses.mean_squared_error and tf.losses.binary_crossentropy, reduce the final dimension of the output, so we need to extend the last dimension of both labels and projections to ensure the correct output format before sample reduction. See function BCE below for reference.

We can then define the loss functions used in CycleGAN. I added a helper function reduce_mean which restores the global average loss. You can also use tf.nn.compute_average_loss perform the same function.

Format templates, optimizers, and checkpoint

We can now initialize models, optimizers and tf.train.Checkpoint indoors tf.strategy.MirroredStrategy scope of application.

The following is not related to a decentralized curriculum, but makes life easier. I added Summary a class for easy TensorBoard recording and a helper function for drawing the image cycle. Here we create a directory OUTPUT_DIR='runs/' to save model checkpoints and a TensorBoard summary.

Custom train and test phase

train_step Here the function is very similar to TensorFlow tutorial except that it returns a dictionary of tensors. With shared training, each copy would play train_step separately.

Example: train_step:

{'loss_G/loss': <tf.Tensor: shape=(), dtype=float32, numpy=0.99999964>, 'loss_G/cycle': <tf.Tensor: shape=(), dtype=float32, numpy=5.1983647>, 'loss_G/identity': <tf.Tensor: shape=(), dtype=float32, numpy=2.5991883>, 'loss_G/total': <tf.Tensor: shape=(), dtype=float32, numpy=8.797552>, 'loss_F/loss': <tf.Tensor: shape=(), dtype=float32, numpy=1.0>, 'loss_F/cycle': <tf.Tensor: shape=(), dtype=float32, numpy=3.6904013>, 'loss_F/identity': <tf.Tensor: shape=(), dtype=float32, numpy=1.8451972>, 'loss_F/total': <tf.Tensor: shape=(), dtype=float32, numpy=6.5355983>, 'loss_X/loss': <tf.Tensor: shape=(), dtype=float32, numpy=0.49949202>, 'loss_Y/loss': <tf.Tensor: shape=(), dtype=float32, numpy=0.49720365>}

Then we call strategy.run split the count between all copies, and each copy calculates and returns a corresponding result for the item train_step. Because of us train_step the function returns a dictionary of tensors, so the returned values strategy.run from tensor dictionaries. I added a helper function reduce_dict which simply repeats all the items in the dictionary and sums all the values ​​for each key.

Things to consider when distributing the calculation:

  • We can replicate a data set both inside and outside it tf.function.
  • Because the results obtained from each copy are proportional to the global batch size, we calculateSUM instead of MEAN with strategy.reduce.
  • Check the returned values train_step (or any function you plan to spread the computation). For example, if your function returns one tensor, then the value returned from strategy.run would be a dictionary of tensors.

Example: strategy.run(train_step, args=(x, y)) With 3 GPUs (i.e. CUDA_VISIBLE_DEVICES=0,1,2):

{'loss_G/loss': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>
}, 'loss_G/cycle': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=1.2934016>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=1.4167924>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=1.2807639>
}, 'loss_G/identity': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=0.6467091>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=0.70838636>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=0.6403792>
}, 'loss_G/total': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=2.2734442>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=2.458512>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=2.2544765>
}, 'loss_F/loss': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=0.33333334>
}, 'loss_F/cycle': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=0.6893108>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=1.1769139>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=1.5197902>
}, 'loss_F/identity': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=0.34465796>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=0.58845943>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=0.75989777>
}, 'loss_F/total': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=1.3673022>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=2.0987067>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=2.6130214>
}, 'loss_X/loss': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=0.16675813>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=0.16679758>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=0.16684371>
}, 'loss_Y/loss': PerReplica:{
0: <tf.Tensor: shape=(), dtype=float32, numpy=0.16681099>,
1: <tf.Tensor: shape=(), dtype=float32, numpy=0.1668036>,
2: <tf.Tensor: shape=(), dtype=float32, numpy=0.16669472>

Example:reduce_dict :

{'loss_G/loss': <tf.Tensor: shape=(), dtype=float32, numpy=1.0>, 'loss_G/cycle': <tf.Tensor: shape=(), dtype=float32, numpy=3.990958>, 'loss_G/identity': <tf.Tensor: shape=(), dtype=float32, numpy=1.9954746>, 'loss_G/total': <tf.Tensor: shape=(), dtype=float32, numpy=6.986433>, 'loss_F/loss': <tf.Tensor: shape=(), dtype=float32, numpy=1.0>, 'loss_F/cycle': <tf.Tensor: shape=(), dtype=float32, numpy=3.386015>, 'loss_F/identity': <tf.Tensor: shape=(), dtype=float32, numpy=1.6930151>, 'loss_F/total': <tf.Tensor: shape=(), dtype=float32, numpy=6.07903>, 'loss_X/loss': <tf.Tensor: shape=(), dtype=float32, numpy=0.5003994>, 'loss_Y/loss': <tf.Tensor: shape=(), dtype=float32, numpy=0.5003093>}

The test phase follows the same procedure.

Training loop

The only thing left is to build the train and test loops in the same way as all typical custom training circuits. We save the checkpoint and draw 5 samples on the TensorBoard every 10 eras. Please note that checkpoint.read must be called into the strategy when returning a checkpoint.

Example of a printout after one era:

Epoch 001/200
Train: 100%|██████████████████████| 178/178 [12:29<00:00, 4.21s/it]
Test: 100%|█████████████████████████| 20/20 [00:24<00:00, 1.22s/it]
MAE(X, F(G(X))): 0.2599 MAE(X, F(X)): 0.2823
MAE(Y, G(F(Y))): 0.2492 MAE(Y, G(Y)): 0.2793
Elapse: 773.84s

1 GPU vs. 3 GPU results

To ensure that we can achieve similar results after splitting the computation, I trained 2 models with the same hyperparameters on 1 GPU and 3 GPUs. In general, we can use a higher learning speed and a larger batch size, although I liked all learning speeds 2e-4 For simplicity. On average, one era took ~ 325 seconds to make one GPU, while 3 GPUs took ~ 167 seconds. It is important to analyze the inlet pipe and training procedure to ensure the correct speed, when using multiple graphics cards, it is possible that the overhead costs of decentralized training outweigh the benefits of parallelization.

The following diagrams and figures show the training statistics and results of generators and separators after 200 eras.


Please enter your comment!
Please enter your name here