Distributed Training


Distributed training allows for training huge models and speeding up the training process.

Types:

Data parallelism

Dividing data into partitions and copying the complete model to all workers. Each worker operates on a different partition, and model updates are synchronized across workers. img source: https://community.deeplearning.ai/t/mlep-course-3-lecture-notes/

  • Synchronous training: (example: all-reduce architecture) Workers train on it’s current mini-batches of data, apply updates, and wait for updates from other workers before proceeding. img source: https://youtu.be/S1tN9a4Proc

  • Asynchronous training: (parameter-server architecture) Workers independently train on their mini-batches of data and update variables asynchronously. Can be more efficient but may lead to reduced accuracy and slower convergence.

Model parallelism:

Models need to be made distribute-aware.

Segmenting the model into different parts and training concurrently on different workers. Workers synchronize shared parameters during forward and backpropagation steps.

Distribution Strategy:

One Device Strategy:

No Pararelization just for testing.

Mirrored Strategy

This strategy is typically used for training on one machine with multiple GPUs/

  • Creates a replica per GPU <> Variables are mirrored
  • Weight updating is done using efficient cross-device communication algorithms (all-reduce algorithms)
Parameter-server Strategy

img source: https://youtu.be/S1tN9a4Proc

Central Storage Strategy
TPU Strategy

References

  1. https://community.deeplearning.ai/t/mlep-course-3-lecture-notes/54454
  2. https://youtu.be/S1tN9a4Proc