Week 1: Distributed Training, Data Parallelism & Gradient Aggregation
Design distributed AI training systems: data and model parallelism, gradient aggregation strategies, parameter servers, and fault-tolerant training.
- Implement data-parallel training with PyTorch DDP
- Understand model parallelism for LLM training
- Design fault-tolerant training with checkpointing
- Optimize communication in distributed training (gradient compression)
This first lecture establishes the foundational framework for Distributed Systems for AI. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.
Key Concepts
The lecture introduces the four main pillars of this course: Data Parallelism: AllReduce & Ring-AllReduce, Model Parallelism & Tensor Parallelism, Parameter Servers vs All-Reduce, Fault Tolerance: Checkpointing & Elastic Training. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.
This Week's Focus
Focus on mastering: Data Parallelism: AllReduce & Ring-AllReduce and Model Parallelism & Tensor Parallelism. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.
AIE302 Project 1: Distributed Training Benchmark
Implement and benchmark data-parallel training on a vision model using PyTorch DDP across multiple GPUs or machines. Measure scaling efficiency, communication overhead, and fault recovery.
- PyTorch DDP training implementation
- Scaling efficiency curve (1→2→4 GPUs)
- Communication overhead analysis
- Fault injection test with checkpoint recovery
These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.
Explain the Ring-AllReduce algorithm. Why is it preferred over parameter servers for large-scale training?
What is the difference between synchronous and asynchronous distributed gradient descent? What are the trade-offs?
How does gradient checkpointing reduce memory usage in large model training?