⚙️ AI Engineering Week 1 of 14 BSc · Y3 ⏱ ~50 min

Week 1: Distributed Training, Data Parallelism & Gradient Aggregation

Design distributed AI training systems: data and model parallelism, gradient aggregation strategies, parameter servers, and fault-tolerant training.

University of Aliens

AIE302 — Lecture 1 · BSc Y3

🎬 CC Licensed Lecture

0:00 / —:—— 📺 Creative Commons Licensed

🎯 Learning Objectives

Implement data-parallel training with PyTorch DDP
Understand model parallelism for LLM training
Design fault-tolerant training with checkpointing
Optimize communication in distributed training (gradient compression)

Topics Covered This Lecture

Data Parallelism: AllReduce & Ring-AllReduce

Model Parallelism & Tensor Parallelism

Parameter Servers vs All-Reduce

Fault Tolerance: Checkpointing & Elastic Training

📖 Lecture Overview

This first lecture establishes the foundational framework for Distributed Systems for AI. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

        Why this matters
        Design distributed AI training systems: data and model parallelism, gradient aggregation strategies, parameter servers, and fault-tolerant training. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.
      

Key Concepts

The lecture introduces the four main pillars of this course: Data Parallelism: AllReduce & Ring-AllReduce, Model Parallelism & Tensor Parallelism, Parameter Servers vs All-Reduce, Fault Tolerance: Checkpointing & Elastic Training. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for AIE302
import sys
print(f"Python {sys.version}")

# Check key libraries are installed
try:
    import numpy, pandas, matplotlib
    print("✅ Core libraries ready")
except ImportError as e:
    print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: Data Parallelism: AllReduce & Ring-AllReduce and Model Parallelism & Tensor Parallelism. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

AIE302 Project 1: Distributed Training Benchmark

Implement and benchmark data-parallel training on a vision model using PyTorch DDP across multiple GPUs or machines. Measure scaling efficiency, communication overhead, and fault recovery.

PyTorch DDP training implementation
Scaling efficiency curve (1→2→4 GPUs)
Communication overhead analysis
Fault injection test with checkpoint recovery

50%

3 Projects

20%

Midterm Exam

30%

Final Exam

📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

Explain the Ring-AllReduce algorithm. Why is it preferred over parameter servers for large-scale training?

Analysis Short Answer

What is the difference between synchronous and asynchronous distributed gradient descent? What are the trade-offs?

Applied Code / Proof

How does gradient checkpointing reduce memory usage in large model training?