🎓 University of Aliens — Course Portal
AI EngineeringAIE302 › Week 1
⚙️ AI Engineering Week 1 of 14 BSc · Y3 ⏱ ~50 min

Week 1: Distributed Training, Data Parallelism & Gradient Aggregation

Design distributed AI training systems: data and model parallelism, gradient aggregation strategies, parameter servers, and fault-tolerant training.

UA
University of Aliens
AIE302 — Lecture 1 · BSc Y3
🎬 CC Licensed Lecture
0:00 / —:—— 📺 Creative Commons Licensed
🎯 Learning Objectives
  • Implement data-parallel training with PyTorch DDP
  • Understand model parallelism for LLM training
  • Design fault-tolerant training with checkpointing
  • Optimize communication in distributed training (gradient compression)
Topics Covered This Lecture
Data Parallelism: AllReduce & Ring-AllReduce
Model Parallelism & Tensor Parallelism
Parameter Servers vs All-Reduce
Fault Tolerance: Checkpointing & Elastic Training
📖 Lecture Overview

This first lecture establishes the foundational framework for Distributed Systems for AI. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

Why this matters Design distributed AI training systems: data and model parallelism, gradient aggregation strategies, parameter servers, and fault-tolerant training. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.

Key Concepts

The lecture introduces the four main pillars of this course: Data Parallelism: AllReduce & Ring-AllReduce, Model Parallelism & Tensor Parallelism, Parameter Servers vs All-Reduce, Fault Tolerance: Checkpointing & Elastic Training. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for AIE302 import sys print(f"Python {sys.version}") # Check key libraries are installed try: import numpy, pandas, matplotlib print("✅ Core libraries ready") except ImportError as e: print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: Data Parallelism: AllReduce & Ring-AllReduce and Model Parallelism & Tensor Parallelism. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

AIE302 Project 1: Distributed Training Benchmark

Implement and benchmark data-parallel training on a vision model using PyTorch DDP across multiple GPUs or machines. Measure scaling efficiency, communication overhead, and fault recovery.

  • PyTorch DDP training implementation
  • Scaling efficiency curve (1→2→4 GPUs)
  • Communication overhead analysis
  • Fault injection test with checkpoint recovery
50%
3 Projects
20%
Midterm Exam
30%
Final Exam
📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

Explain the Ring-AllReduce algorithm. Why is it preferred over parameter servers for large-scale training?

Analysis Short Answer

What is the difference between synchronous and asynchronous distributed gradient descent? What are the trade-offs?

Applied Code / Proof

How does gradient checkpointing reduce memory usage in large model training?