🎓 University of America — Course Portal
AI EngineeringAIE303 › Week 1
⚙️ AI Engineering Week 1 of 14 BSc · Y3 ⏱ ~50 min

Week 1: Low-Latency Serving, Batching & SLA Management

Build real-time ML inference systems: model serving optimization, streaming inference pipelines, dynamic batching strategies, and SLA management.

UA
University of America
AIE303 — Lecture 1 · BSc Y3
🎬 CC Licensed Lecture
0:00 / —:—— 📺 Creative Commons Licensed
🎯 Learning Objectives
  • Serve ML models with <10ms p99 latency using TorchServe or Triton
  • Implement dynamic batching for throughput optimization
  • Build streaming inference pipelines with Kafka
  • Design SLA contracts and measure against them
Topics Covered This Lecture
Model Serving: TorchServe, Triton, vLLM
Dynamic Batching & Throughput Optimization
Streaming Inference with Kafka
SLA Design & Latency Profiling
📖 Lecture Overview

This first lecture establishes the foundational framework for Real-Time Inference Pipelines. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

Why this matters Build real-time ML inference systems: model serving optimization, streaming inference pipelines, dynamic batching strategies, and SLA management. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.

Key Concepts

The lecture introduces the four main pillars of this course: Model Serving: TorchServe, Triton, vLLM, Dynamic Batching & Throughput Optimization, Streaming Inference with Kafka, SLA Design & Latency Profiling. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for AIE303 import sys print(f"Python {sys.version}") # Check key libraries are installed try: import numpy, pandas, matplotlib print("✅ Core libraries ready") except ImportError as e: print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: Model Serving: TorchServe, Triton, vLLM and Dynamic Batching & Throughput Optimization. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

AIE303 Project 1: Low-Latency Inference Service

Deploy a computer vision model as a low-latency REST API using NVIDIA Triton. Implement dynamic batching, measure p50/p95/p99 latency, and optimize to meet a 20ms SLA.

  • Triton inference server deployment
  • Dynamic batching configuration and benchmark
  • Latency profiling (p50/p95/p99/p999)
  • Optimization report: model quantization vs batching trade-offs
50%
3 Projects
20%
Midterm Exam
30%
Final Exam
📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

What is dynamic batching in model serving? How does it trade latency for throughput?

Analysis Short Answer

Explain INT8 quantization. What accuracy-latency trade-off does it offer?

Applied Code / Proof

Design a streaming inference pipeline for real-time fraud scoring on payment transactions.