Week 1: Low-Latency Serving, Batching & SLA Management
Build real-time ML inference systems: model serving optimization, streaming inference pipelines, dynamic batching strategies, and SLA management.
- Serve ML models with <10ms p99 latency using TorchServe or Triton
- Implement dynamic batching for throughput optimization
- Build streaming inference pipelines with Kafka
- Design SLA contracts and measure against them
This first lecture establishes the foundational framework for Real-Time Inference Pipelines. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.
Key Concepts
The lecture introduces the four main pillars of this course: Model Serving: TorchServe, Triton, vLLM, Dynamic Batching & Throughput Optimization, Streaming Inference with Kafka, SLA Design & Latency Profiling. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.
This Week's Focus
Focus on mastering: Model Serving: TorchServe, Triton, vLLM and Dynamic Batching & Throughput Optimization. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.
AIE303 Project 1: Low-Latency Inference Service
Deploy a computer vision model as a low-latency REST API using NVIDIA Triton. Implement dynamic batching, measure p50/p95/p99 latency, and optimize to meet a 20ms SLA.
- Triton inference server deployment
- Dynamic batching configuration and benchmark
- Latency profiling (p50/p95/p99/p999)
- Optimization report: model quantization vs batching trade-offs
These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.
What is dynamic batching in model serving? How does it trade latency for throughput?
Explain INT8 quantization. What accuracy-latency trade-off does it offer?
Design a streaming inference pipeline for real-time fraud scoring on payment transactions.