⚙️ AI Engineering Week 1 of 14 BSc · Y3 ⏱ ~50 min

Week 1: Observability, SLOs & AI-Specific Failure Modes

Monitor AI systems in production: observability patterns, alerting pipelines, SLO design, incident response, and root cause analysis for AI failures.

University of Aliens

AIE304 — Lecture 1 · BSc Y3

🎬 CC Licensed Lecture

0:00 / —:—— 📺 Creative Commons Licensed

🎯 Learning Objectives

Implement full observability (metrics, traces, logs) for ML systems
Define SLOs and error budgets for AI services
Design automated alerting for model degradation
Conduct AI-specific incident post-mortems

Topics Covered This Lecture

Observability: Metrics, Traces & Logs

SLOs, SLAs & Error Budgets

AI-Specific Monitoring: Data Quality, Drift, Accuracy

Incident Response & Post-Mortems for AI

📖 Lecture Overview

This first lecture establishes the foundational framework for AI System Monitoring & Reliability. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

        Why this matters
        Monitor AI systems in production: observability patterns, alerting pipelines, SLO design, incident response, and root cause analysis for AI failures. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.
      

Key Concepts

The lecture introduces the four main pillars of this course: Observability: Metrics, Traces & Logs, SLOs, SLAs & Error Budgets, AI-Specific Monitoring: Data Quality, Drift, Accuracy, Incident Response & Post-Mortems for AI. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for AIE304
import sys
print(f"Python {sys.version}")

# Check key libraries are installed
try:
    import numpy, pandas, matplotlib
    print("✅ Core libraries ready")
except ImportError as e:
    print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: Observability: Metrics, Traces & Logs and SLOs, SLAs & Error Budgets. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

AIE304 Project 1: AI Observability Platform

Instrument an ML inference service with full observability using Prometheus, Grafana, and Jaeger. Define SLOs, build alert rules, and write runbooks for the top 3 failure modes.

Prometheus metrics + Grafana dashboards
Distributed tracing with Jaeger
SLO document with error budget policy
Runbooks for 3 failure modes with auto-remediation

50%

3 Projects

20%

Midterm Exam

30%

Final Exam

📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

What is the difference between monitoring and observability? Why does AI add new monitoring challenges?

Analysis Short Answer

Define SLO, SLA, and error budget. How does an error budget guide release decisions?

Applied Code / Proof

List three AI-specific failure modes that don't occur in traditional software systems.