🤖 Artificial Intelligence Week 1 of 14 BSc · Y3 ⏱ ~50 min

Week 1: MDPs, Bellman Equations & Q-Learning

Learn how agents learn from interaction: Markov Decision Processes, Bellman equations, Q-learning, policy gradients, actor-critic methods, and deep RL.

University of Aliens

AI302 — Lecture 1 · BSc Y3

🎬 CC Licensed Lecture

0:00 / —:—— 📺 MIT OpenCourseWare (CC BY-NC-SA)

🎯 Learning Objectives

Formalize RL problems as Markov Decision Processes
Implement Q-learning and SARSA for tabular MDPs
Derive the policy gradient theorem
Build and train a DQN agent for an Atari game

Topics Covered This Lecture

MDPs: States, Actions, Rewards, Transitions

Dynamic Programming: Value & Policy Iteration

Q-Learning, SARSA & Temporal Difference Learning

Deep RL: DQN, Policy Gradient, PPO

📖 Lecture Overview

This first lecture establishes the foundational framework for Reinforcement Learning. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

        Why this matters
        Learn how agents learn from interaction: Markov Decision Processes, Bellman equations, Q-learning, policy gradients, actor-critic methods, and deep RL. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.
      

Key Concepts

The lecture introduces the four main pillars of this course: MDPs: States, Actions, Rewards, Transitions, Dynamic Programming: Value & Policy Iteration, Q-Learning, SARSA & Temporal Difference Learning, Deep RL: DQN, Policy Gradient, PPO. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for AI302
import sys
print(f"Python {sys.version}")

# Check key libraries are installed
try:
    import numpy, pandas, matplotlib
    print("✅ Core libraries ready")
except ImportError as e:
    print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: MDPs: States, Actions, Rewards, Transitions and Dynamic Programming: Value & Policy Iteration. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

AI302 Project 1: DQN Agent for CartPole/LunarLander

Train a Deep Q-Network (DQN) agent to solve CartPole-v1 and LunarLander-v2 from OpenAI Gym. Implement experience replay, target networks, and epsilon-greedy exploration.

DQN implementation with experience replay and target network
Training curve showing reward over episodes
Hyperparameter sensitivity study
Video of trained agent solving the environment

50%

3 Projects

20%

Midterm Exam

30%

Final Exam

📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

Write the Bellman optimality equation for Q*(s,a) and explain each term.

Analysis Short Answer

What is the difference between on-policy (SARSA) and off-policy (Q-learning) methods?

Applied Code / Proof

Explain the exploration-exploitation tradeoff. Describe 3 exploration strategies beyond ε-greedy.