Week 1: MDPs, Bellman Equations & Q-Learning
Learn how agents learn from interaction: Markov Decision Processes, Bellman equations, Q-learning, policy gradients, actor-critic methods, and deep RL.
- Formalize RL problems as Markov Decision Processes
- Implement Q-learning and SARSA for tabular MDPs
- Derive the policy gradient theorem
- Build and train a DQN agent for an Atari game
This first lecture establishes the foundational framework for Reinforcement Learning. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.
Key Concepts
The lecture introduces the four main pillars of this course: MDPs: States, Actions, Rewards, Transitions, Dynamic Programming: Value & Policy Iteration, Q-Learning, SARSA & Temporal Difference Learning, Deep RL: DQN, Policy Gradient, PPO. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.
This Week's Focus
Focus on mastering: MDPs: States, Actions, Rewards, Transitions and Dynamic Programming: Value & Policy Iteration. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.
AI302 Project 1: DQN Agent for CartPole/LunarLander
Train a Deep Q-Network (DQN) agent to solve CartPole-v1 and LunarLander-v2 from OpenAI Gym. Implement experience replay, target networks, and epsilon-greedy exploration.
- DQN implementation with experience replay and target network
- Training curve showing reward over episodes
- Hyperparameter sensitivity study
- Video of trained agent solving the environment
These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.
Write the Bellman optimality equation for Q*(s,a) and explain each term.
What is the difference between on-policy (SARSA) and off-policy (Q-learning) methods?
Explain the exploration-exploitation tradeoff. Describe 3 exploration strategies beyond ε-greedy.