🎓 University of America — Course Portal
🤖 Artificial Intelligence Week 1 of 14 BSc · Y3 ⏱ ~50 min

Week 1: MDPs, Bellman Equations & Q-Learning

Learn how agents learn from interaction: Markov Decision Processes, Bellman equations, Q-learning, policy gradients, actor-critic methods, and deep RL.

UA
University of America
AI302 — Lecture 1 · BSc Y3
🎬 CC Licensed Lecture
0:00 / —:—— 📺 MIT OpenCourseWare (CC BY-NC-SA)
🎯 Learning Objectives
  • Formalize RL problems as Markov Decision Processes
  • Implement Q-learning and SARSA for tabular MDPs
  • Derive the policy gradient theorem
  • Build and train a DQN agent for an Atari game
Topics Covered This Lecture
MDPs: States, Actions, Rewards, Transitions
Dynamic Programming: Value & Policy Iteration
Q-Learning, SARSA & Temporal Difference Learning
Deep RL: DQN, Policy Gradient, PPO
📖 Lecture Overview

This first lecture establishes the foundational framework for Reinforcement Learning. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

Why this matters Learn how agents learn from interaction: Markov Decision Processes, Bellman equations, Q-learning, policy gradients, actor-critic methods, and deep RL. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.

Key Concepts

The lecture introduces the four main pillars of this course: MDPs: States, Actions, Rewards, Transitions, Dynamic Programming: Value & Policy Iteration, Q-Learning, SARSA & Temporal Difference Learning, Deep RL: DQN, Policy Gradient, PPO. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for AI302 import sys print(f"Python {sys.version}") # Check key libraries are installed try: import numpy, pandas, matplotlib print("✅ Core libraries ready") except ImportError as e: print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: MDPs: States, Actions, Rewards, Transitions and Dynamic Programming: Value & Policy Iteration. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

AI302 Project 1: DQN Agent for CartPole/LunarLander

Train a Deep Q-Network (DQN) agent to solve CartPole-v1 and LunarLander-v2 from OpenAI Gym. Implement experience replay, target networks, and epsilon-greedy exploration.

  • DQN implementation with experience replay and target network
  • Training curve showing reward over episodes
  • Hyperparameter sensitivity study
  • Video of trained agent solving the environment
50%
3 Projects
20%
Midterm Exam
30%
Final Exam
📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

Write the Bellman optimality equation for Q*(s,a) and explain each term.

Analysis Short Answer

What is the difference between on-policy (SARSA) and off-policy (Q-learning) methods?

Applied Code / Proof

Explain the exploration-exploitation tradeoff. Describe 3 exploration strategies beyond ε-greedy.