Week 1: Self-Attention, Positional Encoding & the Transformer Block
Deep dive into transformer architecture: self-attention, BERT and GPT, RLHF fine-tuning, prompt engineering, and the engineering of large language models.
- Implement multi-head self-attention from scratch
- Understand BERT (encoder) vs GPT (decoder) architecture trade-offs
- Apply RLHF and DPO for LLM alignment
- Engineer effective prompts using few-shot and chain-of-thought techniques
This first lecture establishes the foundational framework for Transformer Architectures & LLMs. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.
Key Concepts
The lecture introduces the four main pillars of this course: Transformer Block: Attention, FFN, LayerNorm, BERT vs GPT: Encoder vs Decoder, Instruction Tuning, SFT & RLHF, Prompt Engineering & In-Context Learning. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.
This Week's Focus
Focus on mastering: Transformer Block: Attention, FFN, LayerNorm and BERT vs GPT: Encoder vs Decoder. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.
AI304 Project 1: Mini-GPT Language Model
Implement and train a small GPT-style language model from scratch on a domain corpus (e.g., Shakespeare or Python code). Implement BPE tokenization, causal attention, and sampling with temperature/top-k.
- Full GPT implementation in PyTorch (<300 lines)
- BPE tokenizer implementation
- Training on custom corpus with perplexity tracking
- Text generation with temperature/top-k sampling
These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.
Explain the difference between an encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) transformer.
What is RLHF? Describe the 3 stages: SFT, reward model training, and PPO optimization.
Why does causal language modeling require a triangular attention mask? Illustrate with an example.