Week 1: Text Processing, Word Embeddings & Language Models
From tokenization to transformers: understand how machines process human language, build NLP pipelines, and fine-tune large language models.
- Build text preprocessing pipelines (tokenize, lemmatize, embed)
- Train and use word embeddings (Word2Vec, GloVe, FastText)
- Fine-tune a pretrained BERT model on a classification task
- Understand GPT-style autoregressive language models
This first lecture establishes the foundational framework for Natural Language Processing. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.
Key Concepts
The lecture introduces the four main pillars of this course: Text Preprocessing & Tokenization, Word Embeddings & Semantic Similarity, Transformer Architecture, Fine-tuning BERT & GPT Models. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.
This Week's Focus
Focus on mastering: Text Preprocessing & Tokenization and Word Embeddings & Semantic Similarity. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.
DS302 Project 1: Sentiment Analysis with Fine-Tuned BERT
Fine-tune a pretrained BERT model on a sentiment analysis dataset (IMDB or Yelp reviews). Compare against baseline TF-IDF + logistic regression.
- BERT fine-tuning notebook with HuggingFace Transformers
- Baseline model for comparison
- Accuracy, F1, and confusion matrix analysis
- Error analysis: 10 examples where BERT fails
These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.
Explain how Word2Vec learns word embeddings. What is the difference between CBOW and Skip-gram?
Describe the self-attention mechanism. Why is it preferable to an RNN for long sequences?
What is the difference between zero-shot, few-shot, and fine-tuning for LLM adaptation?