📊 Data Science Week 1 of 14 BSc · Y3 S1 ⏱ ~50 min

Week 1: Text Processing, Word Embeddings & Language Models

From tokenization to transformers: understand how machines process human language, build NLP pipelines, and fine-tune large language models.

University of Aliens

DS302 — Lecture 1 · BSc Y3 S1

🎬 CC Licensed Lecture

0:00 / —:—— 📺 MIT OpenCourseWare (CC BY-NC-SA)

🎯 Learning Objectives

Build text preprocessing pipelines (tokenize, lemmatize, embed)
Train and use word embeddings (Word2Vec, GloVe, FastText)
Fine-tune a pretrained BERT model on a classification task
Understand GPT-style autoregressive language models

Topics Covered This Lecture

Text Preprocessing & Tokenization

Word Embeddings & Semantic Similarity

Transformer Architecture

Fine-tuning BERT & GPT Models

📖 Lecture Overview

This first lecture establishes the foundational framework for Natural Language Processing. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

        Why this matters
        From tokenization to transformers: understand how machines process human language, build NLP pipelines, and fine-tune large language models. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.
      

Key Concepts

The lecture introduces the four main pillars of this course: Text Preprocessing & Tokenization, Word Embeddings & Semantic Similarity, Transformer Architecture, Fine-tuning BERT & GPT Models. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for DS302
import sys
print(f"Python {sys.version}")

# Check key libraries are installed
try:
    import numpy, pandas, matplotlib
    print("✅ Core libraries ready")
except ImportError as e:
    print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: Text Preprocessing & Tokenization and Word Embeddings & Semantic Similarity. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

DS302 Project 1: Sentiment Analysis with Fine-Tuned BERT

Fine-tune a pretrained BERT model on a sentiment analysis dataset (IMDB or Yelp reviews). Compare against baseline TF-IDF + logistic regression.

BERT fine-tuning notebook with HuggingFace Transformers
Baseline model for comparison
Accuracy, F1, and confusion matrix analysis
Error analysis: 10 examples where BERT fails

50%

3 Projects

20%

Midterm Exam

30%

Final Exam

📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

Explain how Word2Vec learns word embeddings. What is the difference between CBOW and Skip-gram?

Analysis Short Answer

Describe the self-attention mechanism. Why is it preferable to an RNN for long sequences?

Applied Code / Proof

What is the difference between zero-shot, few-shot, and fine-tuning for LLM adaptation?