🎓 University of America — Course Portal
Data ScienceDS302 › Week 1
📊 Data Science Week 1 of 14 BSc · Y3 S1 ⏱ ~50 min

Week 1: Text Processing, Word Embeddings & Language Models

From tokenization to transformers: understand how machines process human language, build NLP pipelines, and fine-tune large language models.

UA
University of America
DS302 — Lecture 1 · BSc Y3 S1
🎬 CC Licensed Lecture
0:00 / —:—— 📺 MIT OpenCourseWare (CC BY-NC-SA)
🎯 Learning Objectives
  • Build text preprocessing pipelines (tokenize, lemmatize, embed)
  • Train and use word embeddings (Word2Vec, GloVe, FastText)
  • Fine-tune a pretrained BERT model on a classification task
  • Understand GPT-style autoregressive language models
Topics Covered This Lecture
Text Preprocessing & Tokenization
Word Embeddings & Semantic Similarity
Transformer Architecture
Fine-tuning BERT & GPT Models
📖 Lecture Overview

This first lecture establishes the foundational framework for Natural Language Processing. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

Why this matters From tokenization to transformers: understand how machines process human language, build NLP pipelines, and fine-tune large language models. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.

Key Concepts

The lecture introduces the four main pillars of this course: Text Preprocessing & Tokenization, Word Embeddings & Semantic Similarity, Transformer Architecture, Fine-tuning BERT & GPT Models. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for DS302 import sys print(f"Python {sys.version}") # Check key libraries are installed try: import numpy, pandas, matplotlib print("✅ Core libraries ready") except ImportError as e: print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: Text Preprocessing & Tokenization and Word Embeddings & Semantic Similarity. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

DS302 Project 1: Sentiment Analysis with Fine-Tuned BERT

Fine-tune a pretrained BERT model on a sentiment analysis dataset (IMDB or Yelp reviews). Compare against baseline TF-IDF + logistic regression.

  • BERT fine-tuning notebook with HuggingFace Transformers
  • Baseline model for comparison
  • Accuracy, F1, and confusion matrix analysis
  • Error analysis: 10 examples where BERT fails
50%
3 Projects
20%
Midterm Exam
30%
Final Exam
📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

Explain how Word2Vec learns word embeddings. What is the difference between CBOW and Skip-gram?

Analysis Short Answer

Describe the self-attention mechanism. Why is it preferable to an RNN for long sequences?

Applied Code / Proof

What is the difference between zero-shot, few-shot, and fine-tuning for LLM adaptation?