🎓 University of America — Course Portal
Data ScienceDS205 › Week 1
📊 Data Science Week 1 of 14 BSc · Y2 S2 ⏱ ~50 min

Week 1: Spark Architecture, RDDs & Distributed Processing

Master Apache Spark for large-scale data processing, understand the Hadoop ecosystem, and process streaming data with Kafka.

UA
University of America
DS205 — Lecture 1 · BSc Y2 S2
🎬 CC Licensed Lecture
0:00 / —:—— 📺 Creative Commons Licensed
🎯 Learning Objectives
  • Explain Spark's RDD model and lazy evaluation
  • Write Spark SQL and DataFrame transformations
  • Understand partitioning and data locality
  • Process streaming data with Structured Streaming
Topics Covered This Lecture
Spark Architecture & RDDs
DataFrame & Spark SQL
Partitioning & Optimization
Structured Streaming
📖 Lecture Overview

This first lecture establishes the foundational framework for Big Data Technologies. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.

Why this matters Master Apache Spark for large-scale data processing, understand the Hadoop ecosystem, and process streaming data with Kafka. This lecture sets up everything that follows — make sure you understand the core concepts before proceeding to Week 2.

Key Concepts

The lecture introduces the four main pillars of this course: Spark Architecture & RDDs, DataFrame & Spark SQL, Partitioning & Optimization, Structured Streaming. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.

# Quick Start: verify your environment is ready for DS205 import sys print(f"Python {sys.version}") # Check key libraries are installed try: import numpy, pandas, matplotlib print("✅ Core libraries ready") except ImportError as e: print(f"❌ Missing: {e} — run: pip install numpy pandas matplotlib")

This Week's Focus

Focus on mastering: Spark Architecture & RDDs and DataFrame & Spark SQL. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.

📋 Project 1 of 3 50% of Final Grade

DS205 Project 1: Large-Scale Log Analysis with Spark

Use PySpark to analyze a large web server log dataset (>1GB). Compute top pages, user sessions, error rates, and time-series trends. Optimize with partitioning.

  • PySpark analysis notebook
  • 5+ analytical queries in Spark SQL
  • Performance comparison (Spark vs pandas)
  • Partitioning strategy justification
50%
3 Projects
20%
Midterm Exam
30%
Final Exam
📝 Sample Exam Questions

These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.

Conceptual Short Answer

Explain the difference between a Spark transformation and an action. Why does this matter for performance?

Analysis Short Answer

What is data skew in Spark and how do you diagnose and fix it?

Applied Code / Proof

Write PySpark code to compute the top 10 most visited URLs from a web log DataFrame.