Week 1: Spark Architecture, RDDs & Distributed Processing
Master Apache Spark for large-scale data processing, understand the Hadoop ecosystem, and process streaming data with Kafka.
- Explain Spark's RDD model and lazy evaluation
- Write Spark SQL and DataFrame transformations
- Understand partitioning and data locality
- Process streaming data with Structured Streaming
This first lecture establishes the foundational framework for Big Data Technologies. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.
Key Concepts
The lecture introduces the four main pillars of this course: Spark Architecture & RDDs, DataFrame & Spark SQL, Partitioning & Optimization, Structured Streaming. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.
This Week's Focus
Focus on mastering: Spark Architecture & RDDs and DataFrame & Spark SQL. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.
DS205 Project 1: Large-Scale Log Analysis with Spark
Use PySpark to analyze a large web server log dataset (>1GB). Compute top pages, user sessions, error rates, and time-series trends. Optimize with partitioning.
- PySpark analysis notebook
- 5+ analytical queries in Spark SQL
- Performance comparison (Spark vs pandas)
- Partitioning strategy justification
These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.
Explain the difference between a Spark transformation and an action. Why does this matter for performance?
What is data skew in Spark and how do you diagnose and fix it?
Write PySpark code to compute the top 10 most visited URLs from a web log DataFrame.