Week 1: Distributed Compute, Lambda Architecture & Cloud-Native Design
Design and architect large-scale data systems: distributed compute patterns, cloud-native data lakes, streaming systems, and performance engineering.
- Design Lambda and Kappa architectures for real-time + batch systems
- Implement data lake patterns with Delta Lake or Iceberg
- Optimize Spark jobs for performance (partitioning, caching, skew)
- Architect multi-region, fault-tolerant data systems
This first lecture establishes the foundational framework for Big Data Systems & Architecture. By the end of this session, you will have the conceptual grounding and practical starting point needed for the rest of the course.
Key Concepts
The lecture introduces the four main pillars of this course: Lambda & Kappa Architectures, Delta Lake & Iceberg: ACID on Data Lakes, Spark Performance Tuning, Data Mesh & Federated Architecture. Each will be explored in depth over the 14-week curriculum, with hands-on projects reinforcing theory at every stage.
This Week's Focus
Focus on mastering: Lambda & Kappa Architectures and Delta Lake & Iceberg: ACID on Data Lakes. These are the prerequisites for everything in Week 2. The concepts build on each other — do not skip the practice exercises.
DS502 Project 1: Real-Time Analytics Platform
Design and partially implement a real-time analytics platform that handles both streaming (Kafka → Spark Streaming) and batch (Spark SQL) processing, unified in a Delta Lake.
- Architecture diagram with data flow
- Kafka producer + Spark Streaming consumer
- Delta Lake integration with schema evolution
- Performance benchmark and optimization report
These represent the style and difficulty of questions you'll see on the midterm and final. Start thinking about them now.
Compare Lambda and Kappa architectures. What are the operational trade-offs?
What is ACID compliance in a data lake context? How does Delta Lake achieve it?
Describe three common causes of Spark job slowness and their remedies.