🎓 University of America — Course Portal
Data ScienceDS101 › Week 1
📊 Data Science Week 1 of 14 ⏱ ~45 min lecture BSc Year 1

What is Data Science?

The landscape, the lifecycle, the toolkit — and your first hands-on exploration in Python. By the end of this lecture you'll understand what data scientists actually do and why it matters.

DS101 — Week 1: What is Data Science?
AI-Generated Lecture Video · University of America
0:00 / 45:00
🎬
AI-Generated Video Ready for ProductionThis lecture has a complete script and narration guide. Generate the video avatar using Synthesia, HeyGen, or D-ID. Voice: ElevenLabs. Script length: ~3,200 words (~45 min at moderate pace).
🎯 Learning Objectives

After completing this lecture, you will be able to:

  1. Define data science and explain how it differs from traditional statistics and software engineering.
  2. Describe the end-to-end data science lifecycle — from problem framing to communicating results.
  3. Identify the core tools in a data scientist's toolkit and explain when to use each.
  4. Recognize the types of problems data science can and cannot solve.
  5. Load a dataset in Python using pandas and produce your first exploratory summary.
📖 1. What Is Data Science?

Data science is the discipline of extracting knowledge and actionable insights from structured and unstructured data. It sits at the intersection of three fields: mathematics and statistics, computer science and programming, and domain expertise.

The Classic Definition (Drew Conway, 2010) Data science lives at the intersection of hacking skills (programming), math & statistics, and substantive expertise (domain knowledge). The overlap of all three is the "danger zone" — where the most powerful and impactful work happens.

What Data Scientists Actually Do

Despite the hype, data scientists spend the majority of their time on unglamorous tasks: collecting data, cleaning it, verifying it, and preparing it for analysis. A typical breakdown looks like this:

🧹

Data Cleaning (40%)

Handling missing values, fixing formats, removing duplicates, and validating data quality.

🔍

Exploration (20%)

Understanding distributions, spotting patterns, and forming hypotheses through EDA.

🏗️

Feature Engineering (15%)

Creating and transforming variables to improve model performance.

🤖

Modeling (15%)

Selecting, training, tuning, and evaluating machine learning models.

📊

Communication (10%)

Translating technical findings into clear business insights and recommendations.

💡 Key Insight: The most important skill in data science is often not knowing the fanciest algorithms — it's the ability to frame the right question, then communicate the answer clearly to non-technical stakeholders.
🔄 2. The Data Science Lifecycle

Every data science project follows a recognizable lifecycle, even if the details vary. Understanding this lifecycle helps you plan projects and anticipate where problems tend to occur.

  1. Problem Framing — Define the business question. What decision needs to be made? What would a good answer look like? This step is often skipped, and it causes most project failures.
  2. Data Acquisition — Identify and collect data. Sources include databases, APIs, web scraping, surveys, sensors, and third-party vendors.
  3. Data Cleaning & Preprocessing — Transform raw data into a usable format. This is where 40–60% of project time goes in practice.
  4. Exploratory Data Analysis (EDA) — Visualize and summarize the data to understand its structure, distributions, and relationships. Form and test hypotheses.
  5. Feature Engineering & Selection — Create new variables, encode categoricals, handle outliers, and select the most relevant features for modeling.
  6. Modeling — Train and compare machine learning models. Tune hyperparameters. Evaluate using proper train/test/validation splits.
  7. Evaluation & Iteration — Assess the model against business metrics, not just technical ones (accuracy, F1, etc.). Iterate based on findings.
  8. Deployment & Communication — Deploy the model or analysis to production, and communicate results to stakeholders.
🛠️ 3. The Data Scientist's Toolkit

A data scientist uses a range of tools depending on the task. Here is the core toolkit you'll build proficiency in over this program:

🐍

Python

Primary language. Versatile, readable, and has the richest ecosystem of data science libraries.

🐼

pandas

DataFrames for structured data manipulation. The Excel of programming.

🔢

NumPy

Fast numerical computation on arrays and matrices. The foundation of scientific Python.

📈

Matplotlib / Seaborn

Static visualization libraries for EDA and presentation plots.

🤖

Scikit-learn

Standard library for classical machine learning — preprocessing, models, evaluation.

📓

Jupyter Notebooks

Interactive environment for exploration, documentation, and sharing analysis.

💻 4. Your First Data Analysis in Python

Let's write our first data analysis script. We'll load a dataset about housing prices and produce a basic summary.

# DS101 — Week 1: First Python Analysis # University of America, Department of Data Science import pandas as pd import numpy as np import matplotlib.pyplot as plt # Step 1: Load data df = pd.read_csv('housing.csv') # Step 2: Understand the shape print(f"Dataset shape: {df.shape}") # (rows, columns) print(df.dtypes) # Column data types print(df.head()) # First 5 rows # Step 3: Check for missing values print(df.isnull().sum()) # Missing value counts # Step 4: Basic statistics print(df.describe()) # Min, max, mean, std, quartiles # Step 5: Visualize price distribution plt.figure(figsize=(10, 4)) df['price'].hist(bins=50, color='steelblue', edgecolor='white') plt.title('Distribution of Housing Prices') plt.xlabel('Price ($)') plt.ylabel('Count') plt.tight_layout() plt.show() # Step 6: Find correlations corr = df.corr()['price'].sort_values(ascending=False) print("\nTop correlations with price:") print(corr.head(5))
🏃 Try It Now: Open a Jupyter Notebook, download the California Housing dataset from sklearn.datasets.fetch_california_housing(), and run a version of the analysis above. Notice what you observe about the data before building any model.
📁 PROJECT 1 Weight: 50% of DS101 Grade

Hello Data World

Your first data science project. Choose any freely available dataset (from Kaggle, UCI ML Repository, or data.gov), load it in Python, and produce a complete exploratory analysis. The goal is to tell a coherent story about what you found.

Deliverables:
  • A Jupyter Notebook (.ipynb) with all analysis code — clean, documented, and reproducible.
  • A written summary (300–500 words) describing your dataset, key findings, and at least one surprising insight.
  • Minimum 5 visualizations (histogram, scatter plot, bar chart, correlation heatmap, and one of your choice).
  • A "data quality report" — what was missing, what was messy, what you had to fix.
  • A 2-minute video walkthrough of your notebook (screen recording is fine).
Grading Breakdown:
40%
Code quality & correctness
35%
Analysis depth & insights
25%
Communication & presentation

📅 Due: 3 weeks from today  |  Submit via Student Portal  |  Late penalty: -5% per day

📝 Midterm Preview — Sample Questions

These are representative questions from previous DS101 midterms. The actual midterm is 90 minutes and is worth 20% of your grade.

Question 1 · Multiple Choice MC

Which of the following best describes the primary role of exploratory data analysis (EDA) in the data science lifecycle?

A) Train a machine learning model   B) Understand patterns and validate assumptions about the data   C) Deploy the final model   D) Write the final report

Question 4 · Short Answer SA

Explain the difference between df.shape, df.describe(), and df.info() in pandas. When would you use each?

Question 7 · Applied Problem Applied

Given a dataset with 10,000 rows and 15 columns, you discover that 3 columns have more than 40% missing values and 2 columns are perfectly correlated (r=1.0). Describe your complete preprocessing strategy and justify each decision.

🎯 Final Exam Preview — Sample Questions

The final exam is 3 hours and covers all 14 weeks. It is worth 30% of your grade. One major integrative problem is always included.

Section A — Conceptual MCQ × 20

20 multiple-choice questions covering the full course. Topics include the data science lifecycle, statistical concepts, Python best practices, visualization principles, and ethics.

Section B — Short Answer SA × 4

Four short-answer questions requiring 100–200 word responses. Topics chosen from EDA, data cleaning, correlation vs. causation, and communicating findings.

Section C — Integrative Problem Major Problem

A complete mini-analysis: given a provided dataset (CSV), perform data loading, cleaning, EDA, visualization, and write a 400-word business report on your findings. You will have access to Python and Jupyter during this section.