What is Data Science?
The landscape, the lifecycle, the toolkit — and your first hands-on exploration in Python. By the end of this lecture you'll understand what data scientists actually do and why it matters.
After completing this lecture, you will be able to:
- Define data science and explain how it differs from traditional statistics and software engineering.
- Describe the end-to-end data science lifecycle — from problem framing to communicating results.
- Identify the core tools in a data scientist's toolkit and explain when to use each.
- Recognize the types of problems data science can and cannot solve.
- Load a dataset in Python using pandas and produce your first exploratory summary.
Data science is the discipline of extracting knowledge and actionable insights from structured and unstructured data. It sits at the intersection of three fields: mathematics and statistics, computer science and programming, and domain expertise.
What Data Scientists Actually Do
Despite the hype, data scientists spend the majority of their time on unglamorous tasks: collecting data, cleaning it, verifying it, and preparing it for analysis. A typical breakdown looks like this:
Data Cleaning (40%)
Handling missing values, fixing formats, removing duplicates, and validating data quality.
Exploration (20%)
Understanding distributions, spotting patterns, and forming hypotheses through EDA.
Feature Engineering (15%)
Creating and transforming variables to improve model performance.
Modeling (15%)
Selecting, training, tuning, and evaluating machine learning models.
Communication (10%)
Translating technical findings into clear business insights and recommendations.
Every data science project follows a recognizable lifecycle, even if the details vary. Understanding this lifecycle helps you plan projects and anticipate where problems tend to occur.
- Problem Framing — Define the business question. What decision needs to be made? What would a good answer look like? This step is often skipped, and it causes most project failures.
- Data Acquisition — Identify and collect data. Sources include databases, APIs, web scraping, surveys, sensors, and third-party vendors.
- Data Cleaning & Preprocessing — Transform raw data into a usable format. This is where 40–60% of project time goes in practice.
- Exploratory Data Analysis (EDA) — Visualize and summarize the data to understand its structure, distributions, and relationships. Form and test hypotheses.
- Feature Engineering & Selection — Create new variables, encode categoricals, handle outliers, and select the most relevant features for modeling.
- Modeling — Train and compare machine learning models. Tune hyperparameters. Evaluate using proper train/test/validation splits.
- Evaluation & Iteration — Assess the model against business metrics, not just technical ones (accuracy, F1, etc.). Iterate based on findings.
- Deployment & Communication — Deploy the model or analysis to production, and communicate results to stakeholders.
A data scientist uses a range of tools depending on the task. Here is the core toolkit you'll build proficiency in over this program:
Python
Primary language. Versatile, readable, and has the richest ecosystem of data science libraries.
pandas
DataFrames for structured data manipulation. The Excel of programming.
NumPy
Fast numerical computation on arrays and matrices. The foundation of scientific Python.
Matplotlib / Seaborn
Static visualization libraries for EDA and presentation plots.
Scikit-learn
Standard library for classical machine learning — preprocessing, models, evaluation.
Jupyter Notebooks
Interactive environment for exploration, documentation, and sharing analysis.
Let's write our first data analysis script. We'll load a dataset about housing prices and produce a basic summary.
sklearn.datasets.fetch_california_housing(), and run a version of the analysis above. Notice what you observe about the data before building any model.
Hello Data World
Your first data science project. Choose any freely available dataset (from Kaggle, UCI ML Repository, or data.gov), load it in Python, and produce a complete exploratory analysis. The goal is to tell a coherent story about what you found.
Deliverables:- A Jupyter Notebook (.ipynb) with all analysis code — clean, documented, and reproducible.
- A written summary (300–500 words) describing your dataset, key findings, and at least one surprising insight.
- Minimum 5 visualizations (histogram, scatter plot, bar chart, correlation heatmap, and one of your choice).
- A "data quality report" — what was missing, what was messy, what you had to fix.
- A 2-minute video walkthrough of your notebook (screen recording is fine).
📅 Due: 3 weeks from today | Submit via Student Portal | Late penalty: -5% per day
These are representative questions from previous DS101 midterms. The actual midterm is 90 minutes and is worth 20% of your grade.
Which of the following best describes the primary role of exploratory data analysis (EDA) in the data science lifecycle?
A) Train a machine learning model B) Understand patterns and validate assumptions about the data C) Deploy the final model D) Write the final report
Explain the difference between df.shape, df.describe(), and df.info() in pandas. When would you use each?
Given a dataset with 10,000 rows and 15 columns, you discover that 3 columns have more than 40% missing values and 2 columns are perfectly correlated (r=1.0). Describe your complete preprocessing strategy and justify each decision.
The final exam is 3 hours and covers all 14 weeks. It is worth 30% of your grade. One major integrative problem is always included.
20 multiple-choice questions covering the full course. Topics include the data science lifecycle, statistical concepts, Python best practices, visualization principles, and ethics.
Four short-answer questions requiring 100–200 word responses. Topics chosen from EDA, data cleaning, correlation vs. causation, and communicating findings.
A complete mini-analysis: given a provided dataset (CSV), perform data loading, cleaning, EDA, visualization, and write a 400-word business report on your findings. You will have access to Python and Jupyter during this section.