UofT-DSI · calderonjesus · Jan 13, 2026 · Jan 11, 2026 · Jan 11, 2026 · Jan 11, 2026
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,9 @@
 .vscode/
 *.zip
 *.csv
+*.xls
+*.xlsx
+*.xlsm
 *.parquet
 *.log
 **/__pycache__/*

diff --git a/01_materials/labs/01_setup.ipynb b/01_materials/labs/01_setup.ipynb
diff --git a/01_materials/labs/02_data_engineering.ipynb b/01_materials/labs/02_data_engineering.ipynb
diff --git a/01_materials/labs/03a_sampling.ipynb b/01_materials/labs/03a_sampling.ipynb
@@ -1,5 +1,14 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Setup\n",
+    "\n",
+    "In this notebook, we will demonstrate various sampling methods in Pandas and Dask. To illustrate the methods, we use a dataset on the [annual number of objects launched into space from Our World in Data](https://ourworldindata.org/grapher/yearly-number-of-objects-launched-into-outer-space) and hosted in [Tidy Tuesday's Repository](https://github.com/rfordatascience/tidytuesday/blob/main/data/2024/2024-04-23/readme.md)."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -8,9 +17,17 @@
    "source": [
     "%load_ext dotenv\n",
     "%dotenv \n",
-    "import os\n",
-    "import sys\n",
-    "sys.path.append(os.getenv('SRC_DIR'))\n",
+    "%run update_path.py"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import dask.dataframe as dd\n",
+    "from glob import glob\n",
     "from utils.logger import get_logger\n",
     "_logs = get_logger(__name__)"
    ]
@@ -21,11 +38,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import dask.dataframe as dd\n",
     "import pandas as pd\n",
-    "import numpy as np\n",
-    "import os\n",
-    "from glob import glob"
+    "outer_space_dt = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-04-23/outer_space_objects.csv')"
    ]
   },
   {
@@ -34,25 +48,27 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ft_dir = os.getenv(\"FEATURES_DATA\")\n",
-    "ft_glob = glob(os.path.join(ft_dir, '**/*.parquet'), \n",
-    "               recursive = True)\n",
-    "df = dd.read_parquet(ft_glob).compute().reset_index()"
+    "outer_space_dt.info()"
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "# Sampling in Python"
+    "idx = outer_space_dt['Year'] >= 2020\n",
+    "idx &= outer_space_dt['Entity'] != 'World'\n",
+    "outer_space_dt = outer_space_dt[idx]"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "+ There are different packages that allow sampling.\n",
-    "+ A practical approach is to use pandas/Dask sampling methods."
+    "# Sampling in Python\n",
+    "\n",
+    "There are different packages that allow sampling. A practical approach is to use pandas/Dask sampling methods."
    ]
   },
   {
@@ -61,9 +77,9 @@
    "source": [
     "## Random Sampling\n",
     "\n",
-    "+ Sample n rows from a dataframe with [`df.sample()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html).\n",
+    "Sample n rows from a dataframe with [`df.sample()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html).\n",
     "\n",
-    "```\n",
+    "```python\n",
     "DataFrame.sample(\n",
     "    n=None, frac=None, replace=False, weights=None, \n",
     "    random_state=None, axis=None, ignore_index=False\n",
@@ -77,7 +93,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df.sample(n = 5)"
+    "outer_space_dt.sample(n = 10, random_state = 42)"
    ]
   },
   {
@@ -88,17 +104,19 @@
    "source": [
     "import random\n",
     "random.seed(42)\n",
-    "sample_tickers = random.sample(df['ticker'].unique().tolist(), 30)\n",
-    "df = df[df['ticker'].isin(sample_tickers)]\n",
-    "simple_sample_dt = df.sample(frac = 0.1)\n",
-    "simple_sample_dt.shape, df.shape"
+    "frac = 0.5\n",
+    "\n",
+    "simple_sample_dt = outer_space_dt.sample(frac = frac)\n",
+    "simple_sample_dt.shape, outer_space_dt.shape"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Look at the distribution of tickers."
+    "## Stratified Sampling\n",
+    "\n",
+    "Use `groupby()` and `.sample()` for stratified sampling."
    ]
   },
   {
@@ -107,7 +125,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df['ticker'].value_counts().plot(kind='bar')"
+    "strat_sample_dt=outer_space_dt.groupby('Entity').sample(frac=frac, random_state=42)\n",
+    "strat_sample_dt.shape, outer_space_dt.shape"
    ]
   },
   {
@@ -116,26 +135,34 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "simple_sample_dt['ticker'].value_counts().plot(kind='bar')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Stratified Sampling\n",
+    "import pandas as pd\n",
+    "import seaborn as sns\n",
+    "import matplotlib.pyplot as plt\n",
     "\n",
-    "+ Use `groupby()` and `.sample()` for stratified sampling."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "strat_sample_dt = df.groupby(['ticker']).sample(frac = 0.1)\n",
-    "strat_sample_dt['ticker'].value_counts().plot(kind='bar')"
+    "# Prepare data for comparison\n",
+    "df_orig = outer_space_dt['Entity'].value_counts().reset_index()\n",
+    "df_orig.columns = ['Entity', 'count']\n",
+    "df_orig['sample_type'] = 'Original'\n",
+    "\n",
+    "df_simple = simple_sample_dt['Entity'].value_counts().reset_index()\n",
+    "df_simple.columns = ['Entity', 'count']\n",
+    "df_simple['sample_type'] = 'Simple Random'\n",
+    "\n",
+    "df_strat = strat_sample_dt['Entity'].value_counts().reset_index()\n",
+    "df_strat.columns = ['Entity', 'count']\n",
+    "df_strat['sample_type'] = 'Stratified'\n",
+    "\n",
+    "# Combine all data\n",
+    "combined_df = pd.concat([df_orig, df_simple, df_strat])\n",
+    "\n",
+    "# Create faceted plot\n",
+    "sns.set_style(\"whitegrid\")\n",
+    "g = sns.catplot(data=combined_df, x='Entity', y='count', col='sample_type', \n",
+    "                kind='bar', height=5, aspect=1, palette='Set2')\n",
+    "g.set_xticklabels(rotation=90, ha='right', fontsize=5)\n",
+    "g.set_titles(\"{col_name}\")\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
    ]
   },
   {
@@ -144,7 +171,15 @@
    "source": [
     "# Sampling in Dask\n",
     "\n",
-    "+ Stratified sampling in `dask` can be achieved with `groupby().apply()` and a lambda function."
+    "Stratified sampling in Dask works somewhat differently. The code below will raise a Key Error (the \"key\" *sample* is not found).\n",
+    "\n",
+    "```python\n",
+    "strat_sample_dd = (dd_dt.groupby('Entity', group_keys=False)\n",
+    "                        .sample(frac = frac)\n",
+    "                        .compute())\n",
+    "```\n",
+    "\n",
+    "However, stratified sampling in Dask can be done with `groupby().apply()` and a lambda function."
    ]
   },
   {
@@ -153,19 +188,19 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "dd_dt = dd.read_parquet(ft_glob)\n",
+    "dd_dt = dd.from_pandas(outer_space_dt, npartitions=4)\n",
+    "\n",
     "strat_sample_dd = (dd_dt\n",
-    "                      .groupby('ticker', group_keys=False)\n",
-    "                      .apply(lambda x: x.sample(frac = 0.1))\n",
+    "                      .groupby('Entity', group_keys=False)\n",
+    "                      .apply(lambda x: x.sample(frac = frac))\n",
     "                      .compute()\n",
-    "                      .reset_index())\n",
-    "strat_sample_dd[strat_sample_dd['ticker'].isin(sample_tickers)]['ticker'].value_counts().plot(kind='bar')"
+    "                      .reset_index())\n"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "dsi_participant",
+   "display_name": "production-env (3.11.13)",
    "language": "python",
    "name": "python3"
   },
@@ -179,7 +214,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.21"
+   "version": "3.11.13"
   }
  },
  "nbformat": 4,
-Original file line number
+Diff line change
@@ Expand Up / @@ -2,6 +2,9 @@ @@
     .vscode/
     *.zip
     *.csv
+    *.xls
+    *.xlsx
+    *.xlsm
     *.parquet
     *.log
     **/__pycache__/*
@@ Expand Down @@