Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
.vscode/
*.zip
*.csv
*.xls
*.xlsx
*.xlsm
*.parquet
*.log
**/__pycache__/*
Expand Down
201 changes: 55 additions & 146 deletions 01_materials/labs/01_setup.ipynb

Large diffs are not rendered by default.

385 changes: 292 additions & 93 deletions 01_materials/labs/02_data_engineering.ipynb

Large diffs are not rendered by default.

137 changes: 86 additions & 51 deletions 01_materials/labs/03a_sampling.ipynb
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setup\n",
"\n",
"In this notebook, we will demonstrate various sampling methods in Pandas and Dask. To illustrate the methods, we use a dataset on the [annual number of objects launched into space from Our World in Data](https://ourworldindata.org/grapher/yearly-number-of-objects-launched-into-outer-space) and hosted in [Tidy Tuesday's Repository](https://github.com/rfordatascience/tidytuesday/blob/main/data/2024/2024-04-23/readme.md)."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -8,9 +17,17 @@
"source": [
"%load_ext dotenv\n",
"%dotenv \n",
"import os\n",
"import sys\n",
"sys.path.append(os.getenv('SRC_DIR'))\n",
"%run update_path.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import dask.dataframe as dd\n",
"from glob import glob\n",
"from utils.logger import get_logger\n",
"_logs = get_logger(__name__)"
]
Expand All @@ -21,11 +38,8 @@
"metadata": {},
"outputs": [],
"source": [
"import dask.dataframe as dd\n",
"import pandas as pd\n",
"import numpy as np\n",
"import os\n",
"from glob import glob"
"outer_space_dt = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-04-23/outer_space_objects.csv')"
]
},
{
Expand All @@ -34,25 +48,27 @@
"metadata": {},
"outputs": [],
"source": [
"ft_dir = os.getenv(\"FEATURES_DATA\")\n",
"ft_glob = glob(os.path.join(ft_dir, '**/*.parquet'), \n",
" recursive = True)\n",
"df = dd.read_parquet(ft_glob).compute().reset_index()"
"outer_space_dt.info()"
]
},
{
"cell_type": "markdown",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sampling in Python"
"idx = outer_space_dt['Year'] >= 2020\n",
"idx &= outer_space_dt['Entity'] != 'World'\n",
"outer_space_dt = outer_space_dt[idx]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+ There are different packages that allow sampling.\n",
"+ A practical approach is to use pandas/Dask sampling methods."
"# Sampling in Python\n",
"\n",
"There are different packages that allow sampling. A practical approach is to use pandas/Dask sampling methods."
]
},
{
Expand All @@ -61,9 +77,9 @@
"source": [
"## Random Sampling\n",
"\n",
"+ Sample n rows from a dataframe with [`df.sample()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html).\n",
"Sample n rows from a dataframe with [`df.sample()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html).\n",
"\n",
"```\n",
"```python\n",
"DataFrame.sample(\n",
" n=None, frac=None, replace=False, weights=None, \n",
" random_state=None, axis=None, ignore_index=False\n",
Expand All @@ -77,7 +93,7 @@
"metadata": {},
"outputs": [],
"source": [
"df.sample(n = 5)"
"outer_space_dt.sample(n = 10, random_state = 42)"
]
},
{
Expand All @@ -88,17 +104,19 @@
"source": [
"import random\n",
"random.seed(42)\n",
"sample_tickers = random.sample(df['ticker'].unique().tolist(), 30)\n",
"df = df[df['ticker'].isin(sample_tickers)]\n",
"simple_sample_dt = df.sample(frac = 0.1)\n",
"simple_sample_dt.shape, df.shape"
"frac = 0.5\n",
"\n",
"simple_sample_dt = outer_space_dt.sample(frac = frac)\n",
"simple_sample_dt.shape, outer_space_dt.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the distribution of tickers."
"## Stratified Sampling\n",
"\n",
"Use `groupby()` and `.sample()` for stratified sampling."
]
},
{
Expand All @@ -107,7 +125,8 @@
"metadata": {},
"outputs": [],
"source": [
"df['ticker'].value_counts().plot(kind='bar')"
"strat_sample_dt=outer_space_dt.groupby('Entity').sample(frac=frac, random_state=42)\n",
"strat_sample_dt.shape, outer_space_dt.shape"
]
},
{
Expand All @@ -116,26 +135,34 @@
"metadata": {},
"outputs": [],
"source": [
"simple_sample_dt['ticker'].value_counts().plot(kind='bar')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stratified Sampling\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"\n",
"+ Use `groupby()` and `.sample()` for stratified sampling."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"strat_sample_dt = df.groupby(['ticker']).sample(frac = 0.1)\n",
"strat_sample_dt['ticker'].value_counts().plot(kind='bar')"
"# Prepare data for comparison\n",
"df_orig = outer_space_dt['Entity'].value_counts().reset_index()\n",
"df_orig.columns = ['Entity', 'count']\n",
"df_orig['sample_type'] = 'Original'\n",
"\n",
"df_simple = simple_sample_dt['Entity'].value_counts().reset_index()\n",
"df_simple.columns = ['Entity', 'count']\n",
"df_simple['sample_type'] = 'Simple Random'\n",
"\n",
"df_strat = strat_sample_dt['Entity'].value_counts().reset_index()\n",
"df_strat.columns = ['Entity', 'count']\n",
"df_strat['sample_type'] = 'Stratified'\n",
"\n",
"# Combine all data\n",
"combined_df = pd.concat([df_orig, df_simple, df_strat])\n",
"\n",
"# Create faceted plot\n",
"sns.set_style(\"whitegrid\")\n",
"g = sns.catplot(data=combined_df, x='Entity', y='count', col='sample_type', \n",
" kind='bar', height=5, aspect=1, palette='Set2')\n",
"g.set_xticklabels(rotation=90, ha='right', fontsize=5)\n",
"g.set_titles(\"{col_name}\")\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
Expand All @@ -144,7 +171,15 @@
"source": [
"# Sampling in Dask\n",
"\n",
"+ Stratified sampling in `dask` can be achieved with `groupby().apply()` and a lambda function."
"Stratified sampling in Dask works somewhat differently. The code below will raise a Key Error (the \"key\" *sample* is not found).\n",
"\n",
"```python\n",
"strat_sample_dd = (dd_dt.groupby('Entity', group_keys=False)\n",
" .sample(frac = frac)\n",
" .compute())\n",
"```\n",
"\n",
"However, stratified sampling in Dask can be done with `groupby().apply()` and a lambda function."
]
},
{
Expand All @@ -153,19 +188,19 @@
"metadata": {},
"outputs": [],
"source": [
"dd_dt = dd.read_parquet(ft_glob)\n",
"dd_dt = dd.from_pandas(outer_space_dt, npartitions=4)\n",
"\n",
"strat_sample_dd = (dd_dt\n",
" .groupby('ticker', group_keys=False)\n",
" .apply(lambda x: x.sample(frac = 0.1))\n",
" .groupby('Entity', group_keys=False)\n",
" .apply(lambda x: x.sample(frac = frac))\n",
" .compute()\n",
" .reset_index())\n",
"strat_sample_dd[strat_sample_dd['ticker'].isin(sample_tickers)]['ticker'].value_counts().plot(kind='bar')"
" .reset_index())\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "dsi_participant",
"display_name": "production-env (3.11.13)",
"language": "python",
"name": "python3"
},
Expand All @@ -179,7 +214,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.21"
"version": "3.11.13"
}
},
"nbformat": 4,
Expand Down
Loading