episode on data formats and tidy data

coderefinery · Jan 2, 2025 · 1e41eb2 · 1e41eb2
1 parent 8666e9c
commit 1e41eb2
Show file tree

Hide file tree

Showing 9 changed files with 649 additions and 2 deletions.
diff --git a/content/img/tidy-data/coffees.png b/content/img/tidy-data/coffees.png
diff --git a/content/img/tidy-data/spreadsheet.png b/content/img/tidy-data/spreadsheet.png
diff --git a/content/img/tidy-data/svalbard-compact.png b/content/img/tidy-data/svalbard-compact.png
diff --git a/content/img/tidy-data/svalbard-tidy.png b/content/img/tidy-data/svalbard-tidy.png
diff --git a/content/img/tidy-data/svalbard-transposed.png b/content/img/tidy-data/svalbard-transposed.png
diff --git a/content/img/tidy-data/svalbard-wide.png b/content/img/tidy-data/svalbard-wide.png
diff --git a/content/index.md b/content/index.md
@@ -29,8 +29,7 @@ running Python scripts from the command line.
 - 11:00 - 12:30
   - {doc}`installation`
   - {doc}`jupyter`
-  - Data formats
-  - Tidy data
+  - {doc}`tidy-data`
 
 **Day 1 afternoon**:
 - 13:30 - 15:00
@@ -66,6 +65,7 @@ running Python scripts from the command line.
 installation.md
 python-basics.md
 jupyter.md
+tidy-data.md
 plotting.md
 gallery.md
 profiling.md

diff --git a/content/tidy-data.md b/content/tidy-data.md
@@ -0,0 +1,262 @@
+# Data formats, tidy data, and data cleaning
+
+```{objectives}
+- Knowing about different storage formats
+- Knowing about the tidy data format
+- Be able to reformat tabular data into the tidy data format
+```
+
+Data is not always in nicely formatted "plain" text files.
+But sometimes the data is in a spreadsheet or in less nicely formatted text files.
+In this episode we will discuss strategies for how to work with these.
+
+
+## Importing data from spreadsheets
+
+We can create a spreadsheet with the following content (only columns A and B;
+the actual content does not have to be exactly the same):
+```{figure} img/tidy-data/coffees.png
+:alt: An example spreadsheet with weekdays in one column and number of coffees in the other
+
+Example spreadsheet with a side note.
+```
+
+Copy this also to the second sheet and for demonstration purpose
+add some side-notes to the second sheet and also color
+one or two cells (some people like to give some meaning to cells using color).
+
+Save the spreadsheet as `experiment.xls`.
+
+Now we will together try to read and inspect both sheets in the Jupyter
+Notebook:
+```python
+import pandas as pd
+
+data = pd.read_excel('experiment.xls', sheet_name="Sheet1")
+data
+
+data = pd.read_excel('experiment.xls', sheet_name="Sheet2")
+data
+```
+
+```{discussion}
+- We can import data from spreadsheets
+  ([more documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html))!
+- "Side notes" in spreadsheets can be annoying in this context.
+- Also encoding data in cell colors is a problem now. We will avoid those in
+  future.
+```
+
+
+## Tidy data
+
+```{figure} img/tidy-data/spreadsheet.png
+:alt: An example spreadsheet not in tidy data format
+
+Example spreadsheet (this is a phantasy dataset, apologies to biology
+students/researchers - this is not my domain).
+```
+
+```{discussion} What is the problem with storing data like this?
+- Format: Limited interoperability with other programs
+- Error prone (see e.g. [this famous example](https://www.washingtonpost.com/news/wonk/wp/2013/04/16/is-the-best-evidence-for-austerity-based-on-an-excel-spreadsheet-error/))
+- Difficult to parse ("understand") by scripts: difficult to automate
+- Not in *tidy format*: difficult to extend/modify
+```
+
+How should we arrange the data?
+
+```{figure} img/tidy-data/svalbard-compact.png
+:alt: Example data arranged in a compact representation
+:width: 30%
+
+Attempt 1: Not great since we need to somehow divide at the comma. How should we deal with multiple sightings?
+```
+
+```{figure} img/tidy-data/svalbard-wide.png
+:alt: Example data arranged in a wide format
+:width: 60%
+
+Attempt 2: Adding observation sites will force us to add columns.
+```
+
+```{figure} img/tidy-data/svalbard-transposed.png
+:alt: Example data transposed
+:width: 60%
+
+Attempt 3: Adding species will force us to add columns.
+```
+
+```{figure} img/tidy-data/svalbard-tidy.png
+:alt: Example data arranged in tidy data format
+:width: 50%
+
+Tidy data format: Columns are variables, rows are observations/measurements. Easy to add new species and sites.
+```
+
+```{keypoints} Tidy data format
+- [Hadley Wickham: Tidy Data](https://vita.had.co.nz/papers/tidy-data.html)
+- Columns are variables
+- Rows are observations/measurements
+- "Long form"
+- Order does not matter
+- **Easy to extend** with more species and more sites
+  without modifying the code
+- **Structure for storing data** - this does not mean that this is ideal
+  for tables in presentations or publications
+- It is possible to convert between wide form and long form and back
+  (e.g. using `pandas.melt` or `pandas.pivot`), see [this example notebook](https://nbviewer.org/github/coderefinery/python-progression/blob/main/notebooks/tidy-data.ipynb)
+```
+
+
+## Use a standard format
+
+```text
+Species,Observation site,Number of sightings
+arctic fox,A,3
+arctic fox,B,1
+walrus,B,1
+walrus,C,1
+reindeer,B,10
+reindeer,C,1
+polar bear,A,1
+polar bear,C,1
+seal,A,2
+seal,B,1
+seal,C,2
+```
+
+- **Use a format that is standard in your community, don't invent your own**
+- CSV is often a good choice since most visualization tools can read CSV data
+
+There are many more formats (adapted after [Python for Scientific Computing](https://aaltoscicomp.github.io/python-for-scicomp/work-with-data/)):
+```{list-table}
+:header-rows: 1
+
+* - Name:
+  - Human<br>
+    readable:
+  - Space<br>
+    efficiency:
+  - Arbitrary<br>
+    data:
+  - Tidy<br>
+    data:
+  - Array<br>
+    data:
+  - Long term<br>
+    storage/sharing:
+
+* - [CSV](https://en.wikipedia.org/wiki/Comma-separated_values)
+  - ✅
+  - ❌
+  - ❌
+  - ✅
+  - 🟨
+  - ✅
+
+* - [Feather](https://arrow.apache.org/docs/python/feather.html)
+  - ❌
+  - ✅
+  - ❌
+  - ✅
+  - ❌
+  - ❌
+
+* - [Parquet](https://parquet.apache.org/)
+  - ❌
+  - ✅
+  - 🟨
+  - ✅
+  - 🟨
+  - ✅
+
+* - [NPY](https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html)
+  - ❌
+  - 🟨
+  - ❌
+  - ❌
+  - ✅
+  - ❌
+
+* - [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format)
+  - ❌
+  - ✅
+  - ❌
+  - ❌
+  - ✅
+  - ✅
+
+* - [NetCDF](https://www.unidata.ucar.edu/software/netcdf/)
+  - ❌
+  - ✅
+  - ❌
+  - ❌
+  - ✅
+  - ✅
+
+* - [JSON](https://en.wikipedia.org/wiki/JSON)
+  - ✅
+  - ❌
+  - 🟨
+  - ❌
+  - ❌
+  - ✅
+
+* - [GeoJSON](https://geojson.org/)
+  - ✅
+  - ❌
+  - 🟨
+  - ❌
+  - ❌
+  - ✅
+
+* - Excel
+  - ❌
+  - ❌
+  - ❌
+  - 🟨
+  - ❌
+  - 🟨
+
+* - Graph formats
+  - 🟨
+  - 🟨
+  - ❌
+  - ❌
+  - ❌
+  - ✅
+
+* - [SQL](https://en.wikipedia.org/wiki/SQL)
+  - ❌
+  - 🟨
+  - ❌
+  - ❌
+  - ❌
+  - ❌
+```
+
+```{note}
+- ✅ : Good
+- 🟨 : Ok / depends on a case
+- ❌ : Bad
+```
+
+
+## Data cleaning
+
+Often we want to visualize data sets with inconsistent or missing entries:
+
+```text
+Date,Organization,Number of participants
+2020-09-27,UiT,20
+Oct 10 2020,UiT Norges arktiske universitet,15
+"Nov. 11, 2020",UiT The Arctic University of Norway,40
+2020-12-12,UiT The Arctic University of Norway,-
+```
+
+Data cleaning is a bit outside the scope of this course
+(although we have done some of this in the pandas episode) but still good to know:
+- There are tools to clean and merge inconsistent data sets (e.g. [OpenRefine](https://openrefine.org/), see also
+  [this Data Carpentry lesson](https://datacarpentry.org/OpenRefine-ecology-lesson/))
+- This does not have to be done manually