Merge pull request #19 from ENCCS/fra_fixes

Dataframes and data formats fixes
ENCCS · Feb 3, 2025 · 7ffe615 · 7ffe615
2 parents a6271f8 + 15f138e
commit 7ffe615
Show file tree

Hide file tree

Showing 2 changed files with 102 additions and 103 deletions.
diff --git a/content/dataformats-dataframes.rst b/content/dataformats-dataframes.rst
@@ -48,12 +48,12 @@ You can perform various operations on a DataFrame, such as filtering,
 sorting, grouping, joining, and aggregating data.
 
 The `DataFrames.jl <https://dataframes.juliadata.org/stable/>`_ 
-package is Julia's version of the ``pandas`` library in Python and 
+package offers similar functionality as the ``pandas`` library in Python and 
 the ``data.frame()`` function in R.
-DataFrames.jl also provides a rich set of functions for data cleaning, 
+``DataFrames.jl`` also provides a rich set of functions for data cleaning, 
 transformation, and visualization, making it a popular choice for 
 data science and machine learning tasks in Julia. Just like in Python and R, 
-the DataFrames.jl package provides functionality for data manipulation and analysis. 
+the ``DataFrames.jl`` package provides functionality for data manipulation and analysis. 
 
 
 Download a dataset
@@ -68,7 +68,7 @@ of characteristic features of different penguin species.
 
    Artwork by @allison_horst
 
-To obtain the data we simply add the PalmerPenguins package.
+The dataset is bundled within the ``PalmerPenguins`` package, so we need to add that:
 
 .. code-block:: julia
 
@@ -90,9 +90,9 @@ Here's how you can create a new dataframe:
    using DataFrames
    names = ["Ali", "Clara", "Jingfei", "Stefan"]
    age = ["25", "39", "14", "45"]
-   df = DataFrame(; name=names, age=age)
+   df = DataFrame(name=names, age=age)
 
- .. code-block:: text
+ .. code-block:: julia-repl
 
     4×2 DataFrame
     Row │ name        age
@@ -105,24 +105,46 @@ Here's how you can create a new dataframe:
 
 .. todo:: Dataframes
 
-   The following code loads the `PalmerPenguins` dataset into a DataFrame. 
-   Then it demonstrates how to write and read the data in CSV, JSON, and
-   Parquet formats using the `CSV`, `JSONTables`, and `Parquet` packages respectively. 
+   The following code loads the ``PalmerPenguins`` dataset into a DataFrame. 
 
-   More about `Types of scientific data` one can find at `ENCCS High Performance Data Analytics in Python <https://enccs.github.io/hpda-python/scientific-data/#types-of-scientific-data>`_ training. 
+   .. code-block:: julia
 
-   .. tabs::
+      using DataFrames
+      #Load the PalmerPenguins dataset
+      table = PalmerPenguins.load()
+      df = DataFrame(table);
+      # the raw data can be loaded by
+      #tableraw = PalmerPenguins.load(; raw = true)
 
-      .. tab:: DataFrame
-         .. code-block:: julia
+      first(df, 5)
+   
+   .. code-block:: text
+   
+      344×7 DataFrame
+       Row │ species    island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     
+           │ String     String     Float64?        Float64?       Int64?             Int64?       String? 
+      ─────┼──────────────────────────────────────────────────────────────────────────────────────────────
+         1 │ Adelie   Torgersen            39.1           18.7                181         3750  male
+         2 │ Adelie   Torgersen            39.5           17.4                186         3800  female
+         3 │ Adelie   Torgersen            40.3           18.0                195         3250  female
+         4 │ Adelie   Torgersen       missing        missing              missing      missing  missing 
+         5 │ Adelie   Torgersen            36.7           19.3                193         3450  female
+   
 
-            using DataFrames
-            # Load the PalmerPenguins dataset
-            table = PalmerPenguins.load()
-            df = DataFrame(table)
+   Note that the ``table`` variable is of type ``CSV.File``; the 
+   PalmerPenguins package uses the `CSV.jl <https://csv.juliadata.org/stable/>`_ 
+   package for fast loading of data. Note further that ``DataFrame`` can 
+   accept a ``CSV.File`` object and read it into a dataframe!
+
+   Data can be saved in several common formats such as CSV, JSON, and
+   Parquet using the ``CSV``, ``JSONTables``, and ``Parquet2`` packages respectively. 
 
-      .. tab:: CSV
+   An overview of common data formats for different use cases can be found 
+   `here <https://enccs.github.io/hpda-python/scientific-data/#an-overview-of-common-data-formats>`__.
+
+   .. tabs::
 
+      .. tab:: CSV
 
          .. code-block:: julia
 
@@ -135,47 +157,20 @@ Here's how you can create a new dataframe:
          .. code-block:: julia
 
             using JSONTables
+            using JSON3
             open("penguins.json", "w") do io
-               JSONTables.writetable(io, df)
+               write(io, JSONTables.objecttable(df))
             end
-            df = open(JSONTables.jsontable, "penguins.json", DataFrame)
+            df = DataFrame(JSON3.read("penguins.json"))
 
       .. tab:: Parquet
 
          .. code-block:: julia
 
-            using Parquet
-            Parquet.write("penguins.parquet", df)
-            df = Parquet.read("penguins.parquet", DataFrame)
+            using Parquet2
+            Parquet2.writefile("penguins.parquet", df)
+            df = DataFrame(Parquet2.Dataset("penguins.parquet"))
    
-   We now create a dataframe containing the PalmerPenguins dataset.
-   Note that the ``table`` variable is of type ``CSV.File``; the 
-   PalmerPenguins package uses the `CSV.jl <https://csv.juliadata.org/stable/>`_ 
-   package for fast loading of data. Note further that ``DataFrame`` can 
-   accept a ``CSV.File`` object and read it into a dataframe!
-
-   .. code-block:: julia
-   
-      using PalmerPenguins
-      table = PalmerPenguins.load()
-      df = DataFrame(table)
-   
-      # the raw data can be loaded by
-      #tableraw = PalmerPenguins.load(; raw = true)
-   
-      first(df, 5)
-   
-   .. code-block:: text
-   
-      344×7 DataFrame
-       Row │ species    island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     
-           │ String     String     Float64?        Float64?       Int64?             Int64?       String? 
-      ─────┼──────────────────────────────────────────────────────────────────────────────────────────────
-         1 │ Adelie   Torgersen            39.1           18.7                181         3750  male
-         2 │ Adelie   Torgersen            39.5           17.4                186         3800  female
-         3 │ Adelie   Torgersen            40.3           18.0                195         3250  female
-         4 │ Adelie   Torgersen       missing        missing              missing      missing  missing 
-         5 │ Adelie   Torgersen            36.7           19.3                193         3450  female
    
 
 Inspect dataset
@@ -252,17 +247,6 @@ Inspect dataset
    values with a specific value using the `coalesce` function or by interpolating missing values
    using the `Interpolations` package.
 
-   .. code-block:: julia
-
-      # Interpolating missing values
-      using Interpolations
-      mask = ismissing.(df[:bill_length_mm])
-      itp = interpolate(df[:bill_length_mm][.!mask], BSpline(Linear()))
-      df[:bill_length_mm][mask] .= itp.(findall(mask))
-
-   It throws the issue because the syntax df[column] is not supported in Julia 1.9.0.
-   Here is the correct code:
-
    .. code-block:: julia
 
       # Interpolating missing values
@@ -290,16 +274,19 @@ https://www.statology.org/long-vs-wide-data/
 - **Long format**: In this format, each row is a single observation, and each column is a variable. This format is also known as "tidy" data.
 - **Wide format**: In this format, each row is a subject, and each column is an observation. This format is also known as "spread" data.
 
-The `DataFrames.jl` package provides functions to reshape data between long and wide formats. These functions are `stack`, `unstack`, `melt`, and `pivot`.
-Detailed tutorial: https://dataframes.juliadata.org/stable/man/reshaping_and_pivoting/ 
+The ``DataFrames.jl`` package provides functions to reshape data between long and wide formats. These functions are ``stack``, ``unstack``, ``melt``, and ``pivot``.
+Further examples can be found in the `official documentation <https://dataframes.juliadata.org/stable/man/reshaping_and_pivoting/>`__.
 
 .. code-block:: julia
 
    # To convert from wide to long format
-   df_long = stack(df, Not(:species))
+
+   #First we create an ID column
+   df.id = 1:size(df,1)
+   df_long = stack(df, Not(:species, :id))
 
    # To convert from long to wide format
-   df_wide = unstack(df_long, :species, :variable, :value)
+   df_wide = unstack(df_long, :variable, :value)
    
    # or
    # Custom combine function
@@ -312,54 +299,64 @@ Detailed tutorial: https://dataframes.juliadata.org/stable/man/reshaping_and_piv
    end
 
    # Unstack DataFrame with custom combine function
-   df_wide = unstack(df_long, :species, :variable, :value, combine = custom_combine)
+   unstack(df_long, :species, :variable, :value, combine = custom_combine)
+
 
+Split-apply-combine workflows
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-(Optional) Reshaping and Pivoting
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Oftentimes, data analysis workflows include three steps: 
 
-The `pivot` function can be used to reshape data (from long to wide format) and also perform aggregation.
+- Splitting/stratifying a dataset into different groups;
+- Applying some function/modification to each group;
+- Combining the results.
+
+This is commonly referred to as "split-apply-combine" workflow, which can be
+achieved in Julia with the ``groupby`` function to stratify and
+the ``combine`` function to aggregate with some reduction operator. 
+An example of this is provided below: 
 
 .. code-block:: julia
 
    using Statistics
 
-   # Pivot data with aggregation
+   # Split-apply-combine
    df_grouped = groupby(df, [:species, :island])
-   df_pivot = combine(df_grouped, :body_mass_g => mean)
-
+   df_combined = combine(df_grouped, :body_mass_g => mean)
 
-In this example, `groupby(df, [:species, :island])` groups your DataFrame by the `species` and `island` columns.
-Then, `combine(df_grouped, :body_mass_g => mean)` calculates the mean of the `body_mass_g` column for each group.
-The `mean` function is used for aggregation.
 
-The result is a new DataFrame where each unique value in the `:species` column forms a row, each unique 
-value in the `:island` column forms a column, and the mean body mass for each species-island combination fills the DataFrame.
+In this example, ``groupby(df, [:species, :island])`` groups the DataFrame by the ``species`` and ``island`` columns.
+Then, ``combine(df_grouped, :body_mass_g => mean)`` calculates the mean of the ``:body_mass_g`` column for each group.
+The ``mean`` function is used for aggregation.
 
-Note that if you don't provide an aggregation function and there are multiple values for a given row-column combination, 
-`pivot` will throw an error. To handle this, you can provide an aggregation function like `mean`, `sum`, etc., 
-which will be applied to all values that fall into each cell of the resulting DataFrame.
+The result is a new DataFrame where each unique ``:species``-``:island`` combination forms a row, 
+and the mean body mass for each species-island combination fills the DataFrame.
 
 
-Creating and merging DataFrames like in SQL
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+(Optional) Creating and merging DataFrames 
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Creating DataFrames
+~~~~~~~~~~~~~~~~~~~
 
-In Julia, you can create a DataFrame from scratch using the `DataFrame` constructor from the `DataFrames` package.
+In Julia, you can create a DataFrame from scratch using the ``DataFrame`` constructor from the ``DataFrames`` package.
 This constructor allows you to create a DataFrame by passing column vectors as keyword arguments or pairs.
-For example, to create a DataFrame with two columns named `:A` and `:B`, you can use the following code: 
-`DataFrame(A = 1:3, B = ["x", "y", "z"])`
-You can also create a DataFrame from other data structures such as dictionaries, named tuples, vectors of vectors, matrices, and more.
+For example, to create a DataFrame with two columns named ``:A`` and ``:B``, the following works: 
+
+``DataFrame(A = 1:3, B = ["x", "y", "z"])``
+
+A DataFrame can also be created from other data structures such as dictionaries, named tuples, vectors of vectors, matrices, and more.
 You can find more information about creating DataFrames in Julia in the `official documentation <https://dataframes.juliadata.org/stable/man/getting_started/>`_
 
 Merging DataFrames
+~~~~~~~~~~~~~~~~~~
 
-Also, you can merge two or more DataFrames using the `join` function from the `DataFrames` package.
+Also, you can merge two or more DataFrames using the ``join`` function from the ``DataFrames`` package.
 This function allows you to perform various types of joins, such as inner join, left join, right join, outer join, semi join, and anti join. 
-You can specify the columns used to determine which rows should be combined during a join by passing them as the `on` argument to the `join` function.
-For example, to perform an inner join on two DataFrames `df1` and `df2` using the `:ID` column as the key, you can use the following code: `join(df1, df2, on = :ID, kind = :inner)`.
-You can find more information about joining DataFrames in Julia in the `official documentation <https://dataframes.juliadata.org/stable/man/joins/>`_ 
+You can specify the columns used to determine which rows should be combined during a join by passing them as the ``on`` argument to the ``join`` function.
+For example, to perform an inner join on two DataFrames ``df1`` and ``df2`` using the ``:ID`` column as the key, you can use the following code: 
+``join(df1, df2, on = :ID, kind = :inner)``.
+You can find more information about joining DataFrames in Julia in the `official documentation <https://dataframes.juliadata.org/stable/man/joins/>`_.
 
 
 Plotting
@@ -375,9 +372,8 @@ personal preference.
    - `Plots.jl <http://docs.juliaplots.org/latest/>`_: high-level 
      API for working with several different plotting back-ends, including `GR`, 
      `Matplotlib.Pyplot`, `Plotly` and `PlotlyJS`.
-   - `StatsPlots.jl <https://github.com/JuliaPlots/StatsPlots.jl>`_: was moved 
-     out from core `Plots.jl`. Focuses on statistical use-cases and supports 
-     specialized statistical plotting functionalities.
+   - `StatsPlots.jl <https://docs.juliaplots.org/dev/generated/statsplots/>`_: focuses on statistical
+     use-cases and supports specialized statistical plotting functionalities.
    - `GadFly.jl <http://gadflyjl.org/stable/>`_: based largely on 
      `ggplot2 for R <https://ggplot2.tidyverse.org/>`_ and the book 
      `The Grammar of Graphics <https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html>`_.
@@ -392,7 +388,7 @@ personal preference.
 We will be using `Plots.jl` and `StatsPlots.jl` but we encourage to explore these 
 other packages to find the one that best fits your use case.
 
-First we install `Plots.jl` and `StatsPlots` backend:
+First we install ``Plots.jl`` and ``StatsPlots`` backend:
 
 .. code-block:: julia
 
@@ -466,7 +462,7 @@ Multiple subplots can be created by:
       scatter(df[!, :bill_length_mm], df[!, :bill_depth_mm])
 
    We can adjust the markers by `this list of named colors <https://juliagraphics.github.io/Colors.jl/stable/namedcolors/>`_
-   and `this list of marker types <https://docs.juliaplots.org/latest/generated/unicodeplots/#unicodeplots-ref13>`_:
+   and `this list of marker types <https://docs.juliaplots.org/dev/gallery/gr/generated/gr-ref013/#gr_ref013>`_:
 
    .. code-block:: julia
 
@@ -483,14 +479,16 @@ Multiple subplots can be created by:
    We can add a dimension to the plot by grouping by another column. Let's see if 
    the different penguin species can be distiguished based on their bill length 
    and bill depth. We also set different marker shapes and colors based on the 
-   grouping, and adjust the markersize and transparency (``alpha``):
+   grouping, and adjust the markersize and transparency (``alpha``). Note that 
+   it is also possible to prescribe a palette rather than every colour individually, with
+   many common palettes available `here <https://docs.juliaplots.org/dev/generated/colorschemes/#Pre-defined-ColorSchemes>`__:
 
    .. code-block:: julia
 
       scatter(df[!, :bill_length_mm],
               df[!, :bill_depth_mm], 
               xlabel = "bill length (mm)",
-              ylabel = "bill depth (g)",
+              ylabel = "bill depth (mm)",
               group = df[!, :species],
               marker = [:circle :ltriangle :star5],
               color = [:magenta :springgreen :blue],

diff --git a/content/index.rst b/content/index.rst
@@ -9,17 +9,18 @@ Fortran) without sacrificing simplicity and programming productivity
 (like in Python or R).
 
 Julia has a rich ecosystem of libraries aimed
-towards scientific computing and a powerful in-built package manager
+towards scientific computing and a powerful builtin package manager
 to install and manage their dependencies. Thanks to a rapidly growing
 ecosystem of packages for data science and machine learning, Julia is
 quickly gaining ground in both academic and industrial domains which deal
 with large datasets.
 
 This lesson starts with a discussion of working with data in Julia, how 
-to use the Dataframes.jl package and how to visualise data. It then moves 
+to use the ``DataFrames.jl`` package and how to visualise data. It then moves 
 on to linear algebra approaches, followed by classical machine learning 
-approaches as well as deep learning methods. Finally, key aspects of regression, 
-time series prediction and analyses is covered.
+approaches as well as deep learning methods with an example of scientific ML. 
+Finally, key aspects of regression, 
+time series prediction and analysis is covered.
 
 If you are new to the Julia language, please make sure to go through this 
 `introductory Julia lesson <https://enccs.github.io/julia-intro/>`__ before 
@@ -51,9 +52,9 @@ please visit the lesson `Julia for high-performance scientific computing <https:
    motivation
    dataformats-dataframes
    linear-algebra
+   sciml
    data-science
    regression
-   sciml
 
 .. toctree::
    :maxdepth: 1