Skip to content

Commit 7ffe615

Browse files
authored
Merge pull request #19 from ENCCS/fra_fixes
Dataframes and data formats fixes
2 parents a6271f8 + 15f138e commit 7ffe615

File tree

2 files changed

+102
-103
lines changed

2 files changed

+102
-103
lines changed

content/dataformats-dataframes.rst

Lines changed: 96 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -48,12 +48,12 @@ You can perform various operations on a DataFrame, such as filtering,
4848
sorting, grouping, joining, and aggregating data.
4949

5050
The `DataFrames.jl <https://dataframes.juliadata.org/stable/>`_
51-
package is Julia's version of the ``pandas`` library in Python and
51+
package offers similar functionality as the ``pandas`` library in Python and
5252
the ``data.frame()`` function in R.
53-
DataFrames.jl also provides a rich set of functions for data cleaning,
53+
``DataFrames.jl`` also provides a rich set of functions for data cleaning,
5454
transformation, and visualization, making it a popular choice for
5555
data science and machine learning tasks in Julia. Just like in Python and R,
56-
the DataFrames.jl package provides functionality for data manipulation and analysis.
56+
the ``DataFrames.jl`` package provides functionality for data manipulation and analysis.
5757

5858

5959
Download a dataset
@@ -68,7 +68,7 @@ of characteristic features of different penguin species.
6868

6969
Artwork by @allison_horst
7070

71-
To obtain the data we simply add the PalmerPenguins package.
71+
The dataset is bundled within the ``PalmerPenguins`` package, so we need to add that:
7272

7373
.. code-block:: julia
7474
@@ -90,9 +90,9 @@ Here's how you can create a new dataframe:
9090
using DataFrames
9191
names = ["Ali", "Clara", "Jingfei", "Stefan"]
9292
age = ["25", "39", "14", "45"]
93-
df = DataFrame(; name=names, age=age)
93+
df = DataFrame(name=names, age=age)
9494
95-
.. code-block:: text
95+
.. code-block:: julia-repl
9696
9797
4×2 DataFrame
9898
Row │ name age
@@ -105,24 +105,46 @@ Here's how you can create a new dataframe:
105105
106106
.. todo:: Dataframes
107107

108-
The following code loads the `PalmerPenguins` dataset into a DataFrame.
109-
Then it demonstrates how to write and read the data in CSV, JSON, and
110-
Parquet formats using the `CSV`, `JSONTables`, and `Parquet` packages respectively.
108+
The following code loads the ``PalmerPenguins`` dataset into a DataFrame.
111109

112-
More about `Types of scientific data` one can find at `ENCCS High Performance Data Analytics in Python <https://enccs.github.io/hpda-python/scientific-data/#types-of-scientific-data>`_ training.
110+
.. code-block:: julia
113111
114-
.. tabs::
112+
using DataFrames
113+
#Load the PalmerPenguins dataset
114+
table = PalmerPenguins.load()
115+
df = DataFrame(table);
116+
# the raw data can be loaded by
117+
#tableraw = PalmerPenguins.load(; raw = true)
115118
116-
.. tab:: DataFrame
117-
.. code-block:: julia
119+
first(df, 5)
120+
121+
.. code-block:: text
122+
123+
344×7 DataFrame
124+
Row │ species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
125+
│ String String Float64? Float64? Int64? Int64? String?
126+
─────┼──────────────────────────────────────────────────────────────────────────────────────────────
127+
1 │ Adelie Torgersen 39.1 18.7 181 3750 male
128+
2 │ Adelie Torgersen 39.5 17.4 186 3800 female
129+
3 │ Adelie Torgersen 40.3 18.0 195 3250 female
130+
4 │ Adelie Torgersen missing missing missing missing missing
131+
5 │ Adelie Torgersen 36.7 19.3 193 3450 female
132+
118133
119-
using DataFrames
120-
# Load the PalmerPenguins dataset
121-
table = PalmerPenguins.load()
122-
df = DataFrame(table)
134+
Note that the ``table`` variable is of type ``CSV.File``; the
135+
PalmerPenguins package uses the `CSV.jl <https://csv.juliadata.org/stable/>`_
136+
package for fast loading of data. Note further that ``DataFrame`` can
137+
accept a ``CSV.File`` object and read it into a dataframe!
138+
139+
Data can be saved in several common formats such as CSV, JSON, and
140+
Parquet using the ``CSV``, ``JSONTables``, and ``Parquet2`` packages respectively.
123141

124-
.. tab:: CSV
142+
An overview of common data formats for different use cases can be found
143+
`here <https://enccs.github.io/hpda-python/scientific-data/#an-overview-of-common-data-formats>`__.
144+
145+
.. tabs::
125146

147+
.. tab:: CSV
126148

127149
.. code-block:: julia
128150
@@ -135,47 +157,20 @@ Here's how you can create a new dataframe:
135157
.. code-block:: julia
136158
137159
using JSONTables
160+
using JSON3
138161
open("penguins.json", "w") do io
139-
JSONTables.writetable(io, df)
162+
write(io, JSONTables.objecttable(df))
140163
end
141-
df = open(JSONTables.jsontable, "penguins.json", DataFrame)
164+
df = DataFrame(JSON3.read("penguins.json"))
142165
143166
.. tab:: Parquet
144167

145168
.. code-block:: julia
146169
147-
using Parquet
148-
Parquet.write("penguins.parquet", df)
149-
df = Parquet.read("penguins.parquet", DataFrame)
170+
using Parquet2
171+
Parquet2.writefile("penguins.parquet", df)
172+
df = DataFrame(Parquet2.Dataset("penguins.parquet"))
150173
151-
We now create a dataframe containing the PalmerPenguins dataset.
152-
Note that the ``table`` variable is of type ``CSV.File``; the
153-
PalmerPenguins package uses the `CSV.jl <https://csv.juliadata.org/stable/>`_
154-
package for fast loading of data. Note further that ``DataFrame`` can
155-
accept a ``CSV.File`` object and read it into a dataframe!
156-
157-
.. code-block:: julia
158-
159-
using PalmerPenguins
160-
table = PalmerPenguins.load()
161-
df = DataFrame(table)
162-
163-
# the raw data can be loaded by
164-
#tableraw = PalmerPenguins.load(; raw = true)
165-
166-
first(df, 5)
167-
168-
.. code-block:: text
169-
170-
344×7 DataFrame
171-
Row │ species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
172-
│ String String Float64? Float64? Int64? Int64? String?
173-
─────┼──────────────────────────────────────────────────────────────────────────────────────────────
174-
1 │ Adelie Torgersen 39.1 18.7 181 3750 male
175-
2 │ Adelie Torgersen 39.5 17.4 186 3800 female
176-
3 │ Adelie Torgersen 40.3 18.0 195 3250 female
177-
4 │ Adelie Torgersen missing missing missing missing missing
178-
5 │ Adelie Torgersen 36.7 19.3 193 3450 female
179174
180175
181176
Inspect dataset
@@ -252,17 +247,6 @@ Inspect dataset
252247
values with a specific value using the `coalesce` function or by interpolating missing values
253248
using the `Interpolations` package.
254249

255-
.. code-block:: julia
256-
257-
# Interpolating missing values
258-
using Interpolations
259-
mask = ismissing.(df[:bill_length_mm])
260-
itp = interpolate(df[:bill_length_mm][.!mask], BSpline(Linear()))
261-
df[:bill_length_mm][mask] .= itp.(findall(mask))
262-
263-
It throws the issue because the syntax df[column] is not supported in Julia 1.9.0.
264-
Here is the correct code:
265-
266250
.. code-block:: julia
267251
268252
# Interpolating missing values
@@ -290,16 +274,19 @@ https://www.statology.org/long-vs-wide-data/
290274
- **Long format**: In this format, each row is a single observation, and each column is a variable. This format is also known as "tidy" data.
291275
- **Wide format**: In this format, each row is a subject, and each column is an observation. This format is also known as "spread" data.
292276

293-
The `DataFrames.jl` package provides functions to reshape data between long and wide formats. These functions are `stack`, `unstack`, `melt`, and `pivot`.
294-
Detailed tutorial: https://dataframes.juliadata.org/stable/man/reshaping_and_pivoting/
277+
The ``DataFrames.jl`` package provides functions to reshape data between long and wide formats. These functions are ``stack``, ``unstack``, ``melt``, and ``pivot``.
278+
Further examples can be found in the `official documentation <https://dataframes.juliadata.org/stable/man/reshaping_and_pivoting/>`__.
295279

296280
.. code-block:: julia
297281
298282
# To convert from wide to long format
299-
df_long = stack(df, Not(:species))
283+
284+
#First we create an ID column
285+
df.id = 1:size(df,1)
286+
df_long = stack(df, Not(:species, :id))
300287
301288
# To convert from long to wide format
302-
df_wide = unstack(df_long, :species, :variable, :value)
289+
df_wide = unstack(df_long, :variable, :value)
303290
304291
# or
305292
# Custom combine function
@@ -312,54 +299,64 @@ Detailed tutorial: https://dataframes.juliadata.org/stable/man/reshaping_and_piv
312299
end
313300
314301
# Unstack DataFrame with custom combine function
315-
df_wide = unstack(df_long, :species, :variable, :value, combine = custom_combine)
302+
unstack(df_long, :species, :variable, :value, combine = custom_combine)
303+
316304
305+
Split-apply-combine workflows
306+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
317307

318-
(Optional) Reshaping and Pivoting
319-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
308+
Oftentimes, data analysis workflows include three steps:
320309

321-
The `pivot` function can be used to reshape data (from long to wide format) and also perform aggregation.
310+
- Splitting/stratifying a dataset into different groups;
311+
- Applying some function/modification to each group;
312+
- Combining the results.
313+
314+
This is commonly referred to as "split-apply-combine" workflow, which can be
315+
achieved in Julia with the ``groupby`` function to stratify and
316+
the ``combine`` function to aggregate with some reduction operator.
317+
An example of this is provided below:
322318

323319
.. code-block:: julia
324320
325321
using Statistics
326322
327-
# Pivot data with aggregation
323+
# Split-apply-combine
328324
df_grouped = groupby(df, [:species, :island])
329-
df_pivot = combine(df_grouped, :body_mass_g => mean)
330-
325+
df_combined = combine(df_grouped, :body_mass_g => mean)
331326
332-
In this example, `groupby(df, [:species, :island])` groups your DataFrame by the `species` and `island` columns.
333-
Then, `combine(df_grouped, :body_mass_g => mean)` calculates the mean of the `body_mass_g` column for each group.
334-
The `mean` function is used for aggregation.
335327
336-
The result is a new DataFrame where each unique value in the `:species` column forms a row, each unique
337-
value in the `:island` column forms a column, and the mean body mass for each species-island combination fills the DataFrame.
328+
In this example, ``groupby(df, [:species, :island])`` groups the DataFrame by the ``species`` and ``island`` columns.
329+
Then, ``combine(df_grouped, :body_mass_g => mean)`` calculates the mean of the ``:body_mass_g`` column for each group.
330+
The ``mean`` function is used for aggregation.
338331

339-
Note that if you don't provide an aggregation function and there are multiple values for a given row-column combination,
340-
`pivot` will throw an error. To handle this, you can provide an aggregation function like `mean`, `sum`, etc.,
341-
which will be applied to all values that fall into each cell of the resulting DataFrame.
332+
The result is a new DataFrame where each unique ``:species``-``:island`` combination forms a row,
333+
and the mean body mass for each species-island combination fills the DataFrame.
342334

343335

344-
Creating and merging DataFrames like in SQL
345-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
336+
(Optional) Creating and merging DataFrames
337+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346338

347339
Creating DataFrames
340+
~~~~~~~~~~~~~~~~~~~
348341

349-
In Julia, you can create a DataFrame from scratch using the `DataFrame` constructor from the `DataFrames` package.
342+
In Julia, you can create a DataFrame from scratch using the ``DataFrame`` constructor from the ``DataFrames`` package.
350343
This constructor allows you to create a DataFrame by passing column vectors as keyword arguments or pairs.
351-
For example, to create a DataFrame with two columns named `:A` and `:B`, you can use the following code:
352-
`DataFrame(A = 1:3, B = ["x", "y", "z"])`
353-
You can also create a DataFrame from other data structures such as dictionaries, named tuples, vectors of vectors, matrices, and more.
344+
For example, to create a DataFrame with two columns named ``:A`` and ``:B``, the following works:
345+
346+
``DataFrame(A = 1:3, B = ["x", "y", "z"])``
347+
348+
A DataFrame can also be created from other data structures such as dictionaries, named tuples, vectors of vectors, matrices, and more.
354349
You can find more information about creating DataFrames in Julia in the `official documentation <https://dataframes.juliadata.org/stable/man/getting_started/>`_
355350

356351
Merging DataFrames
352+
~~~~~~~~~~~~~~~~~~
357353

358-
Also, you can merge two or more DataFrames using the `join` function from the `DataFrames` package.
354+
Also, you can merge two or more DataFrames using the ``join`` function from the ``DataFrames`` package.
359355
This function allows you to perform various types of joins, such as inner join, left join, right join, outer join, semi join, and anti join.
360-
You can specify the columns used to determine which rows should be combined during a join by passing them as the `on` argument to the `join` function.
361-
For example, to perform an inner join on two DataFrames `df1` and `df2` using the `:ID` column as the key, you can use the following code: `join(df1, df2, on = :ID, kind = :inner)`.
362-
You can find more information about joining DataFrames in Julia in the `official documentation <https://dataframes.juliadata.org/stable/man/joins/>`_
356+
You can specify the columns used to determine which rows should be combined during a join by passing them as the ``on`` argument to the ``join`` function.
357+
For example, to perform an inner join on two DataFrames ``df1`` and ``df2`` using the ``:ID`` column as the key, you can use the following code:
358+
``join(df1, df2, on = :ID, kind = :inner)``.
359+
You can find more information about joining DataFrames in Julia in the `official documentation <https://dataframes.juliadata.org/stable/man/joins/>`_.
363360

364361

365362
Plotting
@@ -375,9 +372,8 @@ personal preference.
375372
- `Plots.jl <http://docs.juliaplots.org/latest/>`_: high-level
376373
API for working with several different plotting back-ends, including `GR`,
377374
`Matplotlib.Pyplot`, `Plotly` and `PlotlyJS`.
378-
- `StatsPlots.jl <https://github.com/JuliaPlots/StatsPlots.jl>`_: was moved
379-
out from core `Plots.jl`. Focuses on statistical use-cases and supports
380-
specialized statistical plotting functionalities.
375+
- `StatsPlots.jl <https://docs.juliaplots.org/dev/generated/statsplots/>`_: focuses on statistical
376+
use-cases and supports specialized statistical plotting functionalities.
381377
- `GadFly.jl <http://gadflyjl.org/stable/>`_: based largely on
382378
`ggplot2 for R <https://ggplot2.tidyverse.org/>`_ and the book
383379
`The Grammar of Graphics <https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html>`_.
@@ -392,7 +388,7 @@ personal preference.
392388
We will be using `Plots.jl` and `StatsPlots.jl` but we encourage to explore these
393389
other packages to find the one that best fits your use case.
394390

395-
First we install `Plots.jl` and `StatsPlots` backend:
391+
First we install ``Plots.jl`` and ``StatsPlots`` backend:
396392

397393
.. code-block:: julia
398394
@@ -466,7 +462,7 @@ Multiple subplots can be created by:
466462
scatter(df[!, :bill_length_mm], df[!, :bill_depth_mm])
467463
468464
We can adjust the markers by `this list of named colors <https://juliagraphics.github.io/Colors.jl/stable/namedcolors/>`_
469-
and `this list of marker types <https://docs.juliaplots.org/latest/generated/unicodeplots/#unicodeplots-ref13>`_:
465+
and `this list of marker types <https://docs.juliaplots.org/dev/gallery/gr/generated/gr-ref013/#gr_ref013>`_:
470466

471467
.. code-block:: julia
472468
@@ -483,14 +479,16 @@ Multiple subplots can be created by:
483479
We can add a dimension to the plot by grouping by another column. Let's see if
484480
the different penguin species can be distiguished based on their bill length
485481
and bill depth. We also set different marker shapes and colors based on the
486-
grouping, and adjust the markersize and transparency (``alpha``):
482+
grouping, and adjust the markersize and transparency (``alpha``). Note that
483+
it is also possible to prescribe a palette rather than every colour individually, with
484+
many common palettes available `here <https://docs.juliaplots.org/dev/generated/colorschemes/#Pre-defined-ColorSchemes>`__:
487485

488486
.. code-block:: julia
489487
490488
scatter(df[!, :bill_length_mm],
491489
df[!, :bill_depth_mm],
492490
xlabel = "bill length (mm)",
493-
ylabel = "bill depth (g)",
491+
ylabel = "bill depth (mm)",
494492
group = df[!, :species],
495493
marker = [:circle :ltriangle :star5],
496494
color = [:magenta :springgreen :blue],

content/index.rst

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,18 @@ Fortran) without sacrificing simplicity and programming productivity
99
(like in Python or R).
1010

1111
Julia has a rich ecosystem of libraries aimed
12-
towards scientific computing and a powerful in-built package manager
12+
towards scientific computing and a powerful builtin package manager
1313
to install and manage their dependencies. Thanks to a rapidly growing
1414
ecosystem of packages for data science and machine learning, Julia is
1515
quickly gaining ground in both academic and industrial domains which deal
1616
with large datasets.
1717

1818
This lesson starts with a discussion of working with data in Julia, how
19-
to use the Dataframes.jl package and how to visualise data. It then moves
19+
to use the ``DataFrames.jl`` package and how to visualise data. It then moves
2020
on to linear algebra approaches, followed by classical machine learning
21-
approaches as well as deep learning methods. Finally, key aspects of regression,
22-
time series prediction and analyses is covered.
21+
approaches as well as deep learning methods with an example of scientific ML.
22+
Finally, key aspects of regression,
23+
time series prediction and analysis is covered.
2324

2425
If you are new to the Julia language, please make sure to go through this
2526
`introductory Julia lesson <https://enccs.github.io/julia-intro/>`__ before
@@ -51,9 +52,9 @@ please visit the lesson `Julia for high-performance scientific computing <https:
5152
motivation
5253
dataformats-dataframes
5354
linear-algebra
55+
sciml
5456
data-science
5557
regression
56-
sciml
5758

5859
.. toctree::
5960
:maxdepth: 1

0 commit comments

Comments
 (0)