-
Notifications
You must be signed in to change notification settings - Fork 20
docs: improve documentation of prudence
argument
#589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,22 +30,19 @@ knitr::opts_chunk$set( | |
Sys.setenv(DUCKPLYR_FALLBACK_COLLECT = 0) | ||
``` | ||
|
||
Unlike traditional data frames, duckplyr defers computation until absolutely necessary, allowing DuckDB to optimize execution. | ||
This article explains how to control the materialization of data to maintain a seamless dplyr-like experience while remaining cautious of memory usage. | ||
|
||
|
||
This article explains how to control the materialization of data to maintain a seamless dplyr-like experience as well as to protect memory. | ||
|
||
```{r attach} | ||
library(conflicted) | ||
library(dplyr) | ||
conflict_prefer("filter", "dplyr") | ||
``` | ||
|
||
## Introduction | ||
## dplyr drop-in replacement: eager data frames | ||
|
||
From a user's perspective, data frames backed by duckplyr, with class `"duckplyr_df"`, behave as regular data frames in almost all respects. | ||
Data frames backed by duckplyr, with class `"duckplyr_df"`, behave as regular data frames in almost all respects from a user's perspective. | ||
In particular, direct column access like `df$x`, or retrieving the number of rows with `nrow()`, works identically. | ||
Conceptually, duckplyr frames are "eager": | ||
Therefore, conceptually, duckplyr frames are "eager". | ||
|
||
```{r} | ||
df <- | ||
|
@@ -60,14 +57,14 @@ df$y | |
nrow(df) | ||
``` | ||
|
||
Under the hood, two key differences provide improved performance and usability: | ||
Under the hood though, two key differences provide improved performance and usability: | ||
|
||
- **lazy materialization**: Unlike traditional data frames, duckplyr defers computation until absolutely necessary, i.e. lazily, allowing DuckDB to optimize execution. | ||
- **prudence**: Automatic materialization is controllable, as automatic materialization of large data might otherwise inadvertently lead to memory problems. | ||
|
||
The term "prudence" is introduced here to set a clear distinction from the concept of "laziness", and because "control of automatic materialization" is a mouthful. | ||
|
||
## Eager and lazy computation | ||
## DuckDB optimization: lazy evaluation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "lazy evaluation" is easy to confuse with how R handles arguments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oops, this should be "lazy computation", sorry. |
||
|
||
For a duckplyr frame that is the result of a dplyr operation, accessing column data or retrieving the number of rows will trigger a computation that is carried out by DuckDB, not dplyr. | ||
In this sense, duckplyr frames are also "lazy": the computation is deferred until the last possible moment, allowing DuckDB to optimize the whole pipeline. | ||
|
@@ -112,10 +109,11 @@ The result becomes available when accessed: | |
system.time(mean_arr_delay_ewr$mean_arr_delay[[1]]) | ||
``` | ||
|
||
### Comparison | ||
### Comparison with similar tools | ||
|
||
The functionality is similar to lazy tables in [dbplyr](https://dbplyr.tidyverse.org/) and lazy frames in [dtplyr](https://dtplyr.tidyverse.org/). | ||
However, the behavior is different: at the time of writing, the internal structure of a lazy table or frame is different from a data frame, and columns cannot be accessed directly. | ||
Users need to explicitly `collect()` the data, the data frame is not "eager" at all. | ||
|
||
| | **Eager** 😃 | **Lazy** 😴 | | ||
|-------------|:------------:|:-----------:| | ||
|
@@ -142,31 +140,65 @@ system.time( | |
|
||
See also the [duckplyr: dplyr Powered by DuckDB](https://duckdb.org/2024/04/02/duckplyr.html) blog post for more information. | ||
|
||
## Prudence | ||
## Memory protection: control of automatic materialization with `prudence` | ||
|
||
Being both "eager" and "lazy" at the same time introduces a challenge: | ||
it is too easy to accidentally trigger computation, | ||
**it is too easy to accidentally trigger computation**, | ||
which is prohibitive if an intermediate result is too large to fit into memory. | ||
Prudence is a setting for duckplyr frames that limits the size of the data that is materialized automatically. | ||
|
||
### Concept | ||
Fortunately, duckplyr frames have a setting called `prudence` that limits the size of the data that is materialized automatically, | ||
and that the user can choose based on the data size. | ||
|
||
### When to automatically materialize? | ||
|
||
Three levels of prudence are available: | ||
|
||
- _lavish_: always automatically materialize, as in the first example. | ||
- _frugal_: never automatically materialize, throw an error when attempting to access the data. | ||
- _thrifty_: only automaticaly materialize the data if it is small, otherwise throw an error. | ||
- __lavish__: _always_ automatically materialize, as in the first example. | ||
- __frugal__: _never_ automatically materialize, throw an error when attempting to access the data. | ||
- __thrifty__: automatically materialize the data _if it is small_, otherwise throw an error. | ||
|
||
For lavish duckplyr frames, as in the two previous examples, the underlying DuckDB computation is carried out upon the first request. | ||
Once the results are computed, they are cached and subsequent requests are fast. | ||
This is a good choice for small to medium-sized data, where DuckDB can provide a nice speedup but materializing the data is affordable at any stage. | ||
This is the default for `duckdb_tibble()` and `as_duckdb_tibble()`. | ||
|
||
For frugal duckplyr frames, accessing a column or requesting the number of rows triggers an error. | ||
This is a good choice for large data sets where the cost of materializing the data may be prohibitive due to size or computation time, and the user wants to control when the computation is carried out and where the results are stored. | ||
This is a good choice for large data sets where the cost of materializing the data may be prohibitive due to size or computation time, and the user wants to control when the computation is carried out and how (to memory, or to a file). | ||
Results can be materialized explicitly with `collect()` and other functions. | ||
|
||
Thrifty duckplyr frames are a compromise between lavish and frugal, discussed further below. | ||
Thrifty duckplyr frames are a compromise between lavish and frugal, discussed below. | ||
|
||
### Thrift | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found the previous location of this section a bit jarring. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to give the readers time to digest "lavish" and "frugal". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happened to me when reading was that I saw it would be discussed later and thought, ok then. But when the "Thrift" section appeared, it felt misplaced. |
||
|
||
Thrifty is a compromise between frugal and lavish. | ||
Materialization is allowed for data up to a certain size, measured in cells (values) and rows in the resulting data frame. | ||
|
||
```{r} | ||
nrow(flights) | ||
flights_partial <- | ||
flights |> | ||
duckplyr::as_duckdb_tibble(prudence = "thrifty") | ||
``` | ||
|
||
With this setting, the data is materialized only if the result has fewer than 1,000,000 cells (rows multiplied by columns). | ||
|
||
```{r error = TRUE} | ||
flights_partial |> | ||
select(origin, dest, dep_delay, arr_delay) |> | ||
nrow() | ||
``` | ||
|
||
The original input is too large to be materialized, so the operation fails. | ||
On the other hand, the result after aggregation is small enough to be materialized: | ||
|
||
```{r} | ||
flights_partial |> | ||
count(origin) |> | ||
nrow() | ||
``` | ||
|
||
Thrifty is a good choice for data sets where the cost of materializing the data is prohibitive only for large results. | ||
This is the default for the ingestion functions like `read_parquet_duckdb()`. | ||
|
||
|
||
### Example | ||
|
@@ -201,7 +233,7 @@ flights_frugal[[1]] | |
|
||
This means that frugal duckplyr frames can also be used to enforce DuckDB operation for a pipeline. | ||
|
||
### Enforcing DuckDB operation | ||
### Side effect: Enforcing DuckDB operation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is an important section but also does not go with the flow of the rest of the vignette. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, moving to "fallback". Later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the PR where I edit the fallback vignette, I added a section about this but that refers to this vignette. |
||
|
||
For operations not supported by duckplyr, the original dplyr implementation is used as a fallback. | ||
As the original dplyr implementation accesses columns directly, the data must be materialized before a fallback can be executed. | ||
|
@@ -227,7 +259,7 @@ flights_frugal |> | |
By using operations supported by duckplyr and avoiding fallbacks as much as possible, your pipelines will be executed by DuckDB in an optimized way. | ||
|
||
|
||
### From frugal to lavish | ||
### Conversion between prudence levels | ||
|
||
A frugal duckplyr frame can be converted to a lavish one with `as_duckdb_tibble(prudence = "lavish")`. | ||
The `collect.duckplyr_df()` method triggers computation and converts to a plain tibble. | ||
|
@@ -255,54 +287,20 @@ flights_frugal |> | |
class() | ||
``` | ||
|
||
### Comparison | ||
### Comparison with similar tools | ||
|
||
Frugal duckplyr frames behave like lazy tables in dbplyr and lazy frames in dtplyr: the computation only starts when you _explicitly_ request it with `collect.duckplyr_df()` or through other means. | ||
Frugal duckplyr frames behave like lazy tables in dbplyr and lazy frames in dtplyr: the computation only starts when you *explicitly* request it with `collect.duckplyr_df()` or through other means. | ||
However, frugal duckplyr frames can be converted to lavish ones at any time, and vice versa. | ||
In dtplyr and dbplyr, there are no lavish frames: collection always needs to be explicit. | ||
|
||
|
||
## Thrift | ||
|
||
Thrifty is a compromise between frugal and lavish. | ||
Materialization is allowed for data up to a certain size, measured in cells (values) and rows in the resulting data frame. | ||
|
||
```{r} | ||
nrow(flights) | ||
flights_partial <- | ||
flights |> | ||
duckplyr::as_duckdb_tibble(prudence = "thrifty") | ||
``` | ||
|
||
With this setting, the data is materialized only if the result has fewer than 1,000,000 cells (rows multiplied by columns). | ||
|
||
```{r error = TRUE} | ||
flights_partial |> | ||
select(origin, dest, dep_delay, arr_delay) |> | ||
nrow() | ||
``` | ||
|
||
The original input is too large to be materialized, so the operation fails. | ||
On the other hand, the result after aggregation is small enough to be materialized: | ||
|
||
```{r} | ||
flights_partial |> | ||
count(origin) |> | ||
nrow() | ||
``` | ||
|
||
Thrifty is a good choice for data sets where the cost of materializing the data is prohibitive only for large results. | ||
This is the default for the ingestion functions like `read_parquet_duckdb()`. | ||
|
||
|
||
## Conclusion | ||
krlmlr marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The duckplyr package provides | ||
The duckplyr package provides | ||
|
||
- a drop-in replacement for duckplyr, which necessitates "eager" data frames that automatically materialize like in dplyr, | ||
- optimization by DuckDB, which means "lazy" evaluation where the data is materialized at the latest possible stage. | ||
- a drop-in replacement for duckplyr, which necessitates "eager" data frames that automatically materialize like in dplyr, | ||
- optimization by DuckDB, which means lazy evaluation where the data is materialized at the latest possible stage. | ||
|
||
Automatic materialization can be dangerous for memory with large data, so duckplyr provides a setting called `prudence` that controls automatic materialization: | ||
Automatic materialization can be dangerous for memory with large data, so duckplyr provides a setting called `prudence` that controls automatic materialization: | ||
is the data automatically materialized _always_ ("lavish" frames), _never_ ("frugal" frames) or _up to a certain size_ ("thrifty" frames). | ||
|
||
See `vignette("large")` for more details on working with large data sets, `vignette("fallback")` for fallbacks to dplyr, and `vignette("limits")` for the operations supported by duckplyr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is explained better in one of the sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but is this sentence harmful here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be discouraging to have a first sentence that's "complicated".