-
Notifications
You must be signed in to change notification settings - Fork 114
duckplyr 1.0.0 #724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
duckplyr 1.0.0 #724
Changes from all commits
96861b7
5b2b472
97739a8
1d80f56
11e21c1
eb09f2b
44b7c66
20b6a6e
4ccfa70
692b390
0df0d53
4251569
e30f959
1cec2cb
b276db4
7c92ddb
d8aca67
480551b
6adf074
c955826
1c5153f
78ee84b
1b55a30
ae87276
ae8c8e6
7661ad3
48eba60
8b6d0bd
e25fb07
a0b9b39
b90e8af
5ac6f6e
14ec2f6
b9a277a
5762c0a
6aaf953
f847736
c78073f
1f898c0
5a1f22c
2b4b421
d97b031
4be5ea9
f344b9f
a13315a
ad9825f
a734638
fc8122d
3211710
eea955a
4a20ca3
f5e4a38
20dff03
f231c68
49b4f8b
6b84b25
452d5f2
21c2b74
a21daa5
9022d7f
f0563c0
88846a3
17cdfc3
9e0e496
a5ac341
36a93cb
4f88bbd
ad8866a
beee540
5fe00ac
a186621
a994038
ac86b3a
4d6fa0d
0582a15
735e9d9
217144a
dc07389
663eda2
dd9d20d
0008ac8
ea3fda0
9128e16
eaee543
931fd45
5bb1a84
78729ac
0ef4ace
0f23a74
236d793
5bc98a6
cf61e47
bfac020
d0ed8f9
51931cb
1ddb0e1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,241 @@ | ||
--- | ||
output: hugodown::hugo_document | ||
|
||
slug: duckplyr-1-0-0 | ||
title: duckplyr fully joins the tidyverse! | ||
date: 2025-02-13 | ||
author: Kirill Müller and Maëlle Salmon | ||
description: > | ||
duckplyr 1.0.0 is on CRAN and part of the tidyverse! | ||
A drop-in replacement for dplyr, powered by DuckDB for speed. | ||
It is the most dplyr-like of dplyr backends. | ||
|
||
photo: | ||
url: https://www.pexels.com/photo/a-mallard-duck-on-water-6918877/ | ||
author: Kiril Gruev | ||
|
||
# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other" | ||
categories: [package] | ||
tags: | ||
- duckplyr | ||
- dplyr | ||
- tidyverse | ||
--- | ||
|
||
```{r include = FALSE} | ||
options( | ||
pillar.min_title_chars = 20, | ||
pillar.max_footer_lines = 7, | ||
pillar.bold = TRUE | ||
) | ||
``` | ||
|
||
|
||
We're very chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.0.0. | ||
This is a new dplyr backend powered by [DuckDB](https://duckdb.org/), a fast in-memory analytical database system[^duckdb]. | ||
It joins the rank of dplyr backends together with [dtplyr](https://dtplyr.tidyverse.org) and [dbplyr](https://dbplyr.tidyverse.org). | ||
You can install it from CRAN with: | ||
|
||
```{r, eval = FALSE} | ||
install.packages("duckplyr") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With tidyverse/tidyverse#346, we can also There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, but won't the post be published before the PR is merged and the tidyverse package is released on CRAN? |
||
``` | ||
|
||
This article shows how duckplyr can be used instead of dplyr with data of different size for faster computation, explain how you can help improve the package, and share a selection of further resources. | ||
|
||
## A drop-in replacement for dplyr | ||
|
||
Imagine you have to wrangle a huge dataset. | ||
Here we generate one using the [data generator from the TPC-H benchmark](https://duckdb.org/2024/04/02/duckplyr.html#benchmark-tpc-h-q1). | ||
|
||
```{r} | ||
lineitem_tbl <- duckdb:::sql("INSTALL tpch; LOAD tpch; CALL dbgen(sf=1); FROM lineitem;") | ||
lineitem_tbl <- tibble::as_tibble(lineitem_tbl) | ||
dplyr::glimpse(lineitem_tbl) | ||
``` | ||
|
||
We could transform the data using dplyr but we could also transform it using a tool that'll scale well to ever larger data: duckplyr. | ||
The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you need a sentence or two here as to why it's needed compared to dbplyr. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
You can simply _drop_ duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB. | ||
|
||
[^duckdb]: If you haven't heard about it, you can watch [Hannes Mühleisen's keynote at posit::conf(2024)](https://www.youtube.com/watch?v=GELhdezYmP0&feature=youtu.be). | ||
|
||
Below, we express the standard "TPC-H benchmark query 1" in dplyr syntax, but execute it with duckplyr. | ||
We use a function because this code is reused throughout the article. | ||
|
||
```{r include=FALSE} | ||
options(conflicts.policy = list(warn = FALSE)) | ||
``` | ||
|
||
|
||
```{r} | ||
library(conflicted) | ||
library(duckplyr) | ||
conflict_prefer("filter", "dplyr", quiet = TRUE) | ||
|
||
tpch_dplyr <- function(lineitem) { | ||
lineitem |> | ||
filter(l_shipdate <= !!as.Date("1998-09-02")) |> | ||
summarise( | ||
sum_qty = sum(l_quantity), | ||
sum_base_price = sum(l_extendedprice), | ||
sum_disc_price = sum(l_extendedprice * (1 - l_discount)), | ||
sum_charge = sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)), | ||
avg_qty = mean(l_quantity), | ||
avg_price = mean(l_extendedprice), | ||
avg_disc = mean(l_discount), | ||
count_order = n(), | ||
.by = c(l_returnflag, l_linestatus) | ||
) |> | ||
arrange(l_returnflag, l_linestatus) | ||
} | ||
|
||
tpch_dplyr(lineitem_tbl) | ||
``` | ||
|
||
Like with other dplyr backends like dtplyr and dbplyr, duckplyr allows you to get faster results without learning a different syntax. | ||
Unlike other dplyr backends, duckplyr does not require you to change existing code or learn specific idiosyncrasies. | ||
Not only is the syntax the same, the semantics are too! | ||
|
||
Start using duckplyr today by attaching it and running your existing dplyr code. | ||
Many operations will be carried out with DuckDB, faster than with dplyr. | ||
The duckplyr package is fully compatible with dplyr: if an operation cannot be carried out with DuckDB, it is automatically outsourced to dplyr. | ||
Over time, we expect fewer and fewer fallbacks to dplyr to be needed. | ||
|
||
|
||
## How to use duckplyr | ||
|
||
To _replace_ dplyr with duckplyr, you can: | ||
|
||
- Load duckplyr and then keep your pipeline as is. Calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created. | ||
This is shown in the example above. | ||
|
||
- Create individual "duck frames" using _conversion functions_ like `duckdb_tibble()` or `as_duckdb_tibble()`, or _ingestion functions_ like `read_csv_duckdb()`. | ||
|
||
|
||
In both cases, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline, with the exact same semantics. | ||
The duckplyr package performs the computation using DuckDB. | ||
|
||
```{r} | ||
out <- lineitem_tbl |> | ||
duckplyr::as_duckdb_tibble() |> | ||
tpch_dplyr() | ||
|
||
out | ||
``` | ||
|
||
|
||
For programming, the resulting object is indistinguishable from a regular tibble, except for the additional class. | ||
|
||
|
||
```{r} | ||
typeof(out) | ||
class(out) | ||
out$count_order | ||
``` | ||
|
||
The result could also be computed to a file. | ||
|
||
```{r} | ||
csv_file <- withr::local_tempfile() | ||
compute_csv(out, csv_file) | ||
fs::file_size(csv_file) | ||
``` | ||
|
||
Operations not yet supported by duckplyr are automatically outsourced to dplyr. | ||
For instance, filtering on grouped data is not supported yet, still it works thanks to the fallback mechanism. | ||
By default, the fallback is silent. | ||
Here, we make it visible by setting an environment variable. | ||
|
||
```{r} | ||
Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE) | ||
|
||
lineitem_tbl |> | ||
duckplyr::as_duckdb_tibble() |> | ||
filter(l_quantity == max(l_quantity), .by = c(l_returnflag, l_linestatus)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder why you're seeing the message here in the rendered output. Have you set an environment variable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure but it's better, I wanted to show the message and was puzzled when it didn't show up. It's nice to show one gets a message I think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll knit again once the new version is on CRAN, this way the message won't have the typo (tidyverse/duckplyr#611) |
||
``` | ||
|
||
|
||
## Benchmark | ||
|
||
duckplyr is often much faster than dplyr. | ||
The comparison below is done in a fresh R session where dplyr is attached but duckplyr is not. | ||
|
||
```{r include = FALSE} | ||
# Undo the effect of library(duckplyr) | ||
methods_restore() | ||
``` | ||
|
||
```r | ||
# Restart R | ||
library(dplyr) | ||
|
||
tpch_dplyr <- function ... | ||
``` | ||
|
||
We use `tpch_dplyr()` as defined above to run the query with dplyr. | ||
The function that runs it with duckplyr only wraps the input data in a duck frame and forwards it to the dplyr function. | ||
The `collect()` at the end is required only for this benchmark to ensure fairness.[^collect] | ||
|
||
[^collect]: If omitted, the results would be unchanged but the measurements would be wrong. The computation would then be triggered by the check. See `vignette("prudence")` for details. | ||
|
||
```{r} | ||
tpch_duckplyr <- function(lineitem) { | ||
lineitem |> | ||
duckplyr::as_duckdb_tibble() |> | ||
tpch_dplyr() |> | ||
collect() | ||
} | ||
``` | ||
|
||
And now we compare the two: | ||
|
||
```{r} | ||
bench::mark( | ||
tpch_dplyr(lineitem_tbl), | ||
tpch_duckplyr(lineitem_tbl), | ||
check = ~ all.equal(.x, .y, tolerance = 1e-10) | ||
) | ||
``` | ||
|
||
In this example, the pipeline run with duckplyr is clearly faster than the pipeline run with dplyr. | ||
It also appears to use much less memory, but this is misleading: DuckDB uses memory outside of R's memory management, so the memory usage is not visible to R. | ||
|
||
## Data larger than memory | ||
|
||
With datasets that approach or surpass the size of your machine's RAM, you want: | ||
|
||
- input data in an efficient format, like Parquet files, which duckplyr allows thanks to its ingestion functions like `read_parquet_duckdb()`; | ||
- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without having to adapt your code; | ||
- large results to not clutter your memory by dumping them to files using [`compute_parquet()`](https://duckplyr.tidyverse.org/reference/compute_parquet.html) or [`compute_csv()`](https://duckplyr.tidyverse.org/reference/compute_csv.html); | ||
- small results processed seamlessly with dplyr, using all verbs and functions. | ||
|
||
This workflow is fully supported by duckplyr. | ||
See [`vignette("large")`](https://duckplyr.tidyverse.org/articles/large.html) for a walkthrough and more details. | ||
|
||
## Help us improve duckplyr! | ||
|
||
Our goals for future development of duckplyr include: | ||
|
||
- Enabling users to provide [custom translations](https://github.com/tidyverse/duckplyr/issues/158) of dplyr functionality; | ||
- Making it easier to contribute code to duckplyr; | ||
- Supporting more dplyr and tidyr functionality natively in DuckDB. | ||
|
||
You can help! | ||
|
||
- Please report any issues, especially regarding unknown incompabilities. See [`vignette("limits")`](https://duckplyr.tidyverse.org/articles/limits.html). | ||
- Contribute to the codebase after reading duckplyr's [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html). | ||
- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See [`vignette("telemetry")`](https://duckplyr.tidyverse.org/articles/telemetry.html) and the [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html) function. | ||
|
||
## Additional resources | ||
|
||
Eager to learn more about duckplyr -- beside by trying it out yourself? | ||
The pkgdown website of duckplyr features several [articles](https://duckplyr.tidyverse.org/articles/). | ||
Furthermore, the blog post ["duckplyr: dplyr Powered by DuckDB"](https://duckdb.org/2024/04/02/duckplyr.html) by Hannes Mühleisen provides some context on duckplyr including its inner workings, as also seen in a [section](https://blog.r-hub.io/2025/02/13/lazy-meanings/#duckplyr-lazy-evaluation-and-prudence) of the R-hub blog post ["Lazy introduction to laziness in R"](https://blog.r-hub.io/2025/02/13/lazy-meanings/) by Maëlle Salmon, Athanasia Mo Mowinckel and Hannah Frick. | ||
|
||
## Acknowledgements | ||
|
||
A big thanks to all folks who filed issues, created PRs and generally helped to improve duckplyr and its workhorse [duckdb](https://r.duckdb.org/)! | ||
|
||
[@adamschwing](https://github.com/adamschwing), [@alejandrohagan](https://github.com/alejandrohagan), [@andreranza](https://github.com/andreranza), [@apalacio9502](https://github.com/apalacio9502), [@apsteinmetz](https://github.com/apsteinmetz), [@barracuda156](https://github.com/barracuda156), [@beniaminogreen](https://github.com/beniaminogreen), [@bob-rietveld](https://github.com/bob-rietveld), [@brichards920](https://github.com/brichards920), [@cboettig](https://github.com/cboettig), [@davidjayjackson](https://github.com/davidjayjackson), [@DavisVaughan](https://github.com/DavisVaughan), [@Ed2uiz](https://github.com/Ed2uiz), [@eitsupi](https://github.com/eitsupi), [@era127](https://github.com/era127), [@etiennebacher](https://github.com/etiennebacher), [@eutwt](https://github.com/eutwt), [@fmichonneau](https://github.com/fmichonneau), [@hadley](https://github.com/hadley), [@hannes](https://github.com/hannes), [@hawkfish](https://github.com/hawkfish), [@IndrajeetPatil](https://github.com/IndrajeetPatil), [@JanSulavik](https://github.com/JanSulavik), [@JavOrraca](https://github.com/JavOrraca), [@jeroen](https://github.com/jeroen), [@jhk0530](https://github.com/jhk0530), [@joakimlinde](https://github.com/joakimlinde), [@JosiahParry](https://github.com/JosiahParry), [@kevbaer](https://github.com/kevbaer), [@larry77](https://github.com/larry77), [@lnkuiper](https://github.com/lnkuiper), [@lorenzwalthert](https://github.com/lorenzwalthert), [@lschneiderbauer](https://github.com/lschneiderbauer), [@luisDVA](https://github.com/luisDVA), [@math-mcshane](https://github.com/math-mcshane), [@meersel](https://github.com/meersel), [@multimeric](https://github.com/multimeric), [@mytarmail](https://github.com/mytarmail), [@nicki-dese](https://github.com/nicki-dese), [@PMassicotte](https://github.com/PMassicotte), [@prasundutta87](https://github.com/prasundutta87), [@rafapereirabr](https://github.com/rafapereirabr), [@Robinlovelace](https://github.com/Robinlovelace), [@romainfrancois](https://github.com/romainfrancois), [@sparrow925](https://github.com/sparrow925), [@stefanlinner](https://github.com/stefanlinner), [@szarnyasg](https://github.com/szarnyasg), [@thomasp85](https://github.com/thomasp85), [@TimTaylor](https://github.com/TimTaylor), [@Tmonster](https://github.com/Tmonster), [@toppyy](https://github.com/toppyy), [@wibeasley](https://github.com/wibeasley), [@yjunechoe](https://github.com/yjunechoe), [@ywhcuhk](https://github.com/ywhcuhk), [@zhjx19](https://github.com/zhjx19), [@ablack3](https://github.com/ablack3), [@actuarial-lonewolf](https://github.com/actuarial-lonewolf), [@ajdamico](https://github.com/ajdamico), [@amirmazmi](https://github.com/amirmazmi), [@anderson461123](https://github.com/anderson461123), [@andrewGhazi](https://github.com/andrewGhazi), [@Antonov548](https://github.com/Antonov548), [@appiehappie999](https://github.com/appiehappie999), [@ArthurAndrews](https://github.com/ArthurAndrews), [@arthurgailes](https://github.com/arthurgailes), [@babaknaimi](https://github.com/babaknaimi), [@bcaradima](https://github.com/bcaradima), [@bdforbes](https://github.com/bdforbes), [@bergest](https://github.com/bergest), [@bill-ash](https://github.com/bill-ash), [@BorgeJorge](https://github.com/BorgeJorge), [@brianmsm](https://github.com/brianmsm), [@chainsawriot](https://github.com/chainsawriot), [@ckarnes](https://github.com/ckarnes), [@clementlefevre](https://github.com/clementlefevre), [@cregouby](https://github.com/cregouby), [@cy-james-lee](https://github.com/cy-james-lee), [@daranzolin](https://github.com/daranzolin), [@david-cortes](https://github.com/david-cortes), [@DavZim](https://github.com/DavZim), [@denis-or](https://github.com/denis-or), [@developertest1234](https://github.com/developertest1234), [@dicorynia](https://github.com/dicorynia), [@dsolito](https://github.com/dsolito), [@e-kotov](https://github.com/e-kotov), [@EAVWing](https://github.com/EAVWing), [@eddelbuettel](https://github.com/eddelbuettel), [@edward-burn](https://github.com/edward-burn), [@elefeint](https://github.com/elefeint), [@eli-daniels](https://github.com/eli-daniels), [@elysabethpc](https://github.com/elysabethpc), [@erikvona](https://github.com/erikvona), [@florisvdh](https://github.com/florisvdh), [@gaborcsardi](https://github.com/gaborcsardi), [@ggrothendieck](https://github.com/ggrothendieck), [@hdmm3](https://github.com/hdmm3), [@hope-data-science](https://github.com/hope-data-science), [@IoannaNika](https://github.com/IoannaNika), [@jabrown-aepenergy](https://github.com/jabrown-aepenergy), [@JamesLMacAulay](https://github.com/JamesLMacAulay), [@jangorecki](https://github.com/jangorecki), [@javierlenzi](https://github.com/javierlenzi), [@Joe-Heffer-Shef](https://github.com/Joe-Heffer-Shef), [@kalibera](https://github.com/kalibera), [@lboller-pwbm](https://github.com/lboller-pwbm), [@lgaborini](https://github.com/lgaborini), [@m-muecke](https://github.com/m-muecke), [@meztez](https://github.com/meztez), [@mgirlich](https://github.com/mgirlich), [@mtmorgan](https://github.com/mtmorgan), [@nassuphis](https://github.com/nassuphis), [@nbc](https://github.com/nbc), [@olivroy](https://github.com/olivroy), [@pdet](https://github.com/pdet), [@phdjsep](https://github.com/phdjsep), [@pierre-lamarche](https://github.com/pierre-lamarche), [@r2evans](https://github.com/r2evans), [@ran-codes](https://github.com/ran-codes), [@rplsmn](https://github.com/rplsmn), [@Saarialho](https://github.com/Saarialho), [@SimonCoulombe](https://github.com/SimonCoulombe), [@tau31](https://github.com/tau31), [@thohan88](https://github.com/thohan88), [@ThomasSoeiro](https://github.com/ThomasSoeiro), [@timothygmitchell](https://github.com/timothygmitchell), [@vincentarelbundock](https://github.com/vincentarelbundock), [@VincentGuyader](https://github.com/VincentGuyader), [@wlangera](https://github.com/wlangera), [@xbasics](https://github.com/xbasics), [@xiaodaigh](https://github.com/xiaodaigh), [@xtimbeau](https://github.com/xtimbeau), [@yng-me](https://github.com/yng-me), [@Yousuf28](https://github.com/Yousuf28), [@yutannihilation](https://github.com/yutannihilation), and [@zcatav](https://github.com/zcatav) | ||
|
||
Special thanks to Joe Thorley ([@joethorley](https://github.com/joethorley)) for help with choosing the right words. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on that part, duckdb 1.2.0 is about to be released tomorrow.