duckdb_fetch_arrow() consuming way too much memory in R #1065

rafapereirabr · 2025-03-06T16:10:03Z

Problem

I'm using {duckdb} as a dependency in my package {censobr}. I use {duckdb} to mege large data sets, and there is one particular join which DuckDB is not being able to handle due to RAM limits. The left join is based on 8 key columns and the two tables have 20.635.472 and 6.192.332 rows. Please see reprex below.

ps. I just wanted to add that {duckdb} is an incredible package and that the R community really appreciates your work on it ! Thanks !

Reprex

# install.packages('censobr')

library(censobr)
library(duckdb)
library(dplyr)
library(glue)

# get original tables
# these two lines of codes download and cache the data
pop <- censobr::read_population(year = 2010)
hh <- censobr::read_households(year = 2010)

## the reprex code works with this smaller table
# pop <- censobr::read_mortality(year = 2010)


# define key columns
key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
              'code_region', 'name_region', 'code_weighting', 'V0300')

# rename household weight column
hh <- dplyr::rename(hh, 'V0010_household' = 'V0010') |>
  dplyr::compute()

# drop repeated columns
all_common_vars <- names(pop)[names(pop) %in% names(hh)]
vars_to_drop <- setdiff(all_common_vars, key_vars)
hh <- dplyr::select(hh, -all_of(vars_to_drop)) |>
  dplyr::compute()


# create db connection on a temp file
db_path <- tempfile(pattern = 'censobr', fileext = '.duckdb')

con <- duckdb::dbConnect(
  duckdb::duckdb(),
  dbdir = db_path
)

# # Even if I configure memory limit for duckdb to use
# # memory usage goes way up higher than the limit
# memory_limit <- "15GB"
# query_setup <- stringr::str_glue("SET memory_limit = '{memory_limit}';")
# DBI::dbExecute(con, query_setup)

# register data to db
duckdb::duckdb_register_arrow(con, 'pop', pop)
duckdb::duckdb_register_arrow(con, 'hh', hh)


# Create the JOIN condition by concatenating the key columns
join_condition <- paste(
  glue::glue("pop.{key_vars} = hh.{key_vars}"),
  collapse = ' AND '
)

query_match <- glue::glue(
  "SELECT *
    FROM pop
    LEFT JOIN hh
    ON {join_condition};"
)

merge_query <- duckdb::dbSendQuery(
  conn = con,
  query_match, 
  arrow = TRUE
)

# get result of the left join as an arrow table
# this is where things go wrong / out of memory
df_geo <- duckdb::duckdb_fetch_arrow(merge_query)

# check result
head(df_geo) |> collect()

## remove files from cache
# censobr::censobr_cache(delete_file = 'all')

Whenever I run the code above on a machine with 16GB or 32GB of RAM, I get the following error message. Even if I configure memory limit for duckdb, the memory usage goes way up higher than the limit and I still get the error. The code does work when I run it on a machine with 250GB of RAM.

Error in duckdb_result():
! rapi_execute: Failed to run query
Error: Out of Memory Error: Allocation failure
Run rlang::last_trace() to see where the error occurred

Environment

I'm using the latest version of {duckdb} in a Windows OS. See below.

> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8  LC_CTYPE=Portuguese_Brazil.utf8   
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.utf8    

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glue_1.8.0     dplyr_1.1.4    duckdb_1.1.3-2 DBI_1.2.3      censobr_0.4.0 

loaded via a namespace (and not attached):
 [1] R6_2.6.1          tidyselect_1.2.1  magrittr_2.0.3    tibble_3.2.1      pkgconfig_2.0.3  
 [6] generics_0.1.3    lifecycle_1.0.4   cli_3.6.4         vctrs_0.6.5       compiler_4.4.1   
[11] rstudioapi_0.17.1 tools_4.4.1       pillar_1.10.1     rlang_1.1.5       fs_1.6.5

The text was updated successfully, but these errors were encountered:

krlmlr · 2025-03-07T20:37:19Z

Thanks.

For problems like this, the way to go would be to check a SQL-only version that could run in the DuckDB CLI.

Is V0300 a unique key in the household data? Restricting the join to this key worked for me with a 2 GB RAM limit. (Collecting to a temporary duckdb table. Temporary disk space capped out at 24 GB.)

I've also noticed that many columns are stored as strings that could perhaps be integers? That would make things more efficient, I believe.

In your package, if you replace arrow_open_dataset() by duckplyr::read_parquet_duckdb(), the data is read directly into DuckDB, as opposed to going through the Arrow route. See ipeaGIT/censobr#61 for an example. With that PR, and joining over just one key, I have (with the aforementioned "out-of-disk memory"):

library(censobr)
library(dplyr)
library(glue)

memory_limit <- "2GB"
query_setup <- stringr::str_glue("SET memory_limit = '{memory_limit}';")
duckplyr::db_exec(query_setup)
duckplyr::db_exec("SET temp_directory = '/tmp';")

# get original tables
# these two lines of codes download and cache the data
pop <- censobr::read_population(year = 2010)
hh <- censobr::read_households(year = 2010)

## the reprex code works with this smaller table
# pop <- censobr::read_mortality(year = 2010)


# define key columns
key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
              'code_region', 'name_region', 'code_weighting', 'V0300')

# rename household weight column
hh <- dplyr::rename(hh, 'V0010_household' = 'V0010')

# drop repeated columns
all_common_vars <- names(pop)[names(pop) %in% names(hh)]
vars_to_drop <- setdiff(all_common_vars, key_vars)
hh <- dplyr::select(hh, -all_of(vars_to_drop))

# V0300 seems to be a primary key in the data.
# It seems to be stored as numeric, can it be an integer?
hh |>
  count(V0300) |>
  count(n)

df_geo <- pop |>
  select(-!!key_vars, V0300, everything()) |>
  left_join(hh, by = "V0300", na_matches = "never")

df_geo |>
  explain()

df_geo |>
  compute()

This was referenced Mar 6, 2025

fix: Remove reference to nonexistent duckplyr::left_join() ipeaGIT/censobr#59

Closed

add merge_households parameter ipeaGIT/censobr#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duckdb_fetch_arrow() consuming way too much memory in R #1065

duckdb_fetch_arrow() consuming way too much memory in R #1065

rafapereirabr commented Mar 6, 2025 •

edited

Loading

krlmlr commented Mar 7, 2025 •

edited

Loading

duckdb_fetch_arrow() consuming way too much memory in R #1065

duckdb_fetch_arrow() consuming way too much memory in R #1065

Comments

rafapereirabr commented Mar 6, 2025 • edited Loading

Problem

Reprex

Environment

krlmlr commented Mar 7, 2025 • edited Loading

rafapereirabr commented Mar 6, 2025 •

edited

Loading

krlmlr commented Mar 7, 2025 •

edited

Loading