Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duckdb_fetch_arrow() consuming way too much memory in R #1065

Open
rafapereirabr opened this issue Mar 6, 2025 · 1 comment
Open

duckdb_fetch_arrow() consuming way too much memory in R #1065

rafapereirabr opened this issue Mar 6, 2025 · 1 comment

Comments

@rafapereirabr
Copy link

rafapereirabr commented Mar 6, 2025

Problem

I'm using {duckdb} as a dependency in my package {censobr}. I use {duckdb} to mege large data sets, and there is one particular join which DuckDB is not being able to handle due to RAM limits. The left join is based on 8 key columns and the two tables have 20.635.472 and 6.192.332 rows. Please see reprex below.

ps. I just wanted to add that {duckdb} is an incredible package and that the R community really appreciates your work on it ! Thanks !

Reprex

# install.packages('censobr')

library(censobr)
library(duckdb)
library(dplyr)
library(glue)

# get original tables
# these two lines of codes download and cache the data
pop <- censobr::read_population(year = 2010)
hh <- censobr::read_households(year = 2010)

## the reprex code works with this smaller table
# pop <- censobr::read_mortality(year = 2010)


# define key columns
key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
              'code_region', 'name_region', 'code_weighting', 'V0300')

# rename household weight column
hh <- dplyr::rename(hh, 'V0010_household' = 'V0010') |>
  dplyr::compute()

# drop repeated columns
all_common_vars <- names(pop)[names(pop) %in% names(hh)]
vars_to_drop <- setdiff(all_common_vars, key_vars)
hh <- dplyr::select(hh, -all_of(vars_to_drop)) |>
  dplyr::compute()


# create db connection on a temp file
db_path <- tempfile(pattern = 'censobr', fileext = '.duckdb')

con <- duckdb::dbConnect(
  duckdb::duckdb(),
  dbdir = db_path
)

# # Even if I configure memory limit for duckdb to use
# # memory usage goes way up higher than the limit
# memory_limit <- "15GB"
# query_setup <- stringr::str_glue("SET memory_limit = '{memory_limit}';")
# DBI::dbExecute(con, query_setup)

# register data to db
duckdb::duckdb_register_arrow(con, 'pop', pop)
duckdb::duckdb_register_arrow(con, 'hh', hh)


# Create the JOIN condition by concatenating the key columns
join_condition <- paste(
  glue::glue("pop.{key_vars} = hh.{key_vars}"),
  collapse = ' AND '
)

query_match <- glue::glue(
  "SELECT *
    FROM pop
    LEFT JOIN hh
    ON {join_condition};"
)

merge_query <- duckdb::dbSendQuery(
  conn = con,
  query_match, 
  arrow = TRUE
)

# get result of the left join as an arrow table
# this is where things go wrong / out of memory
df_geo <- duckdb::duckdb_fetch_arrow(merge_query)

# check result
head(df_geo) |> collect()

## remove files from cache
# censobr::censobr_cache(delete_file = 'all')

Whenever I run the code above on a machine with 16GB or 32GB of RAM, I get the following error message. Even if I configure memory limit for duckdb, the memory usage goes way up higher than the limit and I still get the error. The code does work when I run it on a machine with 250GB of RAM.

Error in duckdb_result():
! rapi_execute: Failed to run query
Error: Out of Memory Error: Allocation failure
Run rlang::last_trace() to see where the error occurred

Environment

I'm using the latest version of {duckdb} in a Windows OS. See below.

> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8  LC_CTYPE=Portuguese_Brazil.utf8   
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.utf8    

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glue_1.8.0     dplyr_1.1.4    duckdb_1.1.3-2 DBI_1.2.3      censobr_0.4.0 

loaded via a namespace (and not attached):
 [1] R6_2.6.1          tidyselect_1.2.1  magrittr_2.0.3    tibble_3.2.1      pkgconfig_2.0.3  
 [6] generics_0.1.3    lifecycle_1.0.4   cli_3.6.4         vctrs_0.6.5       compiler_4.4.1   
[11] rstudioapi_0.17.1 tools_4.4.1       pillar_1.10.1     rlang_1.1.5       fs_1.6.5  
@krlmlr
Copy link
Collaborator

krlmlr commented Mar 7, 2025

Thanks.

For problems like this, the way to go would be to check a SQL-only version that could run in the DuckDB CLI.

Is V0300 a unique key in the household data? Restricting the join to this key worked for me with a 2 GB RAM limit. (Collecting to a temporary duckdb table. Temporary disk space capped out at 24 GB.)

I've also noticed that many columns are stored as strings that could perhaps be integers? That would make things more efficient, I believe.

In your package, if you replace arrow_open_dataset() by duckplyr::read_parquet_duckdb(), the data is read directly into DuckDB, as opposed to going through the Arrow route. See ipeaGIT/censobr#61 for an example. With that PR, and joining over just one key, I have (with the aforementioned "out-of-disk memory"):

library(censobr)
library(dplyr)
library(glue)

memory_limit <- "2GB"
query_setup <- stringr::str_glue("SET memory_limit = '{memory_limit}';")
duckplyr::db_exec(query_setup)
duckplyr::db_exec("SET temp_directory = '/tmp';")

# get original tables
# these two lines of codes download and cache the data
pop <- censobr::read_population(year = 2010)
hh <- censobr::read_households(year = 2010)

## the reprex code works with this smaller table
# pop <- censobr::read_mortality(year = 2010)


# define key columns
key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
              'code_region', 'name_region', 'code_weighting', 'V0300')

# rename household weight column
hh <- dplyr::rename(hh, 'V0010_household' = 'V0010')

# drop repeated columns
all_common_vars <- names(pop)[names(pop) %in% names(hh)]
vars_to_drop <- setdiff(all_common_vars, key_vars)
hh <- dplyr::select(hh, -all_of(vars_to_drop))

# V0300 seems to be a primary key in the data.
# It seems to be stored as numeric, can it be an integer?
hh |>
  count(V0300) |>
  count(n)

df_geo <- pop |>
  select(-!!key_vars, V0300, everything()) |>
  left_join(hh, by = "V0300", na_matches = "never")

df_geo |>
  explain()

df_geo |>
  compute()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants