-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duckdb_fetch_arrow() consuming way too much memory in R #1065
Comments
Thanks. For problems like this, the way to go would be to check a SQL-only version that could run in the DuckDB CLI. Is I've also noticed that many columns are stored as strings that could perhaps be integers? That would make things more efficient, I believe. In your package, if you replace library(censobr)
library(dplyr)
library(glue)
memory_limit <- "2GB"
query_setup <- stringr::str_glue("SET memory_limit = '{memory_limit}';")
duckplyr::db_exec(query_setup)
duckplyr::db_exec("SET temp_directory = '/tmp';")
# get original tables
# these two lines of codes download and cache the data
pop <- censobr::read_population(year = 2010)
hh <- censobr::read_households(year = 2010)
## the reprex code works with this smaller table
# pop <- censobr::read_mortality(year = 2010)
# define key columns
key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
'code_region', 'name_region', 'code_weighting', 'V0300')
# rename household weight column
hh <- dplyr::rename(hh, 'V0010_household' = 'V0010')
# drop repeated columns
all_common_vars <- names(pop)[names(pop) %in% names(hh)]
vars_to_drop <- setdiff(all_common_vars, key_vars)
hh <- dplyr::select(hh, -all_of(vars_to_drop))
# V0300 seems to be a primary key in the data.
# It seems to be stored as numeric, can it be an integer?
hh |>
count(V0300) |>
count(n)
df_geo <- pop |>
select(-!!key_vars, V0300, everything()) |>
left_join(hh, by = "V0300", na_matches = "never")
df_geo |>
explain()
df_geo |>
compute() |
Problem
I'm using {duckdb} as a dependency in my package {censobr}. I use {duckdb} to mege large data sets, and there is one particular join which DuckDB is not being able to handle due to RAM limits. The left join is based on 8 key columns and the two tables have
20.635.472
and6.192.332
rows. Please see reprex below.ps. I just wanted to add that {duckdb} is an incredible package and that the R community really appreciates your work on it ! Thanks !
Reprex
Whenever I run the code above on a machine with 16GB or 32GB of RAM, I get the following error message. Even if I configure memory limit for duckdb, the memory usage goes way up higher than the limit and I still get the error. The code does work when I run it on a machine with 250GB of RAM.
Environment
I'm using the latest version of {duckdb} in a Windows OS. See below.
The text was updated successfully, but these errors were encountered: