Understanding memory usage and performance of `furrr::future_apply` #260

alessandro-peron-sdg · 2023-06-08T10:44:55Z

I am facing some issues parallelizing processes with furrr::future_apply.

This is the setting I am having issues with:

    rm(list=ls(all=TRUE))
    
    require(future)
    require(furrr)
    require(dplyr)
    require(readr)
    require(parallel)
    set.seed(123)
    
    # fake data
    my_list <-   replicate(1000000, rnorm(1000), simplify = FALSE)
    
    # function to parallelize
    f_to_parallelize <- function(x){
      
      y <- sum(x)
      
      return(y)
      
    }
    
    # plans to test
    plan(sequential)
    #plan(multisession, workers=2)
    #plan(multisession, workers=6)
    #plan(multisession, workers=15)
    
    l <- future_walk(my_list, f_to_parallelize)

When I profile memory and time for these 4 plans this is what I get:

I have launched 4 different jobs from R studio server, while I was profiling all memory used for processes with my user in a separate job to get data for the graph.

This is the outpu of my sessionInfo()) of the parallelization jobs:

R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] readr_2.1.2 dplyr_1.1.0 furrr_0.2.3 future_1.24.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.13 parallelly_1.30.0 magrittr_2.0.2 hms_1.1.1
[5] tidyselect_1.2.0 R6_2.5.1 rlang_1.1.1 fansi_1.0.2
[9] globals_0.14.0 tools_4.2.2 utf8_1.2.2 cli_3.6.0
[13] ellipsis_0.3.2 digest_0.6.29 tibble_3.1.6 lifecycle_1.0.3
[17] crayon_1.5.0 tzdb_0.2.0 purrr_1.0.1 vctrs_0.5.2
[21] codetools_0.2-18 glue_1.6.2 compiler_4.2.2 pillar_1.7.0
[25] generics_0.1.2 listenv_0.8.0 pkgconfig_2.0.3

Is this behavior normal? I did not expected the steep increase in memory for all the plans, other than the increase in time when I increase the number of workers.

I also tested the sys.sleep(1) function in parallel, and I got the result I expected, time decreases as I increase workers.

What I am trying to parallelize is far more complex than this, i.e. a series of nested wrapped functions that do some training for some time series models and inference writing a csv and not returning anything.

I fill like I am losing something very simple but yet I cannot wrap my head around it, what concerns me the most is the memory increase, as it would be a very memory intensive function.

The text was updated successfully, but these errors were encountered:

DavisVaughan · 2023-06-08T11:35:47Z

Can you also post the code you used to do the memory profiling?

alessandro-peron-sdg · 2023-06-08T12:14:46Z

Sure, here it is:

rm(list=ls(all=TRUE))

library(tibble)
library(glue)
library(lubridate)

run_command <- function(filename) {
  while (TRUE) {
    output <- system("ps -u username--no-headers -o rss | awk '{sum+=$1} END {print (sum/1024/1024)}'", intern = TRUE)
    output_df <- tibble(time = format(Sys.time(), "%Y-%m-%d %H:%M:%S"), output = output)
    
    # Check if the file exists
    if (file.exists(filename)) {
      # Append the output to the existing CSV file
      write.table(output_df, file = filename, append = TRUE, sep = ",", row.names = FALSE, col.names = FALSE)
    } else {
      # Create a new CSV file and write the output
      write.table(output_df, file = filename, append = FALSE, sep = ",", row.names = FALSE, col.names = TRUE)
    }
    
    Sys.sleep(1)
  }
}

run_command("output_file.csv")

alessandro-peron-sdg · 2023-06-22T15:20:02Z

@DavisVaughan any news on this?

rasmusrhl · 2023-09-18T13:17:19Z

I'm having the same issue.

D3SL · 2024-02-25T09:05:18Z

Same for me as well. I also recently several gigabytes of temp files had been created and never cleaned up and parallelized functions do not complete as quickly as they used to. I've been using furrr for years with excellent performance, this is unusual. It feels like something else changed in the R ecosystem that's impacting furrr.

I may have to switch to Crew (powered by mirai). It's a shame because nothing comes close to furrr in terms of syntactic sugar and ease of use.

D3SL mentioned this issue Feb 25, 2024

future_map not obviously faster than map in simple linear regression setting #267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding memory usage and performance of `furrr::future_apply` #260

Understanding memory usage and performance of `furrr::future_apply` #260

alessandro-peron-sdg commented Jun 8, 2023

DavisVaughan commented Jun 8, 2023

alessandro-peron-sdg commented Jun 8, 2023

alessandro-peron-sdg commented Jun 22, 2023

rasmusrhl commented Sep 18, 2023

D3SL commented Feb 25, 2024 •

edited

Loading

Understanding memory usage and performance of furrr::future_apply #260

Understanding memory usage and performance of furrr::future_apply #260

Comments

alessandro-peron-sdg commented Jun 8, 2023

DavisVaughan commented Jun 8, 2023

alessandro-peron-sdg commented Jun 8, 2023

alessandro-peron-sdg commented Jun 22, 2023

rasmusrhl commented Sep 18, 2023

D3SL commented Feb 25, 2024 • edited Loading

Understanding memory usage and performance of `furrr::future_apply` #260

Understanding memory usage and performance of `furrr::future_apply` #260

D3SL commented Feb 25, 2024 •

edited

Loading