Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Error in rawToChar(out) : long vectors not supported yet #35382

Open
matthewgson opened this issue May 1, 2023 · 8 comments
Open

[R] Error in rawToChar(out) : long vectors not supported yet #35382

matthewgson opened this issue May 1, 2023 · 8 comments

Comments

@matthewgson
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

I was trying to write long data with 1.2B rows and five variables from R.
It takes long time to raise an error:

> write_parquet(taq_files, 'data/taq_files.parquet')
Error in rawToChar(out) : long vectors not supported yet: raw.c:68

SessionInfo:

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fstcore_0.9.14    fst_0.9.8         gmailr_1.0.1      qs_0.25.4         fasttime_1.1-0    arrow_11.0.0.3    forcats_1.0.0     stringr_1.5.0    
 [9] dplyr_1.1.0       purrr_1.0.1       readr_2.1.3       tidyr_1.3.0       tibble_3.1.8      ggplot2_3.4.0     tidyverse_1.3.2   data.table_1.14.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10         lubridate_1.9.0     assertthat_0.2.1    utf8_1.2.3          R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
 [8] reprex_2.0.2        httr_1.4.4          pillar_1.8.1        rlang_1.0.6         googlesheets4_1.0.1 curl_5.0.0          readxl_1.4.1       
[15] rstudioapi_0.14     googledrive_2.0.0   bit_4.0.5           munsell_0.5.0       broom_1.0.3         compiler_4.2.3      modelr_0.1.10      
[22] pkgconfig_2.0.3     base64enc_0.1-3     tidyselect_1.2.0    fansi_1.0.4         crayon_1.5.2        tzdb_0.3.0          dbplyr_2.3.2       
[29] withr_2.5.0         rappdirs_0.3.3      grid_4.2.3          jsonlite_1.8.4      gtable_0.3.1        lifecycle_1.0.3     DBI_1.1.3          
[36] pacman_0.5.1        magrittr_2.0.3      scales_1.2.1        RcppParallel_5.1.6  cli_3.6.0           stringi_1.7.12      fs_1.6.0           
[43] xml2_1.3.3          ellipsis_0.3.2      generics_0.1.3      vctrs_0.5.2         stringfish_0.15.7   RApiSerialize_0.1.2 rematch2_2.1.2     
[50] tools_4.2.3         bit64_4.0.5         glue_1.6.2          hms_1.1.2           parallel_4.2.3      timechange_0.2.0    colorspace_2.1-0   
[57] gargle_1.2.1        rvest_1.0.3         haven_2.5.1        

Component(s)

R

@nealrichardson
Copy link
Member

Can you describe the contents of taq_files?

Can you get a traceback() to confirm that https://github.com/apache/arrow/blob/main/r/R/metadata.R#L35 is the place that is erroring?

Assuming so, this seems to be about saving additional attributes about the data.frame and its columns, so you'd want to look into why that is very, very large.

@matthewgson
Copy link
Author

I do not recall if I had this issue before so I have downgraded to version 10.0.1, and it wrote the file without a hassle.
I will loop back and confirm with traceback().

@nealrichardson
Copy link
Member

Hmm, https://github.com/apache/arrow/commits/main/r/R/metadata.R hasn't meaningfully changed in a long time, and nothing in https://arrow.apache.org/docs/r/news/index.html#arrow-11002 obviously would affect this either, so I'm surprised that downgrading helps.

@matthewgson
Copy link
Author

matthewgson commented May 1, 2023

The taq_files data is data.table that is generated by calling setDT() on arrow query on local parquet file.

> str(taq_files)
Classes ‘data.table’ and 'data.frame':	1276864378 obs. of  5 variables:
 $ sym_root      : chr  "SBAC" "SBAC" "SBAC" "SBAC" ...
 $ cusip6        : chr  "78388J" "78388J" "78388J" "78388J" ...
 $ date          : Date, format: "2015-01-06" "2015-01-06" ...
 $ minute_grouper: chr  "09:29:00" "09:30:00" "09:31:00" "09:32:00" ...
 $ mid_quote     : num  109 110 110 110 110 ...
- attr(*, "groups")= tibble [4,099,401 × 3] (S3: tbl_df/tbl/data.frame)
  ..$ sym_root: chr [1:4099401] "A" "A" "A" "A" ...
  ..$ date    : Date[1:4099401], format: "2015-01-02" "2015-01-05" ...
  ..$ .rows   : list<int> [1:4099401] 
  .. ..$ : int [1:390] 1209805 1209806 1209807 1209808 1209809 1209810 1209811 1209812 1209813 1209814 ...
  .. ..$ : int [1:390] 2476070 2476071 2476072 2476073 2476074 2476075 2476076 2476077 2476078 2476079 ...
  .. ..$ : int [1:391] 447061 447062 447063 447064 447065 447066 447067 447068 447069 447070 ...
  .. .. [list output truncated]
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE
 - attr(*, ".internal.selfref")=<externalptr> 

traceback

> write_parquet(taq_files, 'data/test.parquet')
Error in rawToChar(out) : long vectors not supported yet: raw.c:68
> traceback()
12: rawToChar(out)
11: .serialize_arrow_r_metadata(list(attributes = list(class = c("data.table", 
    "data.frame"), groups = structure(list(sym_root = c("A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
     ...
10: Table__from_dots(dots, schema, option_use_threads())
9: Table$create(x, schema = schema)
8: as_arrow_table.data.frame(x)
7: as_arrow_table(x)
6: doTryCatch(return(expr), name, parentenv, handler)
5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
4: tryCatchList(expr, classes, parentenv, handlers)
3: tryCatch(as_arrow_table(x), arrow_no_method_as_arrow_table = function(e) {
       abort("Object must be coercible to an Arrow Table using `as_arrow_table()`", 
           parent = e, call = caller_env(2))
   })
2: as_writable_table(x)
1: write_parquet(taq_files, "data/test.parquet")

Strangely enough, I was able to write the parquet with arrow 11.0.0.3 after loading the parquet file written in arrow 10.0.1. However, when I replicate the data-gathering process (from local parquet file, sent arrow query like query %>% collect %>% setDT) it fails to write.

@nealrichardson
Copy link
Member

Ah, I bet it works if you remove the setDT() step, which is anyway not relevant if you're writing to parquet. dplyr caches grouping information in a groups attribute, which can be huge (as you see) and is redundant to the data. We remove that when we write the data, but because setDT() changes the class of the input, it's no longer a grouped_df, so we don't catch it: https://github.com/apache/arrow/blob/main/r/R/metadata.R#L149-L161

@matthewgson
Copy link
Author

matthewgson commented May 10, 2023

I just experienced that when reading a very large, multiple grouped_df, it generates error message like below.

> taq = open_dataset('data/TAQ/') %>% select(date, minute_grouper, sym_root, mid_quote, permno) %>%  collect
Warning message:
Invalid metadata$r 

> print(taq)
Error:
! Assigned data `map(.subset(x, unname), vectbl_set_names, NULL)` must be compatible with existing data.
✖ Existing data has 807824462 rows.
✖ Element 1 of assigned data has 3487142834 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_trace()` to see where the error occurred.
Error during wrapup: long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

> open_dataset('data/TAQ')
FileSystemDataset with 2557 Parquet files
date: date32[day]
minute_grouper: time32[ms]
sym_root: string
sym_suffix: string
best_bid: double
best_ask: double
mid_quote: double
ret_1m: double
permno: int32
cusip: string
ncusip: string
match_lvl: int32

> open_dataset('data/TAQ/') %>% nrow
[1] 3487142834

> single_sample
# A tibble: 2,084,647 × 12
# Groups:   sym_root, sym_suffix, date [8,110]
   date       minute_grouper sym_root sym_suffix best_bid best_ask mid_quote    ret_1m permno cusip    ncusip   match_lvl
   <date>     <time>         <chr>    <chr>         <dbl>    <dbl>     <dbl>     <dbl>  <int> <chr>    <chr>        <int>
 1 2015-01-05 09:29          A        NA             40.1     40.5      40.3 NA         87432 00846U10 00846U10         1
# … with 2,084,637 more rows
# ℹ Use `print(n = ...)` to see more rows
> single_sample %>% class
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Currently using arrow 10.0.0.1

@matthewgson
Copy link
Author

matthewgson commented May 16, 2023

I converted to data.table to tibble(data.frame) with as_tibble but unsuccessful (ver 12.0.0).
My workaround was to reduce the size and remove groups attribute manually with

attr(df, "groups") <- NULL

@lucasmation
Copy link

lucasmation commented Feb 24, 2025

I am having the same problem here. Arrow 17.0.0.1, large dataset in data.table format(234m rows, 111columns, several being large name strings, attr(pes, "groups") is NULL NULL).

@matthewgson , were you able to pinpoint what made this work? Removing groups or "reduce size" (if so did that mean removing lines or columns (or optimizing column types))?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants