[R] Error in rawToChar(out) : long vectors not supported yet #35382

matthewgson · 2023-05-01T14:04:48Z

Describe the bug, including details regarding any error messages, version, and platform.

I was trying to write long data with 1.2B rows and five variables from R.
It takes long time to raise an error:

> write_parquet(taq_files, 'data/taq_files.parquet')
Error in rawToChar(out) : long vectors not supported yet: raw.c:68

SessionInfo:

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fstcore_0.9.14    fst_0.9.8         gmailr_1.0.1      qs_0.25.4         fasttime_1.1-0    arrow_11.0.0.3    forcats_1.0.0     stringr_1.5.0    
 [9] dplyr_1.1.0       purrr_1.0.1       readr_2.1.3       tidyr_1.3.0       tibble_3.1.8      ggplot2_3.4.0     tidyverse_1.3.2   data.table_1.14.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10         lubridate_1.9.0     assertthat_0.2.1    utf8_1.2.3          R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
 [8] reprex_2.0.2        httr_1.4.4          pillar_1.8.1        rlang_1.0.6         googlesheets4_1.0.1 curl_5.0.0          readxl_1.4.1       
[15] rstudioapi_0.14     googledrive_2.0.0   bit_4.0.5           munsell_0.5.0       broom_1.0.3         compiler_4.2.3      modelr_0.1.10      
[22] pkgconfig_2.0.3     base64enc_0.1-3     tidyselect_1.2.0    fansi_1.0.4         crayon_1.5.2        tzdb_0.3.0          dbplyr_2.3.2       
[29] withr_2.5.0         rappdirs_0.3.3      grid_4.2.3          jsonlite_1.8.4      gtable_0.3.1        lifecycle_1.0.3     DBI_1.1.3          
[36] pacman_0.5.1        magrittr_2.0.3      scales_1.2.1        RcppParallel_5.1.6  cli_3.6.0           stringi_1.7.12      fs_1.6.0           
[43] xml2_1.3.3          ellipsis_0.3.2      generics_0.1.3      vctrs_0.5.2         stringfish_0.15.7   RApiSerialize_0.1.2 rematch2_2.1.2     
[50] tools_4.2.3         bit64_4.0.5         glue_1.6.2          hms_1.1.2           parallel_4.2.3      timechange_0.2.0    colorspace_2.1-0   
[57] gargle_1.2.1        rvest_1.0.3         haven_2.5.1

Component(s)

R

The text was updated successfully, but these errors were encountered:

nealrichardson · 2023-05-01T16:48:54Z

Can you describe the contents of taq_files?

Can you get a traceback() to confirm that https://github.com/apache/arrow/blob/main/r/R/metadata.R#L35 is the place that is erroring?

Assuming so, this seems to be about saving additional attributes about the data.frame and its columns, so you'd want to look into why that is very, very large.

matthewgson · 2023-05-01T16:53:51Z

I do not recall if I had this issue before so I have downgraded to version 10.0.1, and it wrote the file without a hassle.
I will loop back and confirm with traceback().

nealrichardson · 2023-05-01T16:58:53Z

Hmm, https://github.com/apache/arrow/commits/main/r/R/metadata.R hasn't meaningfully changed in a long time, and nothing in https://arrow.apache.org/docs/r/news/index.html#arrow-11002 obviously would affect this either, so I'm surprised that downgrading helps.

matthewgson · 2023-05-01T19:22:27Z

The taq_files data is data.table that is generated by calling setDT() on arrow query on local parquet file.

> str(taq_files)
Classes ‘data.table’ and 'data.frame':	1276864378 obs. of  5 variables:
 $ sym_root      : chr  "SBAC" "SBAC" "SBAC" "SBAC" ...
 $ cusip6        : chr  "78388J" "78388J" "78388J" "78388J" ...
 $ date          : Date, format: "2015-01-06" "2015-01-06" ...
 $ minute_grouper: chr  "09:29:00" "09:30:00" "09:31:00" "09:32:00" ...
 $ mid_quote     : num  109 110 110 110 110 ...
- attr(*, "groups")= tibble [4,099,401 × 3] (S3: tbl_df/tbl/data.frame)
  ..$ sym_root: chr [1:4099401] "A" "A" "A" "A" ...
  ..$ date    : Date[1:4099401], format: "2015-01-02" "2015-01-05" ...
  ..$ .rows   : list<int> [1:4099401] 
  .. ..$ : int [1:390] 1209805 1209806 1209807 1209808 1209809 1209810 1209811 1209812 1209813 1209814 ...
  .. ..$ : int [1:390] 2476070 2476071 2476072 2476073 2476074 2476075 2476076 2476077 2476078 2476079 ...
  .. ..$ : int [1:391] 447061 447062 447063 447064 447065 447066 447067 447068 447069 447070 ...
  .. .. [list output truncated]
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE
 - attr(*, ".internal.selfref")=<externalptr>

traceback

> write_parquet(taq_files, 'data/test.parquet')
Error in rawToChar(out) : long vectors not supported yet: raw.c:68
> traceback()
12: rawToChar(out)
11: .serialize_arrow_r_metadata(list(attributes = list(class = c("data.table", 
    "data.frame"), groups = structure(list(sym_root = c("A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
    "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
     ...
10: Table__from_dots(dots, schema, option_use_threads())
9: Table$create(x, schema = schema)
8: as_arrow_table.data.frame(x)
7: as_arrow_table(x)
6: doTryCatch(return(expr), name, parentenv, handler)
5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
4: tryCatchList(expr, classes, parentenv, handlers)
3: tryCatch(as_arrow_table(x), arrow_no_method_as_arrow_table = function(e) {
       abort("Object must be coercible to an Arrow Table using `as_arrow_table()`", 
           parent = e, call = caller_env(2))
   })
2: as_writable_table(x)
1: write_parquet(taq_files, "data/test.parquet")

Strangely enough, I was able to write the parquet with arrow 11.0.0.3 after loading the parquet file written in arrow 10.0.1. However, when I replicate the data-gathering process (from local parquet file, sent arrow query like query %>% collect %>% setDT) it fails to write.

nealrichardson · 2023-05-01T19:32:17Z

Ah, I bet it works if you remove the setDT() step, which is anyway not relevant if you're writing to parquet. dplyr caches grouping information in a groups attribute, which can be huge (as you see) and is redundant to the data. We remove that when we write the data, but because setDT() changes the class of the input, it's no longer a grouped_df, so we don't catch it: https://github.com/apache/arrow/blob/main/r/R/metadata.R#L149-L161

matthewgson · 2023-05-10T22:44:50Z

I just experienced that when reading a very large, multiple grouped_df, it generates error message like below.

> taq = open_dataset('data/TAQ/') %>% select(date, minute_grouper, sym_root, mid_quote, permno) %>%  collect
Warning message:
Invalid metadata$r 

> print(taq)
Error:
! Assigned data `map(.subset(x, unname), vectbl_set_names, NULL)` must be compatible with existing data.
✖ Existing data has 807824462 rows.
✖ Element 1 of assigned data has 3487142834 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_trace()` to see where the error occurred.
Error during wrapup: long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

> open_dataset('data/TAQ')
FileSystemDataset with 2557 Parquet files
date: date32[day]
minute_grouper: time32[ms]
sym_root: string
sym_suffix: string
best_bid: double
best_ask: double
mid_quote: double
ret_1m: double
permno: int32
cusip: string
ncusip: string
match_lvl: int32

> open_dataset('data/TAQ/') %>% nrow
[1] 3487142834

> single_sample
# A tibble: 2,084,647 × 12
# Groups:   sym_root, sym_suffix, date [8,110]
   date       minute_grouper sym_root sym_suffix best_bid best_ask mid_quote    ret_1m permno cusip    ncusip   match_lvl
   <date>     <time>         <chr>    <chr>         <dbl>    <dbl>     <dbl>     <dbl>  <int> <chr>    <chr>        <int>
 1 2015-01-05 09:29          A        NA             40.1     40.5      40.3 NA         87432 00846U10 00846U10         1
# … with 2,084,637 more rows
# ℹ Use `print(n = ...)` to see more rows
> single_sample %>% class
[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Currently using arrow 10.0.0.1

matthewgson · 2023-05-16T12:23:45Z

I converted to data.table to tibble(data.frame) with as_tibble but unsuccessful (ver 12.0.0).
My workaround was to reduce the size and remove groups attribute manually with

attr(df, "groups") <- NULL

lucasmation · 2025-02-24T13:42:19Z

I am having the same problem here. Arrow 17.0.0.1, large dataset in data.table format(234m rows, 111columns, several being large name strings, attr(pes, "groups") is NULL NULL).

@matthewgson , were you able to pinpoint what made this work? Removing groups or "reduce size" (if so did that mean removing lines or columns (or optimizing column types))?

matthewgson added the Type: bug label May 1, 2023

github-actions bot added the Component: R label May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Error in rawToChar(out) : long vectors not supported yet #35382

[R] Error in rawToChar(out) : long vectors not supported yet #35382

matthewgson commented May 1, 2023

nealrichardson commented May 1, 2023

matthewgson commented May 1, 2023

nealrichardson commented May 1, 2023

matthewgson commented May 1, 2023 •

edited

Loading

nealrichardson commented May 1, 2023

matthewgson commented May 10, 2023 •

edited

Loading

matthewgson commented May 16, 2023 •

edited

Loading

lucasmation commented Feb 24, 2025 •

edited

Loading

[R] Error in rawToChar(out) : long vectors not supported yet #35382

[R] Error in rawToChar(out) : long vectors not supported yet #35382

Comments

matthewgson commented May 1, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

nealrichardson commented May 1, 2023

matthewgson commented May 1, 2023

nealrichardson commented May 1, 2023

matthewgson commented May 1, 2023 • edited Loading

nealrichardson commented May 1, 2023

matthewgson commented May 10, 2023 • edited Loading

matthewgson commented May 16, 2023 • edited Loading

lucasmation commented Feb 24, 2025 • edited Loading

matthewgson commented May 1, 2023 •

edited

Loading

matthewgson commented May 10, 2023 •

edited

Loading

matthewgson commented May 16, 2023 •

edited

Loading

lucasmation commented Feb 24, 2025 •

edited

Loading