Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -441,7 +441,9 @@ services:
ARROW_HOME: /arrow
ARROW_DEPENDENCY_SOURCE: BUNDLED
LIBARROW_MINIMAL: "false"
ARROW_MIMALLOC: "ON"
# explicitly enable GCS when we build libarrow so that binary libarrow
# users get more fully-featured builds
ARROW_GCS: "ON"
volumes: *ubuntu-volumes
command: &cpp-static-command
/bin/bash -c "
Expand Down
1 change: 0 additions & 1 deletion dev/tasks/r/github.packages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,6 @@ jobs:
env:
{{ macros.github_set_sccache_envvars()|indent(8) }}
MACOSX_DEPLOYMENT_TARGET: "11.6"
ARROW_S3: ON
ARROW_GCS: ON
ARROW_DEPENDENCY_SOURCE: BUNDLED
CMAKE_GENERATOR: Ninja
Expand Down
2 changes: 1 addition & 1 deletion r/tools/nixlibs.R
Original file line number Diff line number Diff line change
Expand Up @@ -597,7 +597,7 @@ build_libarrow <- function(src_dir, dst_dir) {
env_var_list <- c(
env_var_list,
ARROW_S3 = Sys.getenv("ARROW_S3", "ON"),
ARROW_GCS = Sys.getenv("ARROW_GCS", "ON"),
# ARROW_GCS = Sys.getenv("ARROW_GCS", "ON"),
ARROW_WITH_ZSTD = Sys.getenv("ARROW_WITH_ZSTD", "ON")
)
}
Expand Down
193 changes: 193 additions & 0 deletions r/vignettes/developers/binary_features.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
---
title: "Libarrow binary features"
description: >
Understanding which C++ features are enabled in Arrow R package builds
output: rmarkdown::html_vignette
---

This document explains which C++ features are enabled in different Arrow R
package build configurations, and documents the decisions behind our default
feature set. This is intended as internal developer documentation for understanding
which features are enabled in which builds. It is not intended to be a guide for
installing the Arrow R package; for that, see the
[installation guide](../../install.html).

## Overview

When the Arrow R package is installed, it needs a copy of the Arrow C++ library
(libarrow). This can come from:

1. **Prebuilt binaries** we host (for releases and nightlies)
2. **Source builds** when binaries aren't available or users opt out

The features available in libarrow depend on how it was built. This document
covers the feature configuration for both scenarios.

## Prebuilt libarrow binary configuration

We produce prebuilt libarrow binaries for macOS, Windows, and Linux. These
binaries include **more features** than the default source build to provide
users with a fully-featured experience out of the box.

### Current binary feature set

| Platform | S3 | GCS | Configured in |
|----------|----|----|---------------|
| macOS (ARM64, x86_64) | ON | ON | `dev/tasks/r/github.packages.yml` |
| Windows | ON | ON | `ci/scripts/PKGBUILD` |
| Linux (x86_64) | ON | ON | `compose.yaml` (`ubuntu-cpp-static`) |

### Exceptions to our build defaults

Even though GCS defaults to OFF for source builds, we explicitly enable it in
our prebuilt binaries because:

1. **Binary users expect features to "just work"** - they shouldn't need to
rebuild from source to access cloud storage
2. **Build time is not a concern** - we build binaries once in CI, not on
user machines
3. **Parity across platforms** - users get the same features regardless of OS


## Feature configuration in source builds of libarrow

Source builds are controlled by `r/inst/build_arrow_static.sh`. The key
environment variable is `LIBARROW_MINIMAL`:

- `LIBARROW_MINIMAL` unset: Default feature set (Parquet, Dataset, JSON, common compression ON; S3/GCS/jemalloc OFF)
- `LIBARROW_MINIMAL=false`: Full feature set (adds S3, jemalloc, additional compression)
- `LIBARROW_MINIMAL=true`: Truly minimal (disables Parquet, Dataset, JSON, most compression, SIMD)

### Features always enabled

These features are always built regardless of `LIBARROW_MINIMAL`:

| Feature | CMake Flag | Notes |
|---------|------------|-------|
| Compute | `ARROW_COMPUTE=ON` | Core compute functions |
| CSV | `ARROW_CSV=ON` | CSV reading/writing |
| Filesystem | `ARROW_FILESYSTEM=ON` | Local filesystem support |
| JSON | `ARROW_JSON=ON` | JSON reading |
| Parquet | `ARROW_PARQUET=ON` | Parquet file format |
| Dataset | `ARROW_DATASET=ON` | Multi-file datasets |
| Acero | `ARROW_ACERO=ON` | Query execution engine |
| Mimalloc | `ARROW_MIMALLOC=ON` | Memory allocator |
| LZ4 | `ARROW_WITH_LZ4=ON` | LZ4 compression |
| Snappy | `ARROW_WITH_SNAPPY=ON` | Snappy compression |
| RE2 | `ARROW_WITH_RE2=ON` | Regular expressions |
| UTF8Proc | `ARROW_WITH_UTF8PROC=ON` | Unicode support |

### Features controlled by LIBARROW_MINIMAL
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the docs I wish we had a few years ago, thank you for writing them!!


When `LIBARROW_MINIMAL=false`, the following additional features are enabled
(via `$ARROW_DEFAULT_PARAM=ON`):

| Feature | CMake Flag | Default |
|---------|------------|---------|
| S3 | `ARROW_S3` | `$ARROW_DEFAULT_PARAM` |
| Jemalloc | `ARROW_JEMALLOC` | `$ARROW_DEFAULT_PARAM` |
| Brotli | `ARROW_WITH_BROTLI` | `$ARROW_DEFAULT_PARAM` |
| BZ2 | `ARROW_WITH_BZ2` | `$ARROW_DEFAULT_PARAM` |
| Zlib | `ARROW_WITH_ZLIB` | `$ARROW_DEFAULT_PARAM` |
| Zstd | `ARROW_WITH_ZSTD` | `$ARROW_DEFAULT_PARAM` |

### Features that require explicit opt-in

GCS (Google Cloud Storage) is **always off by default**, even when
`LIBARROW_MINIMAL=false`:

| Feature | CMake Flag | Default | Reason |
|---------|------------|---------|--------|
| GCS | `ARROW_GCS` | `OFF` | Build complexity, dependency size |

To enable GCS in a source build, you must explicitly set `ARROW_GCS=ON`.

**Why is GCS off by default?**
Comment thread
jonkeane marked this conversation as resolved.

GCS was turned off by default in [#48343](https://github.com/apache/arrow/pull/48343)
(December 2025) because:

1. Building google-cloud-cpp is fragile and adds significant build time
2. The dependency on abseil (ABSL) has caused compatibility issues
3. Users who need GCS can still enable it explicitly

## Configuration file locations

### libarrow source build configuration

The main build script that controls source builds:

**`r/inst/build_arrow_static.sh`** - CMake flags and defaults
([view source](https://github.com/apache/arrow/blob/main/r/inst/build_arrow_static.sh))
the environment variables to look for are `LIBARROW_MINIMAL`, `ARROW_*`, and, `ARROW_DEFAULT_PARAM`

### libarrow binary build configuration

Each platform has its own configuration file:

| Platform | Config file | Key settings |
|----------|-------------|--------------|
| macOS | `dev/tasks/r/github.packages.yml` | `LIBARROW_MINIMAL=false`, `ARROW_GCS=ON` |
| Windows | `ci/scripts/PKGBUILD` | `ARROW_GCS=ON`, `ARROW_S3=ON` |
| Linux | `compose.yaml` (`ubuntu-cpp-static`) | `LIBARROW_MINIMAL=false`, `ARROW_GCS=ON` |

## R-universe builds

[R-universe](https://apache.r-universe.dev/arrow) builds the Arrow R package
for users who want newer versions than CRAN. R-universe behavior varies by
platform and architecture:

| Platform | Architecture | Build method | Features |
|----------|--------------|--------------|----------|
| macOS | ARM64 | Downloads prebuilt binary | Full (S3 + GCS) |
| macOS | x86_64 | Downloads prebuilt binary | Full (S3 + GCS) |
| Windows | x86_64 | Downloads prebuilt binary | Full (S3 + GCS) |
Comment thread
jonkeane marked this conversation as resolved.
| Windows | ARM64 | Not supported | NA |
| Linux | x86_64 | Downloads prebuilt binary | Full (S3 + GCS) |
| Linux | ARM64 | Builds from source | S3 only (no GCS) |

### Why Linux ARM64 builds from source

We only publish prebuilt Linux binaries for x86_64 architecture. The binary
selection logic in `r/tools/nixlibs.R` (line 263) explicitly checks for this:

```r
if (identical(os, "darwin") || (identical(os, "linux") && identical(arch, "x86_64"))) {
```
When R-universe builds on Linux ARM64 runners, no binary is available, so it
falls back to building from source using `build_arrow_static.sh`. Since GCS
defaults to OFF in that script, Linux ARM64 users don't get GCS support.

### Enabling GCS for Linux ARM64

To provide full feature parity for Linux ARM64, we would need to:

1. Add an ARM64 Linux build job to `dev/tasks/r/github.packages.yml`
2. Update `select_binary()` in `nixlibs.R` to recognize `linux-aarch64`
3. Add the artifact pattern to `dev/tasks/tasks.yml`
4. Update the nightly upload workflow

See [GH-36193](https://github.com/apache/arrow/issues/36193) for tracking this work.

Alternatively, changing the GCS default in `build_arrow_static.sh` from `OFF`
to `$ARROW_DEFAULT_PARAM` would enable GCS for all source builds, including
Linux ARM64 on R-universe.

## Checking installed features

Users can check which features are enabled in their installation:

```r
# Show all capabilities
arrow::arrow_info()

# Check specific features
arrow::arrow_with_s3()
arrow::arrow_with_gcs()
```

## Related documentation

- [Installation guide](../install.html) - User-facing installation docs
- [Installation details](./install_details.html) - How the build system works
- [Developer setup](./setup.html) - Building Arrow for development
Loading