-
Notifications
You must be signed in to change notification settings - Fork 1.7k
"Gentle Introduction to Arrow / Record Batches" #11336 #18051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This adds a new user guide page addressing issue apache#11336 to provide a gentle introduction to Apache Arrow and RecordBatches for DataFusion users. The guide includes: - Explanation of Arrow as a columnar specification - Visual comparison of row vs columnar storage (with ASCII diagrams) - Rationale for RecordBatch-based streaming (memory + vectorization) - Practical examples: reading files, building batches, querying with MemTable - Clear guidance on when Arrow knowledge is needed (extension points) - Links back to DataFrame API and library user guide - Link to DataFusion Invariants for contributors who want to go deeper This helps users understand the foundation without getting overwhelmed, addressing feedback from PR apache#11290 that DataFrame examples 'throw people into the deep end of Arrow.'
…navigation - Add explanation of Arc and ArrayRef for Rust newcomers - Add visual diagram showing RecordBatch streaming through pipeline - Make common pitfalls more concrete with specific examples - Emphasize Arrow's unified type system as DataFusion's foundation - Add comprehensive API documentation links throughout document - Link to extension points guides (TableProvider, UDFs, custom operators) These improvements make the Arrow introduction more accessible for newcomers while providing clear navigation paths to advanced topics for users extending DataFusion. Addresses apache#11336
Run prettier to fix markdown link reference formatting (lowercase convention)
Apply prettier formatting to fix pre-existing formatting issues in: - query-optimizer.md - concepts-readings-events.md - scalar_functions.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up; I left a few comments, though my overall thoughts are that the guide as is feels a little disjointed in some of the information that is presented and is confusing to me as I don't know what preexisting knowledge it assumes of readers. Maybe the article would benefit from having a tighter focus and leaving more verbose details to external links (such as the Arrow docs).
Then again I'm not coming from a fresh user perspective so I'm a biased in that regard 😅
--- | ||
|
||
# REFERENCES | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These links below aren't actually visible so don't think this header is necessary
- **Vectorized Execution**: Process entire columns at once using SIMD instructions | ||
- **Better Compression**: Similar values stored together compress more efficiently | ||
- **Cache Efficiency**: Scanning specific columns doesn't load unnecessary data | ||
- **Zero-Copy Data Sharing**: Systems can share Arrow data without conversion overhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be hesitant to mention compression here as being an in-memory format it isn't typically compressed (as compared to something like Parquet)
**Key Properties**: | ||
|
||
- Arrays are immutable (create new batches to modify data) | ||
- NULL values tracked via efficient validity bitmaps | ||
- Variable-length data (strings, lists) use offset arrays for efficient access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the last two properties are a bit mismatched here; they are instead properties of arrays and not recordbatches, but more importantly in a guide that is meant to be a gentle introduction, they seem to be placed here randomly. If someone were to read Variable-length data (strings, lists) use offset arrays for efficient access
there isn't much to gleam from that information (that is relevant to the overall theme of the guide) 🤔
### Why Not Process Single Rows? | ||
|
||
- **Lost Vectorization**: Can't use SIMD instructions on single values | ||
- **Poor Cache Utilization**: Jumping between rows defeats CPU cache optimization | ||
- **High Overhead**: Managing individual rows has significant bookkeeping costs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section feels a bit misplaced, as some of these downsides were mentioned right above under Why this matters
so it feels a little inconsistent to have the points stated again right below
|
||
## What is a RecordBatch? (And Why Batch?) | ||
|
||
A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays sharing the same schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays sharing the same schema. | |
A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays that form a common schema. |
I'm not sure about this wording either, but it feels slightly wrong to call the schema as being shared by arrays 🤔
|
||
Sometimes you need to create Arrow data programmatically rather than reading from files. This example shows the core building blocks: creating typed arrays (like [`Int32Array`] for numbers), defining a [`Schema`] that describes your columns, and assembling them into a [`RecordBatch`]. | ||
|
||
You'll notice [`Arc`] ([Atomically Reference Counted](https://doc.rust-lang.org/std/sync/struct.Arc.html)) is used frequently—this is how Arrow enables efficient, zero-copy data sharing. Instead of copying data, different parts of the query engine can safely share read-only references to the same underlying memory. [`ArrayRef`] is simply a type alias for `Arc<dyn Array>`, representing a reference to any Arrow array type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wording implies Arc
is the key to Arrow, though it can be misleading considering that's more of an implementation detail on the Rust side 🤔
2. [Library User Guide: DataFrame API](../library-user-guide/using-the-dataframe-api.md) - Detailed examples and patterns | ||
3. [Custom Table Providers](../library-user-guide/custom-table-providers.md) - When you need Arrow knowledge | ||
|
||
## Further reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we can trim some of these references; for example including IPC is probably unnecessary for the goal of this guide.
## Next Steps: Working with DataFrames | ||
|
||
Now that you understand Arrow's RecordBatch format, you're ready to work with DataFusion's high-level APIs. The [DataFrame API](dataframe.md) provides a familiar, ergonomic interface for building queries without needing to think about Arrow internals most of the time. | ||
|
||
The DataFrame API handles all the Arrow details under the hood - reading files into RecordBatches, applying transformations, and producing results. You only need to drop down to the Arrow level when implementing custom data sources, UDFs, or other extension points. | ||
|
||
**Recommended reading order:** | ||
|
||
1. [DataFrame API](dataframe.md) - High-level query building interface | ||
2. [Library User Guide: DataFrame API](../library-user-guide/using-the-dataframe-api.md) - Detailed examples and patterns | ||
3. [Custom Table Providers](../library-user-guide/custom-table-providers.md) - When you need Arrow knowledge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to have a Next steps
about working with the DataFrame API, given this guide itself is meant to be an introduction to Arrow for DataFusion users who may not need to use Arrow directly.
- **interval**: Bin interval. | ||
- **expression**: Time expression to operate on. Can be a constant, column, or function. | ||
- **origin-timestamp**: Optional. Starting point used to determine bin boundaries. If not specified defaults 1970-01-01T00:00:00Z (the UNIX epoch in UTC). The following intervals are supported: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is autogenerated, if you wanna change the doc please change the userdoc for pub struct DateBinFunc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sm4rtm4art I would delegate RecordBatch
description,. pros/cons and examples to Arrow itself, otherwise it would be quite complicated to keep documentation in sync. WDYT?
Which issue does this PR close?
Since this is my first contribution, I suppose to mention @alamb , author of the Issue #11336
Could you please trigger the CI? Thanks!
Rationale for this change
The Arrow introduction guide (#11336) needed improvements to make it more accessible for newcomers while providing better navigation to advanced topics.
What changes are included in this PR?
Issue #11336 requested a gentle introduction to Apache Arrow and RecordBatches to help DataFusion users understand the foundational concepts. This PR enhances the existing Arrow introduction guide with clearer explanations, practical examples, visual aids, and comprehensive navigation links to make it more accessible for newcomers while providing pathways to advanced topics.
Was unsure if this fits to `docs/source/user-guide/dataframe.md'
Are these changes tested?
applyed prettier, like described.
Are there any user-facing changes?
Yes - improved documentation for the Arrow introduction guide at
docs/source/user-guide/arrow-introduction.md