Skip to content

Conversation

sm4rtm4art
Copy link

Which issue does this PR close?

Since this is my first contribution, I suppose to mention @alamb , author of the Issue #11336

Could you please trigger the CI? Thanks!

Rationale for this change

The Arrow introduction guide (#11336) needed improvements to make it more accessible for newcomers while providing better navigation to advanced topics.

What changes are included in this PR?

Issue #11336 requested a gentle introduction to Apache Arrow and RecordBatches to help DataFusion users understand the foundational concepts. This PR enhances the existing Arrow introduction guide with clearer explanations, practical examples, visual aids, and comprehensive navigation links to make it more accessible for newcomers while providing pathways to advanced topics.

Was unsure if this fits to `docs/source/user-guide/dataframe.md'

Are these changes tested?

applyed prettier, like described.

Are there any user-facing changes?

Yes - improved documentation for the Arrow introduction guide at docs/source/user-guide/arrow-introduction.md

Martin added 2 commits October 14, 2025 00:13
This adds a new user guide page addressing issue apache#11336 to provide
a gentle introduction to Apache Arrow and RecordBatches for DataFusion users.

The guide includes:
- Explanation of Arrow as a columnar specification
- Visual comparison of row vs columnar storage (with ASCII diagrams)
- Rationale for RecordBatch-based streaming (memory + vectorization)
- Practical examples: reading files, building batches, querying with MemTable
- Clear guidance on when Arrow knowledge is needed (extension points)
- Links back to DataFrame API and library user guide
- Link to DataFusion Invariants for contributors who want to go deeper

This helps users understand the foundation without getting overwhelmed,
addressing feedback from PR apache#11290 that DataFrame examples 'throw people
into the deep end of Arrow.'
…navigation

- Add explanation of Arc and ArrayRef for Rust newcomers
- Add visual diagram showing RecordBatch streaming through pipeline
- Make common pitfalls more concrete with specific examples
- Emphasize Arrow's unified type system as DataFusion's foundation
- Add comprehensive API documentation links throughout document
- Link to extension points guides (TableProvider, UDFs, custom operators)

These improvements make the Arrow introduction more accessible for
newcomers while providing clear navigation paths to advanced topics
for users extending DataFusion.

Addresses apache#11336
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 14, 2025
Martin added 3 commits October 14, 2025 14:25
Run prettier to fix markdown link reference formatting (lowercase convention)
Apply prettier formatting to fix pre-existing formatting issues in:
- query-optimizer.md
- concepts-readings-events.md
- scalar_functions.md
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up; I left a few comments, though my overall thoughts are that the guide as is feels a little disjointed in some of the information that is presented and is confusing to me as I don't know what preexisting knowledge it assumes of readers. Maybe the article would benefit from having a tighter focus and leaving more verbose details to external links (such as the Arrow docs).

Then again I'm not coming from a fresh user perspective so I'm a biased in that regard 😅

Comment on lines +114 to +117
---

# REFERENCES

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These links below aren't actually visible so don't think this header is necessary

Comment on lines +69 to +72
- **Vectorized Execution**: Process entire columns at once using SIMD instructions
- **Better Compression**: Similar values stored together compress more efficiently
- **Cache Efficiency**: Scanning specific columns doesn't load unnecessary data
- **Zero-Copy Data Sharing**: Systems can share Arrow data without conversion overhead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be hesitant to mention compression here as being an in-memory format it isn't typically compressed (as compared to something like Parquet)

Comment on lines +96 to +100
**Key Properties**:

- Arrays are immutable (create new batches to modify data)
- NULL values tracked via efficient validity bitmaps
- Variable-length data (strings, lists) use offset arrays for efficient access
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the last two properties are a bit mismatched here; they are instead properties of arrays and not recordbatches, but more importantly in a guide that is meant to be a gentle introduction, they seem to be placed here randomly. If someone were to read Variable-length data (strings, lists) use offset arrays for efficient access there isn't much to gleam from that information (that is relevant to the overall theme of the guide) 🤔

Comment on lines +86 to +90
### Why Not Process Single Rows?

- **Lost Vectorization**: Can't use SIMD instructions on single values
- **Poor Cache Utilization**: Jumping between rows defeats CPU cache optimization
- **High Overhead**: Managing individual rows has significant bookkeeping costs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section feels a bit misplaced, as some of these downsides were mentioned right above under Why this matters so it feels a little inconsistent to have the points stated again right below


## What is a RecordBatch? (And Why Batch?)

A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays sharing the same schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays sharing the same schema.
A **[`RecordBatch`]** represents a horizontal slice of a table—a collection of equal-length columnar arrays that form a common schema.

I'm not sure about this wording either, but it feels slightly wrong to call the schema as being shared by arrays 🤔


Sometimes you need to create Arrow data programmatically rather than reading from files. This example shows the core building blocks: creating typed arrays (like [`Int32Array`] for numbers), defining a [`Schema`] that describes your columns, and assembling them into a [`RecordBatch`].

You'll notice [`Arc`] ([Atomically Reference Counted](https://doc.rust-lang.org/std/sync/struct.Arc.html)) is used frequently—this is how Arrow enables efficient, zero-copy data sharing. Instead of copying data, different parts of the query engine can safely share read-only references to the same underlying memory. [`ArrayRef`] is simply a type alias for `Arc<dyn Array>`, representing a reference to any Arrow array type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording implies Arc is the key to Arrow, though it can be misleading considering that's more of an implementation detail on the Rust side 🤔

2. [Library User Guide: DataFrame API](../library-user-guide/using-the-dataframe-api.md) - Detailed examples and patterns
3. [Custom Table Providers](../library-user-guide/custom-table-providers.md) - When you need Arrow knowledge

## Further reading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can trim some of these references; for example including IPC is probably unnecessary for the goal of this guide.

Comment on lines +239 to +249
## Next Steps: Working with DataFrames

Now that you understand Arrow's RecordBatch format, you're ready to work with DataFusion's high-level APIs. The [DataFrame API](dataframe.md) provides a familiar, ergonomic interface for building queries without needing to think about Arrow internals most of the time.

The DataFrame API handles all the Arrow details under the hood - reading files into RecordBatches, applying transformations, and producing results. You only need to drop down to the Arrow level when implementing custom data sources, UDFs, or other extension points.

**Recommended reading order:**

1. [DataFrame API](dataframe.md) - High-level query building interface
2. [Library User Guide: DataFrame API](../library-user-guide/using-the-dataframe-api.md) - Detailed examples and patterns
3. [Custom Table Providers](../library-user-guide/custom-table-providers.md) - When you need Arrow knowledge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to have a Next steps about working with the DataFrame API, given this guide itself is meant to be an introduction to Arrow for DataFusion users who may not need to use Arrow directly.

- **interval**: Bin interval.
- **expression**: Time expression to operate on. Can be a constant, column, or function.
- **origin-timestamp**: Optional. Starting point used to determine bin boundaries. If not specified defaults 1970-01-01T00:00:00Z (the UNIX epoch in UTC). The following intervals are supported:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is autogenerated, if you wanna change the doc please change the userdoc for pub struct DateBinFunc

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sm4rtm4art I would delegate RecordBatch description,. pros/cons and examples to Arrow itself, otherwise it would be quite complicated to keep documentation in sync. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a "Gentle Introduction to Arrow / Record Batches"

3 participants