Skip to content

add sample/issue922_benchmark.rb to compare table adapter approaches vs parquet#1121

Merged
suketa merged 2 commits intomainfrom
sample/issue922-benchmark
Feb 28, 2026
Merged

add sample/issue922_benchmark.rb to compare table adapter approaches vs parquet#1121
suketa merged 2 commits intomainfrom
sample/issue922-benchmark

Conversation

@suketa
Copy link
Owner

@suketa suketa commented Feb 28, 2026

Summary

Adds sample/issue922_benchmark.rb comparing three table function approaches against the parquet approach for exposing a Polars::DataFrame to DuckDB.

Approaches compared

Approach Description
1 row/call Baseline — one row per execute callback
batch/call 2048 rows per execute callback, reduces Ruby↔C crossings
batch + pre-extract Batch + pre-extract column arrays via to_a to eliminate repeated Polars FFI calls
parquet Write to temp file, read with read_parquet()

Typical results

1 row/call:            ~0.29s
batch/call:            ~0.23s  (~20% faster)
batch + pre-extract:   ~0.07s  (~76% faster than baseline)
parquet:               ~0.05s

Pre-extracting column arrays is the biggest Ruby-side win, bringing the table adapter to within ~1.5x of parquet.

Summary by CodeRabbit

  • New Features
    • Added benchmark utilities for evaluating Polars DataFrame integration with DuckDB across multiple execution approaches.
    • Introduced Parquet-based fallback pathway for comprehensive performance comparison testing.

suketa and others added 2 commits February 28, 2026 16:46
…approach

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…vs parquet

Three table function approaches are compared:
- 1 row per execute call (baseline)
- batch (2048 rows per call) to reduce Ruby<->C crossings
- batch + pre-extracted column arrays to avoid repeated Polars FFI calls

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 28, 2026

📝 Walkthrough

Walkthrough

A new benchmark file is introduced that implements three Polars DataFrame table-function adapters for DuckDB with different streaming strategies (single-row, batch-per-call, and optimized batch), along with a Parquet-based fallback approach and benchmark harness to compare performance.

Changes

Cohort / File(s) Summary
Polars DataFrame Table Adapters Benchmark
sample/issue922_benchmark.rb
Adds three table-adapter classes (PolarsDataFrameTableAdapter, PolarsDataFrameBatchTableAdapter, PolarsDataFrameOptimizedTableAdapter) implementing different streaming patterns to expose Polars DataFrames as DuckDB table functions; includes query_via_parquet helper for Parquet-based fallback; benchmark setup creates test data, registers adapters, measures performance, and cleans up temporary files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 Three rabbits hop through Polars fields so green,
They bundle rows in batches—swift, serene!
From DataFrame to DuckDB they race,
One row, then batches, optimized pace—
A benchmark feast, no Parquet delay! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a benchmark file that compares table adapter approaches against parquet.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sample/issue922-benchmark

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@sample/issue922_benchmark.rb`:
- Around line 120-123: The query_via_parquet method currently interpolates name
and parquet_path directly into SQL (con.query("CREATE OR REPLACE TABLE #{name}
..." and read_parquet('#{parquet_path}')), which is unsafe and will break on
quotes; change it to use your DB driver's safe APIs: bind the parquet_path as a
parameter or use a quoted-literal helper for the read_parquet argument, and
quote or escape the table identifier using the connection's identifier-quoting
helper (or a prepared/parameterized CREATE/INSERT pattern) instead of string
interpolation; update the two con.query calls in query_via_parquet to construct
SQL with the connection's quote_identifier/quote_literal or parameter binding to
ensure safe identifiers and literals.
- Around line 134-163: Wrap the benchmark resource lifecycle in an ensure block
so the connection and database are always closed and the parquet file always
removed even on errors: acquire db and con via DuckDB::Database.open and
con.connect, run the benchmark calls (con.expose_as_table, queries, and
query_via_parquet) in the begin section, and move con.close, db.close and
parquet cleanup into ensure. Replace the fixed parquet_path with a unique
temporary file (use Tempfile or Dir::Tmpname to generate a temp pathname) and
ensure that tempfile is closed/unlinked in the ensure block after
query_via_parquet completes.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9209357 and f2e27f1.

📒 Files selected for processing (1)
  • sample/issue922_benchmark.rb

Comment on lines +120 to +123
def query_via_parquet(con, data_frame, name, parquet_path)
data_frame.write_parquet(parquet_path)
con.query("CREATE OR REPLACE TABLE #{name} AS SELECT * FROM read_parquet('#{parquet_path}')")
con.query("SELECT * FROM #{name}").to_a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden SQL construction for identifier/literal safety.

name and parquet_path are interpolated directly into SQL. This can break on quotes and is unsafe if reused with non-constant input.

🔧 Proposed fix
 def query_via_parquet(con, data_frame, name, parquet_path)
+  quoted_name = %("#{name.to_s.gsub('"', '""')}")
+  quoted_path = "'#{parquet_path.to_s.gsub("'", "''")}'"
   data_frame.write_parquet(parquet_path)
-  con.query("CREATE OR REPLACE TABLE #{name} AS SELECT * FROM read_parquet('#{parquet_path}')")
-  con.query("SELECT * FROM #{name}").to_a
+  con.query("CREATE OR REPLACE TABLE #{quoted_name} AS SELECT * FROM read_parquet(#{quoted_path})")
+  con.query("SELECT * FROM #{quoted_name}").to_a
 end
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sample/issue922_benchmark.rb` around lines 120 - 123, The query_via_parquet
method currently interpolates name and parquet_path directly into SQL
(con.query("CREATE OR REPLACE TABLE #{name} ..." and
read_parquet('#{parquet_path}')), which is unsafe and will break on quotes;
change it to use your DB driver's safe APIs: bind the parquet_path as a
parameter or use a quoted-literal helper for the read_parquet argument, and
quote or escape the table identifier using the connection's identifier-quoting
helper (or a prepared/parameterized CREATE/INSERT pattern) instead of string
interpolation; update the two con.query calls in query_via_parquet to construct
SQL with the connection's quote_identifier/quote_literal or parameter binding to
ensure safe identifiers and literals.

Comment on lines +134 to +163
db = DuckDB::Database.open
con = db.connect
con.query('SET threads=1')

DuckDB::TableFunction.add_table_adapter(Polars::DataFrame, PolarsDataFrameTableAdapter.new)
start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
con.expose_as_table(df, 'polars_tf')
con.query('SELECT * FROM polars_tf()').to_a
end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)

DuckDB::TableFunction.add_table_adapter(Polars::DataFrame, PolarsDataFrameBatchTableAdapter.new)
start_time3 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
con.expose_as_table(df, 'polars_tf_batch')
con.query('SELECT * FROM polars_tf_batch()').to_a
end_time3 = Process.clock_gettime(Process::CLOCK_MONOTONIC)

DuckDB::TableFunction.add_table_adapter(Polars::DataFrame, PolarsDataFrameOptimizedTableAdapter.new)
start_time4 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
con.expose_as_table(df, 'polars_tf_opt')
con.query('SELECT * FROM polars_tf_opt()').to_a
end_time4 = Process.clock_gettime(Process::CLOCK_MONOTONIC)

parquet_path = File.join(Dir.tmpdir, 'issue922_benchmark.parquet')
start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
query_via_parquet(con, df, 'polars_pq', parquet_path)
end_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)

con.close
db.close
File.delete(parquet_path)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wrap benchmark lifecycle in ensure and avoid fixed tmp filename.

Connection/database close and parquet cleanup should run even if a query fails. Also, a fixed tmp filename can collide across concurrent runs.

🔧 Proposed fix
-parquet_path = File.join(Dir.tmpdir, 'issue922_benchmark.parquet')
-start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
-query_via_parquet(con, df, 'polars_pq', parquet_path)
-end_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
-
-con.close
-db.close
-File.delete(parquet_path)
+parquet_path = File.join(Dir.tmpdir, "issue922_benchmark_#{Process.pid}_#{Time.now.to_i}.parquet")
+begin
+  start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+  query_via_parquet(con, df, 'polars_pq', parquet_path)
+  end_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+ensure
+  con&.close
+  db&.close
+  File.delete(parquet_path) if File.exist?(parquet_path)
+end
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sample/issue922_benchmark.rb` around lines 134 - 163, Wrap the benchmark
resource lifecycle in an ensure block so the connection and database are always
closed and the parquet file always removed even on errors: acquire db and con
via DuckDB::Database.open and con.connect, run the benchmark calls
(con.expose_as_table, queries, and query_via_parquet) in the begin section, and
move con.close, db.close and parquet cleanup into ensure. Replace the fixed
parquet_path with a unique temporary file (use Tempfile or Dir::Tmpname to
generate a temp pathname) and ensure that tempfile is closed/unlinked in the
ensure block after query_via_parquet completes.

@suketa suketa merged commit 6a9c9e6 into main Feb 28, 2026
41 checks passed
@suketa suketa deleted the sample/issue922-benchmark branch February 28, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant