add sample/issue922_benchmark.rb to compare table adapter approaches vs parquet#1121
add sample/issue922_benchmark.rb to compare table adapter approaches vs parquet#1121
Conversation
…approach Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…vs parquet Three table function approaches are compared: - 1 row per execute call (baseline) - batch (2048 rows per call) to reduce Ruby<->C crossings - batch + pre-extracted column arrays to avoid repeated Polars FFI calls Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 WalkthroughWalkthroughA new benchmark file is introduced that implements three Polars DataFrame table-function adapters for DuckDB with different streaming strategies (single-row, batch-per-call, and optimized batch), along with a Parquet-based fallback approach and benchmark harness to compare performance. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sample/issue922_benchmark.rb`:
- Around line 120-123: The query_via_parquet method currently interpolates name
and parquet_path directly into SQL (con.query("CREATE OR REPLACE TABLE #{name}
..." and read_parquet('#{parquet_path}')), which is unsafe and will break on
quotes; change it to use your DB driver's safe APIs: bind the parquet_path as a
parameter or use a quoted-literal helper for the read_parquet argument, and
quote or escape the table identifier using the connection's identifier-quoting
helper (or a prepared/parameterized CREATE/INSERT pattern) instead of string
interpolation; update the two con.query calls in query_via_parquet to construct
SQL with the connection's quote_identifier/quote_literal or parameter binding to
ensure safe identifiers and literals.
- Around line 134-163: Wrap the benchmark resource lifecycle in an ensure block
so the connection and database are always closed and the parquet file always
removed even on errors: acquire db and con via DuckDB::Database.open and
con.connect, run the benchmark calls (con.expose_as_table, queries, and
query_via_parquet) in the begin section, and move con.close, db.close and
parquet cleanup into ensure. Replace the fixed parquet_path with a unique
temporary file (use Tempfile or Dir::Tmpname to generate a temp pathname) and
ensure that tempfile is closed/unlinked in the ensure block after
query_via_parquet completes.
| def query_via_parquet(con, data_frame, name, parquet_path) | ||
| data_frame.write_parquet(parquet_path) | ||
| con.query("CREATE OR REPLACE TABLE #{name} AS SELECT * FROM read_parquet('#{parquet_path}')") | ||
| con.query("SELECT * FROM #{name}").to_a |
There was a problem hiding this comment.
Harden SQL construction for identifier/literal safety.
name and parquet_path are interpolated directly into SQL. This can break on quotes and is unsafe if reused with non-constant input.
🔧 Proposed fix
def query_via_parquet(con, data_frame, name, parquet_path)
+ quoted_name = %("#{name.to_s.gsub('"', '""')}")
+ quoted_path = "'#{parquet_path.to_s.gsub("'", "''")}'"
data_frame.write_parquet(parquet_path)
- con.query("CREATE OR REPLACE TABLE #{name} AS SELECT * FROM read_parquet('#{parquet_path}')")
- con.query("SELECT * FROM #{name}").to_a
+ con.query("CREATE OR REPLACE TABLE #{quoted_name} AS SELECT * FROM read_parquet(#{quoted_path})")
+ con.query("SELECT * FROM #{quoted_name}").to_a
end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sample/issue922_benchmark.rb` around lines 120 - 123, The query_via_parquet
method currently interpolates name and parquet_path directly into SQL
(con.query("CREATE OR REPLACE TABLE #{name} ..." and
read_parquet('#{parquet_path}')), which is unsafe and will break on quotes;
change it to use your DB driver's safe APIs: bind the parquet_path as a
parameter or use a quoted-literal helper for the read_parquet argument, and
quote or escape the table identifier using the connection's identifier-quoting
helper (or a prepared/parameterized CREATE/INSERT pattern) instead of string
interpolation; update the two con.query calls in query_via_parquet to construct
SQL with the connection's quote_identifier/quote_literal or parameter binding to
ensure safe identifiers and literals.
| db = DuckDB::Database.open | ||
| con = db.connect | ||
| con.query('SET threads=1') | ||
|
|
||
| DuckDB::TableFunction.add_table_adapter(Polars::DataFrame, PolarsDataFrameTableAdapter.new) | ||
| start_time = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
| con.expose_as_table(df, 'polars_tf') | ||
| con.query('SELECT * FROM polars_tf()').to_a | ||
| end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
|
|
||
| DuckDB::TableFunction.add_table_adapter(Polars::DataFrame, PolarsDataFrameBatchTableAdapter.new) | ||
| start_time3 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
| con.expose_as_table(df, 'polars_tf_batch') | ||
| con.query('SELECT * FROM polars_tf_batch()').to_a | ||
| end_time3 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
|
|
||
| DuckDB::TableFunction.add_table_adapter(Polars::DataFrame, PolarsDataFrameOptimizedTableAdapter.new) | ||
| start_time4 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
| con.expose_as_table(df, 'polars_tf_opt') | ||
| con.query('SELECT * FROM polars_tf_opt()').to_a | ||
| end_time4 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
|
|
||
| parquet_path = File.join(Dir.tmpdir, 'issue922_benchmark.parquet') | ||
| start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
| query_via_parquet(con, df, 'polars_pq', parquet_path) | ||
| end_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
|
|
||
| con.close | ||
| db.close | ||
| File.delete(parquet_path) |
There was a problem hiding this comment.
Wrap benchmark lifecycle in ensure and avoid fixed tmp filename.
Connection/database close and parquet cleanup should run even if a query fails. Also, a fixed tmp filename can collide across concurrent runs.
🔧 Proposed fix
-parquet_path = File.join(Dir.tmpdir, 'issue922_benchmark.parquet')
-start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
-query_via_parquet(con, df, 'polars_pq', parquet_path)
-end_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
-
-con.close
-db.close
-File.delete(parquet_path)
+parquet_path = File.join(Dir.tmpdir, "issue922_benchmark_#{Process.pid}_#{Time.now.to_i}.parquet")
+begin
+ start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+ query_via_parquet(con, df, 'polars_pq', parquet_path)
+ end_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+ensure
+ con&.close
+ db&.close
+ File.delete(parquet_path) if File.exist?(parquet_path)
+end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sample/issue922_benchmark.rb` around lines 134 - 163, Wrap the benchmark
resource lifecycle in an ensure block so the connection and database are always
closed and the parquet file always removed even on errors: acquire db and con
via DuckDB::Database.open and con.connect, run the benchmark calls
(con.expose_as_table, queries, and query_via_parquet) in the begin section, and
move con.close, db.close and parquet cleanup into ensure. Replace the fixed
parquet_path with a unique temporary file (use Tempfile or Dir::Tmpname to
generate a temp pathname) and ensure that tempfile is closed/unlinked in the
ensure block after query_via_parquet completes.
Summary
Adds
sample/issue922_benchmark.rbcomparing three table function approaches against the parquet approach for exposing aPolars::DataFrameto DuckDB.Approaches compared
to_ato eliminate repeated Polars FFI callsread_parquet()Typical results
Pre-extracting column arrays is the biggest Ruby-side win, bringing the table adapter to within ~1.5x of parquet.
Summary by CodeRabbit