Skip to content

fix(pegboard): skip protocol version keys in runner pool backfill#4620

Open
NathanFlurry wants to merge 1 commit into04-12-fix_pegboard_create_missing_normal_runner_configsfrom
04-12-fix_pegboard_skip_protocol_version_keys_in_runner_pool_backfill
Open

fix(pegboard): skip protocol version keys in runner pool backfill#4620
NathanFlurry wants to merge 1 commit into04-12-fix_pegboard_create_missing_normal_runner_configsfrom
04-12-fix_pegboard_skip_protocol_version_keys_in_runner_pool_backfill

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 13, 2026

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

PR Review: fix(pegboard): skip protocol version keys in runner pool backfill

Summary

This PR fixes a bug in runner_pool_backfill.rs where the range scan over the runner_config_subspace was inadvertently pulling in ProtocolVersionKey entries alongside the intended DataKey entries, causing the subsequent read_entry::<DataKey> deserialization to fail on those entries.

The root cause is a key overlap: ProtocolVersionKey is a child key of DataKey sharing the same prefix:

  • DataKey: (RUNNER, CONFIG, DATA, namespace_id, name)
  • ProtocolVersionKey: (RUNNER, CONFIG, DATA, namespace_id, name, PROTOCOL_VERSION)

Since the range scan covers the entire DataKey::entire_subspace() prefix, it returns both types. The fix correctly detects and skips ProtocolVersionKey entries.


Correctness

The fix is correct. The unpack of (Id, String, usize) within the subspace context will only match 3-element suffixes (i.e., ProtocolVersionKey), while plain DataKey entries only have 2 elements and will fail the unpack, falling through to read_entry as expected. The if let Ok pattern cleanly handles both cases.

Updating new_last_key before the skip check is intentional and correct for pagination: skipped protocol version keys still advance the cursor, preventing infinite loops.


Issues

Pre-existing stale log message (not introduced here, but worth fixing):

Line 92 says tracing::warn!("timed out processing pending actors metrics") which appears to be copy-pasted from another workflow. The function is running a backfill, not processing actor metrics. Consider updating to "timed out during runner pool backfill chunk".

No tests added:

The checklist item for tests is unchecked. While backfill migrations are harder to test, a snapshot-based integration test (using test-snapshot-gen per the engine CLAUDE.md) covering a UDB store that has ProtocolVersionKey entries alongside DataKey entries would help prevent regressions.

Fragility of the tuple-shape filter:

The skip logic detects protocol version keys by checking for a 3-element tuple suffix matching (Id, String, usize). If future key types are added under the same (RUNNER, CONFIG, DATA) subspace with a 3-element suffix but a different leaf value, they would silently fall through and attempt to deserialize as DataKey.

An allow-list approach (only process keys that match the DataKey shape) may be safer, depending on whether tuple unpacking is strict about trailing bytes:

// Skip anything that is not a plain DataKey (2-element suffix).
if runner_config_subspace.unpack::<(Id, String)>(entry.key()).is_err() {
    continue;
}
entries.push(tx.read_entry::<keys::runner_config::DataKey>(&entry)?);

If the unpack is strict (no trailing bytes allowed), this makes the intent clearer and is more forward-compatible.


Verdict

The fix is targeted and correct. Main follow-ups: replace the stale log message, and consider flipping the filter to an allow-list for more forward-compatible safety.

Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant