Skip to content

fix: move runner configs to epoxy#4605

Merged
MasterPtato merged 1 commit intomainfrom
04-10-fix_move_runner_configs_to_epoxy
Apr 13, 2026
Merged

fix: move runner configs to epoxy#4605
MasterPtato merged 1 commit intomainfrom
04-10-fix_move_runner_configs_to_epoxy

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Contributor Author

MasterPtato commented Apr 11, 2026

@claude
Copy link
Copy Markdown

claude bot commented Apr 11, 2026

PR Review: fix: move runner configs to epoxy

Summary

This PR refactors how runner pool configurations are stored and discovered across datacenters. Instead of querying remote DCs over HTTP, configs are now written to epoxy (distributed KV) under a new GlobalDataKey encoding the originating DC label. A backfill workflow (runner_pool_backfill) migrates existing UDB-local configs. The PR also adds save_empty caching to the optimistic epoxy KV read path to avoid repeated fanout on negative cache misses.


Bug: GlobalDataKey::unpack Missing DATA Segment

File: engine/packages/pegboard/src/keys/runner_config.rs

TuplePack serializes 7 elements: (RUNNER, CONFIG, GLOBAL, DATA, dc_label, namespace_id, name), but TupleUnpack only destructures 6 elements: (usize, usize, usize, u16, Id, String). The DATA constant (value 4) is skipped.

This means:

  • dc_label always decodes as 4 (the numeric value of DATA), not the actual dc_label
  • namespace_id receives bytes from the actual dc_label field, likely a parse error or silently wrong UUID
  • name receives bytes from namespace_id

Fix: add a usize placeholder for the DATA segment:

let (input, (_, _, _, _, dc_label, namespace_id, name)) =
    <(usize, usize, usize, usize, u16, Id, String)>::unpack(input, tuple_depth)?;

This would cause list_runner_config_enabled_dcs to malfunction and the backfill to produce entries that cannot be unpacked correctly.


Bug: ensure_normal_if_missing Now Overwrites Existing Configs

File: engine/packages/pegboard/src/ops/runner_config/ensure_normal_if_missing.rs

The function doc comment says "Creates a default normal runner config if one does not already exist", but the implementation now calls upsert unconditionally. This will overwrite a serverless config with a Normal {} config every time a runner2 workflow calls EnsureRunnerConfig. The original guard that skipped the write when a config already existed has been removed.

Combined with the existing // TODO: Race comment in upsert.rs, this race is now exercised on every new runner connection.


Issue: mutable Field Placeholder in Empty-Value Cache Write

File: engine/packages/epoxy/src/ops/kv/get_optimistic.rs

The // TODO: What should this be set to? on the mutable: true field in the empty-value CommittedValue should be resolved before merging. Even if semantically irrelevant today (empty-value returns None immediately in get_local.rs), leaving it unresolved is a maintenance hazard if sentinel behavior changes.


Issue: Race Guard Removed from cache_fanout_value

File: engine/packages/epoxy/src/ops/kv/get_optimistic.rs

The old implementation guarded against the race where a commit lands between the fanout read and the cache write by checking if a committed value with an equal or newer version already existed before writing. This guard was removed. If a committed value lands locally in that window, the cache will briefly serve a stale value. For mutable runner configs, this could silently serve an outdated config until TTL expiry. If this regression is intentional, it should be documented.


Issue: Missing epoxy Deletion in runner_config::delete

File: engine/packages/pegboard/src/ops/runner_config/delete.rs

Deleting a runner config removes it from local UDB but does not remove the corresponding GlobalDataKey entry from epoxy. list_runner_config_enabled_dcs will continue returning DC labels for deleted configs indefinitely. If tombstoning is handled separately, this should be documented.


Stale Log Message in Backfill Workflow

File: engine/packages/pegboard/src/workflows/runner_pool_backfill.rs

tracing::warn!("timed out processing pending actors metrics") appears to be copy-pasted from another backfill workflow. Should be something like "timed out reading runner pool configs".


Minor Issues

  1. useless_format in upsert.rs - format!("invalid header name: too long (max 128)") has no format args. Should be a plain .to_string(). Clippy will flag these.

  2. Duplicated protocol_version fetch in list.rs - Both arms of the if input.variant.is_some() check have identical fetch code. Could be deduplicated.

  3. Backfill chunk timeout discards partial progress - When a backfill chunk times out at 2500ms, entries collected before the timeout are discarded. Resuming re-reads and re-writes the same range (idempotent), but progress within a timed-out chunk is entirely lost. Acceptable, but worth documenting.

  4. DatacentersIntoIter placement - The new iterator type is declared between two IntoIterator impls rather than adjacent to DatacentersIter. Minor style inconsistency.


What Looks Good

  • Structured logging with ? formatters throughout. No println!/eprintln! usage.
  • Custom RivetError-derived errors follow the established pattern.
  • Timestamp and key naming conventions followed.
  • No Mutex<HashMap> usage; concurrent operations use buffer_unordered.
  • The save_empty caching addition is a reasonable optimization for the negative-cache case.
  • IntoIterator for DatacentersRepr follows the existing IntoIterator for &DatacentersRepr pattern well.
  • Backfill workflow idempotency via chunk boundary hashing is correctly structured.

The GlobalDataKey::unpack missing DATA segment and the ensure_normal_if_missing unconditional upsert are the two issues that should be fixed before this merges.

@MasterPtato MasterPtato force-pushed the 04-10-fix_move_runner_configs_to_epoxy branch from c15aa52 to 2670edc Compare April 11, 2026 22:07
@MasterPtato MasterPtato marked this pull request as ready for review April 11, 2026 22:07
@MasterPtato MasterPtato force-pushed the 04-10-fix_move_runner_configs_to_epoxy branch from 2670edc to 94f4263 Compare April 12, 2026 19:42
@MasterPtato MasterPtato force-pushed the 04-09-fix_epoxy_add_list_operation branch from e68c174 to 0ac1091 Compare April 12, 2026 19:42
@MasterPtato MasterPtato force-pushed the 04-09-fix_epoxy_add_list_operation branch from 0ac1091 to 84391bd Compare April 12, 2026 19:48
@MasterPtato MasterPtato force-pushed the 04-10-fix_move_runner_configs_to_epoxy branch 2 times, most recently from 4064862 to 39d337d Compare April 12, 2026 19:59
@MasterPtato MasterPtato force-pushed the 04-09-fix_epoxy_add_list_operation branch 2 times, most recently from 42b7a5c to 0b43aa9 Compare April 12, 2026 20:39
@MasterPtato MasterPtato force-pushed the 04-10-fix_move_runner_configs_to_epoxy branch 2 times, most recently from 9198ffc to bd6ef15 Compare April 12, 2026 21:03
@MasterPtato MasterPtato force-pushed the 04-10-fix_move_runner_configs_to_epoxy branch from bd6ef15 to e2a69f1 Compare April 12, 2026 21:21
@MasterPtato MasterPtato force-pushed the 04-09-fix_epoxy_add_list_operation branch from 0b43aa9 to 4cd09cd Compare April 12, 2026 21:22
@MasterPtato MasterPtato force-pushed the 04-09-fix_epoxy_add_list_operation branch from 4cd09cd to 1b7c05c Compare April 12, 2026 23:38
@MasterPtato MasterPtato force-pushed the 04-10-fix_move_runner_configs_to_epoxy branch from e2a69f1 to 40e5746 Compare April 12, 2026 23:38
@MasterPtato MasterPtato force-pushed the 04-10-fix_move_runner_configs_to_epoxy branch from 40e5746 to be5f01c Compare April 13, 2026 00:20
@MasterPtato MasterPtato force-pushed the 04-09-fix_epoxy_add_list_operation branch from 1b7c05c to 570ddec Compare April 13, 2026 00:20
Copy link
Copy Markdown
Contributor Author

MasterPtato commented Apr 13, 2026

Merge activity

  • Apr 13, 12:21 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 13, 12:34 AM UTC: Graphite rebased this pull request as part of a merge.
  • Apr 13, 12:34 AM UTC: @MasterPtato merged this pull request with Graphite.

@MasterPtato MasterPtato changed the base branch from 04-09-fix_epoxy_add_list_operation to graphite-base/4605 April 13, 2026 00:31
@MasterPtato MasterPtato changed the base branch from graphite-base/4605 to main April 13, 2026 00:32
@MasterPtato MasterPtato force-pushed the 04-10-fix_move_runner_configs_to_epoxy branch from be5f01c to 2b6e5d9 Compare April 13, 2026 00:33
@MasterPtato MasterPtato merged commit a8761e9 into main Apr 13, 2026
5 of 7 checks passed
@MasterPtato MasterPtato deleted the 04-10-fix_move_runner_configs_to_epoxy branch April 13, 2026 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants