Skip to content

[P2] Onboarding Clarity: handler-layer UX + selfcheck, Dima's approach (supersedes #1261, #866)#1292

Closed
bonujel wants to merge 16 commits into
gonka-ai:upgrade-v0.2.14from
bonujel:onboarding_clarity_1261
Closed

[P2] Onboarding Clarity: handler-layer UX + selfcheck, Dima's approach (supersedes #1261, #866)#1292
bonujel wants to merge 16 commits into
gonka-ai:upgrade-v0.2.14from
bonujel:onboarding_clarity_1261

Conversation

@bonujel

@bonujel bonujel commented Jun 3, 2026

Copy link
Copy Markdown

Summary

Supersedes #866, completes #1261. Implements issue #326 (clearer onboarding logs, no false "errors", pre-PoC readiness testing) without touching the broker — onboarding UX is derived at the admin handler layer, and PoC wiring is verified in a separate api selfcheck mode against a mocked chain.

  • Onboarding statusGET /admin/v1/nodes returns an optional onboarding block: MLnode state WAITING_FOR_POC/TESTING/TEST_FAILED, participant INACTIVE_WAITING/ACTIVE_PARTICIPATING, timing, and human messages. Derived read-only (TimingCalculator/StatusReporter/OnboardingStateManager + a 30s-cached ActivityTracker, no per-request chain RPCs). broker.NodeState unchanged.
  • Pre-PoC testingPOST /admin/v1/nodes/:id/test (manual) + auto-test on register/config-change when >1h to PoC. Runs model load → health → inference request and records timings. Pass → WAITING_FOR_POC (the "safe to be offline" wording only after a pass); fail → TEST_FAILED, logged at the API node. TEST_FAILED comes only from the test, never from the broker's operational FAILED.
  • api selfcheck — drives a full synthetic PoC lifecycle (PoCGenerate → PoCValidate → Inference) and reports per-stage PASS/FAIL (exit 0/1).
  • Log cleanup — a "waiting for PoC (starts in 2h 15m)…" line at startup and every 5 min while onboarding; benign startup/inactive logs demoted to Info; genuine failures stay visible.

Why not #866

#866 added UX state to broker.NodeState + new broker commands; reviewers (@patimen, @DimaOrekhovPS) flagged this as complexity on a critical path. Same features, broker untouched:

#866 This PR
broker.NodeState new fields 6 0
broker new commands 2 0
per-request chain RPCs in GET /nodes N+1 0 (30s cache)
UX state computation broker writes back handler derives

Product context in proposals/onboarding-clarity-v1/README.md.

Test plan

  • go build / go vet / go test (broker, admin, selfcheck, participant, mlnodeclient) — green
  • api selfcheck — 6 stages PASS, exit 0
  • End-to-end on a local network (real API node + chain, WireMock as the MLnode): GET /nodes onboarding (additive shape), POST /test SUCCESS (load/health/resp) and the TEST_FAILED path, onboarding logs
  • Against a real MLnode (vLLM) / the public testnet — not yet run

Ryanchen911 and others added 12 commits May 27, 2026 17:32
Compute onboarding UX (participant state, MLnode state, timing,
human-friendly messages) in the admin handler rather than on broker
state. The broker is unchanged; admin GET /nodes returns an enriched
NodeWithOnboarding envelope that wraps the broker's NodeResponse with
an optional Onboarding block.

New components:
- participant.ActivityTracker: background goroutine caches whether the
  participant is in the current active set (refreshed every 30s); admin
  handlers read IsActive() without per-request chain RPCs.
- admin pure helpers (OnboardingStateManager, TimingCalculator,
  StatusReporter): no state, no broker references, no I/O. Outputs
  derive only from inputs.
- apiconfig timing constants: shared defaults (block time, alert lead,
  auto-test slack).

Also demote the "No epoch models available" log in node_worker_commands
from Error to Info: this is normal during onboarding when the
participant has not yet joined the active set, and Error here was the
loudest source of confusion in user reports.

See proposals/onboarding-clarity-v1/README.md for product context.
Replaces the broker-coupled approach from gonka-ai#866 with the separation
suggested by reviewers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndpoint

Implements the approach suggested by reviewers on PR gonka-ai#866: run the real
broker against a mocked chain in a separate mode to verify the PoC
lifecycle is wired up, with production code paths untouched.

Two additions:

1) admin POST /admin/v1/nodes/:id/test
   Synchronous one-shot validation against a configured MLnode.
   For each model: InferenceUp + load-time measurement; then
   InferenceHealth probe; then Stop. Returns a TestResult with per-
   model load times and overall pass/fail. Per-node mutex prevents
   concurrent tests against the same node (returns 409). The broker
   is not involved — the result is reported back to the caller, not
   written to broker state.

2) `api selfcheck` subcommand
   Builds a real broker with a MockChainBridge (implements
   broker.BrokerChainBridge with hardcoded single-participant/single-
   model epoch data) and a MockClientFactory for MLnode I/O. The
   EventDriver advances the phase tracker and triggers the same
   broker commands OnNewBlockDispatcher would normally queue. The
   Evaluator inspects broker state via the public GetNodes API and
   produces a Report (PASS/FAIL per stage). Exits 0 on PASS, 1 on
   FAIL; the report goes to stderr.

   Initial stages:
   - node-registered
   - epoch-models-populated (chain bridge wiring works end-to-end)
   - hardware-diff-submitted (broker reports hardware on startup)

   Full PoC artifact protocol simulation is intentionally out of
   scope; the foundation is here for follow-up PRs to add more
   stages (StartPoC -> POC transition, validation, inference resume).

Tests:
- admin/mlnode_tester_test.go covers NodeNotFound, success,
  model-load failure, concurrent rejection.
- selfcheck/run_test.go runs the full selfcheck and asserts PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Treat an unknown PoC schedule as a sentinel (-1) rather than 0, so the
  UI never renders "PoC starting soon (in 0s)" before the phase tracker
  has synced; show "schedule syncing" instead.
- mlnode_tester: iterate models in sorted order so FailingModel and the
  per-model load sequence are deterministic (Go map order is random).
- status_reporter: derive the auto-test lead time from the constant
  instead of a hardcoded "1 hour".
- broker: log "no epoch models" at Warn instead of Info, so genuine
  misconfigurations stay visible while normal onboarding stays quiet.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… register

- computeOnboarding now reads MLNodeTester state: TESTING while a one-shot
  test is in flight, and TEST_FAILED (with the failing model) from the
  most recent failed result. Previously the tester result was never
  reflected in GET /nodes, so those states could never appear.
- Auto-test a node when it is registered or its config changes, gated on
  more than AutoTestMinSecondsBeforePoC until the next PoC. This is the
  proposal's "auto-test a new MLnode when there's >1h until PoC",
  implemented purely at the admin layer: fire-and-forget, result flows
  back via computeOnboarding, no write to broker NodeState (the
  broker-coupling that sank gonka-ai#866 is avoided).
- Add handler-level tests: POST /nodes/:id/test (success/failure/404),
  GET /nodes onboarding envelope + TEST_FAILED wiring, and the auto-test
  gate and trigger.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- run.go now steps the synthetic chain through PoCGenerate -> PoCValidate
  -> back to Inference, queueing the same broker commands the production
  OnNewBlockDispatcher would, and asserts the broker drives the node's
  intended status across all three phases (poc-generate, poc-validate,
  inference-resumed). Previously selfcheck only checked the inference-phase
  wiring and the PoC-phase trigger methods were dead code, so "the PoC
  lifecycle works" was never actually verified.
- Restore ENFORCED_MODEL_ID to its prior value on exit so selfcheck leaves
  no process-wide side effects when invoked programmatically or from tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three log sites fired loudly while a node was simply starting up or had
not yet synced the chain — normal during onboarding, not errors:

- PopulateSingleNodeEpochData "phase tracker not initialized" and
  "epoch state not synced": demoted Warn -> Info (they only occur during
  startup / before sync; post-sync they don't fire, so no loss of signal).
- RegisterNode "failed to populate epoch data": now logged at Info when
  the chain isn't synced yet (benign onboarding), but kept at Warn once
  synced — a real failure there still surfaces.

Logging level/text only; no behavior change. The two "synced but
participant-inactive" sites (parent epoch group empty; node not in epoch)
are intentionally left at Warn: they are only reached after the sync
check, so gating on IsSynced can't distinguish benign-from-real there —
that needs a participant-active signal, deferred to avoid broker logic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Issue gonka-ai#326 asks for clearer logging when a node is launched and waiting
for PoC; until now the onboarding guidance only existed in the
GET /nodes response, never in the logs.

Add StartOnboardingStatusLogger: a lightweight admin-layer goroutine
(started from main alongside the activity tracker) that, while the
participant is configured-but-inactive, logs the human-facing
"Waiting for next PoC cycle (starts in 2h 15m) - safe to be offline ..."
line at startup and every 5 minutes, switching to "PoC starting soon -
must be online now" inside the alert window, and announcing the
inactive->active transition once before going quiet. Also log the same
status right after a node is registered via the admin API.

Reads cached phase/activity state only — no chain RPCs, no broker writes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fter a test passes

Per the proposal, "waiting for PoC" should only reassure the operator
once the MLnode's test has passed. BuildMLNodeMessage now takes a
`validated` flag: the calm "validated, safe to be offline until ..."
line (and the plain "schedule syncing" line) are shown only when the
node's most recent test succeeded. Until then it says the node is "not
yet validated - it will be tested before the next PoC", while still
telling the operator to come online when PoC is near.

- computeOnboarding derives `validated` from the per-node MLNodeTester
  last result (TestSuccess) and passes it through.
- The onboarding status logger gates on allNodesValidated() so the
  periodic heartbeat only reassures once every configured node passed.

Logging/messaging only; no state-machine or behavior change (kept the
3 MLnode states; the gate lives in the message wording).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-test failure reported)

Two proposal-fidelity fixes from re-reading proposals/onboarding-clarity-v1:

- "no model for ml node" path: log at Info with the proposal's wording
  ("Participant not yet active - model assignment pending (normal for new
  participants)") instead of Warn. Per the proposal this state is normal
  for an inactive participant; a genuine problem still surfaces via the
  node's FAILED status, so visibility is not lost. (Reverts the earlier
  Warn choice in favor of the proposal's explicit Info requirement.)

- Auto-test results are now reported in the API node logs (proposal:
  "detailed error reporting on TEST_FAILED"): a failed pre-PoC auto-test
  logs at Error with the failing model and error; a pass logs at Info.
  Previously maybeAutoTest discarded the result, so auto-test failures
  only surfaced via GET /nodes and never hit the logs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ST_FAILED

computeOnboarding derived TEST_FAILED from the broker's operational
status (FailureReason != "" && CurrentStatus == FAILED). But an inactive
participant's node is driven to INFERENCE by InferenceUpAllCommand and
"fails" with no-epoch-models — normal onboarding, not a test failure —
which set FailureReason + FAILED and made GET /nodes show TEST_FAILED for
a perfectly healthy waiting node, contradicting the proposal.

Per the proposal, TEST_FAILED means the MLnode validation test failed, so
derive it ONLY from the MLNodeTester result. Touches only the
proposal-added computeOnboarding; the pre-existing broker reconciliation
(InferenceUpAllCommand / getCommandForState, from gonka-ai#160) is unchanged.

Adds a regression test: a node with broker FAILED + no test result is not
shown as TEST_FAILED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…quest)

Implements the proposal's Proactive Testing steps Ryanchen911#3/gonka-ai#4 (send a test
inference request and validate the response; record response time):

- mlnodeclient: new Client.Inference(ctx, model) posts a minimal
  OpenAI-compatible chat-completion to <inferenceUrl>/v1/chat/completions
  (the same path the public handler proxies to) and validates a usable
  response. Added to the MLNodeClient interface + MockClient.
- MLNodeTester now runs load -> health -> inference and records RespMs;
  a failed inference request marks the node TEST_FAILED.

Per team decision this uses the existing host:port URL construction
rather than waiting on gonka-ai#717 (token-auth); the gonka-ai#717 follow-up is documented
in dev_notes/onboarding_clarity_changes.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dev_notes/onboarding_observability.md — maps each onboarding lifecycle stage
to the log lines this branch emits and the admin API results (GET /nodes
onboarding block, POST /test, api selfcheck).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bonujel bonujel marked this pull request as draft June 3, 2026 01:47

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78bf58ea4e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread decentralized-api/internal/server/admin/mlnode_tester.go
Comment thread decentralized-api/internal/server/admin/mlnode_tester.go Outdated
@bonujel

bonujel commented Jun 3, 2026

Copy link
Copy Markdown
Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6700debe9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread decentralized-api/internal/server/admin/timing_calculator.go Outdated
Comment thread decentralized-api/internal/server/admin/mlnode_tester.go
…s, versioned test URLs

- maybeAutoTest now skips when the participant is active: an active node is
  serving inference, and the test would reload + Stop it outside the
  broker's knowledge, briefly taking it out of service (Codex P1).
- MLNodeTester builds versioned MLnode URLs (current node version inserted
  into the path, like broker.Node.*UrlWithVersion), so versioned /
  rolling-upgrade deployments hit the right routes instead of falsely
  reporting TEST_FAILED (Codex P2).
- computeOnboarding treats an active participant's node as validated, so an
  already-serving node is not shown as "not yet validated".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6700debe9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread decentralized-api/internal/server/admin/node_handlers.go Outdated
@bonujel

bonujel commented Jun 3, 2026

Copy link
Copy Markdown
Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

…n + per-node test guard

- ComputeTiming targets this epoch's PoC start while still before it
  (pre-PoC inference phase), only using the next epoch's PoC once past it.
  NextPoCStart() unconditionally overcounted by an epoch, which left
  ShouldBeOnline false and could let auto-test fire inside the real
  online-alert window (Codex P1).
- Replace the participant-wide IsActive() auto-test guard with a per-node
  check (the node is assigned/serving in the current epoch): idle/spare
  nodes are still tested even when the participant has other active nodes
  (Codex P2, was over-suppressing).
- Apply the same guard to the manual POST /nodes/:id/test path, returning
  409 for a serving node instead of reloading + stopping it out of service
  (Codex P2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bonujel bonujel marked this pull request as ready for review June 3, 2026 03:10

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3268625c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread decentralized-api/internal/server/admin/mlnode_tester.go Outdated
// until ctx is cancelled. Performs one immediate refresh before
// returning so IsActive() is meaningful for the first admin request.
func (t *ActivityTracker) Start(ctx context.Context) {
t.refresh(ctx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep activity refresh off the startup path

When the chain RPCs used by CurrentEpochGroupData/EpochGroupData are slow or not responding during startup, this synchronous refresh uses the process context with no deadline; main.go calls activityTracker.Start(ctx) before the listener and HTTP servers are started, so an onboarding cache refresh can block the entire API from coming up. Since periodic refreshes already run in the goroutine, consider moving the initial refresh there or wrapping it in a short timeout so startup is not dependent on this optional status cache.

Useful? React with 👍 / 👎.

@bonujel bonujel marked this pull request as draft June 3, 2026 03:17
bonujel and others added 2 commits June 3, 2026 11:30
…-specific validated, non-blocking activity refresh

- MLNodeTester now validates EVERY configured model (load -> health ->
  inference) and stops the node between models, since the MLnode rejects a
  second /inference/up while one is running with 409 (Codex: only the last
  model was probed, and multi-model nodes would falsely fail).
- computeOnboarding sets `validated` from node-specific evidence
  (this node's EpochMLNodes) instead of participant-wide activity, so a
  spare/untested node on an active participant is no longer shown as
  "validated, safe to be offline" (Codex P1).
- ActivityTracker.Start runs its initial refresh inside the goroutine, so a
  slow/unresponsive chain RPC can't block API startup (Codex P2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ale results, activity on RPC error

- BuildMLNodeMessage honors shouldBeOnline (which reflects the current PoC
  phase): during an active PoC it tells the operator to be online instead
  of "safe to be offline", even though the countdown points to the next
  epoch (Codex P1).
- maybeAutoTest invalidates any prior MLNodeTester result for the node
  before scheduling a retest, so a stale pass from the old configuration
  can't keep a changed node shown as validated (Codex P1).
- ActivityTracker.query returns the error on a subgroup RPC failure so
  refresh keeps the previous value, instead of flipping an active
  participant to inactive on a transient failure (Codex P2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bonujel

bonujel commented Jun 3, 2026

Copy link
Copy Markdown
Author

@codex review whether the previous comments have been fixed. Do not introduce unrelated new suggestions unless they are critical.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3268625c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread decentralized-api/internal/server/admin/node_handlers.go Outdated
Comment thread decentralized-api/internal/server/admin/mlnode_tester.go Outdated
@bonujel

bonujel commented Jun 3, 2026

Copy link
Copy Markdown
Author

@codex i've addressed the comments you mentioned before. check whether the previous comments have been fixed. Do not introduce unrelated new suggestions unless they are critical.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bonujel

bonujel commented Jun 3, 2026

Copy link
Copy Markdown
Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02dee2f320

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +34 to +35
if currentHeight >= nextPoC {
nextPoC = es.LatestEpoch.NextPoCStart()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not advertise offline safety during an active PoC

When the current phase is a PoC phase, currentHeight >= StartOfPoC() switches the countdown to the following epoch's PoC. computeOnboarding then passes that large countdown to BuildMLNodeMessage, which can tell a validated active node that it is "safe to be offline" even though ShouldBeOnline is true for the current PoC. Keep the active PoC window distinct or make the displayed message honor the current phase.

Useful? React with 👍 / 👎.

Comment on lines +279 to +280
// Config changed — re-test if PoC is far enough away.
s.maybeAutoTest(node.Id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Invalidate stale test results after configuration changes

When a node that previously passed validation is updated, its old lastTests entry remains in the tester. If auto-testing is skipped because the schedule is unknown, PoC is within one hour, or the node is serving—and even while an asynchronous retest is running—computeOnboarding continues treating the changed configuration as validated and may report that it is safe to be offline. Clear the prior result before scheduling a retest.

Useful? React with 👍 / 👎.

Comment on lines +189 to +190
modelStart := time.Now()
if err := client.InferenceUp(ctx, modelId, cfg.Models[modelId].Args); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Test with the same merged model arguments as the broker

When a governance model defines ModelArgs, the production inference path launches it with those epoch arguments merged with the local configuration in node_worker_commands.go, but this validation path sends only cfg.Models[modelId].Args. The test therefore does not exercise the command the broker will actually use and can either falsely fail a model that requires governance arguments or falsely pass a configuration that fails after merging.

Useful? React with 👍 / 👎.

Comment on lines +103 to +104
if err != nil || subResp == nil {
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve activity state on subgroup query failures

When an EpochGroupData RPC fails for the subgroup containing this participant, continuing the loop allows query to return false, so refresh flips a previously active participant to inactive despite its stated policy of retaining the previous value on transient errors. This causes incorrect onboarding status and logs for at least one refresh interval, or indefinitely during repeated subgroup failures; return the error instead of treating the missing subgroup as evidence of inactivity.

Useful? React with 👍 / 👎.

@bonujel bonujel closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants