[P2] Onboarding Clarity: handler-layer UX + selfcheck, Dima's approach (supersedes #1261, #866)#1292
[P2] Onboarding Clarity: handler-layer UX + selfcheck, Dima's approach (supersedes #1261, #866)#1292bonujel wants to merge 16 commits into
Conversation
Compute onboarding UX (participant state, MLnode state, timing, human-friendly messages) in the admin handler rather than on broker state. The broker is unchanged; admin GET /nodes returns an enriched NodeWithOnboarding envelope that wraps the broker's NodeResponse with an optional Onboarding block. New components: - participant.ActivityTracker: background goroutine caches whether the participant is in the current active set (refreshed every 30s); admin handlers read IsActive() without per-request chain RPCs. - admin pure helpers (OnboardingStateManager, TimingCalculator, StatusReporter): no state, no broker references, no I/O. Outputs derive only from inputs. - apiconfig timing constants: shared defaults (block time, alert lead, auto-test slack). Also demote the "No epoch models available" log in node_worker_commands from Error to Info: this is normal during onboarding when the participant has not yet joined the active set, and Error here was the loudest source of confusion in user reports. See proposals/onboarding-clarity-v1/README.md for product context. Replaces the broker-coupled approach from gonka-ai#866 with the separation suggested by reviewers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndpoint Implements the approach suggested by reviewers on PR gonka-ai#866: run the real broker against a mocked chain in a separate mode to verify the PoC lifecycle is wired up, with production code paths untouched. Two additions: 1) admin POST /admin/v1/nodes/:id/test Synchronous one-shot validation against a configured MLnode. For each model: InferenceUp + load-time measurement; then InferenceHealth probe; then Stop. Returns a TestResult with per- model load times and overall pass/fail. Per-node mutex prevents concurrent tests against the same node (returns 409). The broker is not involved — the result is reported back to the caller, not written to broker state. 2) `api selfcheck` subcommand Builds a real broker with a MockChainBridge (implements broker.BrokerChainBridge with hardcoded single-participant/single- model epoch data) and a MockClientFactory for MLnode I/O. The EventDriver advances the phase tracker and triggers the same broker commands OnNewBlockDispatcher would normally queue. The Evaluator inspects broker state via the public GetNodes API and produces a Report (PASS/FAIL per stage). Exits 0 on PASS, 1 on FAIL; the report goes to stderr. Initial stages: - node-registered - epoch-models-populated (chain bridge wiring works end-to-end) - hardware-diff-submitted (broker reports hardware on startup) Full PoC artifact protocol simulation is intentionally out of scope; the foundation is here for follow-up PRs to add more stages (StartPoC -> POC transition, validation, inference resume). Tests: - admin/mlnode_tester_test.go covers NodeNotFound, success, model-load failure, concurrent rejection. - selfcheck/run_test.go runs the full selfcheck and asserts PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Treat an unknown PoC schedule as a sentinel (-1) rather than 0, so the UI never renders "PoC starting soon (in 0s)" before the phase tracker has synced; show "schedule syncing" instead. - mlnode_tester: iterate models in sorted order so FailingModel and the per-model load sequence are deterministic (Go map order is random). - status_reporter: derive the auto-test lead time from the constant instead of a hardcoded "1 hour". - broker: log "no epoch models" at Warn instead of Info, so genuine misconfigurations stay visible while normal onboarding stays quiet. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… register - computeOnboarding now reads MLNodeTester state: TESTING while a one-shot test is in flight, and TEST_FAILED (with the failing model) from the most recent failed result. Previously the tester result was never reflected in GET /nodes, so those states could never appear. - Auto-test a node when it is registered or its config changes, gated on more than AutoTestMinSecondsBeforePoC until the next PoC. This is the proposal's "auto-test a new MLnode when there's >1h until PoC", implemented purely at the admin layer: fire-and-forget, result flows back via computeOnboarding, no write to broker NodeState (the broker-coupling that sank gonka-ai#866 is avoided). - Add handler-level tests: POST /nodes/:id/test (success/failure/404), GET /nodes onboarding envelope + TEST_FAILED wiring, and the auto-test gate and trigger. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- run.go now steps the synthetic chain through PoCGenerate -> PoCValidate -> back to Inference, queueing the same broker commands the production OnNewBlockDispatcher would, and asserts the broker drives the node's intended status across all three phases (poc-generate, poc-validate, inference-resumed). Previously selfcheck only checked the inference-phase wiring and the PoC-phase trigger methods were dead code, so "the PoC lifecycle works" was never actually verified. - Restore ENFORCED_MODEL_ID to its prior value on exit so selfcheck leaves no process-wide side effects when invoked programmatically or from tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three log sites fired loudly while a node was simply starting up or had not yet synced the chain — normal during onboarding, not errors: - PopulateSingleNodeEpochData "phase tracker not initialized" and "epoch state not synced": demoted Warn -> Info (they only occur during startup / before sync; post-sync they don't fire, so no loss of signal). - RegisterNode "failed to populate epoch data": now logged at Info when the chain isn't synced yet (benign onboarding), but kept at Warn once synced — a real failure there still surfaces. Logging level/text only; no behavior change. The two "synced but participant-inactive" sites (parent epoch group empty; node not in epoch) are intentionally left at Warn: they are only reached after the sync check, so gating on IsSynced can't distinguish benign-from-real there — that needs a participant-active signal, deferred to avoid broker logic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Issue gonka-ai#326 asks for clearer logging when a node is launched and waiting for PoC; until now the onboarding guidance only existed in the GET /nodes response, never in the logs. Add StartOnboardingStatusLogger: a lightweight admin-layer goroutine (started from main alongside the activity tracker) that, while the participant is configured-but-inactive, logs the human-facing "Waiting for next PoC cycle (starts in 2h 15m) - safe to be offline ..." line at startup and every 5 minutes, switching to "PoC starting soon - must be online now" inside the alert window, and announcing the inactive->active transition once before going quiet. Also log the same status right after a node is registered via the admin API. Reads cached phase/activity state only — no chain RPCs, no broker writes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fter a test passes Per the proposal, "waiting for PoC" should only reassure the operator once the MLnode's test has passed. BuildMLNodeMessage now takes a `validated` flag: the calm "validated, safe to be offline until ..." line (and the plain "schedule syncing" line) are shown only when the node's most recent test succeeded. Until then it says the node is "not yet validated - it will be tested before the next PoC", while still telling the operator to come online when PoC is near. - computeOnboarding derives `validated` from the per-node MLNodeTester last result (TestSuccess) and passes it through. - The onboarding status logger gates on allNodesValidated() so the periodic heartbeat only reassures once every configured node passed. Logging/messaging only; no state-machine or behavior change (kept the 3 MLnode states; the gate lives in the message wording). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-test failure reported)
Two proposal-fidelity fixes from re-reading proposals/onboarding-clarity-v1:
- "no model for ml node" path: log at Info with the proposal's wording
("Participant not yet active - model assignment pending (normal for new
participants)") instead of Warn. Per the proposal this state is normal
for an inactive participant; a genuine problem still surfaces via the
node's FAILED status, so visibility is not lost. (Reverts the earlier
Warn choice in favor of the proposal's explicit Info requirement.)
- Auto-test results are now reported in the API node logs (proposal:
"detailed error reporting on TEST_FAILED"): a failed pre-PoC auto-test
logs at Error with the failing model and error; a pass logs at Info.
Previously maybeAutoTest discarded the result, so auto-test failures
only surfaced via GET /nodes and never hit the logs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ST_FAILED computeOnboarding derived TEST_FAILED from the broker's operational status (FailureReason != "" && CurrentStatus == FAILED). But an inactive participant's node is driven to INFERENCE by InferenceUpAllCommand and "fails" with no-epoch-models — normal onboarding, not a test failure — which set FailureReason + FAILED and made GET /nodes show TEST_FAILED for a perfectly healthy waiting node, contradicting the proposal. Per the proposal, TEST_FAILED means the MLnode validation test failed, so derive it ONLY from the MLNodeTester result. Touches only the proposal-added computeOnboarding; the pre-existing broker reconciliation (InferenceUpAllCommand / getCommandForState, from gonka-ai#160) is unchanged. Adds a regression test: a node with broker FAILED + no test result is not shown as TEST_FAILED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…quest) Implements the proposal's Proactive Testing steps Ryanchen911#3/gonka-ai#4 (send a test inference request and validate the response; record response time): - mlnodeclient: new Client.Inference(ctx, model) posts a minimal OpenAI-compatible chat-completion to <inferenceUrl>/v1/chat/completions (the same path the public handler proxies to) and validates a usable response. Added to the MLNodeClient interface + MockClient. - MLNodeTester now runs load -> health -> inference and records RespMs; a failed inference request marks the node TEST_FAILED. Per team decision this uses the existing host:port URL construction rather than waiting on gonka-ai#717 (token-auth); the gonka-ai#717 follow-up is documented in dev_notes/onboarding_clarity_changes.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dev_notes/onboarding_observability.md — maps each onboarding lifecycle stage to the log lines this branch emits and the admin API results (GET /nodes onboarding block, POST /test, api selfcheck). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 78bf58ea4e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6700debe9f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…s, versioned test URLs - maybeAutoTest now skips when the participant is active: an active node is serving inference, and the test would reload + Stop it outside the broker's knowledge, briefly taking it out of service (Codex P1). - MLNodeTester builds versioned MLnode URLs (current node version inserted into the path, like broker.Node.*UrlWithVersion), so versioned / rolling-upgrade deployments hit the right routes instead of falsely reporting TEST_FAILED (Codex P2). - computeOnboarding treats an active participant's node as validated, so an already-serving node is not shown as "not yet validated". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6700debe9f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
…n + per-node test guard - ComputeTiming targets this epoch's PoC start while still before it (pre-PoC inference phase), only using the next epoch's PoC once past it. NextPoCStart() unconditionally overcounted by an epoch, which left ShouldBeOnline false and could let auto-test fire inside the real online-alert window (Codex P1). - Replace the participant-wide IsActive() auto-test guard with a per-node check (the node is assigned/serving in the current epoch): idle/spare nodes are still tested even when the participant has other active nodes (Codex P2, was over-suppressing). - Apply the same guard to the manual POST /nodes/:id/test path, returning 409 for a serving node instead of reloading + stopping it out of service (Codex P2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3268625c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // until ctx is cancelled. Performs one immediate refresh before | ||
| // returning so IsActive() is meaningful for the first admin request. | ||
| func (t *ActivityTracker) Start(ctx context.Context) { | ||
| t.refresh(ctx) |
There was a problem hiding this comment.
Keep activity refresh off the startup path
When the chain RPCs used by CurrentEpochGroupData/EpochGroupData are slow or not responding during startup, this synchronous refresh uses the process context with no deadline; main.go calls activityTracker.Start(ctx) before the listener and HTTP servers are started, so an onboarding cache refresh can block the entire API from coming up. Since periodic refreshes already run in the goroutine, consider moving the initial refresh there or wrapping it in a short timeout so startup is not dependent on this optional status cache.
Useful? React with 👍 / 👎.
…-specific validated, non-blocking activity refresh - MLNodeTester now validates EVERY configured model (load -> health -> inference) and stops the node between models, since the MLnode rejects a second /inference/up while one is running with 409 (Codex: only the last model was probed, and multi-model nodes would falsely fail). - computeOnboarding sets `validated` from node-specific evidence (this node's EpochMLNodes) instead of participant-wide activity, so a spare/untested node on an active participant is no longer shown as "validated, safe to be offline" (Codex P1). - ActivityTracker.Start runs its initial refresh inside the goroutine, so a slow/unresponsive chain RPC can't block API startup (Codex P2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ale results, activity on RPC error - BuildMLNodeMessage honors shouldBeOnline (which reflects the current PoC phase): during an active PoC it tells the operator to be online instead of "safe to be offline", even though the countdown points to the next epoch (Codex P1). - maybeAutoTest invalidates any prior MLNodeTester result for the node before scheduling a retest, so a stale pass from the old configuration can't keep a changed node shown as validated (Codex P1). - ActivityTracker.query returns the error on a subgroup RPC failure so refresh keeps the previous value, instead of flipping an active participant to inactive on a transient failure (Codex P2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@codex review whether the previous comments have been fixed. Do not introduce unrelated new suggestions unless they are critical. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3268625c7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex i've addressed the comments you mentioned before. check whether the previous comments have been fixed. Do not introduce unrelated new suggestions unless they are critical. |
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02dee2f320
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if currentHeight >= nextPoC { | ||
| nextPoC = es.LatestEpoch.NextPoCStart() |
There was a problem hiding this comment.
Do not advertise offline safety during an active PoC
When the current phase is a PoC phase, currentHeight >= StartOfPoC() switches the countdown to the following epoch's PoC. computeOnboarding then passes that large countdown to BuildMLNodeMessage, which can tell a validated active node that it is "safe to be offline" even though ShouldBeOnline is true for the current PoC. Keep the active PoC window distinct or make the displayed message honor the current phase.
Useful? React with 👍 / 👎.
| // Config changed — re-test if PoC is far enough away. | ||
| s.maybeAutoTest(node.Id) |
There was a problem hiding this comment.
Invalidate stale test results after configuration changes
When a node that previously passed validation is updated, its old lastTests entry remains in the tester. If auto-testing is skipped because the schedule is unknown, PoC is within one hour, or the node is serving—and even while an asynchronous retest is running—computeOnboarding continues treating the changed configuration as validated and may report that it is safe to be offline. Clear the prior result before scheduling a retest.
Useful? React with 👍 / 👎.
| modelStart := time.Now() | ||
| if err := client.InferenceUp(ctx, modelId, cfg.Models[modelId].Args); err != nil { |
There was a problem hiding this comment.
Test with the same merged model arguments as the broker
When a governance model defines ModelArgs, the production inference path launches it with those epoch arguments merged with the local configuration in node_worker_commands.go, but this validation path sends only cfg.Models[modelId].Args. The test therefore does not exercise the command the broker will actually use and can either falsely fail a model that requires governance arguments or falsely pass a configuration that fails after merging.
Useful? React with 👍 / 👎.
| if err != nil || subResp == nil { | ||
| continue |
There was a problem hiding this comment.
Preserve activity state on subgroup query failures
When an EpochGroupData RPC fails for the subgroup containing this participant, continuing the loop allows query to return false, so refresh flips a previously active participant to inactive despite its stated policy of retaining the previous value on transient errors. This causes incorrect onboarding status and logs for at least one refresh interval, or indefinitely during repeated subgroup failures; return the error instead of treating the missing subgroup as evidence of inactivity.
Useful? React with 👍 / 👎.
Summary
Supersedes #866, completes #1261. Implements issue #326 (clearer onboarding logs, no false "errors", pre-PoC readiness testing) without touching the broker — onboarding UX is derived at the admin handler layer, and PoC wiring is verified in a separate
api selfcheckmode against a mocked chain.GET /admin/v1/nodesreturns an optionalonboardingblock: MLnode stateWAITING_FOR_POC/TESTING/TEST_FAILED, participantINACTIVE_WAITING/ACTIVE_PARTICIPATING, timing, and human messages. Derived read-only (TimingCalculator/StatusReporter/OnboardingStateManager+ a 30s-cachedActivityTracker, no per-request chain RPCs).broker.NodeStateunchanged.POST /admin/v1/nodes/:id/test(manual) + auto-test on register/config-change when >1h to PoC. Runs model load → health → inference request and records timings. Pass →WAITING_FOR_POC(the "safe to be offline" wording only after a pass); fail →TEST_FAILED, logged at the API node.TEST_FAILEDcomes only from the test, never from the broker's operational FAILED.api selfcheck— drives a full synthetic PoC lifecycle (PoCGenerate → PoCValidate → Inference) and reports per-stage PASS/FAIL (exit 0/1).Why not #866
#866 added UX state to
broker.NodeState+ new broker commands; reviewers (@patimen, @DimaOrekhovPS) flagged this as complexity on a critical path. Same features, broker untouched:broker.NodeStatenew fieldsGET /nodesProduct context in
proposals/onboarding-clarity-v1/README.md.Test plan
go build/go vet/go test(broker, admin, selfcheck, participant, mlnodeclient) — greenapi selfcheck— 6 stages PASS, exit 0GET /nodesonboarding (additive shape),POST /testSUCCESS (load/health/resp) and theTEST_FAILEDpath, onboarding logs