[P2] onboarding clarity — Dima's approach (supersedes #866)#1261
Open
Ryanchen911 wants to merge 2 commits into
Open
[P2] onboarding clarity — Dima's approach (supersedes #866)#1261Ryanchen911 wants to merge 2 commits into
Ryanchen911 wants to merge 2 commits into
Conversation
Compute onboarding UX (participant state, MLnode state, timing, human-friendly messages) in the admin handler rather than on broker state. The broker is unchanged; admin GET /nodes returns an enriched NodeWithOnboarding envelope that wraps the broker's NodeResponse with an optional Onboarding block. New components: - participant.ActivityTracker: background goroutine caches whether the participant is in the current active set (refreshed every 30s); admin handlers read IsActive() without per-request chain RPCs. - admin pure helpers (OnboardingStateManager, TimingCalculator, StatusReporter): no state, no broker references, no I/O. Outputs derive only from inputs. - apiconfig timing constants: shared defaults (block time, alert lead, auto-test slack). Also demote the "No epoch models available" log in node_worker_commands from Error to Info: this is normal during onboarding when the participant has not yet joined the active set, and Error here was the loudest source of confusion in user reports. See proposals/onboarding-clarity-v1/README.md for product context. Replaces the broker-coupled approach from gonka-ai#866 with the separation suggested by reviewers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndpoint Implements the approach suggested by reviewers on PR gonka-ai#866: run the real broker against a mocked chain in a separate mode to verify the PoC lifecycle is wired up, with production code paths untouched. Two additions: 1) admin POST /admin/v1/nodes/:id/test Synchronous one-shot validation against a configured MLnode. For each model: InferenceUp + load-time measurement; then InferenceHealth probe; then Stop. Returns a TestResult with per- model load times and overall pass/fail. Per-node mutex prevents concurrent tests against the same node (returns 409). The broker is not involved — the result is reported back to the caller, not written to broker state. 2) `api selfcheck` subcommand Builds a real broker with a MockChainBridge (implements broker.BrokerChainBridge with hardcoded single-participant/single- model epoch data) and a MockClientFactory for MLnode I/O. The EventDriver advances the phase tracker and triggers the same broker commands OnNewBlockDispatcher would normally queue. The Evaluator inspects broker state via the public GetNodes API and produces a Report (PASS/FAIL per stage). Exits 0 on PASS, 1 on FAIL; the report goes to stderr. Initial stages: - node-registered - epoch-models-populated (chain bridge wiring works end-to-end) - hardware-diff-submitted (broker reports hardware on startup) Full PoC artifact protocol simulation is intentionally out of scope; the foundation is here for follow-up PRs to add more stages (StartPoC -> POC transition, validation, inference resume). Tests: - admin/mlnode_tester_test.go covers NodeNotFound, success, model-load failure, concurrent rejection. - selfcheck/run_test.go runs the full selfcheck and asserts PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds onboarding UX improvements and a standalone “selfcheck” mode to reduce operator confusion during the PoC wait period and validate broker wiring without modifying production broker paths.
Changes:
- Introduces a
selfcheckpackage that runs the real broker against a mocked chain and reports PASS/FAIL with staged diagnostics. - Enhances the admin
/nodesresponse with derived onboarding/timing fields and adds a manual node test endpoint. - Adds a participant
ActivityTrackerto cache “active set” membership and reduces noisy broker logging during non-active onboarding.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| proposals/onboarding-clarity-v1/README.md | Product proposal describing onboarding UX + testing goals. |
| decentralized-api/selfcheck/run.go | Implements selfcheck runner wiring a real broker to mock chain/events. |
| decentralized-api/selfcheck/run_test.go | Adds regression test ensuring selfcheck.Run passes on the synthetic setup. |
| decentralized-api/selfcheck/mock_chain_bridge.go | Mocks BrokerChainBridge with single model/node/participant and captures hardware diffs. |
| decentralized-api/selfcheck/event_driver.go | Drives synthetic chainphase transitions and broker commands. |
| decentralized-api/selfcheck/evaluator.go | Evaluates broker state via public APIs and produces a staged report. |
| decentralized-api/selfcheck/config.go | Creates temp-file-backed config manager for selfcheck isolation. |
| decentralized-api/participant/activity_tracker.go | Adds cached “participant active” checker refreshed in background. |
| decentralized-api/main.go | Adds selfcheck CLI mode and wires ActivityTracker into admin server startup. |
| decentralized-api/internal/server/admin/server.go | Extends admin server with phase tracker, activity tracker, and ML node tester + route. |
| decentralized-api/internal/server/admin/server_test.go | Updates server test setup for new NewServer parameters. |
| decentralized-api/internal/server/admin/node_handlers.go | Enriches /nodes response with onboarding status and adds /nodes/:id/test. |
| decentralized-api/internal/server/admin/onboarding_state_manager.go | Adds onboarding state derivation helpers and enums. |
| decentralized-api/internal/server/admin/onboarding_state_manager_test.go | Adds unit tests for onboarding state derivation and messages. |
| decentralized-api/internal/server/admin/timing_calculator.go | Adds timing derivation (blocks/seconds to next PoC) + duration formatting. |
| decentralized-api/internal/server/admin/status_reporter.go | Builds human-facing MLnode/participant guidance strings. |
| decentralized-api/internal/server/admin/mlnode_tester.go | Implements synchronous node validation (model load + health) with concurrency guard. |
| decentralized-api/internal/server/admin/mlnode_tester_test.go | Adds tests for MLNodeTester success/failure/concurrency. |
| decentralized-api/broker/node_worker_commands.go | Downgrades a noisy “no epoch models” log from Error to Info with context. |
| decentralized-api/apiconfig/constants.go | Adds shared timing constants used by admin UX and selfcheck. |
Comment on lines
+80
to
+101
| var seconds int64 | ||
| if timing != nil { | ||
| seconds = timing.SecondsUntilNextPoC | ||
| } | ||
|
|
||
| testFailed := n.State.FailureReason != "" && | ||
| n.State.CurrentStatus == types.HardwareNodeStatus_FAILED | ||
| mlState, _ := DeriveMLNodeState(OnboardingStateInputs{ | ||
| ParticipantActive: active, | ||
| IsTesting: false, | ||
| TestFailed: testFailed, | ||
| SecondsUntilNextPoC: seconds, | ||
| }) | ||
|
|
||
| var userMsg, guidance string | ||
| if active { | ||
| userMsg = BuildMLNodeMessage(mlState, seconds, "") | ||
| guidance = BuildParticipantMessage(participantState) | ||
| } else { | ||
| userMsg = BuildParticipantMessage(participantState) | ||
| guidance = BuildInactiveGuidance(seconds) | ||
| } |
Comment on lines
+41
to
+42
| if os.Getenv("ENFORCED_MODEL_ID") == "" { | ||
| os.Setenv("ENFORCED_MODEL_ID", "disabled") |
| } | ||
| } | ||
|
|
||
| // staticParticipant is a CurrenParticipantInfo whose values never |
| // do automatically. | ||
| func BuildInactiveGuidance(secondsUntilNextPoC int64) string { | ||
| if secondsUntilNextPoC > apiconfig.AutoTestMinSecondsBeforePoC { | ||
| return "MLnode will be tested automatically when there is more than 1 hour until next PoC" |
Comment on lines
+151
to
+161
| for modelId, modelCfg := range cfg.Models { | ||
| modelStart := time.Now() | ||
| if err := client.InferenceUp(ctx, modelId, modelCfg.Args); err != nil { | ||
| result.Status = TestFailed | ||
| result.FailingModel = modelId | ||
| result.Error = err.Error() | ||
| result.LoadMs[modelId] = time.Since(modelStart).Milliseconds() | ||
| return result | ||
| } | ||
| result.LoadMs[modelId] = time.Since(modelStart).Milliseconds() | ||
| } |
Comment on lines
+138
to
+143
| // Info, not Error: this state is normal when the participant | ||
| // has not yet joined the active set for the current epoch. | ||
| // A genuine misconfiguration shows up downstream as a | ||
| // FAILED node status; logging at Error here only spams | ||
| // healthy onboarding flows. | ||
| logging.Info(result.Error+" (participant may not be active in current epoch)", types.Nodes, "node_id", worker.nodeId) |
4 tasks
bonujel
added a commit
to bonujel/gonka
that referenced
this pull request
May 30, 2026
Trim and reorganize dev_notes/onboarding_clarity_changes.md into three parts: requirements (Issue gonka-ai#326 + proposal), what was done (gonka-ai#1261 baseline + our additions), and gaps vs the requirements with brief reasons. Records that notifying the MLnode of TEST_FAILED is intentionally not done (would require MLnode server changes the proposal says aren't needed). Plain text and tables only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bonujel
added a commit
to bonujel/gonka
that referenced
this pull request
May 30, 2026
Add dev_notes/onboarding_clarity_pr_description.md — a ready-to-paste PR body modeled on gonka-ai#1261 (Summary / Why-not-gonka-ai#866 table / Test plan / Follow-ups), covering the full feature set on this branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bonujel
added a commit
to bonujel/gonka
that referenced
this pull request
May 30, 2026
- dev_notes/onboarding_clarity_changes.md: requirements (Issue gonka-ai#326 + proposal), what was done (gonka-ai#1261 baseline + our additions), and gaps with brief reasons. - dev_notes/onboarding_clarity_pr_description.md: a ready-to-paste PR body (Summary / Why-not-gonka-ai#866 / Test plan / Follow-ups), with the test plan reflecting local verification and the real-MLnode gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bonujel
added a commit
to bonujel/gonka
that referenced
this pull request
Jun 3, 2026
- dev_notes/onboarding_clarity_changes.md: requirements (Issue gonka-ai#326 + proposal), what was done (gonka-ai#1261 baseline + our additions), and gaps with brief reasons. - dev_notes/onboarding_clarity_pr_description.md: a ready-to-paste PR body (Summary / Why-not-gonka-ai#866 / Test plan / Follow-ups), with the test plan reflecting local verification and the real-MLnode gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bonujel
added a commit
to bonujel/gonka
that referenced
this pull request
Jun 3, 2026
- dev_notes/onboarding_clarity_changes.md: requirements (Issue gonka-ai#326 + proposal), what was done (gonka-ai#1261 baseline + our additions), and gaps with brief reasons. - dev_notes/onboarding_clarity_pr_description.md: a ready-to-paste PR body (Summary / Why-not-gonka-ai#866 / Test plan / Follow-ups), with the test plan reflecting local verification and the real-MLnode gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Supersedes #866. Implements the architecture reviewers asked for: keep the broker untouched, derive onboarding UX at the handler layer, and run PoC-wiring assertions in a separate
api selfcheckmode against a mocked chain.Two commits:
feat(admin): onboarding clarity at handler layer — pure-function helpers (
TimingCalculator,StatusReporter,OnboardingStateManager) +ParticipantActivityTracker(30s cached "is active" — zero per-request chain RPCs).GET /admin/v1/nodesreturns an enrichedNodeWithOnboardingenvelope.broker.NodeStateis unchanged; one log line innode_worker_commands.gois demoted Error→Info because "no epoch models" is normal during onboarding.feat(selfcheck): isolated PoC-wiring self-test + manual mlnode test endpoint —
api selfchecksubcommand spins up the real broker with aMockChainBridgeand reports per-stage PASS/FAIL (exit 0/1). NewPOST /admin/v1/nodes/:id/testruns a one-shot MLnode validation (synchronous, no broker mutation, per-node mutex).Why this PR instead of #866
#866 added UX state to broker.NodeState and new broker commands. Reviewers (@patimen, @DimaOrekhovPS) flagged this as adding UX complexity to a critical/already-complex code path. This PR delivers the same user-facing features without touching broker internals.
GET /nodesSee proposals/onboarding-clarity-v1/README.md for product context.
Test plan
go build ./...go test ./broker/... ./internal/server/admin/... ./selfcheck/... ./participant/...— all greenapi selfchecksmoke test — 3 stages PASS (node-registered, epoch-models-populated, hardware-diff-submitted), exit 0POST /admin/v1/nodes/:id/teston a live testnet nodeGET /admin/v1/nodesresponse shape (added optionalonboardingfield) doesn't break existing admin UI consumers