Skip to content

[P2] onboarding clarity — Dima's approach (supersedes #866)#1261

Open
Ryanchen911 wants to merge 2 commits into
gonka-ai:upgrade-v0.2.13from
Ryanchen911:jf/onboarding_clarity_v2_dima_approach
Open

[P2] onboarding clarity — Dima's approach (supersedes #866)#1261
Ryanchen911 wants to merge 2 commits into
gonka-ai:upgrade-v0.2.13from
Ryanchen911:jf/onboarding_clarity_v2_dima_approach

Conversation

@Ryanchen911

@Ryanchen911 Ryanchen911 commented May 27, 2026

Copy link
Copy Markdown

Summary

Supersedes #866. Implements the architecture reviewers asked for: keep the broker untouched, derive onboarding UX at the handler layer, and run PoC-wiring assertions in a separate api selfcheck mode against a mocked chain.

Two commits:

  • feat(admin): onboarding clarity at handler layer — pure-function helpers (TimingCalculator, StatusReporter, OnboardingStateManager) + ParticipantActivityTracker (30s cached "is active" — zero per-request chain RPCs). GET /admin/v1/nodes returns an enriched NodeWithOnboarding envelope. broker.NodeState is unchanged; one log line in node_worker_commands.go is demoted Error→Info because "no epoch models" is normal during onboarding.

  • feat(selfcheck): isolated PoC-wiring self-test + manual mlnode test endpointapi selfcheck subcommand spins up the real broker with a MockChainBridge and reports per-stage PASS/FAIL (exit 0/1). New POST /admin/v1/nodes/:id/test runs a one-shot MLnode validation (synchronous, no broker mutation, per-node mutex).

Why this PR instead of #866

#866 added UX state to broker.NodeState and new broker commands. Reviewers (@patimen, @DimaOrekhovPS) flagged this as adding UX complexity to a critical/already-complex code path. This PR delivers the same user-facing features without touching broker internals.

#866 This PR
broker.NodeState new fields 6 0
broker new commands 2 0
Per-request chain RPCs in GET /nodes N+1 0 (30s cache)
Location of UX state computation broker writes back handler derives

See proposals/onboarding-clarity-v1/README.md for product context.

Test plan

  • go build ./...
  • go test ./broker/... ./internal/server/admin/... ./selfcheck/... ./participant/... — all green
  • api selfcheck smoke test — 3 stages PASS (node-registered, epoch-models-populated, hardware-diff-submitted), exit 0
  • Manual: hit POST /admin/v1/nodes/:id/test on a live testnet node
  • Manual: confirm GET /admin/v1/nodes response shape (added optional onboarding field) doesn't break existing admin UI consumers

Ryanchen911 and others added 2 commits May 27, 2026 18:11
Compute onboarding UX (participant state, MLnode state, timing,
human-friendly messages) in the admin handler rather than on broker
state. The broker is unchanged; admin GET /nodes returns an enriched
NodeWithOnboarding envelope that wraps the broker's NodeResponse with
an optional Onboarding block.

New components:
- participant.ActivityTracker: background goroutine caches whether the
  participant is in the current active set (refreshed every 30s); admin
  handlers read IsActive() without per-request chain RPCs.
- admin pure helpers (OnboardingStateManager, TimingCalculator,
  StatusReporter): no state, no broker references, no I/O. Outputs
  derive only from inputs.
- apiconfig timing constants: shared defaults (block time, alert lead,
  auto-test slack).

Also demote the "No epoch models available" log in node_worker_commands
from Error to Info: this is normal during onboarding when the
participant has not yet joined the active set, and Error here was the
loudest source of confusion in user reports.

See proposals/onboarding-clarity-v1/README.md for product context.
Replaces the broker-coupled approach from gonka-ai#866 with the separation
suggested by reviewers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndpoint

Implements the approach suggested by reviewers on PR gonka-ai#866: run the real
broker against a mocked chain in a separate mode to verify the PoC
lifecycle is wired up, with production code paths untouched.

Two additions:

1) admin POST /admin/v1/nodes/:id/test
   Synchronous one-shot validation against a configured MLnode.
   For each model: InferenceUp + load-time measurement; then
   InferenceHealth probe; then Stop. Returns a TestResult with per-
   model load times and overall pass/fail. Per-node mutex prevents
   concurrent tests against the same node (returns 409). The broker
   is not involved — the result is reported back to the caller, not
   written to broker state.

2) `api selfcheck` subcommand
   Builds a real broker with a MockChainBridge (implements
   broker.BrokerChainBridge with hardcoded single-participant/single-
   model epoch data) and a MockClientFactory for MLnode I/O. The
   EventDriver advances the phase tracker and triggers the same
   broker commands OnNewBlockDispatcher would normally queue. The
   Evaluator inspects broker state via the public GetNodes API and
   produces a Report (PASS/FAIL per stage). Exits 0 on PASS, 1 on
   FAIL; the report goes to stderr.

   Initial stages:
   - node-registered
   - epoch-models-populated (chain bridge wiring works end-to-end)
   - hardware-diff-submitted (broker reports hardware on startup)

   Full PoC artifact protocol simulation is intentionally out of
   scope; the foundation is here for follow-up PRs to add more
   stages (StartPoC -> POC transition, validation, inference resume).

Tests:
- admin/mlnode_tester_test.go covers NodeNotFound, success,
  model-load failure, concurrent rejection.
- selfcheck/run_test.go runs the full selfcheck and asserts PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 10:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds onboarding UX improvements and a standalone “selfcheck” mode to reduce operator confusion during the PoC wait period and validate broker wiring without modifying production broker paths.

Changes:

  • Introduces a selfcheck package that runs the real broker against a mocked chain and reports PASS/FAIL with staged diagnostics.
  • Enhances the admin /nodes response with derived onboarding/timing fields and adds a manual node test endpoint.
  • Adds a participant ActivityTracker to cache “active set” membership and reduces noisy broker logging during non-active onboarding.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
proposals/onboarding-clarity-v1/README.md Product proposal describing onboarding UX + testing goals.
decentralized-api/selfcheck/run.go Implements selfcheck runner wiring a real broker to mock chain/events.
decentralized-api/selfcheck/run_test.go Adds regression test ensuring selfcheck.Run passes on the synthetic setup.
decentralized-api/selfcheck/mock_chain_bridge.go Mocks BrokerChainBridge with single model/node/participant and captures hardware diffs.
decentralized-api/selfcheck/event_driver.go Drives synthetic chainphase transitions and broker commands.
decentralized-api/selfcheck/evaluator.go Evaluates broker state via public APIs and produces a staged report.
decentralized-api/selfcheck/config.go Creates temp-file-backed config manager for selfcheck isolation.
decentralized-api/participant/activity_tracker.go Adds cached “participant active” checker refreshed in background.
decentralized-api/main.go Adds selfcheck CLI mode and wires ActivityTracker into admin server startup.
decentralized-api/internal/server/admin/server.go Extends admin server with phase tracker, activity tracker, and ML node tester + route.
decentralized-api/internal/server/admin/server_test.go Updates server test setup for new NewServer parameters.
decentralized-api/internal/server/admin/node_handlers.go Enriches /nodes response with onboarding status and adds /nodes/:id/test.
decentralized-api/internal/server/admin/onboarding_state_manager.go Adds onboarding state derivation helpers and enums.
decentralized-api/internal/server/admin/onboarding_state_manager_test.go Adds unit tests for onboarding state derivation and messages.
decentralized-api/internal/server/admin/timing_calculator.go Adds timing derivation (blocks/seconds to next PoC) + duration formatting.
decentralized-api/internal/server/admin/status_reporter.go Builds human-facing MLnode/participant guidance strings.
decentralized-api/internal/server/admin/mlnode_tester.go Implements synchronous node validation (model load + health) with concurrency guard.
decentralized-api/internal/server/admin/mlnode_tester_test.go Adds tests for MLNodeTester success/failure/concurrency.
decentralized-api/broker/node_worker_commands.go Downgrades a noisy “no epoch models” log from Error to Info with context.
decentralized-api/apiconfig/constants.go Adds shared timing constants used by admin UX and selfcheck.

Comment on lines +80 to +101
var seconds int64
if timing != nil {
seconds = timing.SecondsUntilNextPoC
}

testFailed := n.State.FailureReason != "" &&
n.State.CurrentStatus == types.HardwareNodeStatus_FAILED
mlState, _ := DeriveMLNodeState(OnboardingStateInputs{
ParticipantActive: active,
IsTesting: false,
TestFailed: testFailed,
SecondsUntilNextPoC: seconds,
})

var userMsg, guidance string
if active {
userMsg = BuildMLNodeMessage(mlState, seconds, "")
guidance = BuildParticipantMessage(participantState)
} else {
userMsg = BuildParticipantMessage(participantState)
guidance = BuildInactiveGuidance(seconds)
}
Comment on lines +41 to +42
if os.Getenv("ENFORCED_MODEL_ID") == "" {
os.Setenv("ENFORCED_MODEL_ID", "disabled")
}
}

// staticParticipant is a CurrenParticipantInfo whose values never
// do automatically.
func BuildInactiveGuidance(secondsUntilNextPoC int64) string {
if secondsUntilNextPoC > apiconfig.AutoTestMinSecondsBeforePoC {
return "MLnode will be tested automatically when there is more than 1 hour until next PoC"
Comment on lines +151 to +161
for modelId, modelCfg := range cfg.Models {
modelStart := time.Now()
if err := client.InferenceUp(ctx, modelId, modelCfg.Args); err != nil {
result.Status = TestFailed
result.FailingModel = modelId
result.Error = err.Error()
result.LoadMs[modelId] = time.Since(modelStart).Milliseconds()
return result
}
result.LoadMs[modelId] = time.Since(modelStart).Milliseconds()
}
Comment on lines +138 to +143
// Info, not Error: this state is normal when the participant
// has not yet joined the active set for the current epoch.
// A genuine misconfiguration shows up downstream as a
// FAILED node status; logging at Error here only spams
// healthy onboarding flows.
logging.Info(result.Error+" (participant may not be active in current epoch)", types.Nodes, "node_id", worker.nodeId)
@tcharchian tcharchian linked an issue May 27, 2026 that may be closed by this pull request
4 tasks
@tcharchian tcharchian added this to the v0.2.14 milestone May 27, 2026
bonujel added a commit to bonujel/gonka that referenced this pull request May 30, 2026
Trim and reorganize dev_notes/onboarding_clarity_changes.md into three
parts: requirements (Issue gonka-ai#326 + proposal), what was done (gonka-ai#1261 baseline
+ our additions), and gaps vs the requirements with brief reasons. Records
that notifying the MLnode of TEST_FAILED is intentionally not done (would
require MLnode server changes the proposal says aren't needed). Plain text
and tables only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bonujel added a commit to bonujel/gonka that referenced this pull request May 30, 2026
Add dev_notes/onboarding_clarity_pr_description.md — a ready-to-paste PR
body modeled on gonka-ai#1261 (Summary / Why-not-gonka-ai#866 table / Test plan /
Follow-ups), covering the full feature set on this branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bonujel added a commit to bonujel/gonka that referenced this pull request May 30, 2026
- dev_notes/onboarding_clarity_changes.md: requirements (Issue gonka-ai#326 +
  proposal), what was done (gonka-ai#1261 baseline + our additions), and gaps with
  brief reasons.
- dev_notes/onboarding_clarity_pr_description.md: a ready-to-paste PR body
  (Summary / Why-not-gonka-ai#866 / Test plan / Follow-ups), with the test plan
  reflecting local verification and the real-MLnode gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bonujel added a commit to bonujel/gonka that referenced this pull request Jun 3, 2026
- dev_notes/onboarding_clarity_changes.md: requirements (Issue gonka-ai#326 +
  proposal), what was done (gonka-ai#1261 baseline + our additions), and gaps with
  brief reasons.
- dev_notes/onboarding_clarity_pr_description.md: a ready-to-paste PR body
  (Summary / Why-not-gonka-ai#866 / Test plan / Follow-ups), with the test plan
  reflecting local verification and the real-MLnode gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bonujel added a commit to bonujel/gonka that referenced this pull request Jun 3, 2026
- dev_notes/onboarding_clarity_changes.md: requirements (Issue gonka-ai#326 +
  proposal), what was done (gonka-ai#1261 baseline + our additions), and gaps with
  brief reasons.
- dev_notes/onboarding_clarity_pr_description.md: a ready-to-paste PR body
  (Summary / Why-not-gonka-ai#866 / Test plan / Follow-ups), with the test plan
  reflecting local verification and the real-MLnode gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P2] Improve onboarding experience

3 participants