Skip to content

devshard v2 (v0.2.13-devshard-v2)#1289

Open
a-kuprin wants to merge 135 commits into
mainfrom
devshard-0.2.13-v2
Open

devshard v2 (v0.2.13-devshard-v2)#1289
a-kuprin wants to merge 135 commits into
mainfrom
devshard-0.2.13-v2

Conversation

@a-kuprin

@a-kuprin a-kuprin commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

This PR prepares the devshard v2 release.

This is the first devshard-only upgrade, which operates independently of usual chain upgrades. Once approved, v2 will run in parallel with the existing v1 devshard runtime.

See the upgrade design doc and the versioned/ package for details.

Upgrade process

  • Release the devshardd binary as a Gonka release artifact
  • Submit a governance proposal to register the new supported version in DevshardEscrowParams.approved_versions (defining the name, binary download URL, and sha256 hash)
  • If the proposal is approved, versiond automatically downloads the binary and serves it under the /devshard/v2 prefix
  • Once /devshard/v2 is available, contributors can test it before gateways switch primary traffic to v2

No manual host steps are expected during this type of upgrade.

devshard

  • Prune old epoch storage on epoch changes, move SQLite/Postgres schema setup out of hot paths, and select exactly one storage backend per process
  • Remove the seed reveal round, seal completed inference stats, and prune payloads so long-running sessions do not keep all served inferences in RAM or state
  • Re-gossip stale MsgFinishInference transactions so the sequencer can pick them up from another host's mempool
  • Enforce the governance-controlled maximum nonce limit on hosts to reject invalid requests before settlement
  • Separate devshard runtime version from state-root protocol version and stamp protocol v2 at build time
  • Create sessions from on-chain escrow fee snapshots and runtime config instead of hardcoded values (with direct chain fallback until mainnet has the matching NodeManager runtime-config endpoint)
  • Store per-inference validation counters outside the state root in SQLite/Postgres and expose per-slot totals through devshard stats endpoints after inference pruning
  • Add internal devshard traces and metrics through OpenTelemetry and Prometheus
  • Return typed devshard errors for disabled, initializing, and non-retryable states instead of generic failures

decentralized-api

The changes in the decentralized-api/ module are fully backward compatible and do not need to be activated before the next mainnet release.

  • Serve chain-backed devshard runtime config through the NodeManager GetRuntimeConfig gRPC long-poll
  • Add dapi traces and metrics for public inference requests, event listening, validation, chain queries, transaction broadcasts, and ML node calls
  • Propagate trace context across executor forwarding, validation payload fetches, and ML node calls

inference-chain

The changes in the inference-chain/ module are wire-compatible and do not need to be activated before the next mainnet release.

  • Rename the version field to state_root_and_protocol_version in the devshard settlement message proto
  • Move devshard session timeouts, fees, validation rates, vote threshold factor, and grace periods to governance-controlled DevshardEscrowParams
  • Add create_devshard_fee and fee_per_nonce to DevshardEscrow to snapshot active fees at escrow creation

deploy

  • Add join-stack observability with Grafana, Jaeger, Prometheus, Loki, Promtail, and cAdvisor
  • Add dashboards for devshard sessions, chain health, query latency, storage, containers, and node health

Proposed Bounties

Bounty ID Sum USDT Bounty Explanation GitHub ID
PR #1114, PR #1115 3000 Certik security audit fixes (GEB-62, GEB-59, GEB-60), reported in Issue #1109 @x0152
Issue #1135 30000 PoC Decode. So far, PoC validation has only covered the prefill step, but most of the real computation in inference happens during decode, which goes unverified. PoC-decode extends it to every decode step, so a node running a different/cheaper model gets caught. It closes the biggest open gap in the network's PoC validation mechanism. spec Axel-t
PR #1035 100 fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout @unameisfine
PR #1298 17000 Devshard 0.2.13 v2 - release implementation and management @akup
PR #1046 4000 Observability implementation @qdanik
PR #1046 2000 Observability implementation @blizko
branch 7000 Emergency troubleshooting @qdanik
-- 3000 Gateway - implementation work @qdanik
report 7000 Emergency troubleshooting - schema bomb and B200 investigation kaitaku.ai
MiniMax, Additional benchm 10000 MiniMax integration + post-deploy bug-fixing + additional benchmarks + community FAQ kaitaku.ai
Issue #1026 5000 VLM inference and validation in Gonka - testing VLM serving validation and adding the necessary tools/scripts (inference + validation for visual language models, threshold calibration across honest/fraud scenarios) @fedor-konovalenko, MIL team
Issue #34 5000 TOPLOC as a validation mechanism. Evaluated using topic to reduce artifact size. The original paper reported near-100% accuracy, but only on small models (Llama-8B); Experiment results matched the paper for small models, while accuracy dropped on large models (235B). @fedor-konovalenko, MIL team
docs#1093, docs#1134, docs#992, docs#1094 500 docs: restructure governance section and expand guidance; add MiniMax-M2.7 and Kimi K2.6 model licenses; update host hardware specifications @Dolper

akup and others added 30 commits February 25, 2026 18:19
Co-authored-by: Cursor <cursoragent@cursor.com>
Sets DevshardEscrowParams.MaxEscrowsPerEpoch to 500_000.
Skip startup only when the port is set negative; treat 0 as unset and
fall back to 9400. Wire the same default into the join compose file via
NODE_MANAGER_GRPC_PORT so devshard reaches the API without manual config.
A participant restored to ACTIVE inherited the prior ConsecutiveInvalidInferences,
so a single new failure could re-invalidate them immediately. Zero the counter
when transitioning to INVALID and at every upcoming-to-effective promotion.
Replace the hardcoded keeper.DevshardMaxNonce constant with a governance
parameter on DevshardEscrowParams. VerifyDevshardSettlement now receives
the bound from params; the settle msg server reads it before verifying.
The v0.2.13 upgrade handler raises MaxNonce to 1_000_000 and bundles the
existing MaxEscrowsPerEpoch=500_000 bump into the same step.
…2.13

v0.2.12 added MsgRespondDealerComplaints to InferenceOperationKeyPerms
but did not migrate existing cold-to-warm grants, leaving pre-v0.2.12
DAPIs unable to respond to dealer complaints. Walk authz grants, key
each pair off its MsgStartInference grant, and add the missing
authorization with the source grant's expiration. Idempotent.
Wire CreateUpgradeHandler with InferenceKeeper and AuthzKeeper so the
chain runs the v0.2.13 migrations at the upgrade height. No module
ConsensusVersion bump: the handler edits existing collections, no
inference store schema change.
# Devshard storage: Postgres backend + epoch pruning

Drop-in replacement for the unbounded single-file SQLite store on `main`.
SQLite-only deployments need no config change; new binaries auto-migrate
the legacy DB on first boot.

## Architecture

```
HostManager
  -> ManagedStorage           // 30s pruner, retain N=3 epochs
       -> SQLite              // PGHOST unset
       -> HybridStorage       // PGHOST set
            -> Postgres       // primary, sticky per-escrow
            -> SQLite         // local fallback while PG is down
```

Storage is partitioned by `epoch_id` (= `DevshardEscrow.epoch_index`):

- Postgres: `devshard_sessions`, `devshard_diffs`, `devshard_signatures`
  each `PARTITION BY RANGE (epoch_id)`. Partitions are created lazily;
  pruning is `DROP TABLE`.
- SQLite: one `epoch_<N>.db` per epoch plus a `_meta.db` routing index;
  pruning closes the pool and removes the file.
- Hybrid: per-escrow stickiness keeps a session on one backend.

`ManagedStorage` ticks every 30s, computes
`cutoff = max_observed_epoch + 1 - retain`, and prunes everything older.
An `EpochProvider` advances the cutoff on quiet hosts.

## Drop-in guarantees

- `PGHOST` unset -> SQLite-only, identical to before.
- `PGHOST` set -> hybrid mode, same env vars as `payloadstorage`.
- Legacy `/root/.dapi/data/devshard.db` is migrated to
  `/root/.dapi/data/devshard/` on first boot, then renamed
  `*.migrated.<unix>`. Idempotent across restarts.
- Per-host storage. No schema, proto, HTTP, or gossip changes.

## Tradeoffs

For simplicity, partitioning is by `epoch_id` only, not
`(epoch_id, escrow_id)`. Loading a session reads its diffs from the
shared epoch partition (indexed on `escrow_id`). The next step is per-escrow state snapshots (data +
additions) so readers skip the diff replay.
…poch

Reuses the v0.2.10 grace-epoch primitive with UpgradeProtectionWindow=3000.
The pruning test queried latestEpoch at the very end and asserted that
its session partition existed. But the advance-epochs loop exits via
waitForNextEpoch after the last write, so by the time the assertion
runs the chain's current epoch has no devshard activity and therefore
no partition. Capture the epochIndex of the last tick's escrow during
the loop and assert against that partition instead.
Problem:
API startup waited for devshard legacy migration and full session replay before
starting the ML/admin servers. On large devshard state this delayed port 9100 by
minutes even though most endpoints did not need recovered devshard sessions.

Solution:
Gate devshard session routes with a 503 initializing response, run legacy
migration in the background, then mark devshard ready and recover sessions
asynchronously. Requests after migration still lazily recover a single escrow
before serving it.

Flow:
startup -> register gated routes -> start servers
        -> migrate legacy DB -> mark ready -> background recovery

request -> ready? no -> 503 initializing
request -> ready? yes -> session cached? yes -> serve
request -> ready? yes -> session cached? no -> recover escrow -> serve
* devshard snapshots for hosts

* devshards recoversessions parallel workers

* devshard host snapshot on settlement

---------

Co-authored-by: David and Daniil Liberman <da@liberman.net>
@a-kuprin

a-kuprin commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Added #1326 that fixes found issue:

Hosts could diverge from the user on SealedAcc / post_state_root because sealing used a wall-clock grace gate outside the signed diff

a-kuprin and others added 4 commits June 9, 2026 21:16
* Move devshard inference sealing into deterministic state-machine auto-seal.

Host-local wall-clock prune tiers made seal timing node-dependent and risked diverging state roots. Fold eligible inferences during diff apply using nonce and ConfirmedAt-derived state clock gates, and have the host emit payload-prune events only after the machine seals them.

* Added short path for sealing inference:
if inference is validated/invalidated don't wait grace period and seal it immidiately.

Additional check before sealing inference has one of following statuses:
StatusFinished, StatusValidated, StatusInvalidated, StatusTimedOut

---------

Co-authored-by: akup <ak@neonavigation.com>
@0xMayoor

Copy link
Copy Markdown
Contributor

devshardAssignedUpperBoundForSlot (devshard_settlement.go) is documented as "the maximum number of inference IDs that could have been assigned to a slot" — an upper bound, 1 + (nonce-firstAssigned)/slotCount. but the settle handler uses it as the actual completed count: assignedToSlot, _ := devshardAssignedUpperBoundForSlot(msg.Nonce, ...)AggregateDevshardHostStatsIntoCurrentEpochStats(participant, *hs, assignedToSlot), which credits completed = assignedPerSlot - missed straight into CurrentEpochStats.InferenceCount. so the credited inference count comes from the settlement nonce, not from work the hosts actually attested.

the nonce isn't bound to real work. in applyCore (devshard/state/machine.go) an empty diff (or MsgFinalizeRound) advances LatestNonce with no StartInference, and the per-nonce fee is only charged in the Active phase — so once you're in Finalizing/Settlement you can advance the nonce up to the max for free. the new host-side max-nonce limit caps the magnitude (~MaxNonce/groupSize per slot, ~1250 at the defaults) but doesn't change that the count is decoupled from work. hosts still sign those empty roots — the only acceptance checker withholds on a stale mempool, not on an inference-less diff — and HostStats.Missed/Cost stay 0 since nothing finished or timed out. so an all-zero HostStats settlement at a high nonce is a valid quorum-signed payload, and each occupied slot's participant gets credited ~1250 "completed".

that's the same counter the downtime punishment reads (accountsettle.go, total = InferenceCount + MissedRequests). a participant who's genuinely down — say 50 served / 50 missed, normally zeroed by MissedStatTest — can settle one max-nonce escrow, fabricate ~1250 completed, drop their apparent miss-rate under p0, and keep the full reward. the same counters also feed getDynamicP0, so a large zero-missed contribution pulls the network-wide baseline down and tightens p0 for everyone.

create/settle is permissionless by default (AllowedCreatorAddresses empty) and slots are sampled from the epoch group, so any active participant can land a slot — one is enough. i have a small go test that runs the real devshardAssignedUpperBoundForSlotAggregateDevshardHostStatsIntoCurrentEpochStatsCheckAndPunishForDowntime path and shows that same 50/50 participant flip from reward 0 to full reward; happy to share.

not prescribing a fix since that's your design, but the root is using the nonce-derived upper bound as the actual completed count — binding the credit to signed per-slot completed work (or cross-checking against Cost/validations at settle) would close it.

@0xMayoor

Copy link
Copy Markdown
Contributor

two more verification gaps in the v2 runtime this PR ships — both the same "sibling verifies, twin doesn't" shape, and i've got fixes open against main for each:

fetchSignature (devshard/user/session.go) stores the bytes a host returns from GET /signatures keyed by slot, with only a slot-ownership check and no RecoverAddress — so a host can hand back arbitrary bytes that then get counted toward quorum. its sibling processResponse recovers and matches the address before storing. fix: #1311.

HandleGossipTxs (devshard/transport/server.go) forwards gossiped txs into the mempool after only a group-membership check, with no per-tx proposer-sig verification — so a group member can inject forged txs the host then trusts (e.g. a forged validation vote that suppresses the host's own validation via the mempool oracle). its sibling HandleGossipNonce does RecoverAddress + slot match before storing. fix: #1312.

both are still present on devshard-0.2.13-v2 at the current head — flagging here since they ride along in the code under review.

@a-kuprin

Copy link
Copy Markdown
Collaborator Author

@0xMayoor

I've seen both and they are candidates for next release in 1 or 2 weeks. We just need to make this release finite

a-kuprin and others added 5 commits June 11, 2026 21:34
* Parameters naming and inferenceSealGraceNonce, inferenceSealGraceTimeout moved to EscrowStart
* Don't seal inferences when stateClock is undefined (no confirmedAt value in latest inferences)
It is at escrow start message and unchangable during escrow session
Default is 150.
It is required for e2e testermint test pass. That test checking autodealing works
Comment thread devshard/devshardctl Outdated
Comment thread inference-chain/go.mod
@@ -4,7 +4,7 @@ go 1.24.2

replace (
cosmossdk.io/store => github.com/gonka-ai/cosmos-sdk/store v1.1.2-ps1
github.com/cosmos/cosmos-sdk => github.com/gonka-ai/cosmos-sdk v0.53.3-ps17
github.com/cosmos/cosmos-sdk => github.com/gonka-ai/cosmos-sdk v0.53.3-ps17-observability

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to make this include as a stable version, instead of a feature branch?

Comment thread inference-chain/go.sum
@@ -788,8 +790,8 @@ github.com/golangci/revgrep v0.5.3 h1:3tL7c1XBMtWHHqVpS5ChmiAAoe4PF/d5+ULzV9sLAz
github.com/golangci/revgrep v0.5.3/go.mod h1:U4R/s9dlXZsg8uJmaR1GrloUr14D7qDl8gi2iPXJH8k=
github.com/golangci/unconvert v0.0.0-20240309020433-c5143eacb3ed h1:IURFTjxeTfNFP0hTEi1YKjB/ub8zkpaOqFFMApi2EAs=
github.com/golangci/unconvert v0.0.0-20240309020433-c5143eacb3ed/go.mod h1:XLXN8bNw4CGRPaqgl3bv/lhz7bsGPh4/xSaMTbo2vkQ=
github.com/gonka-ai/cosmos-sdk v0.53.3-ps17 h1:xw8ssDJDfl+/TnD9QMq/EZGzjnoh+6cvROqZE/MwNzU=
github.com/gonka-ai/cosmos-sdk v0.53.3-ps17/go.mod h1:90S054hIbadFB1MlXVZVC5w0QbKfd1P4b79zT+vvJxw=
github.com/gonka-ai/cosmos-sdk v0.53.3-ps17-observability h1:vWph4b1Xzvwj9jV3BVD6RXQLqRmCsGNyPAxePlFIU0Q=

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to make this include as a stable version, instead of a feature branch?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stable version, not a feature branch.
Do you have any concerns on this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming v0.53.3-ps17-observability breaks semantic versioning.

@a-kuprin

a-kuprin commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

@0xMayoor

so the credited inference count comes from the settlement nonce, not from work the hosts actually attested.

Basically in devshard nonceId == inferenceId, but you are right on that there is service nonces like one carrying MsgFinalizeRound.
devshard is designed to serve a lot of inderences, so this doesn't break the stats.

But again you are right that we should add - 1

@0xMayoor

Copy link
Copy Markdown
Contributor

yeah fair @a-kuprin , the active-phase fee bounds it so it's not free like i implied, my bad.
the gap's bigger than -1 though — once finalizing starts the nonce keeps advancing with no fee till
LatestNonce >= FinalizeNonce +len(Group), so it's the whole finalize window not one service nonce.
and that count lands in CurrentEpochStats.InferenceCount which feeds the downtime punishment denom and the dynamicP0 baseline, so it shifts the miss-rate test a bit, not just a display stat.
might be small in normal runs, you'd know better — figured worth subtracting the window not just 1.

@a-kuprin

Copy link
Copy Markdown
Collaborator Author

yeah fair @a-kuprin , the active-phase fee bounds it so it's not free like i implied, my bad. the gap's bigger than -1 though — once finalizing starts the nonce keeps advancing with no fee till LatestNonce >= FinalizeNonce +len(Group), so it's the whole finalize window not one service nonce. and that count lands in CurrentEpochStats.InferenceCount which feeds the downtime punishment denom and the dynamicP0 baseline, so it shifts the miss-rate test a bit, not just a display stat. might be small in normal runs, you'd know better — figured worth subtracting the window not just 1.

Could you please dig it deeper and prepare PR with a fair fix?

@0xMayoor

Copy link
Copy Markdown
Contributor

ok dug in, it's bigger than the -1 we landed on — a node that's genuinely down keeps its full epoch reward.

empty diffs are the lever: they bump the nonce with no inference in them, so settlement credits the slot ~nonce/groupsize "completed" off pure nonce while Missed stays 0. run it to max, settle all-zero hoststats, and a 50-done/50-missed node that should be zeroed gets buried under ~1250 fake completed. no collusion needed either — stale mempool is the only thing that makes a host withhold its sig, and an empty session never has pending txs, so honest hosts sign it fine. costs ~2e7 in per-nonce fees, nothing next to the reward it saves.

and it beats both downtime gates, not just the epoch one — the per-block slashing check too. devshard misses only hit that SPRT batched at settlement, so front-load the empty one and it never trips. inactive = zero reward, so that's the part that actually keeps the money. all confirmed with PoCs on the v2 head.

finalize-window subtraction won't close it btw — the empty active nonces survive that, the count has to come from real work not the nonce.
PR coming. @a-kuprin

@a-kuprin

a-kuprin commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator Author

@0xMayoor

empty diffs are the lever

Actually empty diff is very undesirable by protocol and normally shouldn't appear. The another attack with empty diffs is skipping slots and send real work only to some choosen host in devshard.

There are still a lot of work on devshards to make it stable on protocol level. I don't think we need to push it right now, as devshard now is primarily used for gateway stabilisation.
But the intent is that every nonce should carry inference, so I think settlement solutions was done keeping this intent in mind, as anyway this host skipping attack should be prevented by protocol.

So the main thing here is that empty diff is not what target version of protocol expects, but current state is mostly experimental for gateway purposes, and every gateway currently is from white list and is trusted

Of course we can add real inference counting, but this should also be added to settlement message.

From one side of view we will be changing part of protocol to legitimate host-skip attack I've described. But we anyway need legal way to skip inferences for some hosts (for example during cPoC see https://github.com/a-kuprin/gonka/blob/devshard-testenv/devshard/docs/proposals/CPOC_PROTOCOL.md)

So adding real inference count to settlement message is what we should have, and is very easy to add

@0xMayoor

Copy link
Copy Markdown
Contributor

@0xMayoor

empty diffs are the lever

Actually empty diff is very undesirable by protocol and normally shouldn't appear. The another attack with empty diffs is skipping slots and send real work only to some choosen host in devshard.

There are still a lot of work on devshards to make it stable on protocol level. I don't think we need to push it right now, as devshard now is primarily used for gateway stabilisation. But the intent is that every nonce should carry inference, so I think settlement solutions was done keeping this intent in mind, as anyway this host skipping attack should be prevented by protocol.

So the main thing here is that empty diff is not what target version of protocol expects, but current state is mostly experimental for gateway purposes, and every gateway currently is from white list and is trusted

Of course we can add real inference counting, but this should also be added to settlement message.

From one side of view we will be changing part of protocol to legitimate host-skip attack I've described. But we anyway need legal way to skip inferences for some hosts (for example during cPoC see https://github.com/a-kuprin/gonka/blob/devshard-testenv/devshard/docs/proposals/CPOC_PROTOCOL.md)

So adding real inference count to settlement message is what we should have, and is very easy to add

yeah, attested per-slot count in the settlement message is the move — only thing i'd watch is it rides in the signed host_stats, not a value the settler passes in, otherwise you've just moved the trust. nonce stays as the cap.
you tackling the pr or should i?

@Ryanchen911

Copy link
Copy Markdown

I found three settlement issues in devshard_settlement.go — seems all can lock or drain escrow funds

  1. Protocol tag validated against the binary-version allowlist (:103 → :33)
    msg.StateRootAndProtocolVersion (the protocol tag, constant "v2" from domain.go:15) is checked against params.ApprovedVersions[].Name (versiond binary names). These are different concepts — your protocol-version.md says "do not assume it must equal an approved_versions.name entry." It only passes today because the allowlist is empty (len(approved)==0 → nil) or happens to contain "v2". Once approved_versions holds anything else (legacy {name:"v1"}, or a future {name:"v3"} bugfix binary with no protocol bump), every settlement is rejected, escrow.Settled is never set, and funds lock.
    Fix: drop the check (the tag is already bound into the state root via versionHash + quorum sigs), or validate against a separate protocol-version list.

  2. msg.Nonce checked against live MaxNonce, not a per-escrow snapshot (:95)
    Fees are snapshotted at creation (create_devshard_fee/fee_per_nonce), but max_nonce is read live. A session created at max_nonce=1000 that legitimately ran to nonce 800 becomes unsettleable if governance later lowers max_nonce to 500 (800 > 500 → reject) → funds locked/forfeited.
    Fix: snapshot max_nonce onto the escrow at creation and validate msg.Nonce against the snapshot.

  3. Snapshotted fee schedule is never enforced against msg.Fees (:209)
    The only check is totalCost + msg.Fees ≤ escrow.Amount; msg.Fees is never compared to create_devshard_fee + fee_per_nonce * msg.Nonce. So a colluding/buggy host quorum can sign a settlement with Fees inflated up to the full escrow, overpaying validators and eating the creator's refund. The snapshotted schedule is currently decorative.
    Fix: compute expectedFees = create_devshard_fee + fee_per_nonce * msg.Nonce (with overflow checks) and require msg.Fees == expectedFees.

Comment thread devshard/state/seal.go
sm.mu.Lock()
defer sm.mu.Unlock()

if err := sm.inferenceStore.DeleteSealedInferences(sm.state.EscrowID); err != nil {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing some context, but why do we prune sealed tables at startup?
This might cause data loss after restart in some scenarios

@a-kuprin a-kuprin Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the session recovery.

The source of truth is diffs here not the local data. Also sealed inferences are only for observability

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker

interface.go:84 says these counters survive restarts, but sealed ones don't: recovery calls DeleteSealedInferences() (clears sealed_validation_obs) and never rebuilds them from diffs. So a session's sealed validation data is lost after restart

Comment thread devshard/state/machine.go

@x0152 x0152 Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB write error here can mutate state but drop tx from diff which may cause checksum mismatches and reject diffs

Note: I don't know why, but github doesn't show the exact line

var applied []*types.DevshardTx
	for _, tx := range txs {
		if err := sm.applyTx(tx); err != nil {
			if tx.GetStartInference() != nil {
				sm.restoreMutable(snap)
				return nil, nil, fmt.Errorf("mandatory start inference: %w", err)
			}
			continue // <--
        }
		applied = append(applied, tx)
     ...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid concern

SELECT c.relname
FROM pg_class c
JOIN pg_inherits i ON i.inhrelid = c.oid
JOIN pg_class p ON p.oid = i.inhparent
WHERE p.relname IN ('devshard_sessions', 'devshard_diffs', 'devshard_signatures', 'devshard_snapshots')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensurePartition creates 8 partitions per epoch, but this pruneBefore query only lists 5 parents - leading to unbounded storage growth over time

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid, should be fixed

return c.JSON(http.StatusOK, []prometheusTargetGroup{})
}

versions := s.configManager.GetDevshardVersions().Versions

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small fix to avoid a data race:

versions := slices.Clone(s.configManager.GetDevshardVersions().Versions)

@gmorgachev

Copy link
Copy Markdown
Contributor

@0xMayoor @Ryanchen911 @x0152
Thanks for feedback, let's definitely include them all in next devshard release, i hope we'll have it in next 1-2 weeks

For v0.2.13-v2, created the release from the current state, going to propose it today

@a-kuprin

a-kuprin commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author
  1. Protocol tag validated against the binary-version allowlist (:103 → :33)
    msg.StateRootAndProtocolVersion (the protocol tag, constant "v2" from domain.go:15) is checked against params.ApprovedVersions[].Name (versiond binary names). These are different concepts — your protocol-version.md says "do not assume it must equal an approved_versions.name entry." It only passes today because the allowlist is empty (len(approved)==0 → nil) or happens to contain "v2". Once approved_versions holds anything else (legacy {name:"v1"}, or a future {name:"v3"} bugfix binary with no protocol bump), every settlement is rejected, escrow.Settled is never set, and funds lock.
    Fix: drop the check (the tag is already bound into the state root via versionHash + quorum sigs), or validate against a separate protocol-version list.

More like documentation missmatch

  1. msg.Nonce checked against live MaxNonce, not a per-escrow snapshot (:95)
    Fees are snapshotted at creation (create_devshard_fee/fee_per_nonce), but max_nonce is read live. A session created at max_nonce=1000 that legitimately ran to nonce 800 becomes unsettleable if governance later lowers max_nonce to 500 (800 > 500 → reject) → funds locked/forfeited.
    Fix: snapshot max_nonce onto the escrow at creation and validate msg.Nonce against the snapshot.

Yes it should be done and we should check it in settlement not from live parameter

3. Snapshotted fee schedule is never enforced against msg.Fees (:209)
The only check is totalCost + msg.Fees ≤ escrow.Amount; msg.Fees is never compared to create_devshard_fee + fee_per_nonce * msg.Nonce. So a colluding/buggy host quorum can sign a settlement with Fees inflated up to the full escrow, overpaying validators and eating the creator's refund. The snapshotted schedule is currently decorative.
Fix: compute expectedFees = create_devshard_fee + fee_per_nonce * msg.Nonce (with overflow checks) and require msg.Fees == expectedFees.

Meaningless as colluding consensus can make a lot of bad things, like say there was a lot of inference. And we trust devshard hosts consensus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.