Skip to content

chore(upgrade): clean leftover state in v0.2.14#1287

Open
Ryanchen911 wants to merge 7 commits into
gonka-ai:upgrade-v0.2.14from
Ryanchen911:ryan/1223-v0.2.14-state-cleanup
Open

chore(upgrade): clean leftover state in v0.2.14#1287
Ryanchen911 wants to merge 7 commits into
gonka-ai:upgrade-v0.2.14from
Ryanchen911:ryan/1223-v0.2.14-state-cleanup

Conversation

@Ryanchen911

@Ryanchen911 Ryanchen911 commented Jun 1, 2026

Copy link
Copy Markdown

Summary

  • clean leftover inference module state during the v0.2.14 upgrade
  • re-run legacy epoch-group, top miner, training, and PoC v2 cleanup paths idempotently
  • include the missing TrainingTaskKvRecordKeyPrefix in training cleanup

Closes #1223

Tests

  • go test -C /Users/chenjunying/gonka/inference-chain ./app/upgrades/v0_2_14
  • go test -C /Users/chenjunying/gonka/inference-chain ./app/upgrades/v0_2_12 ./x/inference/keeper

Identified leftovers

  • Legacy EpochGroupValidationsMap: replaced by per-inference EpochGroupValidationEntry in v0.2.11; this cleanup migrates any remaining current/previous epoch entries and clears the old aggregate map.
  • TopMiners: cleared in v0.2.12 and no longer used by live paths.
  • Training state: training task state is removed; cleanup now also includes the previously omitted TrainingTaskKvRecordKeyPrefix.
  • Legacy PoC v2 prefixes: replaced by model-aware prefixes 58/59/60 in v0.2.12; old raw prefixes are cleared idempotently.

Copilot AI review requested due to automatic review settings June 1, 2026 03:56
@Ryanchen911 Ryanchen911 force-pushed the ryan/1223-v0.2.14-state-cleanup branch from 81a3dbc to b51d8b1 Compare June 1, 2026 04:04
@tcharchian tcharchian added this to the v0.2.14 milestone Jun 1, 2026
@tcharchian tcharchian requested a review from patimen June 1, 2026 21:58
@tcharchian tcharchian linked an issue Jun 1, 2026 that may be closed by this pull request
@patimen

patimen commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

/run-integration

@gmorgachev

Copy link
Copy Markdown
Contributor

@Ryanchen911

i think this task should include analysis amount of state / state history used by prefix. then we check what can be removed. the state size is quite big still, we need to understand why

Adds an offline `inferenced state-stats` command that reports per-store and
per-inference-prefix committed state size, with legacy prefixes flagged as
cleanup candidates. Backed by a StatePrefixCatalog single-source-of-truth that
maps every inference prefix to a readable name.

Addresses the review request to analyze state size by prefix before deciding
what to remove (issue gonka-ai#1223).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Ryanchen911

Copy link
Copy Markdown
Author

@gmorgachev good point — agreed that we should drive removal from measured per-prefix size, not just remove the prefixes we already know are dead.

I added an offline analysis command for exactly this: inferenced state-stats (see docs/state-stats.md).

What it does:

  • opens a stopped node's application.db (or a restored snapshot), loads the latest committed height (or --height), and iterates every module KV store;
  • prints a per-store size summary (keys / key bytes / value bytes / total), so we can see which module dominates;
  • for the inference module, attributes every key to a named prefix via a new types.StatePrefixCatalog (single source of truth mapping each prefix in keys.go to a readable label) and flags legacy prefixes — the cleanup candidates;
  • --legacy-only and --top N to focus the view.

So the workflow to answer "why is state big / what else can we drop":

  1. run state-stats on a current mainnet snapshot → see the biggest prefixes;
  2. anything large + legacy is already removed by this PR's v0.2.14 cleanup (EpochGroupValidations aggregate map, TopMiner, training state, legacy PoC v2);
  3. anything large + non-legacy that looks prunable becomes a follow-up cleanup task, decided from the numbers.

I don't have a mainnet DB locally, so I can't paste the actual breakdown here. If someone with access to a node/snapshot can run inferenced state-stats --home <stopped-node-home> and drop the output here, we can decide on scope: keep this PR as the known-legacy cleanup + analysis tooling, and open a separate issue for any newly-identified large prefixes.

@patimen

patimen commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/run-integration

@patimen

patimen commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@Ryanchen911 To run that, we'd need to have the new binary running on main-net, wouldn't we? Couldn't we use inferenced export instead? Or is there a problem with that, it seems to have issues locally...

@Ryanchen911

Copy link
Copy Markdown
Author

@patimen no mainnet deployment needed — state-stats is an offline, read-only command. It opens the DB exclusively, so you just run it once against a snapshot or a copy of a node's data dir (node stopped).

I think inferenced export won't answer Gleb's question, unfortunately:
export only emits the logical genesis each module's ExportGenesis writes (params, participants, models, bridge…). The leftover/index/cache prefixes we actually want to measure — TopMiner, training state, legacy PoC v2, the various indexes — are not in the export at all, so they'd be invisible.

If running the branch binary against a snapshot is too much friction for this PR, I think we can split it: merge the known-legacy cleanup now, and track the per-prefix size analysis (Gleb's ask) as a separate task where ops can run state-stats on a snapshot whenever convenient. Either way works.

@patimen

patimen commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I have run this on mainnet... there is no way we're going to be able to do this kind of large scale pruning in an upgrade handler. Deletion is not cheap in 0.53.3 of Cosmos, it will take a very long time by my estimation (10-30 minutes) for each major component we need to clear.
We are going to need to use gradual pruning, either using our existing pruner pruner.go or something else in order to remove state. For reference, here is the list of what's taking up our state, as of height 4520003:

Store Size Breakdown

Store Keys Key Bytes Value Bytes Total Size
inference 78,179,591 8,174,257,889 17,398,241,842 23.8 GiB
bls 42,230 1,698,869 623,477,291 596.2 MiB
acc 2,275,754 47,806,043 121,778,453 161.7 MiB
staking 54,829 878,207 100,549,500 96.7 MiB
bank 2,251,747 64,189,672 1,218,968 62.4 MiB
authz 188,724 15,835,938 26,864,448 40.7 MiB
group 303,560 11,812,693 11,910,734 22.6 MiB
streamvesting 2,169 45,529 5,471,088 5.3 MiB
wasm 20,995 1,298,958 1,293,529 2.5 MiB
feegrant 12,971 745,881 1,134,762 1.8 MiB
slashing 6,263 144,245 473,258 603.0 KiB
distribution 13,176 395,506 111,319 494.9 KiB
gov 98 1,281 320,722 314.5 KiB
ibc 3,098 178,254 106,744 278.3 KiB
collateral 3,720 106,730 8,249 112.3 KiB
evidence 45 1,485 5,175 6.5 KiB
genesistransfer 38 1,349 4,777 6.0 KiB
capability 11 245 893 1.1 KiB
transfer 13 480 402 882 B
upgrade 48 495 300 795 B
icahost 4 238 79 317 B
mint 2 2 149 151 B
consensus 1 9 49 58 B
restrictions 2 38 5 43 B
crisis 1 1 14 15 B
bookkeeper 1 12 0 12 B
icacontroller 1 6 2 8 B
params 0 0 0 0 B
nft 0 0 0 0 B
circuit 0 0 0 0 B
feeibc 0 0 0 0 B
TOTAL 83,359,092 8,319,400,055 18,292,972,752 24.8 GiB

Inference Prefix Breakdown

Prefix Keys Key Bytes Value Bytes Total Size
PoCBatch 14,774,601 975,123,666 5,740,211,563 6.3 GiB
StatsDevelopersByInferenceAndModel 14,367,831 2,134,050,738 2,210,474,926 4.0 GiB
InferenceValidationDetails 14,957,767 1,450,903,399 2,874,536,849 4.0 GiB
StatsDevelopersByTime 10,167,512 1,739,727,404 1,457,006,992 3.0 GiB
StatsDevelopersByInference 10,167,512 1,159,096,368 1,526,209,652 2.5 GiB
PoCValidation 12,749,190 637,459,500 1,627,479,855 2.1 GiB
StatsDevelopersByEpoch 1,606 123,618 991,613,039 945.8 MiB
Inferences 638,653 56,840,117 872,791,024 886.6 MiB
PoCValidationV2 177,533 15,064,581 24,128,878 37.4 MiB
EpochGroupData 1,132 32,622 29,671,320 28.3 MiB
PreservedNodesSnapshot 293 9,929 22,246,426 21.2 MiB
RandomSeed 67,802 1,966,258 12,177,327 13.5 MiB
EpochPerformanceSummary 74,123 2,223,690 4,330,514 6.3 MiB
<unmatched:0x48> 4,393 311,903 1,630,109 1.9 MiB
Participants 6,867 144,207 1,274,964 1.4 MiB
MLNodeWeightDistribution 6,814 418,296 946,122 1.3 MiB
PoCV2StoreCommit 6,814 418,296 863,658 1.2 MiB
ExcludedParticipants 3,728 108,112 293,006 391.7 KiB
DevshardEscrows 264 2,376 235,290 232.1 KiB
BridgeTransactionValidators 1,365 120,120 0 117.3 KiB
ConfirmationPoCEvents 602 10,234 49,379 58.2 KiB
PoCDelegation 227 16,122 27,018 42.1 KiB
ParticipantAllowList 1,793 37,653 0 36.8 KiB
InferencesToPrune 293 28,421 0 27.8 KiB
BridgeMintRefunds 86 5,590 13,526 18.7 KiB
DelegationSnapshot 1 1 11,650 11.4 KiB
BridgeTransactions 32 1,376 7,888 9.0 KiB
Epochs 293 2,637 2,073 4.6 KiB
DevshardEscrowsByEpoch 264 4,488 0 4.4 KiB
Params 1 11 3,441 3.4 KiB

The first mainnet state-stats run surfaced two large prefixes the catalog did
not recognize: developer-stats indexes (~10.4 GiB under "stats/...") and
hardware-node registrations ("HardwareNodesValues/value/"). Add them so the
breakdown attributes those bytes precisely instead of bucketing them as
<unmatched>.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Ryanchen911

Copy link
Copy Markdown
Author

Agreed on all of it, @patimen — gradual pruning via the existing pruner.go, not the upgrade handler. (The legacy cleanup in this PR only Clears the legacy prefixes, which are ~0 bytes on mainnet per the --legacy-only run, so that part stays instant.)

Good news: k.Prune already runs every block in EndBlock and each Pruner deletes at most PruningMax entries per block for epochs older than its threshold (advancing PruningState), so the rate-limiting you want is already built in — we just need to point it at the big prefixes.

Mapping the big consumers onto the pruner, with how confident I am that each is safe to drop:

  • InferenceValidationDetails (4.0 GiB) — safe, drop-in. It's keyed by (epochId, inferenceId), so it fits the same pattern as EpochGroupValidationPruner / InferencePruner. And it's only read for the claim window: ClaimRewards rejects anything except the previous epoch ((currentEpochIndex - 1) != msg.EpochIndex), and the validation-params query only reads current/previous epoch. So nothing reads older epochs — pruning epochs older than a small threshold (≥ claim window) breaks nothing. I can add this pruner + tests.

  • PoCBatch (6.3 GiB) / PoCValidation (2.1 GiB) — already pruned; they're just large due to retention window × volume. If we want them smaller we can revisit PocDataPruningEpochThreshold / PocPruningMax, but no new code needed.

  • developer-stats (~10.4 GiB across StatsDevelopersBy*) — biggest chunk, but the trickiest and I am NOT yet sure it's safe to drop. Two problems: (1) 3 of its 4 indexes (by-inference, by-time, by-model) are raw string stores that aren't epoch-prefixed, so the generic epoch-ranged Pruner doesn't apply directly — it needs a custom epoch-driven pruner (walk old epochs via the by-epoch index → resolve inference IDs → delete across all 4 indexes); (2) it backs a query API consumed off-chain, so dropping old history is a retention decision for whoever owns that consumer, not something I can decide from the code.

Proposed split:

  1. This PR: keep the legacy cleanup + the state-stats tool (now labels the stats/* and hardware-node prefixes instead of <unmatched>).
  2. Follow-up A: add the InferenceValidationDetails pruner — clean, safe ~4 GiB, bled off over blocks. I'll take this.
  3. Follow-up B: developer-stats retention policy + custom pruner, once we confirm how much history the query consumer actually needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] Clean up the state

4 participants