chore(upgrade): clean leftover state in v0.2.14 by Ryanchen911 · Pull Request #1287 · gonka-ai/gonka

Ryanchen911 · 2026-06-01T03:56:36Z

Summary

clean leftover inference module state during the v0.2.14 upgrade
re-run legacy epoch-group, top miner, training, and PoC v2 cleanup paths idempotently
include the missing TrainingTaskKvRecordKeyPrefix in training cleanup

Closes #1223

Tests

go test -C /Users/chenjunying/gonka/inference-chain ./app/upgrades/v0_2_14
go test -C /Users/chenjunying/gonka/inference-chain ./app/upgrades/v0_2_12 ./x/inference/keeper

Identified leftovers

Legacy EpochGroupValidationsMap: replaced by per-inference EpochGroupValidationEntry in v0.2.11; this cleanup migrates any remaining current/previous epoch entries and clears the old aggregate map.
TopMiners: cleared in v0.2.12 and no longer used by live paths.
Training state: training task state is removed; cleanup now also includes the previously omitted TrainingTaskKvRecordKeyPrefix.
Legacy PoC v2 prefixes: replaced by model-aware prefixes 58/59/60 in v0.2.12; old raw prefixes are cleared idempotently.

patimen · 2026-06-02T02:49:41Z

/run-integration

gmorgachev · 2026-06-02T03:25:19Z

@Ryanchen911

i think this task should include analysis amount of state / state history used by prefix. then we check what can be removed. the state size is quite big still, we need to understand why

Adds an offline `inferenced state-stats` command that reports per-store and per-inference-prefix committed state size, with legacy prefixes flagged as cleanup candidates. Backed by a StatePrefixCatalog single-source-of-truth that maps every inference prefix to a readable name. Addresses the review request to analyze state size by prefix before deciding what to remove (issue gonka-ai#1223). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ryanchen911 · 2026-06-02T09:09:41Z

@gmorgachev good point — agreed that we should drive removal from measured per-prefix size, not just remove the prefixes we already know are dead.

I added an offline analysis command for exactly this: inferenced state-stats (see docs/state-stats.md).

What it does:

opens a stopped node's application.db (or a restored snapshot), loads the latest committed height (or --height), and iterates every module KV store;
prints a per-store size summary (keys / key bytes / value bytes / total), so we can see which module dominates;
for the inference module, attributes every key to a named prefix via a new types.StatePrefixCatalog (single source of truth mapping each prefix in keys.go to a readable label) and flags legacy prefixes — the cleanup candidates;
--legacy-only and --top N to focus the view.

So the workflow to answer "why is state big / what else can we drop":

run state-stats on a current mainnet snapshot → see the biggest prefixes;
anything large + legacy is already removed by this PR's v0.2.14 cleanup (EpochGroupValidations aggregate map, TopMiner, training state, legacy PoC v2);
anything large + non-legacy that looks prunable becomes a follow-up cleanup task, decided from the numbers.

I don't have a mainnet DB locally, so I can't paste the actual breakdown here. If someone with access to a node/snapshot can run inferenced state-stats --home <stopped-node-home> and drop the output here, we can decide on scope: keep this PR as the known-legacy cleanup + analysis tooling, and open a separate issue for any newly-identified large prefixes.

patimen · 2026-06-09T00:18:52Z

/run-integration

patimen · 2026-06-09T22:54:43Z

@Ryanchen911 To run that, we'd need to have the new binary running on main-net, wouldn't we? Couldn't we use inferenced export instead? Or is there a problem with that, it seems to have issues locally...

Ryanchen911 · 2026-06-10T05:17:50Z

@patimen no mainnet deployment needed — state-stats is an offline, read-only command. It opens the DB exclusively, so you just run it once against a snapshot or a copy of a node's data dir (node stopped).

I think inferenced export won't answer Gleb's question, unfortunately:
export only emits the logical genesis each module's ExportGenesis writes (params, participants, models, bridge…). The leftover/index/cache prefixes we actually want to measure — TopMiner, training state, legacy PoC v2, the various indexes — are not in the export at all, so they'd be invisible.

If running the branch binary against a snapshot is too much friction for this PR, I think we can split it: merge the known-legacy cleanup now, and track the per-prefix size analysis (Gleb's ask) as a separate task where ops can run state-stats on a snapshot whenever convenient. Either way works.

patimen · 2026-06-12T00:21:14Z

I have run this on mainnet... there is no way we're going to be able to do this kind of large scale pruning in an upgrade handler. Deletion is not cheap in 0.53.3 of Cosmos, it will take a very long time by my estimation (10-30 minutes) for each major component we need to clear.
We are going to need to use gradual pruning, either using our existing pruner pruner.go or something else in order to remove state. For reference, here is the list of what's taking up our state, as of height 4520003:

Store Size Breakdown

Store	Keys	Key Bytes	Value Bytes	Total Size
inference	78,179,591	8,174,257,889	17,398,241,842	23.8 GiB
bls	42,230	1,698,869	623,477,291	596.2 MiB
acc	2,275,754	47,806,043	121,778,453	161.7 MiB
staking	54,829	878,207	100,549,500	96.7 MiB
bank	2,251,747	64,189,672	1,218,968	62.4 MiB
authz	188,724	15,835,938	26,864,448	40.7 MiB
group	303,560	11,812,693	11,910,734	22.6 MiB
streamvesting	2,169	45,529	5,471,088	5.3 MiB
wasm	20,995	1,298,958	1,293,529	2.5 MiB
feegrant	12,971	745,881	1,134,762	1.8 MiB
slashing	6,263	144,245	473,258	603.0 KiB
distribution	13,176	395,506	111,319	494.9 KiB
gov	98	1,281	320,722	314.5 KiB
ibc	3,098	178,254	106,744	278.3 KiB
collateral	3,720	106,730	8,249	112.3 KiB
evidence	45	1,485	5,175	6.5 KiB
genesistransfer	38	1,349	4,777	6.0 KiB
capability	11	245	893	1.1 KiB
transfer	13	480	402	882 B
upgrade	48	495	300	795 B
icahost	4	238	79	317 B
mint	2	2	149	151 B
consensus	1	9	49	58 B
restrictions	2	38	5	43 B
crisis	1	1	14	15 B
bookkeeper	1	12	0	12 B
icacontroller	1	6	2	8 B
params	0	0	0	0 B
nft	0	0	0	0 B
circuit	0	0	0	0 B
feeibc	0	0	0	0 B
TOTAL	83,359,092	8,319,400,055	18,292,972,752	24.8 GiB

Inference Prefix Breakdown

Prefix	Keys	Key Bytes	Value Bytes	Total Size
PoCBatch	14,774,601	975,123,666	5,740,211,563	6.3 GiB
StatsDevelopersByInferenceAndModel	14,367,831	2,134,050,738	2,210,474,926	4.0 GiB
InferenceValidationDetails	14,957,767	1,450,903,399	2,874,536,849	4.0 GiB
StatsDevelopersByTime	10,167,512	1,739,727,404	1,457,006,992	3.0 GiB
StatsDevelopersByInference	10,167,512	1,159,096,368	1,526,209,652	2.5 GiB
PoCValidation	12,749,190	637,459,500	1,627,479,855	2.1 GiB
StatsDevelopersByEpoch	1,606	123,618	991,613,039	945.8 MiB
Inferences	638,653	56,840,117	872,791,024	886.6 MiB
PoCValidationV2	177,533	15,064,581	24,128,878	37.4 MiB
EpochGroupData	1,132	32,622	29,671,320	28.3 MiB
PreservedNodesSnapshot	293	9,929	22,246,426	21.2 MiB
RandomSeed	67,802	1,966,258	12,177,327	13.5 MiB
EpochPerformanceSummary	74,123	2,223,690	4,330,514	6.3 MiB
`<unmatched:0x48>`	4,393	311,903	1,630,109	1.9 MiB
Participants	6,867	144,207	1,274,964	1.4 MiB
MLNodeWeightDistribution	6,814	418,296	946,122	1.3 MiB
PoCV2StoreCommit	6,814	418,296	863,658	1.2 MiB
ExcludedParticipants	3,728	108,112	293,006	391.7 KiB
DevshardEscrows	264	2,376	235,290	232.1 KiB
BridgeTransactionValidators	1,365	120,120	0	117.3 KiB
ConfirmationPoCEvents	602	10,234	49,379	58.2 KiB
PoCDelegation	227	16,122	27,018	42.1 KiB
ParticipantAllowList	1,793	37,653	0	36.8 KiB
InferencesToPrune	293	28,421	0	27.8 KiB
BridgeMintRefunds	86	5,590	13,526	18.7 KiB
DelegationSnapshot	1	1	11,650	11.4 KiB
BridgeTransactions	32	1,376	7,888	9.0 KiB
Epochs	293	2,637	2,073	4.6 KiB
DevshardEscrowsByEpoch	264	4,488	0	4.4 KiB
Params	1	11	3,441	3.4 KiB

The first mainnet state-stats run surfaced two large prefixes the catalog did not recognize: developer-stats indexes (~10.4 GiB under "stats/...") and hardware-node registrations ("HardwareNodesValues/value/"). Add them so the breakdown attributes those bytes precisely instead of bucketing them as <unmatched>. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ryanchen911 · 2026-06-17T02:41:54Z

Agreed on all of it, @patimen — gradual pruning via the existing pruner.go, not the upgrade handler. (The legacy cleanup in this PR only Clears the legacy prefixes, which are ~0 bytes on mainnet per the --legacy-only run, so that part stays instant.)

Good news: k.Prune already runs every block in EndBlock and each Pruner deletes at most PruningMax entries per block for epochs older than its threshold (advancing PruningState), so the rate-limiting you want is already built in — we just need to point it at the big prefixes.

Mapping the big consumers onto the pruner, with how confident I am that each is safe to drop:

InferenceValidationDetails (4.0 GiB) — safe, drop-in. It's keyed by (epochId, inferenceId), so it fits the same pattern as EpochGroupValidationPruner / InferencePruner. And it's only read for the claim window: ClaimRewards rejects anything except the previous epoch ((currentEpochIndex - 1) != msg.EpochIndex), and the validation-params query only reads current/previous epoch. So nothing reads older epochs — pruning epochs older than a small threshold (≥ claim window) breaks nothing. I can add this pruner + tests.
PoCBatch (6.3 GiB) / PoCValidation (2.1 GiB) — already pruned; they're just large due to retention window × volume. If we want them smaller we can revisit PocDataPruningEpochThreshold / PocPruningMax, but no new code needed.
developer-stats (~10.4 GiB across StatsDevelopersBy*) — biggest chunk, but the trickiest and I am NOT yet sure it's safe to drop. Two problems: (1) 3 of its 4 indexes (by-inference, by-time, by-model) are raw string stores that aren't epoch-prefixed, so the generic epoch-ranged Pruner doesn't apply directly — it needs a custom epoch-driven pruner (walk old epochs via the by-epoch index → resolve inference IDs → delete across all 4 indexes); (2) it backs a query API consumed off-chain, so dropping old history is a retention decision for whoever owns that consumer, not something I can decide from the code.

Proposed split:

This PR: keep the legacy cleanup + the state-stats tool (now labels the stats/* and hardware-node prefixes instead of <unmatched>).
Follow-up A: add the InferenceValidationDetails pruner — clean, safe ~4 GiB, bled off over blocks. I'll take this.
Follow-up B: developer-stats retention policy + custom pruner, once we confirm how much history the query consumer actually needs.

chore(upgrade): clean leftover state in v0.2.14

b51d8b1

Copilot AI review requested due to automatic review settings June 1, 2026 03:56

Ryanchen911 force-pushed the ryan/1223-v0.2.14-state-cleanup branch from 81a3dbc to b51d8b1 Compare June 1, 2026 04:04

tcharchian added this to the v0.2.14 milestone Jun 1, 2026

tcharchian requested a review from patimen June 1, 2026 21:58

tcharchian linked an issue Jun 1, 2026 that may be closed by this pull request

[P1] Clean up the state #1223

Open

Merge origin/upgrade-v0.2.14 into ryan/1223-v0.2.14-state-cleanup

425e8fd

patimen added 2 commits June 2, 2026 15:20

Merge branch 'upgrade-v0.2.14' into ryan/1223-v0.2.14-state-cleanup

7eea597

Merge branch 'upgrade-v0.2.14' into ryan/1223-v0.2.14-state-cleanup

7ac9f15

Merge branch 'upgrade-v0.2.14' into ryan/1223-v0.2.14-state-cleanup

ff12df7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(upgrade): clean leftover state in v0.2.14#1287

chore(upgrade): clean leftover state in v0.2.14#1287
Ryanchen911 wants to merge 7 commits into
gonka-ai:upgrade-v0.2.14from
Ryanchen911:ryan/1223-v0.2.14-state-cleanup

Ryanchen911 commented Jun 1, 2026 •

edited

Loading

Uh oh!

patimen commented Jun 2, 2026

Uh oh!

gmorgachev commented Jun 2, 2026

Uh oh!

Ryanchen911 commented Jun 2, 2026

Uh oh!

patimen commented Jun 9, 2026

Uh oh!

patimen commented Jun 9, 2026 •

edited

Loading

Uh oh!

Ryanchen911 commented Jun 10, 2026

Uh oh!

patimen commented Jun 12, 2026

Uh oh!

Ryanchen911 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Ryanchen911 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Identified leftovers

Uh oh!

patimen commented Jun 2, 2026

Uh oh!

gmorgachev commented Jun 2, 2026

Uh oh!

Ryanchen911 commented Jun 2, 2026

Uh oh!

patimen commented Jun 9, 2026

Uh oh!

patimen commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ryanchen911 commented Jun 10, 2026

Uh oh!

patimen commented Jun 12, 2026

Store Size Breakdown

Inference Prefix Breakdown

Uh oh!

Ryanchen911 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ryanchen911 commented Jun 1, 2026 •

edited

Loading

patimen commented Jun 9, 2026 •

edited

Loading