Skip to content

Conversation

PaddyMc
Copy link

@PaddyMc PaddyMc commented Mar 25, 2025

Description

We found a case in Osmosis node where there is a root key that is points to a node that doesn't exists and it hangs the pruning process because fails at get root key (returns ErrVersionDoesNotExist).

There is already code to clean the dangling ref node up, but it just never get there because it early returns ErrVersionDoesNotExist before getting there.

This means when pruning we cannot prune a version of the store because it gets stuck. This PR moves onto the next version in the store if pruning returns a not found error.

Notes about legacy nodes

  • there is an nasty edge case that I think will be hit with legacy nodes that we need to consider before merging this in
  1. If legacy pruning is broken
  2. legacy pruning will error and set the first to legacyLatestVersion+1
  3. this has the side effect of iterating to the latest non legacy version available
  4. this iteration could be potentially large for some validators that
    1. aren't aware or maintaining their state
    2. lots of log lines are this PR adds an error log to IAVL for pruning that's skipped
    3. chains that haven't fully upgraded to IAVLv1 or heavily depend on legacy nodes

see:

Downloading state

https://snapshots.testnet.osmosis.zone/

wget -q -O - https://osmosis.fra1.cdn.digitaloceanspaces.com/osmo-test-5/snapshots/v29/osmosis-snapshot-202503251415-27294691.tar.lz4 | lz4 -d | tar -C $HOME/.osmosisd -xvf -

Checking broken stores

Use this PR and run:
osmosis-labs/cosmprund#2

go run main.go check-store-versions /home/ghost/osmosis-states/osmosis-testnet-state/data

Pruning broken stores

Use this PR and run:
osmosis-labs/cosmprund#2

go run main.go prune /home/ghost/osmosis-states/osmosis-testnet-state/data

State will then be fixed

Things we don't know

Why are there states deleted outside of pruning? Why does this become more apparent with async pruning?

Another version of the fix

cosmos#1048

This fix, works in the same way and just continues after the is a version not found error, this moves past both checks, version and version+1

Why this is needed

Currently if pruning breaks with this error the chain state will start to grow quickly.

What the fix will look like

Osmosis mainnet with broken state:

osmosis → λ git v28.0.5* → osmosisd start --home ~/osmosis-states/test 2>&1 | grep "version does not exist"
1:34PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=31883938 version missing=31883937
1:34PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=31883939 version missing=31883938

Before this would have and the state would bloat

This is osmosis testnet with broken state

1:55PM INF service stop impl="Peer{MConn{176.9.82.221:12556} ade4d8bc8cbe014af6ebdf3cb7b1e9ad36f412c0 out}" module=p2p msg="Stopping Peer service" [email protected]:12556
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202688 version missing=7202687
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202689 version missing=7202688
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202690 version missing=7202689
1:55PM INF commit is for a block we do not know about; set ProposalBlock=nil commit=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 commit_round=0 height=27208281 module=consensus proposal=
1:55PM INF received complete proposal block hash=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 height=27208281 module=consensus
1:55PM INF finalizing commit of block hash=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 height=27208281 module=consensus num_txs=0 root=08CB5DDE28307231EF8D5D5B9BEF99BB803058B08AF1FCE0095E5620E381E14F
1:55PM INF finalized block block_app_hash=BA816B7934B12755217F0F4BDA6264743CDC6E18B3E8858D86CE37618EBA57B7 height=27208281 module=state num_txs_res=0 num_val_updates=0
1:55PM INF executed block app_hash=BA816B7934B12755217F0F4BDA6264743CDC6E18B3E8858D86CE37618EBA57B7 height=27208281 module=state
1:55PM INF committed state block_app_hash=08CB5DDE28307231EF8D5D5B9BEF99BB803058B08AF1FCE0095E5620E381E14F height=27208281 module=state
1:55PM INF Timed out dur=443.168164 height=27208282 module=consensus round=0 step=RoundStepNewHeight
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=6667061 version missing=6667060
1:55PM INF Timed out dur=1600 height=27208282 module=consensus round=0 step=RoundStepPropose
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7203887 version missing=7203886
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7203888 version missing=7203887

This represents a large backlog as pruning is on 7203887 27208281

Copy link
Member

@ValarDragon ValarDragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the high level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants