Skip to content

fix: continue pruning if version is not found #1063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

PaddyMc
Copy link
Contributor

@PaddyMc PaddyMc commented Mar 26, 2025

Description

We found a case in Osmosis node where there is a root key that is points to a node that doesn't exists and it hangs the pruning process because fails at get root key (returns ErrVersionDoesNotExist).

There is already code to clean the dangling ref node up, but it just never get there because it early returns ErrVersionDoesNotExist before getting there.

This means when pruning we cannot prune a version of the store because it gets stuck. This PR moves onto the next version in the store if pruning returns a not found error.

Notes about legacy nodes

  • there is an nasty edge case that I think will be hit with legacy nodes that we need to consider before merging this in
  1. If legacy pruning is broken
  2. legacy pruning will error and set the first to legacyLatestVersion+1
  3. this has the side effect of iterating to the latest non legacy version available
  4. this iteration could be potentially large for some validators that
    1. aren't aware or maintaining their state
    2. lots of log lines are this PR adds an error log to IAVL for pruning that's skipped
    3. chains that haven't fully upgraded to IAVLv1 or heavily depend on legacy nodes

see:

Downloading state

https://snapshots.testnet.osmosis.zone/

wget -q -O - https://osmosis.fra1.cdn.digitaloceanspaces.com/osmo-test-5/snapshots/v29/osmosis-snapshot-202503251415-27294691.tar.lz4 | lz4 -d | tar -C $HOME/.osmosisd -xvf -

or rn polkachu snapshots have and issue with bank and concentratedliquidity

https://polkachu.com/tendermint_snapshots/osmosis

wget -O osmosis_32290276.tar.lz4 https://snapshots.polkachu.com/snapshots/osmosis/osmosis_32290276.tar.lz4 --inet4-only
Analyzing store versions for outliers...

Stores with potentially excessive versions (may need pruning):
Store 'concentratedliquidity' has 161172 versions (average: 6746.03) - This store may need pruning
Store 'bank' has 80752 versions (average: 6746.03) - This store may need pruning

Stores with large version gaps (may indicate inconsistent pruning):
Store 'concentratedliquidity' has a large version gap: 161171 (from 32129108 to 32290279) - This may indicate inconsistent pruning

I ran this PR on this state on osmosis mainnet and it fixed the issue see => osmosis-labs/osmosis#9333

Checking broken stores

Use this PR and run:
osmosis-labs/cosmprund#2

go run main.go check-store-versions /home/ghost/osmosis-states/osmosis-testnet-state/data

Pruning broken stores

Use this PR and run:
osmosis-labs/cosmprund#2

go run main.go prune /home/ghost/osmosis-states/osmosis-testnet-state/data

State will then be fixed

Things we don't know

Why are there states deleted outside of pruning? Why does this become more apparent with async pruning?

Another version of the fix

#1048

This fix, works in the same way and just continues after the is a version not found error, this moves past both checks, version and version+1

Why this is needed

Currently if pruning breaks with this error the chain state will start to grow quickly.

What the fix will look like

Osmosis mainnet with broken state:

osmosis → λ git v28.0.5* → osmosisd start --home ~/osmosis-states/test 2>&1 | grep "version does not exist"
1:34PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=31883938 version missing=31883937
1:34PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=31883939 version missing=31883938

Before this would have and the state would bloat

This is osmosis testnet with broken state

1:55PM INF service stop impl="Peer{MConn{176.9.82.221:12556} ade4d8bc8cbe014af6ebdf3cb7b1e9ad36f412c0 out}" module=p2p msg="Stopping Peer service" [email protected]:12556
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202688 version missing=7202687
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202689 version missing=7202688
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202690 version missing=7202689
1:55PM INF commit is for a block we do not know about; set ProposalBlock=nil commit=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 commit_round=0 height=27208281 module=consensus proposal=
1:55PM INF received complete proposal block hash=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 height=27208281 module=consensus
1:55PM INF finalizing commit of block hash=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 height=27208281 module=consensus num_txs=0 root=08CB5DDE28307231EF8D5D5B9BEF99BB803058B08AF1FCE0095E5620E381E14F
1:55PM INF finalized block block_app_hash=BA816B7934B12755217F0F4BDA6264743CDC6E18B3E8858D86CE37618EBA57B7 height=27208281 module=state num_txs_res=0 num_val_updates=0
1:55PM INF executed block app_hash=BA816B7934B12755217F0F4BDA6264743CDC6E18B3E8858D86CE37618EBA57B7 height=27208281 module=state
1:55PM INF committed state block_app_hash=08CB5DDE28307231EF8D5D5B9BEF99BB803058B08AF1FCE0095E5620E381E14F height=27208281 module=state
1:55PM INF Timed out dur=443.168164 height=27208282 module=consensus round=0 step=RoundStepNewHeight
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=6667061 version missing=6667060
1:55PM INF Timed out dur=1600 height=27208282 module=consensus round=0 step=RoundStepPropose
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7203887 version missing=7203886
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7203888 version missing=7203887

This represents a large backlog as pruning is on 7203887 27208281

Summary by CodeRabbit

Summary by CodeRabbit

  • Bug Fixes
    • Improved error handling during version removal, now logging missing versions while allowing the process to continue smoothly.
    • Refined the handling of orphan node traversal, ensuring operations proceed only when valid data is present, thereby enhancing overall system stability.

@PaddyMc PaddyMc requested a review from a team March 26, 2025 10:45
Copy link

coderabbitai bot commented Mar 26, 2025

Walkthrough

The changes enhance the deleteVersion method in nodedb.go by improving error handling and logging for cases where a version is missing. The method now checks for an ErrVersionDoesNotExist error when retrieving the root key, logs the error message, and continues execution. Additionally, the traversal of orphan nodes is conditioned on the validity of the root key, with the method ignoring the specific error during traversal.

Changes

File Change Summary
nodedb.go Updated deleteVersion to refine error handling by logging for ErrVersionDoesNotExist during root key retrieval and conditionally traversing orphans if rootKey is valid.

Sequence Diagram(s)

sequenceDiagram
    participant NodeDB
    participant Cache as Cache.getRootKey
    participant OrphanTraversal as traverseOrphansWithRootkeyCache

    NodeDB->>Cache: getRootKey(version)
    alt Version does not exist
        Cache-->>NodeDB: ErrVersionDoesNotExist
        NodeDB->>NodeDB: Log error and continue
    else Valid rootKey returned
        Cache-->>NodeDB: rootKey
        NodeDB->>OrphanTraversal: traverseOrphansWithRootkeyCache(rootKey)
        alt traverse error is ErrVersionDoesNotExist
            OrphanTraversal-->>NodeDB: Ignored error
        else Other error occurs
            OrphanTraversal-->>NodeDB: Return error
        end
    end
Loading

Possibly related PRs

Suggested labels

backport/v1.x

Suggested reviewers

  • tac0turtle

Poem

I'm a rabbit who hops with delight,
In code, changes are shining so bright.
Errors are logged when versions are flown,
Ensuring smooth trails with each line shown.
A nibble of code here, a hop of debug there –
Celebrate these changes with a joyful air!
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfe5c53 and 22df54f.

📒 Files selected for processing (1)
  • nodedb.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nodedb.go
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Benchmarks
  • GitHub Check: Analyze (go)
  • GitHub Check: Test

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nodedb.go (1)

466-468: Fix typo in error log message.

There's a typo in the error message: "moving on the the next version" (duplicate "the").

-    ndb.logger.Error("Error while pruning, moving on the the next version in the store", "version missing", version, "next version", version+1, "err", err)
+    ndb.logger.Error("Error while pruning, moving on to the next version in the store", "version missing", version, "next version", version+1, "err", err)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bce70d and f4d7c94.

📒 Files selected for processing (1)
  • nodedb.go (2 hunks)
🧰 Additional context used
🧬 Code Definitions (1)
nodedb.go (3)
mutable_tree.go (1)
  • ErrVersionDoesNotExist (18-18)
node.go (1)
  • Node (59-75)
cache/cache.go (1)
  • Node (10-12)
🔇 Additional comments (3)
nodedb.go (3)

462-464: Improved error handling for ErrVersionDoesNotExist.

This change modifies the error handling to specifically check for ErrVersionDoesNotExist and continue execution in that case, rather than immediately returning the error. This aligns with the PR's objective to allow pruning to continue when versions are missing.


470-493: Added conditional traversal and ErrVersionDoesNotExist handling.

This change adds a null check on rootKey before traversing orphans, which prevents potential nil pointer dereferences. It also modifies the error handling to ignore ErrVersionDoesNotExist errors during traversal, consistent with the other changes in this PR.


506-508: Consistent error handling for next version root key.

This change applies the same improved error handling pattern for the next version's root key check, maintaining consistency with the earlier changes.

nodedb.go Outdated
@@ -497,7 +503,7 @@ func (ndb *nodeDB) deleteVersion(version int64, cache *rootkeyCache) error {

// check if the version is referred by the next version
nextRootKey, err := cache.getRootKey(ndb, version+1)
if err != nil {
if err != nil && err != ErrVersionDoesNotExist {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could nextRootKey be nil above if ErrVersionDoesNotExist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it can if both the current version and the next version are missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a check for it being nil right?

@aljo242
Copy link
Contributor

aljo242 commented Mar 26, 2025

@Mergifyio backport release/v1.2.x

@aljo242
Copy link
Contributor

aljo242 commented Mar 26, 2025

@Mergifyio backport release/v1.3.x

Copy link
Contributor

mergify bot commented Mar 26, 2025

backport release/v1.2.x

✅ Backports have been created

Copy link
Contributor

mergify bot commented Mar 26, 2025

backport release/v1.3.x

✅ Backports have been created

@aljo242 aljo242 merged commit 8a2e2fe into cosmos:master Mar 26, 2025
6 of 7 checks passed
mergify bot pushed a commit that referenced this pull request Mar 26, 2025
mergify bot pushed a commit that referenced this pull request Mar 26, 2025
Copy link
Contributor

@ValarDragon ValarDragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants