Skip to content

fix: Reading work from discarded state#391

Merged
czareko merged 5 commits intomainfrom
fix/deep_forks
Mar 5, 2026
Merged

fix: Reading work from discarded state#391
czareko merged 5 commits intomainfrom
fix/deep_forks

Conversation

@czareko
Copy link
Collaborator

@czareko czareko commented Mar 4, 2026

The node then stayed idle and stopped producing blocks.

Root cause

  1. ArchiveCanonical pruning – The node keeps state only for blocks on the canonical chain. Non-canonical fork blocks have their state pruned.

  2. get_total_work needs state – Best chain selection calls runtime_api().get_total_work(block_hash) for each leaf. That requires executing the runtime at that block, so the block’s state must exist.

  3. Many forks – With many competing forks (reorg depth 168–179 in the logs), some leaves are tips of fork chains whose state has already been pruned.

  4. Failure mode – When calculate_chain_work hit such a block, it returned an error. The whole best_chain() call failed, so the mining loop could not get a best block and stopped authoring.

Solution

Instead of failing the entire best chain selection when calculate_chain_work returns a “State already discarded” / UnknownBlock error:

  • For leaves: Skip the problematic leaf, add it to ignored chains, and continue evaluating the rest.
  • For current best: If its state is pruned, fall back to find_best_chain(leaves) and pick the best among evaluable leaves.

The node can now choose the best chain from leaves whose state is available and continue authoring blocks.

Copy link
Contributor

@illuzen illuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct, but we may want to rewrite this whole section to be more obviously correct

@n13
Copy link
Collaborator

n13 commented Mar 4, 2026

I am checking this now

@n13
Copy link
Collaborator

n13 commented Mar 4, 2026

Good, I now have a thorough understanding of the PR. Here's my review:


PR #391 Review: fix: Reading state from discarded chain

Overall: The fix correctly addresses the production issue -- pruned state on non-canonical forks causing the entire best-chain selection to fail, halting block production. The approach is sound: skip leaves whose state is unavailable rather than aborting the whole selection. A few concerns worth raising:

1. String-matching on error messages is fragile

	fn is_state_pruned_error(err: &sp_consensus::Error) -> bool {
		let msg = format!("{:?}", err);
		msg.contains("State already discarded") || msg.contains("UnknownBlock")
	}

This is the weakest part of the PR. Checking Debug-formatted error strings is brittle -- if Substrate changes the wording (e.g., "State was discarded" or "BlockNotFound"), this silently stops matching and the bug resurfaces.

Suggestion: Investigate whether the underlying ApiError can be matched structurally (e.g., via downcast or an enum variant) instead of relying on string contents. If string matching is truly the only option here (Substrate's error types are notoriously opaque), at minimum add a test that constructs these error variants and asserts is_state_pruned_error returns true, so a Substrate upgrade that changes wording will be caught.

2. find_best_chain can silently skip ALL leaves

In find_best_chain, if every single leaf has pruned state, all are skipped via continue, and the method falls through to:

		best_header.ok_or(sp_consensus::Error::Other("No Valid Chain Found".into()))

This returns a generic "No Valid Chain Found" error with no indication that the real cause was all states being pruned. Worth logging a specific warning before that final error when all leaves were skipped, so operators can distinguish "no valid chain at all" from "all leaves had pruned state."

3. Asymmetry in handling between find_best_chain and best_chain

In best_chain (the main leaf evaluation loop), pruned leaves get added to the ignored list:

+					let _ = self.add_ignored_chain(*leaf_hash);

But in find_best_chain (the no-current-best path), pruned leaves are just skipped without being added to ignored chains. This means the same pruned leaf will be re-evaluated every cycle when there's no current best. Should find_best_chain also add pruned leaves to the ignored list for consistency?

4. let _ = self.add_ignored_chain(...) silently swallows errors

If adding to the ignored list fails (e.g., aux storage issue), the error is discarded. This is probably fine for now since it's not critical-path, but a log::warn! on failure would help debugging in production.

5. Fallback path in best_chain when current best is pruned

+			Err(ref e) if Self::is_state_pruned_error(e) => {
+				log::warn!(
+					target: "qpow",
+					"🍴️ Current best block state was pruned. Falling back to evaluating all leaves."
+				);
+				return self.find_best_chain(leaves).await;
+			},

When the current best's state is pruned, this falls back to find_best_chain(leaves) which evaluates all leaves without reorg depth constraints. This is a reasonable emergency fallback, but it means a pruned-state condition could bypass the max reorg depth check entirely. Worth noting in a comment or at minimum being aware of this trade-off.

6. DRY opportunity

The pruned-state skip pattern is repeated 3 times with slight variations (skip-and-continue, skip-and-add-to-ignored, fallback-to-find-best). I agree with illuzen's comment that this section could benefit from a refactor to make the control flow more obviously correct -- perhaps a helper that wraps calculate_chain_work and returns an Option<U512> (with None for pruned state), handling the logging internally.

Summary

The fix is correct for the immediate problem and unblocks block production. The main risk is the string-based error detection (point 1). I'd merge this as a hotfix but follow up with:

  • Structural error matching or at least a test for is_state_pruned_error
  • A refactor pass to consolidate the repeated match patterns (as illuzen suggested)

@n13
Copy link
Collaborator

n13 commented Mar 4, 2026

Deleted tht AI assessment again - I think it was mostly wrong ;)

@n13
Copy link
Collaborator

n13 commented Mar 4, 2026

Did some more digging, I think this is fine!

Copy link
Collaborator

@n13 n13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good not to crash

Would be very nice if we could catch the error type instead of the string compare, but it's a minor thing also

I don't think even a smart contract would be able to get inside this code path and fake the error message, for example

That leaves as only downside if we are updating to a new substrate version and they changed the error message it might break again.

@czareko czareko changed the title fix: Reading state from discarded chain fix: Reading state from discarded state Mar 4, 2026
@n13
Copy link
Collaborator

n13 commented Mar 4, 2026

LGTM, adding AI review too

@n13
Copy link
Collaborator

n13 commented Mar 4, 2026

AI opinion - use your discernment on these plz

Alright, I now have a thorough understanding of the PR. Here's my review:


PR #391 Review: fix: Reading state from discarded state

Verdict: Approve with suggestions. The fix correctly addresses the production issue and the second commit is a nice improvement over the first.

What's good

  1. Correct root-cause fix. The core problem -- get_total_work failing on pruned non-canonical forks and crashing the entire best-chain selection -- is properly addressed. Returning Ok(None) from try_calculate_chain_work for pruned blocks and continuing evaluation of remaining leaves is the right approach.

  2. Structural error matching. The second commit (8873256) addressed the biggest concern from the first round of review. is_state_pruned_error_raw now does a recursive downcast_ref::<ApiError>() looking for ApiError::UnknownBlock, which is much more robust than string matching.

  3. Proper ChainManagementError enum. Replacing ad-hoc sp_consensus::Error::Other(...) strings with a typed enum is a clear improvement. The From<ChainManagementError> for ConsensusError impl correctly maps to the existing upstream variants (ChainLookup, StateUnavailable).

  4. Consistency between find_best_chain and best_chain. Both paths now add pruned leaves to the ignored list, which was flagged as an asymmetry issue in the first review and is now fixed.

Issues / suggestions

1. Duplicate log messages in try_calculate_chain_work

The pruned and non-pruned error branches log the exact same message:

log::error!(
    "Failed to get total work for chain with head #{}: {:?}",
    current_number,
    e
);

The pruned case should probably be log::warn! rather than log::error!, since it's now an expected/handled condition. This would also help operators distinguish between "expected pruned state" noise and actual failures in logs.

2. Non-pruned errors return StateUnavailable -- is that right?

In try_calculate_chain_work, when is_state_pruned_error_raw returns false, the error is wrapped as ChainManagementError::StateUnavailable. But if it's not a pruned-state error, it could be anything -- a codec error, a different runtime API failure, etc. StateUnavailable seems semantically wrong for these cases. Something like a generic RuntimeApiError variant or just ChainLookup might be more accurate.

3. is_state_pruned_error_raw only checks UnknownBlock

The function name and doc say "state was pruned/discarded" but ApiError::UnknownBlock could also fire for genuinely unknown blocks (e.g., a hash that was never imported). In practice this distinction probably doesn't matter -- you'd want to skip that leaf either way -- but worth being aware that this is slightly broader than "pruned state" detection.

4. Minor: skipped_pruned tracking only in find_best_chain

find_best_chain tracks skipped_pruned and logs a specific warning when all leaves are pruned. The best_chain method doesn't track this. Not critical since best_chain always has the current-best as a fallback, but it could be useful for operator visibility if many leaves are being skipped silently in the main loop.

5. DRY opportunity (echoing illuzen's comment)

The pattern of "get header -> try_calculate_chain_work -> match None to skip+ignore" appears twice (in find_best_chain and in best_chain). A helper like evaluate_leaf(&self, leaf_hash) -> Result<Option<(Header, U512)>, ConsensusError> would consolidate this. Not blocking, but worth a follow-up.

Summary

The fix is correct and safe to merge as a hotfix. The structural error matching via downcast_ref::<ApiError>() is a solid improvement over the string-matching first commit. The remaining suggestions are minor quality-of-life items that can be addressed in a follow-up refactor pass.

@czareko czareko changed the title fix: Reading state from discarded state fix: Reading work from discarded state Mar 5, 2026
@czareko czareko merged commit 6181fa7 into main Mar 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants