Conversation
illuzen
left a comment
There was a problem hiding this comment.
This seems correct, but we may want to rewrite this whole section to be more obviously correct
|
I am checking this now |
|
Good, I now have a thorough understanding of the PR. Here's my review: PR #391 Review: fix: Reading state from discarded chainOverall: The fix correctly addresses the production issue -- pruned state on non-canonical forks causing the entire best-chain selection to fail, halting block production. The approach is sound: skip leaves whose state is unavailable rather than aborting the whole selection. A few concerns worth raising: 1. String-matching on error messages is fragileThis is the weakest part of the PR. Checking Suggestion: Investigate whether the underlying 2.
|
|
Deleted tht AI assessment again - I think it was mostly wrong ;) |
|
Did some more digging, I think this is fine! |
There was a problem hiding this comment.
It's good not to crash
Would be very nice if we could catch the error type instead of the string compare, but it's a minor thing also
I don't think even a smart contract would be able to get inside this code path and fake the error message, for example
That leaves as only downside if we are updating to a new substrate version and they changed the error message it might break again.
|
LGTM, adding AI review too |
|
AI opinion - use your discernment on these plz Alright, I now have a thorough understanding of the PR. Here's my review: PR #391 Review: fix: Reading state from discarded stateVerdict: Approve with suggestions. The fix correctly addresses the production issue and the second commit is a nice improvement over the first. What's good
Issues / suggestions1. Duplicate log messages in The pruned and non-pruned error branches log the exact same message: log::error!(
"Failed to get total work for chain with head #{}: {:?}",
current_number,
e
);The pruned case should probably be 2. Non-pruned errors return In 3. The function name and doc say "state was pruned/discarded" but 4. Minor:
5. DRY opportunity (echoing illuzen's comment) The pattern of "get header -> try_calculate_chain_work -> match None to skip+ignore" appears twice (in SummaryThe fix is correct and safe to merge as a hotfix. The structural error matching via |
The node then stayed idle and stopped producing blocks.
Root cause
ArchiveCanonical pruning – The node keeps state only for blocks on the canonical chain. Non-canonical fork blocks have their state pruned.
get_total_workneeds state – Best chain selection callsruntime_api().get_total_work(block_hash)for each leaf. That requires executing the runtime at that block, so the block’s state must exist.Many forks – With many competing forks (reorg depth 168–179 in the logs), some leaves are tips of fork chains whose state has already been pruned.
Failure mode – When
calculate_chain_workhit such a block, it returned an error. The wholebest_chain()call failed, so the mining loop could not get a best block and stopped authoring.Solution
Instead of failing the entire best chain selection when
calculate_chain_workreturns a “State already discarded” /UnknownBlockerror:find_best_chain(leaves)and pick the best among evaluable leaves.The node can now choose the best chain from leaves whose state is available and continue authoring blocks.