Skip to content

Commit c779cb5

Browse files
authored
Merge pull request #1372 from taxmeifyoucan/pectra-postmortems
Holešky update + Sepolia postmortem
2 parents 5395d8d + def2c8f commit c779cb5

File tree

2 files changed

+271
-11
lines changed

2 files changed

+271
-11
lines changed

Pectra/holesky-postmortem.md

+119-11
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,19 @@
1-
# Holesky Pectra Incident Post-Mortem
1+
# Holešky Pectra Incident Post-Mortem
22

33
Author: Tim Beiko, Mario Havel
44

55
Status: Resolved
66

7-
Date: Mar 10, 2025
7+
Date: Mar 20, 2025
88

99
# Current Status
1010

1111
The Holešky network has been successfully recovered. After 2 weeks of running without finality, the chain finalized again on Mar 10 at 19:21 UTC at epoch `119090`.
1212

1313
The recovery was achieved by coordinating operators to follow the correct chain until participation reached enough validators for finality. Since then, participation has continued to slowly rise and the network appears stable. Detailed recovery efforts and original instructions for validators are described below.
1414

15+
After reaching finality, the consensus started processing validator events and the exit queue became full for the next ~1.5 years. Because of this, full validator lifecycle tests and other Electra testing like consolidations are not possible. Holešky long term support window was shortened and to allow for immediate testing, the network is replaced by [Hoodi](https://github.com/eth-clients/hoodi).
16+
1517
## Recovery Efforts
1618

1719
The initial strategy to coordinate slashing with operators disabling slashing protections, planned at [ACDE#206](https://github.com/ethereum/pm/issues/1306), was not successful. The outcome and new options were discussed at [ACDC#152](https://github.com/ethereum/pm/issues/1323). At that point, slashings and coordination of validators to follow the correct fork did not reach the necessary 66% threshold, and the network remained in a prolonged state of non-finalization. This scenario is one which consensus clients are not designed for and causes significant resource overhead. Client teams implemented fixes and mitigations allowing clients to run more efficiently even through extended periods of non-finalization.
@@ -20,9 +22,10 @@ After [ACDC#152](https://github.com/ethereum/pm/issues/1323), based on outlined
2022

2123
As participation slowly increased, the goal was to prompt as many validators as possible to connect to the correct network before Mar 12. This was successfully achieved on the evening of March 10. A [new finalized epoch](https://light-holesky.beaconcha.in/epoch/119090) of the correct chain was created, which could then be used by clients to sync normally again.
2224

23-
### Validator Instructions
25+
<details>
26+
<summary>Original Validator Instructions for recovery</summary>
2427

25-
Instructions for Holešky validators to participate and contribute to the recovery:
28+
Original instructions for Holešky validators to participate and contribute to the recovery:
2629

2730
- Update your clients to a version containing the fix, [list of versions below](#client-releases-and-resources)
2831
- Disable slashing protection as [described below](#Disabling-Slashing-Protection)
@@ -34,7 +37,7 @@ EL clients need to use full sync instead of snap sync. This should be the defaul
3437

3538
### Coordinated Slashings
3639

37-
On [ACDE#206](https://github.com/ethereum/pm/issues/1306), client teams decided to try and coordinate mass Holesky slashings around slot `3737760` (Feb 28, 15:12:00 UTC). The goal was for the network to achieve enough validators online to finalize an epoch on the valid chain at the same time.
40+
On [ACDE#206](https://github.com/ethereum/pm/issues/1306), client teams decided to try and coordinate mass Holešky slashings around slot `3737760` (Feb 28, 15:12:00 UTC). The goal was for the network to achieve enough validators online to finalize an epoch on the valid chain at the same time.
3841

3942
### Disabling Slashing Protection
4043

@@ -120,13 +123,21 @@ Consensus layer teams have been releasing patches to improve peering and sync on
120123
- [Holesky block explorer (correct chain)](https://dora-holesky.pk910.de/)
121124
- [Nethermind snapshot](https://nethermind.benaadams.vip/snapshot/nethermind_holesky_3420120_0x204dda_8f25ea_snapshot.tar.bz2)
122125
- [EthPandaOps snapshots](https://ethpandaops.io/data/snapshots/)
123-
- [Incident Debrief call notes](https://ethereum-magicians.org/t/holesky-incident-debrief-february-26-2025/22998)
126+
127+
</details>
128+
129+
## Postmortems from client teams
130+
- [Lodestar Holesky Rescue Retrospective](https://hackmd.io/@philknows/ByxcAAWnye)
131+
- [Besu Deposit Contract Address Postmortem](https://hackmd.io/@siladu/H1qydmWhyx)
132+
- [Prysm Postmortem](https://github.com/prysmaticlabs/documentation/pull/1028)
133+
- [Who Moved My Testnet? - Reflection on testnet situation by Lucas Saldanha](https://hackmd.io/@lucassaldanha/rJd-9rAikg)
134+
- [Original Incident Debrief call notes](https://ethereum-magicians.org/t/holesky-incident-debrief-february-26-2025/22998)
124135

125136
# Root Cause Analysis
126137

127138
## Execution Layer Issue
128139

129-
The root cause of the initial problem was that several execution clients (Geth, Nethermind, and Besu) had incorrect deposit contract addresses configured for the Holesky testnet. Specifically:
140+
The root cause of the initial problem was that several execution clients (Geth, Nethermind, and Besu) had incorrect deposit contract addresses configured for the Holešky testnet. Specifically:
130141

131142
- Holesky's deposit contract address should be `0x4242424242424242424242424242424242424242`
132143
- Some EL clients were using the mainnet deposit contract address or had no specific configuration for Holesky, leading them to use `0x0000...0000`
@@ -156,13 +167,110 @@ This created a negative feedback loop where the valid chain had few blocks and f
156167

157168
# Root Cause Remediations
158169

170+
## Validating Configuration and Fork Parameterization
171+
172+
Better validation of config and fork paramaters is necessary, also genesis configuration is being standardized across clients. EL clients are implementing a new RPC method enabling to retrieve and validate the correct configuration. [eth_config](https://hackmd.io/@shemnon/eth_config). Incompatible configuration results in an early error.
173+
159174
## User-specified Unfinalized Checkpoint Sync
160175

161-
Allow CL clients to pick an arbitrary block from which to initialize checkpoint sync, even if not finalized. This would enable users to socially coordinate around a specific chain, forcing the client to sync to it.
176+
Clients will enable custom checkpoint sync block and improve capabilities to sync even from an arbitrary non-finalized checkpoints. This would enable users to socially coordinate around a specific chain, forcing the client to sync to it.
177+
Further improvement could be a leader-based coordination systems for correct chain identification and fix invalid chain pruning capabilities.
178+
179+
> e.g. Prysm added "sync from head" feature ([PR #15000](https://github.com/prysmaticlabs/prysm/pull/15000), more planned [#14988](https://github.com/prysmaticlabs/prysm/issues/14988)), geth is working on similar featre [#31375](https://github.com/ethereum/go-ethereum/issues/31375), Teku consideres it but it's codebase heavily relies on a finalized source to start the sync
180+
181+
## Further issues and mitigations
182+
183+
Apart from the original root issue, the incident uncovered a number of other problems caused by the extreme case of a long non-finality, especially for CL clients. Clients implemented more fixes and improvements.
184+
185+
### Issues across consensus clients
186+
187+
#### High resource usage
188+
189+
Most clients ended up with excessive memory usage and performance degradation as non-finality duration extended (for example Prysm/Geth machine with 300GB+ RAM usage). Without finalized checkpoint to write to the database, clients had to store a lot of data in memory.
190+
191+
#### Fork Choice issues
192+
193+
An incorrect chain with an invalid block being justified caused issues for fork choice. Clients had to handle many competing forks with correct one being a minority without justification.
194+
195+
#### Peer discovery and connection issues
196+
197+
All clients struggled to find peers on the correct chain, even with manual ENR sharing and coordination. Peer scoring and many concurrent forks made it difficult to find a good peer.
198+
199+
#### Slashing protection issues
200+
201+
Every client team needed custom procedures for managing slashing protection. Surround slashing conditions were a problem for validators that attested to the invalid chain.
202+
203+
### Client specific improvements
204+
205+
- Lighthouse
206+
- Developing "hot tree-states" feature to store data in hot DB more efficiently during non-finality. Allows to store data in the hot DB in a disk similarly to the cold DB, without consuming an inordinate amount of disk space.
207+
- Added `lighthouse/add_peer` endpoint to help nodes find canonical chain peers, especially useful with `--disable-discovery`
208+
- Optimized `BlocksByRange` to load from fork choice when possible
209+
- Added `--invalid-block-roots` flag to automatically invalidate problematic blocks (like 2db899...)
210+
- Improved cache management and other optimizations
211+
- Prysm
212+
- Added a new flag to allow syncing from a custom checkpoint, [sync from head](https://github.com/prysmaticlabs/prysm/pull/15000)
213+
- Fixed bugs with attestation aggregation and attester slashing bug introduced in Electra [#15027](https://github.com/prysmaticlabs/prysm/pull/15027), [#15028](https://github.com/prysmaticlabs/prysm/pull/15028)
214+
- Fixed REST API performance issues with `GetDuties` endpoint [#14990](https://github.com/prysmaticlabs/prysm/pull/14990)
215+
- Plans more features for custom sync: [marking invalid blocks](https://github.com/prysmaticlabs/prysm/issues/14989), [optimistic sync option](https://github.com/prysmaticlabs/prysm/issues/14987), [adding blocks manually](https://github.com/prysmaticlabs/prysm/issues/14986), [follow chain by leader's ENR](https://github.com/prysmaticlabs/prysm/issues/14994)
216+
217+
- Lodestar
218+
- Added feature to [check for blacklisted blocks](https://github.com/ChainSafe/lodestar/pull/7498) and introduced a [new endpoint](https://github.com/ChainSafe/lodestar/pull/7580) to return them
219+
- Fixed checkpoint state pruning to prevent OOM crashes [#7497](https://github.com/ChainSafe/lodestar/pull/7497), [#7505](https://github.com/ChainSafe/lodestar/pull/7505)
220+
- Added pruning of persisted checkpoint states [#7510](https://github.com/ChainSafe/lodestar/pull/7510), [#7495](https://github.com/ChainSafe/lodestar/issues/7495)
221+
- Added feature to use local state source as checkpoint [#7509](https://github.com/ChainSafe/lodestar/pull/7509)
222+
- Addded new endpoint `eth/v1/lodestar/persisted_checkpoint_state` to return a state based on an optional `rootHex:epoch` parameter [#7541](https://github.com/ChainSafe/lodestar/pull/7541)
223+
- Improved peer management during sync stalls, adding check whether peer is `starved` [#7508](https://github.com/ChainSafe/lodestar/pull/7508)
224+
- Fixed bug in attestationgossip validation introduced in Electra [#7543](https://github.com/ChainSafe/lodestar/pull/7543)
225+
- Considered adding pessimistic sync but might cause problems with snap synced EL and doesn't seem that useful [#7511](https://github.com/ChainSafe/lodestar/pull/7511)
226+
- Added state persistence for invalid blocks to allow their analysis [#7482](https://github.com/ChainSafe/lodestar/pull/7482)
227+
- Exploring binary diff states and era files to import state [#7535](https://github.com/ChainSafe/lodestar/pull/7535), [#7048](https://github.com/ChainSafe/lodestar/issues/7048)
228+
229+
- Teku
230+
- Fixed fork choice bug related to equivocating votes [#9234](https://github.com/Consensys/teku/pull/9234)
231+
- Fixed sync issues during long non-finality, sync process to restarting from an old block because `protoArray` initialised with 0 weights and canonical head became a random chain tip in the past
232+
- Fixed node restarting sync from last finalized state
233+
- Fixed slow block production due to too many single attestations
234+
- Added sorting for better attestation selection during aggregation
235+
- Identified and working on various smaller issues https://github.com/Consensys/teku/issues?q=%5BHOLESKY%20PECTRA%5D
236+
- To deal with huge performance overhead, team created a "superbeacon" node on a new machine with substantial CPU/RAM resources to handle the load
237+
238+
- Nimbus
239+
- Didn't experience major issues with performance
240+
- Created branch `feat/splitview branch that keeps better track of forks
241+
- Even when block was was `INVALID`, it was added to fork choice and justified, creating a hard situation to recover from, due to the fundamentally optimistic nature of how the engine API works
242+
- While the `feat/splitview` branch was able to effectively find/explore lots of forks on from different nodes, it was unable to get ELs to often respond with anything but `SYNCING`, so couldn't rule out actually-`INVALID` forks
243+
- After the networking finalized, Nimbus took a while to finish some on-finalization processing it did and disrupted slot and block processing for a while. Once it got past that, it was fine, and `feat/splitview` wasn't necessary anymore
244+
245+
- Grandine
246+
- Fixed increased memory usage that led to OOM errors
247+
248+
- Besu
249+
- Fixed deposit contract address misconfiguration [#8346](https://github.com/hyperledger/besu/pull/8346)
250+
- Besu is using a 3rd party web3 library to process the deposit contract, team will review usage of external libraries in critical consensus paths [#8391](https://github.com/hyperledger/besu/issues/8391)
251+
- Fixed snap sync stalling [#8393](https://github.com/hyperledger/besu/issues/8393)
252+
253+
- Geth
254+
- Fixed deposit contract address configuration [#31247](https://github.com/ethereum/go-ethereum/pull/31247)
255+
- Working on `--synctarget` flag to force client follow a specific chain [#31375](https://github.com/ethereum/go-ethereum/issues/31375)
256+
- Identified crashes when trying to add invalid block after syncing good branch [#31320](https://github.com/ethereum/go-ethereum/issues/31320)
257+
258+
### Testing and process improvements
259+
260+
#### More non-finality testing
261+
Clients have not been tested under such long non-finality conditions before. Some period of non-finality should become a standard testing procedure.
262+
263+
#### Testnet and fork management**
264+
265+
Testnets and hardforks require more careful handling, Holesky/Sepolia/Hoodi should be considered a proper staging environments. Testnets setup should be close to mainnet as possible and hardfork activation should be handled similarly to mainnet with proper procedures. Some more insights on this topic can be found here: https://hackmd.io/@lucassaldanha/rJd-9rAikg
266+
267+
#### Incident response coordination
268+
269+
The process for incident response needs to be clear and executed across client teams. Especially during hardforks, whether testnet or mainnet,developers and devops need be on-call and actively monitoring the situation. Communication needs to be clear between clients, without teams working in isolation. A proper standard procedure for incident response needs to be established with clear guidelines and responsibilities.
162270

163-
## Validate EL Fork Parameterization
271+
#### Validator client separation
164272

165-
Implement a form of validation for EL parameters introduced or changed in a network upgrade, either statically or as part of the peer-to-peer protocol. A Telegram group has been created to discuss the issue: https://t.me/+d8rLI1WcaY41MmY5
273+
Modularity by using separate validator clients proved valuable, allowing teams to connect their validators to connect to healthy beacon nodes providers. Moving validator keys between clients can be challenging and time-consuming.
166274

167275
# Timeline of Events
168276

@@ -171,7 +279,7 @@ _Note: I used an LLM to compile this based on the Discord chat transcript. I've
171279
## February 24, 2025
172280

173281
### 22:04-23:00 UTC: Network Split Identified
174-
- 22:04: Multiple users report invalid block issues on the Holesky network after the Pectra upgrade
282+
- 22:04: Multiple users report invalid block issues on the Holešky network after the Pectra upgrade
175283
- 22:05: First reports of validation errors in Lighthouse:
176284
> "Invalid execution payload... validation_error: mismatched block requests hash: got 0x12e7307cb8a29c779310bea59482500fb917e433f6849de7394f9e2f5c34bf31, expected 0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
177285
- 22:07: Confirmation that Erigon is also seeing invalid blocks: "we have bad blocks on erigon too"

0 commit comments

Comments
 (0)