forked from anza-xyz/agave
-
Notifications
You must be signed in to change notification settings - Fork 0
streamer/TPU: increase STREAM_LOAD_EMA_INTERVAL_COUNT from 10 to 40 #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
stablebits
wants to merge
28
commits into
rfc-throttling-threshold-v2
Choose a base branch
from
increase-stream_load_ema_interval_count
base: rfc-throttling-threshold-v2
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
streamer/TPU: increase STREAM_LOAD_EMA_INTERVAL_COUNT from 10 to 40 #2
stablebits
wants to merge
28
commits into
rfc-throttling-threshold-v2
from
increase-stream_load_ema_interval_count
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Add wfsm metric. Add trace logging for peers. * Remove trace logging, since peers are already logged by gossip * Remove wrong_shred_stake from wfsm_gossip metric. This will always be 0 and the associated code will be cleaned up in a future PR
* Split update_index function into two: one for cached accounts and the other for frozen * Updated comment and added debug asserts * Remove unneeded type declaration
alpenglow: upstream votor & votor-messages as of December
* Make flushing of unrooted slots explicit * Rename flush_unrooted_cache_slot to flush_unrooted_slot_cache * Checking for unrooted slots Removing changes to tests that are not using new function * Inlining flush_slot_cache to flush_accounts_cache_slot_for_tests for DCOU issue * Resolving unused function issue
* bump `bls-signatures` to v3.0 * update vote program with the new syntax * update genesis-utils with the new syntax * update `clap-utils` tests * update `keygen` tests * update genesis tests * update votor * update votor tests * update `epoch_stakes`
…anza-xyz#9732) use bounded channels between streamers and sigver
…ams/sbf (anza-xyz#10029) chore(deps): bump solana-program-memory in /programs/sbf Bumps [solana-program-memory](https://github.com/anza-xyz/solana-sdk) from 3.0.0 to 3.1.0. - [Release notes](https://github.com/anza-xyz/solana-sdk/releases) - [Commits](https://github.com/anza-xyz/solana-sdk/compare/sdk@v3.0.0...cpi@v3.1.0) --- updated-dependencies: - dependency-name: solana-program-memory dependency-version: 3.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* chore(deps): bump chrono from 0.4.42 to 0.4.43 Bumps [chrono](https://github.com/chronotope/chrono) from 0.4.42 to 0.4.43. - [Release notes](https://github.com/chronotope/chrono/releases) - [Changelog](https://github.com/chronotope/chrono/blob/main/CHANGELOG.md) - [Commits](chronotope/chrono@v0.4.42...v0.4.43) --- updated-dependencies: - dependency-name: chrono dependency-version: 0.4.43 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> * Update all Cargo files --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…yz#10041) * chore(deps): bump solana-system-interface from 2.0.0 to 3.0.0 Bumps [solana-system-interface](https://github.com/anza-xyz/solana-sdk) from 2.0.0 to 3.0.0. - [Release notes](https://github.com/anza-xyz/solana-sdk/releases) - [Commits](https://github.com/anza-xyz/solana-sdk/compare/address@v2.0.0...sdk@v3.0.0) --- updated-dependencies: - dependency-name: solana-system-interface dependency-version: 3.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Update all Cargo files --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…z#10002) * epoch stakes in thread * Add comment and asserts for versioned_epoch_stakes
…-bins (anza-xyz#10049) chore(deps): bump solana-system-interface in /dev-bins Bumps [solana-system-interface](https://github.com/anza-xyz/solana-sdk) from 2.0.0 to 3.0.0. - [Release notes](https://github.com/anza-xyz/solana-sdk/releases) - [Commits](https://github.com/anza-xyz/solana-sdk/compare/address@v2.0.0...sdk@v3.0.0) --- updated-dependencies: - dependency-name: solana-system-interface dependency-version: 3.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…nza-xyz#10048) decrease QUIC_MAX_TIMEOUT from 60s to 30s 60s timeout might be too hight due to NAT timeouts. The 30sec is safe default idle timeout and used as default in quinn.
scale RX window and max_streams with BDP
anza-xyz#9580) Simulations with the existing EMA-based load metric (stream_throttle.rs) showed that very low-stake staked connections (~0.01% of total stake) could end up with streams-per-100ms quotas similar to unstaked connections even under near-zero load. Data collected on mds1 (mainnet) over a few leader slots also showed low-stake connections being throttled under effectively idle conditions: [2025-12-04T22:56:59.929547468Z ERROR solana_streamer::nonblocking::stream_throttle] Throttling tpu stream from 3.66.188.50:8016, peer type: Staked(30314578869242), current_load: 11, total_stake: 415746706271632896, max_streams_per_interval: 28, read_interval_streams: 28, throttle_duration: 99.948899ms In observed cases, effective load was near 0 (3–25 streams per 5ms) while affected connections had quotas of 28–64 streams per 100ms and stakes of ~0.007–0.016% of total stake. Also: - Fix update_ema() catch-up behavior so missed slots do not re-apply the same accumulated load. - available_load_capacity_in_throttling_duration() mixed load values in streams/5ms and streams/50ms. Replaced it with a simpler stake-only quota under load.
* Prepopulate zero lamport accounts in store_for_tests * Update accounts-db/src/accounts_db.rs Co-authored-by: Brooks <brooks@prumo.org> --------- Co-authored-by: Brooks <brooks@prumo.org>
* chore(deps): bump flate2 from 1.0.31 to 1.1.8 in /programs/sbf Bumps [flate2](https://github.com/rust-lang/flate2-rs) from 1.0.31 to 1.1.8. - [Release notes](https://github.com/rust-lang/flate2-rs/releases) - [Commits](rust-lang/flate2-rs@1.0.31...1.1.8) --- updated-dependencies: - dependency-name: flate2 dependency-version: 1.1.8 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update all Cargo files --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [js-sys](https://github.com/wasm-bindgen/wasm-bindgen) from 0.3.83 to 0.3.85. - [Release notes](https://github.com/wasm-bindgen/wasm-bindgen/releases) - [Changelog](https://github.com/wasm-bindgen/wasm-bindgen/blob/main/CHANGELOG.md) - [Commits](https://github.com/wasm-bindgen/wasm-bindgen/commits) --- updated-dependencies: - dependency-name: js-sys dependency-version: 0.3.85 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This constant controls the duration of the EMA smoothing window used to reduce sensitivity to short-lived load spikes at the start of a leader slot. Throttling is only triggered when saturation is sustained. The value 40 was chosen based on simulations: at a max target TPS of ~400K, it allows the system to absorb a burst of ~50K transactions over ~40 ms before throttling activates. There is no magic about N=40; the value should be tuned based on the size and duration of spikes we want to tolerate.
6fc5d7c to
98486db
Compare
* Switch networking crates to Rust 2024 edition * clippy(networking): update formatting for rust 2024 * clippy: fix collapsible ifs
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
STREAM_LOAD_EMA_INTERVAL_COUNTconstant controls the duration of the EMA smoothing window used to reduce sensitivity to short-lived load spikes at the start of a leader slot. With anza-xyz#9580 in place, throttling is only triggered when saturation is sustained (reaching 95% of max target).Problem
With 10, the duration of the smoothing window is too short (see the simulation results below).
Summary of Changes
The value 40 was chosen based on simulations: at a max target TPS of ~400K, it allows the system to absorb a burst of ~50K transactions over ~40 ms before throttling activates.
There is no magic about N=40; the value should be tuned based on the size and duration of spikes we want to tolerate.
This choice was made based on simulations: the
alphain the EMA (new_ema = alpha * latest + (1 - alpha) * ema) is basically2/(N+1), whereNisSTREAM_LOAD_EMA_INTERVAL_COUNT.The larger
Nis, the slower the EMA grows (i.e., the larger a burst it can absorb). With N=10 (current code), alpha ≈ 0.18. For example, here’s the EMA growth under sustained load of 1K / 5ms.N=10 (alpha ≈ 0.18)
N=40 (alpha ≈ 0.047)
Below is simulated ingestion of ~60K transactions over 100ms with a spike at the beginning -- roughly corresponding to a pattern we recently saw on mds1 (mainnet), but at about 10x more traffic.
Note: throttling is activated at 95% of the target (500K TPS) load and deactivated at 90%). The quota of 40K basically means unthrottled.
N=10
N=40
With N=40, we can absorb ~50K transactions (with a spike) over ~40ms before throttling gets activated.
Fixes #