Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an optional bandwidth cap to TieredMergePolicy? #14148

Open
mikemccand opened this issue Jan 17, 2025 · 0 comments
Open

Add an optional bandwidth cap to TieredMergePolicy? #14148

mikemccand opened this issue Jan 17, 2025 · 0 comments

Comments

@mikemccand
Copy link
Member

Description

TL;DR: TieredMergePolicy can create massive snapshots if you configure it for aggressive deletesPctAllowed, which can hurt searchers (cause page fault storms) in a near-real-time replication world. Maybe we could add an optional (off by default) "rate limit" on how many amortized bytes/sec TMP is merging? This is just an idea / brainstorming / design discussion so far ... no PR.

Full context:

At Amazon (Product Search team) we use near-real-time segment replication to efficiently distribute index updates to all searchers/replicas.

Since we have many searchers per indexer shard, to scale to very high QPS, we intentionally tune TieredMergePolicy (TMP) to very aggressively reclaim deletions. Burning extra CPU / bandwidth during indexing to save even a little bit of CPU during searching is a good tradeoff for us (and in general, for Lucene users with high QPS requirements).

But we have a "fun" problem occasionally: sometimes we have an update storm (an upstream team reindexes large-ish portions of Amazon's catalog through the real-time indexing stream), and this leads to lots and lots of merging and many large (max-sized 5 GB) segments being replicated out to searchers in short order, sometimes over links (e.g. cross-region) that are not as crazy-fast as within-region networking fabric, and our searchers fall behind a bit.

Falling behind is not the end of the world: the searchers simply skip some point-in-time snapshots and jump to the latest one, effectively sub-sampling checkpoints as best they can given the bandwidth constraints. Index-to-search latency is hurt a bit, but recovers once the indexer catches up on the update storm.

The bigger problem for us is that we size our shards, roughly, so that the important parts of the index (the parts hit often by query traffic) are fully "hot". I.e. so the OS has enough RAM to hold the hot parts of the index. But when it takes too long to copy and light (switch over to the new segments for searching) a given snapshot, and we skip the next one or two snapshots, the followon snapshot that we finally do load may have a sizable part of the index rewritten, and the snapshot size maybe a big percentage of the overall index, and copying/lighting it will stress the OS into a paging storm, hurting our long-pole latencies.

So one possible solution we thought of is to add an optional (off by default) setMaxBandwidth to TMP so that "on average" (amortized over some time window ish) TMP would not produce so many merges that it exceeds that bandwidth cap. With such a cap, during an update storm (war time), the index delete %tg would necessarily increase beyond what we ideally want / configured with setDeletesPctAllowed, but then during peace time, TMP could again catch up and push the deletes back below the target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant