Skip to content

Optimize ScyllaDB's batch writes #4047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ndr-ds
Copy link
Contributor

@ndr-ds ndr-ds commented Jun 2, 2025

Motivation

So it is a known thing that batches are not token aware in the ScyllaDB’s Rust driver. Batches will be sent to a random node, which will then forward things to the proper nodes, which makes it not be “token aware”. Which also means there’s an extra network hop for most batch requests. This is what the default load balancing policies currently do.
So currently if someone using the Rust driver needs atomicity, they can use batches, but they’ll get a bit of a performance hit as the batch won’t be token aware.
So, for us to have the best performance on batches, maintaining the per partition atomicity that it guarantees, we would need shard aware batching, but that’s not yet supported in the Rust driver.
There is some work being attempted for “shard aware batching”, but one of the reviewers is arguing that there are ways of solving this problem that don’t involve user code. These ways involve creating a custom Load Balancing Policy, which is what I'm doing in this PR.

Proposal

Build a custom "Sticky" Load Balancing Policy. This policy will be specific to a given partition: given the partition, it will remember what are the (node, shard) pairs for all the replicas containing this partition's data. Then for every batch that we try to send, send them to one of the replicas, in a round robin fashion to spread load across the replicas.

We'll have an LRU cache keyed on the partition key, that contains either a Ready value or a NotReady value.
The reason for this is that there are some cases where you try to get the endpoints information for a token, and the Rust driver hasn't updated it's metadata yet about the table, so that information isn't filled yet. If we have a Ready value, we have the actual "sticky" policy already, with the (node, shard) endpoints, and we're good to go. If you have a NotReady value, you'll have a timestamp of when we last attempted to get the endpoints. We always wait at least 2 seconds before trying again, to give the driver time to update itself, and not overload it with these endpoint requests. Until then we use the default policy and take a bit of a performance hit, but should be for very limited time.

The NotReady state can also contain the Token already for that partition, in case we managed to calculate it in the last attempt. The Token is calculated by doing a Murmur3 hash of the tables specs and partition key. If the table doesn't change that Token will never change for this partition key. Since there's hashing involved, we cache it to not do that repeated work.

If we ever decide to auto scale our ScyllaDB deployment based on load, we'll need to add a mechanism here to invalidate these cache entries when that happens.

Test Plan

CI + I won't merge before I benchmark this code together with the new key space partitioning PR, to make sure the performance is what we expect.

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

Copy link
Contributor Author

ndr-ds commented Jun 2, 2025

@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 3 times, most recently from 46899e9 to 78c5f0a Compare June 2, 2025 18:31
@ndr-ds ndr-ds force-pushed the 05-22-optimize_scylladb_usage branch from 4acb372 to 6106db7 Compare June 2, 2025 19:05
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 78c5f0a to 0c95618 Compare June 2, 2025 19:05
@ndr-ds ndr-ds changed the base branch from 05-22-optimize_scylladb_usage to graphite-base/4047 June 2, 2025 23:22
@ndr-ds ndr-ds force-pushed the graphite-base/4047 branch from 6106db7 to af51607 Compare June 2, 2025 23:22
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 0c95618 to 768208f Compare June 2, 2025 23:22
@graphite-app graphite-app bot changed the base branch from graphite-base/4047 to main June 2, 2025 23:22
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 768208f to 0b093e3 Compare June 2, 2025 23:22
@ndr-ds ndr-ds marked this pull request as ready for review June 3, 2025 13:09
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 2 times, most recently from 09d6c81 to af0e20c Compare June 3, 2025 19:11
@ndr-ds ndr-ds changed the base branch from main to graphite-base/4047 June 3, 2025 21:03
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from af0e20c to fa4280f Compare June 3, 2025 21:03
@ndr-ds ndr-ds changed the base branch from graphite-base/4047 to 06-03-truncate_query_output_on_query_node June 3, 2025 21:03
@ndr-ds ndr-ds changed the base branch from 06-03-truncate_query_output_on_query_node to graphite-base/4047 June 3, 2025 21:29
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from fa4280f to e1c5311 Compare June 3, 2025 21:29
@graphite-app graphite-app bot changed the base branch from graphite-base/4047 to main June 3, 2025 21:29
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 3 times, most recently from 6ada5da to a8361a5 Compare June 4, 2025 03:39
Comment on lines +332 to +336
self.multi_keys
.insert(num_markers, prepared_statement.clone());
Copy link
Contributor

@MathieuDutSik MathieuDutSik Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why switch back to the construction with two searches, get and then insert ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashmap's entry will actually deadlock if you call it and there's anyone else using the map :)
So it's not a good idea here, and was actually a bug to use it. We got "lucky" that CI didn't trigger it much in the PR that was merged, but it will be triggered increasingly more with these PRs, specially the new partitions one.
But CI in this PR was already deadlocking sometimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but then I would add a comment on that.
I think the most appropriate is "We are not using the dashmap::DashMap::entry in order to reduce contention and creating deadlocks".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's generally never safe to hold DashMap references over an await point. (Or is it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MathieuDutSik The Dashmap documentation is pretty clear about the deadlock risk, I think adding another comment here is redundant.

@afck Exactly, it's not safe. Which is why it's best to avoid things that hold a reference to the map for too long, like using entry for example. It is designed for concurrent access though, there are just these nuances to be careful about

@afck
Copy link
Contributor

afck commented Jun 4, 2025

Do we need this if we never have batches that affect multiple shards?

Can we enforce that each view is on a single shard, and then just not batch anything else?

@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from a8361a5 to b6a6083 Compare June 4, 2025 17:17
@ndr-ds ndr-ds mentioned this pull request Jun 4, 2025
Copy link
Contributor Author

ndr-ds commented Jun 4, 2025

Do we need this if we never have batches that affect multiple shards?

Yes, even if all queries in the batch are for the same partition key, the Rust driver will still just send the batch to a random node by default.

Can we enforce that each view is on a single shard, and then just not batch anything else?

Yes, but that's a much bigger change I think 😅 both the enforcing that each view is on a single shard, and not batching.
I want to do a few follow ups here where we do that enforcement, and stop doing single query batches and do just the queries directly instead. It's still useful to keep this even in that case, because now batches should still be reasonably performant even in the case where people don't care about atomicity, but want to use batches just to minimize network calls to the DB.

@ndr-ds ndr-ds changed the base branch from main to graphite-base/4047 June 5, 2025 05:28
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from b6a6083 to c8b5e29 Compare June 5, 2025 05:28
@ndr-ds ndr-ds changed the base branch from graphite-base/4047 to 06-04-some_code_cleanups June 5, 2025 05:28
@ndr-ds ndr-ds changed the base branch from 06-04-some_code_cleanups to graphite-base/4047 June 5, 2025 05:47
@ndr-ds ndr-ds force-pushed the graphite-base/4047 branch from 877abcb to 6976d9c Compare June 5, 2025 06:25
@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from c8b5e29 to 36cfe7b Compare June 5, 2025 06:25
@ndr-ds ndr-ds changed the base branch from graphite-base/4047 to main June 5, 2025 06:25
@@ -98,13 +113,38 @@ const MAX_BATCH_SIZE: usize = 5000;
/// The keyspace to use for the ScyllaDB database.
const KEYSPACE: &str = "kv";

/// The default size of the cache for the load balancing policies.
const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: usize = 50_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work now, as of Rust 1.83.0?

Suggested change
const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: usize = 50_000;
const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: NonZeroUsize = NonZeroUsize::new(50_000).unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it does!

policy
}
Err(error) => {
// Cache that the policy creation failed, so we don't try again too soon, and don't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected? Should we log if that happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do a WARN here I think, but it shouldn't happen too many times AFAIU

match policy {
LoadBalancingPolicyCacheEntry::Ready(policy) => policy.clone(),
LoadBalancingPolicyCacheEntry::NotReady(timestamp, token) => {
if Timestamp::now().delta_since(*timestamp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use Instance here instead of Timestamp? Our linera_base timestamp type is just to define and serialize timestamps in the protocol as u64s. Since this here is only used locally I'd go with the standard library. Also, this expression here could then be timestamp.elapsed().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant Instant.

@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 36cfe7b to cef2f5d Compare June 5, 2025 10:47
Comment on lines 126 to 129
pub struct ScyllaDbClientConfig {
/// The delay before the sticky load balancing policy creation is retried.
pub delay_before_sticky_load_balancing_policy_retry_ms: u64,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have a ScyllaDbClientConfig, why not put the DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE in it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the replication_factor could have its place here. It does not have to be in this PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, sure! And on the replication_factor, yeah, that's in my TODO list 😅 I noticed a while back it doesn't belong on common config as it's actually specific to ScyllaDb

Comment on lines +118 to +120
enum LoadBalancingPolicyCacheEntry {
Ready(Arc<dyn LoadBalancingPolicy>),
// The timestamp of the last time the policy creation was attempted.
NotReady(Instant, Option<Token>),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some maybe stupid concern, but I think that the sharding policy in ScyllaDb is dynamic.
So, I wonder if this affects the caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. For a given partition_key, the Token is always unique, as it's always calculated by the same hash function. So the Token won't change. As far as the endpoints, and AFAIU, we'll only reshuffle token ranges (causing the nodes that hold a different token to change) when we scale ScyllaDB up or down, as in add or remove VMs in the NodePool, or run ALTER TABLE commands, stuff like that.
We currently don't do that, and it would complicate the code a bit to add support for it now, so I would rather leave it for when it's needed

Copy link
Contributor

@ma2bd ma2bd Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, we don't support changes in the sharding assignment on the ScyllaDb side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sticky load balancing policy part currently doesn't, but we can add support for that in a follow up PR, I think :)

@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from cef2f5d to aca7d91 Compare June 5, 2025 20:40
Copy link
Contributor

@MathieuDutSik MathieuDutSik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No obstacle on my side, but I would like to see benchmarks confirming the improvements before merging.

Benchmarks can be added to the PR description.

Comment on lines 522 to 524
if let Some(policy) = cache.get(partition_key) {
match policy {
LoadBalancingPolicyCacheEntry::Ready(policy) => policy.clone(),
Copy link
Contributor

@ma2bd ma2bd Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for nesting if let and match:

match cache.get(partition_key) {
    Some(LoadBalancingPolicyCacheEntry::Ready(policy)) => ...
    ...
    None => ...
}

@ndr-ds ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from aca7d91 to 9367ee5 Compare June 5, 2025 23:21
Copy link
Contributor

@ma2bd ma2bd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's experiment with this on a separate branch first. The fact that this doesn't support changes in the shard assignments by ScyllaDb is a problem.

Copy link
Contributor Author

ndr-ds commented Jun 16, 2025

I can add support for that on this PR, or in a following one, to not inflate the size of this one even more (I won't merge this without that, will merge the full stack at once, probably). I was thinking on creating a stack that we know works, then merging the full stack once we agree it gets things to a good state (instead of doing a separate branch).
Happy to do a separate branch as well though, if we prefer that 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants