Optimize ScyllaDB's batch writes #4047

ndr-ds · 2025-06-02T17:50:39Z

Motivation

So it is a known thing that batches are not token aware in the ScyllaDB’s Rust driver. Batches will be sent to a random node, which will then forward things to the proper nodes, which makes it not be “token aware”. Which also means there’s an extra network hop for most batch requests. This is what the default load balancing policies currently do.
So currently if someone using the Rust driver needs atomicity, they can use batches, but they’ll get a bit of a performance hit as the batch won’t be token aware.
So, for us to have the best performance on batches, maintaining the per partition atomicity that it guarantees, we would need shard aware batching, but that’s not yet supported in the Rust driver.
There is some work being attempted for “shard aware batching”, but one of the reviewers is arguing that there are ways of solving this problem that don’t involve user code. These ways involve creating a custom Load Balancing Policy, which is what I'm doing in this PR.

Proposal

Build a custom "Sticky" Load Balancing Policy. This policy will be specific to a given partition: given the partition, it will remember what are the (node, shard) pairs for all the replicas containing this partition's data. Then for every batch that we try to send, send them to one of the replicas, in a round robin fashion to spread load across the replicas.

We'll have an LRU cache keyed on the partition key, that contains either a Ready value or a NotReady value.
The reason for this is that there are some cases where you try to get the endpoints information for a token, and the Rust driver hasn't updated it's metadata yet about the table, so that information isn't filled yet. If we have a Ready value, we have the actual "sticky" policy already, with the (node, shard) endpoints, and we're good to go. If you have a NotReady value, you'll have a timestamp of when we last attempted to get the endpoints. We always wait at least 2 seconds before trying again, to give the driver time to update itself, and not overload it with these endpoint requests. Until then we use the default policy and take a bit of a performance hit, but should be for very limited time.

The NotReady state can also contain the Token already for that partition, in case we managed to calculate it in the last attempt. The Token is calculated by doing a Murmur3 hash of the tables specs and partition key. If the table doesn't change that Token will never change for this partition key. Since there's hashing involved, we cache it to not do that repeated work.

If we ever decide to auto scale our ScyllaDB deployment based on load, we'll need to add a mechanism here to invalidate these cache entries when that happens.

Test Plan

CI + I won't merge before I benchmark this code together with the new key space partitioning PR, to make sure the performance is what we expect.

Release Plan

Nothing to do / These changes follow the usual release cycle.

ndr-ds · 2025-06-02T17:51:00Z

New ScyllaDB key space partitioning #4049
Optimize ScyllaDB's batch writes #4047 👈 (View in Graphite)
Some code cleanups #4066
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

MathieuDutSik · 2025-06-04T06:14:39Z

linera-views/src/backends/scylla_db.rs

+        self.multi_keys
+            .insert(num_markers, prepared_statement.clone());


Why switch back to the construction with two searches, get and then insert ?

Dashmap's entry will actually deadlock if you call it and there's anyone else using the map :)
So it's not a good idea here, and was actually a bug to use it. We got "lucky" that CI didn't trigger it much in the PR that was merged, but it will be triggered increasingly more with these PRs, specially the new partitions one.
But CI in this PR was already deadlocking sometimes.

Ok, but then I would add a comment on that.
I think the most appropriate is "We are not using the dashmap::DashMap::entry in order to reduce contention and creating deadlocks".

I guess it's generally never safe to hold DashMap references over an await point. (Or is it?)

@MathieuDutSik The Dashmap documentation is pretty clear about the deadlock risk, I think adding another comment here is redundant.

@afck Exactly, it's not safe. Which is why it's best to avoid things that hold a reference to the map for too long, like using entry for example. It is designed for concurrent access though, there are just these nuances to be careful about

afck · 2025-06-04T15:48:51Z

Do we need this if we never have batches that affect multiple shards?

Can we enforce that each view is on a single shard, and then just not batch anything else?

ndr-ds · 2025-06-04T17:22:54Z

Do we need this if we never have batches that affect multiple shards?

Yes, even if all queries in the batch are for the same partition key, the Rust driver will still just send the batch to a random node by default.

Can we enforce that each view is on a single shard, and then just not batch anything else?

Yes, but that's a much bigger change I think 😅 both the enforcing that each view is on a single shard, and not batching.
I want to do a few follow ups here where we do that enforcement, and stop doing single query batches and do just the queries directly instead. It's still useful to keep this even in that case, because now batches should still be reasonably performant even in the case where people don't care about atomicity, but want to use batches just to minimize network calls to the DB.

afck · 2025-06-05T08:30:14Z

linera-views/src/backends/scylla_db.rs

@@ -98,13 +113,38 @@ const MAX_BATCH_SIZE: usize = 5000;
 /// The keyspace to use for the ScyllaDB database.
 const KEYSPACE: &str = "kv";

+/// The default size of the cache for the load balancing policies.
+const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: usize = 50_000;


Does this work now, as of Rust 1.83.0?

Suggested change

const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: usize = 50_000;

const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: NonZeroUsize = NonZeroUsize::new(50_000).unwrap();

Seems like it does!

linera-views/src/backends/scylla_db.rs

afck · 2025-06-05T09:06:33Z

linera-views/src/backends/scylla_db.rs

+                policy
+            }
+            Err(error) => {
+                // Cache that the policy creation failed, so we don't try again too soon, and don't


Is this expected? Should we log if that happens?

I can do a WARN here I think, but it shouldn't happen too many times AFAIU

afck · 2025-06-05T09:13:20Z

linera-views/src/backends/scylla_db.rs

+                match policy {
+                    LoadBalancingPolicyCacheEntry::Ready(policy) => policy.clone(),
+                    LoadBalancingPolicyCacheEntry::NotReady(timestamp, token) => {
+                        if Timestamp::now().delta_since(*timestamp)


Maybe we should use Instance here instead of Timestamp? Our linera_base timestamp type is just to define and serialize timestamps in the protocol as u64s. Since this here is only used locally I'd go with the standard library. Also, this expression here could then be timestamp.elapsed().

Sorry, I meant Instant.

MathieuDutSik · 2025-06-05T13:43:30Z

linera-views/src/backends/scylla_db.rs

+pub struct ScyllaDbClientConfig {
+    /// The delay before the sticky load balancing policy creation is retried.
+    pub delay_before_sticky_load_balancing_policy_retry_ms: u64,
+}


Now that we have a ScyllaDbClientConfig, why not put the DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE in it?

Also, the replication_factor could have its place here. It does not have to be in this PR, though.

I could, sure! And on the replication_factor, yeah, that's in my TODO list 😅 I noticed a while back it doesn't belong on common config as it's actually specific to ScyllaDb

MathieuDutSik · 2025-06-05T14:02:46Z

linera-views/src/backends/scylla_db.rs

+enum LoadBalancingPolicyCacheEntry {
+    Ready(Arc<dyn LoadBalancingPolicy>),
+    // The timestamp of the last time the policy creation was attempted.
+    NotReady(Instant, Option<Token>),
+}


I have some maybe stupid concern, but I think that the sharding policy in ScyllaDb is dynamic.
So, I wonder if this affects the caching.

Good question. For a given partition_key, the Token is always unique, as it's always calculated by the same hash function. So the Token won't change. As far as the endpoints, and AFAIU, we'll only reshuffle token ranges (causing the nodes that hold a different token to change) when we scale ScyllaDB up or down, as in add or remove VMs in the NodePool, or run ALTER TABLE commands, stuff like that.
We currently don't do that, and it would complicate the code a bit to add support for it now, so I would rather leave it for when it's needed

Wait, we don't support changes in the sharding assignment on the ScyllaDb side?

This sticky load balancing policy part currently doesn't, but we can add support for that in a follow up PR, I think :)

MathieuDutSik

No obstacle on my side, but I would like to see benchmarks confirming the improvements before merging.

Benchmarks can be added to the PR description.

ma2bd · 2025-06-05T23:12:22Z

linera-views/src/backends/scylla_db.rs

+            if let Some(policy) = cache.get(partition_key) {
+                match policy {
+                    LoadBalancingPolicyCacheEntry::Ready(policy) => policy.clone(),


No need for nesting if let and match:

match cache.get(partition_key) { Some(LoadBalancingPolicyCacheEntry::Ready(policy)) => ... ... None => ... }

ma2bd

Let's experiment with this on a separate branch first. The fact that this doesn't support changes in the shard assignments by ScyllaDb is a problem.

ndr-ds · 2025-06-16T22:23:13Z

I can add support for that on this PR, or in a following one, to not inflate the size of this one even more (I won't merge this without that, will merge the full stack at once, probably). I was thinking on creating a stack that we know works, then merging the full stack once we agree it gets things to a good state (instead of doing a separate branch).
Happy to do a separate branch as well though, if we prefer that 😅

This was referenced Jun 2, 2025

Optimize ScyllaDB usage #3985

Merged

Add type_id as primary key to ScyllaDb schema #4015

Closed

Add bucket_id to ScyllaDb's primary key #4016

Closed

ndr-ds mentioned this pull request Jun 2, 2025

Add key partition prefix to ScyllaDB schema #4021

Closed

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 3 times, most recently from 46899e9 to 78c5f0a Compare June 2, 2025 18:31

ndr-ds force-pushed the 05-22-optimize_scylladb_usage branch from 4acb372 to 6106db7 Compare June 2, 2025 19:05

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 78c5f0a to 0c95618 Compare June 2, 2025 19:05

ndr-ds mentioned this pull request Jun 2, 2025

New ScyllaDB key space partitioning #4049

Open

ndr-ds changed the base branch from 05-22-optimize_scylladb_usage to graphite-base/4047 June 2, 2025 23:22

ndr-ds force-pushed the graphite-base/4047 branch from 6106db7 to af51607 Compare June 2, 2025 23:22

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 0c95618 to 768208f Compare June 2, 2025 23:22

graphite-app bot changed the base branch from graphite-base/4047 to main June 2, 2025 23:22

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 768208f to 0b093e3 Compare June 2, 2025 23:22

ndr-ds requested review from afck, bart-linera, christos-h, deuszx, ma2bd, MathieuDutSik and Twey June 3, 2025 13:09

ndr-ds marked this pull request as ready for review June 3, 2025 13:09

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 2 times, most recently from 09d6c81 to af0e20c Compare June 3, 2025 19:11

ndr-ds changed the base branch from main to graphite-base/4047 June 3, 2025 21:03

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from af0e20c to fa4280f Compare June 3, 2025 21:03

ndr-ds changed the base branch from graphite-base/4047 to 06-03-truncate_query_output_on_query_node June 3, 2025 21:03

ndr-ds mentioned this pull request Jun 3, 2025

Truncate query output on query_node #4054

Merged

ndr-ds changed the base branch from 06-03-truncate_query_output_on_query_node to graphite-base/4047 June 3, 2025 21:29

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from fa4280f to e1c5311 Compare June 3, 2025 21:29

graphite-app bot changed the base branch from graphite-base/4047 to main June 3, 2025 21:29

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch 3 times, most recently from 6ada5da to a8361a5 Compare June 4, 2025 03:39

MathieuDutSik reviewed Jun 4, 2025

View reviewed changes

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from a8361a5 to b6a6083 Compare June 4, 2025 17:17

ndr-ds mentioned this pull request Jun 4, 2025

Some code cleanups #4066

Merged

ndr-ds changed the base branch from main to graphite-base/4047 June 5, 2025 05:28

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from b6a6083 to c8b5e29 Compare June 5, 2025 05:28

ndr-ds changed the base branch from graphite-base/4047 to 06-04-some_code_cleanups June 5, 2025 05:28

ndr-ds changed the base branch from 06-04-some_code_cleanups to graphite-base/4047 June 5, 2025 05:47

ndr-ds force-pushed the graphite-base/4047 branch from 877abcb to 6976d9c Compare June 5, 2025 06:25

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from c8b5e29 to 36cfe7b Compare June 5, 2025 06:25

ndr-ds changed the base branch from graphite-base/4047 to main June 5, 2025 06:25

afck reviewed Jun 5, 2025

View reviewed changes

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from 36cfe7b to cef2f5d Compare June 5, 2025 10:47

MathieuDutSik reviewed Jun 5, 2025

View reviewed changes

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from cef2f5d to aca7d91 Compare June 5, 2025 20:40

MathieuDutSik approved these changes Jun 5, 2025

View reviewed changes

ma2bd reviewed Jun 5, 2025

View reviewed changes

Optimize ScyllaDB's batch writes

9367ee5

ndr-ds force-pushed the 06-02-optimize_scylladb_s_batch_writes branch from aca7d91 to 9367ee5 Compare June 5, 2025 23:21

ma2bd requested changes Jun 16, 2025

View reviewed changes

ndr-ds mentioned this pull request Jun 18, 2025

Don't hold DashMap lock over await points. #4138

Merged

		self.multi_keys
		.insert(num_markers, prepared_statement.clone());

	const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: usize = 50_000;
	const DEFAULT_LOAD_BALANCING_POLICY_CACHE_SIZE: NonZeroUsize = NonZeroUsize::new(50_000).unwrap();

Optimize ScyllaDB's batch writes #4047

Are you sure you want to change the base?

Optimize ScyllaDB's batch writes #4047

Conversation

ndr-ds commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Test Plan

Release Plan

Uh oh!

ndr-ds commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MathieuDutSik Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

afck commented Jun 4, 2025

Uh oh!

ndr-ds commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ma2bd Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MathieuDutSik left a comment

Choose a reason for hiding this comment

Uh oh!

ma2bd Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ma2bd left a comment

Choose a reason for hiding this comment

Uh oh!

ndr-ds commented Jun 16, 2025

Uh oh!

Uh oh!

ndr-ds commented Jun 2, 2025 •

edited

Loading

ndr-ds commented Jun 2, 2025 •

edited

Loading

MathieuDutSik Jun 4, 2025 •

edited

Loading

ndr-ds commented Jun 4, 2025 •

edited

Loading

ma2bd Jun 5, 2025 •

edited

Loading

ma2bd Jun 5, 2025 •

edited

Loading