[core]Make GCS InternalKV workload configurable to the Policy. #47736

rynewang · 2024-09-18T21:20:20Z

Since #48231 we can define a policy for IoContext runs. To make a workload configurable one need to fix thread safety for involved classes and then define the policy for the class.

Makes PeriodicalRunner thread-safe with atomic bool.

Adds a policy to assign InternalKVManager to the default IO Context for

InternalKV gRPC service
InMemoryStoreClient callbacks
RedisStoreClient callbacks

Notably gcs_table_storage_ is not controlled by this policy, because that's used by all other non-KV services and we want it be isolated from InternalKVManager's policy.

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang · 2024-09-20T00:21:14Z

Note on the force push:

c84fd80 is the OG commit of internal kv
aadcd14 is a PR + GcsNodeManager to its dedicated thread. This did not work and causes segfault since now the thread + the main thread both RW the GcsNodeManager internal maps.
Now, reverting to c84fd80 and merge master.

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang · 2024-09-24T19:21:13Z

Note: TSAN captured a new data race: GetAllJobInfo sends RPCs to workers (main thread) and KVMultiGet (the new ioctx thread). Both modifies a counter that implements a "gather all these tasks and then run callback" logic. I don't want a new mutex just for this; so I used atomics.

So now, both threads access to a same atomic<size_t>. They perform atomic fetch_add to increment the counter and use the atomic readout value to determine if the callback can be run. This also simplifies code from an int + a bool to a single size_t.

Signed-off-by: Ruiyang Wang <[email protected]>

alexeykudinkin · 2024-09-25T20:28:30Z

src/ray/gcs/gcs_server/gcs_server.cc

+    RAY_LOG(INFO) << "main_service_ Event stats:\n\n"
+                  << main_service_.stats().StatsString() << "\n\n";
+    RAY_LOG(INFO) << "pubsub_io_context_ Event stats:\n\n"
+                  << pubsub_io_context_.GetIoService().stats().StatsString() << "\n\n";
+    RAY_LOG(INFO) << "kv_io_context_ Event stats:\n\n"
+                  << kv_io_context_.GetIoService().stats().StatsString() << "\n\n";
+    RAY_LOG(INFO) << "task_io_context_ Event stats:\n\n"
+                  << task_io_context_.GetIoService().stats().StatsString() << "\n\n";
+    RAY_LOG(INFO) << "ray_syncer_io_context_ Event stats:\n\n"
+                  << ray_syncer_io_context_.GetIoService().stats().StatsString()
+                  << "\n\n";


This is user-facing, let's make this user-friendly -- instead of using field name describe what they are

I placed it this way for ease of searching logs - I don't think Ray users (other than Core developers) will ever need to read this...

src/ray/gcs/gcs_server/gcs_job_manager.cc

Signed-off-by: Ruiyang Wang <[email protected]>

jjyao

There is refactoring in this PR. Can we have the refactoring in its own PR first?

jjyao · 2024-09-26T15:46:46Z

src/ray/common/asio/asio_util.h

+ * The constructor takes a thread name and starts the thread.
+ * The destructor stops the io_service and joins the thread.
+ */
+class InstrumentedIoContextWithThread {


should be IO instead of Io? lol

jjyao

There is refactoring in this PR. Can we have the refactoring in its own PR first?

Signed-off-by: Ruiyang Wang <[email protected]>

Every time GetInternalConfig reads from table_storage, but it's never mutated. Moves to internal kv as a simple in-mem get (no more read from redis). This itself should slightly update performance. But with #47736 it should improve start up latency a lot in thousand-node clusters. In theory we can remove it all for good, instead just put it as an InternalKV entry. but let's do things step by step. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>

Signed-off-by: Ruiyang Wang <[email protected]>

) Every time GetInternalConfig reads from table_storage, but it's never mutated. Moves to internal kv as a simple in-mem get (no more read from redis). This itself should slightly update performance. But with ray-project#47736 it should improve start up latency a lot in thousand-node clusters. In theory we can remove it all for good, instead just put it as an InternalKV entry. but let's do things step by step. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang · 2024-10-29T22:55:12Z

@jjyao ready to review

jjyao · 2024-10-30T03:46:53Z

src/ray/common/asio/periodical_runner.cc

@@ -104,7 +103,7 @@ void PeriodicalRunner::DoRunFnPeriodicallyInstrumented(
  // event loop.
  auto stats_handle = io_service_.stats().RecordStart(name, period.total_nanoseconds());
  timer->async_wait([this,
-                     fn = std::move(fn),


why removing move?

because that fn move did not work, because fn is a const& and move is no-op.

jjyao · 2024-10-30T03:49:17Z

src/ray/gcs/gcs_server/gcs_server.cc

+  // Init GCS table storage. Note this is on the default io context, not the one with
+  // GcsInternalKVManager, to avoid congestion on the latter.


what does this mean?

Meaning gcs_table_storage_ should always go with GetDefaultIOContext, not with GetIOContext<GcsInternalKVManager>()

to avoid congestion on the latter.

What does this mean?

If gcs_table_storage_ operations go with GcsInternalKVManager's, it may slow down GcsInternalKVManager

stale · 2025-02-01T00:12:26Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale · 2025-04-25T21:25:13Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

dedicated kv ioctx

bcf81c8

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang requested a review from a team as a code owner September 18, 2024 21:20

move gcs_table_storage_ back to main service.

c84fd80

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang force-pushed the dedicated-kv-ioctx branch 2 times, most recently from aadcd14 to c84fd80 Compare September 20, 2024 00:19

Merge branch 'master' into dedicated-kv-ioctx

a021521

This was referenced Sep 20, 2024

[core] move GetInternalConfig: NodeInfo -> InternalKV #47755

Merged

Getinternalconfig and ioctx #47756

Closed

Merge branch 'master' into dedicated-kv-ioctx

5d8eaee

rynewang added the go add ONLY when ready to merge, run all tests label Sep 23, 2024

rynewang assigned jjyao Sep 23, 2024

rynewang added 3 commits September 23, 2024 16:31

fix cpp test

36fc808

Signed-off-by: Ruiyang Wang <[email protected]>

fix atomics now that we have multiple thread reads...

a87c39d

Signed-off-by: Ruiyang Wang <[email protected]>

atomics

7cd7705

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang added 2 commits September 24, 2024 16:06

fix

bbf02cd

Signed-off-by: Ruiyang Wang <[email protected]>

size_t -> int for proto

99f7ba9

Signed-off-by: Ruiyang Wang <[email protected]>

alexeykudinkin reviewed Sep 25, 2024

View reviewed changes

rynewang added 2 commits September 25, 2024 14:19

fix atomics in periodical_runner

a1ab6c6

Signed-off-by: Ruiyang Wang <[email protected]>

update doc

cf3f343

Signed-off-by: Ruiyang Wang <[email protected]>

jjyao reviewed Sep 26, 2024

View reviewed changes

stopped -> shared_ptr<atomic<bool>>

6d006e9

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang added 3 commits September 26, 2024 17:33

rename

110ae3e

Signed-off-by: Ruiyang Wang <[email protected]>

Merge remote-tracking branch 'origin/master' into dedicated-kv-ioctx

3d5e7f0

Signed-off-by: Ruiyang Wang <[email protected]>

fit lint

3461330

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang added 2 commits October 29, 2024 14:38

Merge remote-tracking branch 'origin/master' into dedicated-kv-ioctx

4ade4af

Signed-off-by: Ruiyang Wang <[email protected]>

type traits and policy for kv

a607528

Signed-off-by: Ruiyang Wang <[email protected]>

rynewang changed the title ~~[core] Move GCS InternalKV workload to dedicated thread.~~ [core]Make GCS InternalKV workload configurable to the Policy. Oct 29, 2024

remove temp code

698cbe9

Signed-off-by: Ruiyang Wang <[email protected]>

jjyao reviewed Oct 30, 2024

View reviewed changes

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 1, 2025

stale bot closed this Apr 25, 2025

		// Init GCS table storage. Note this is on the default io context, not the one with
		// GcsInternalKVManager, to avoid congestion on the latter.

[core]Make GCS InternalKV workload configurable to the Policy. #47736

[core]Make GCS InternalKV workload configurable to the Policy. #47736

Uh oh!

Conversation

rynewang commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rynewang commented Sep 20, 2024

Uh oh!

rynewang commented Sep 24, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjyao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjyao left a comment

Choose a reason for hiding this comment

Uh oh!

rynewang commented Oct 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rynewang Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stale bot commented Feb 1, 2025

Uh oh!

stale bot commented Apr 25, 2025

Uh oh!

Uh oh!

rynewang commented Sep 18, 2024 •

edited

Loading

rynewang Oct 30, 2024 •

edited

Loading