-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core]Make GCS InternalKV workload configurable to the Policy. #47736
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
aadcd14
to
c84fd80
Compare
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Note: TSAN captured a new data race: GetAllJobInfo sends RPCs to workers (main thread) and KVMultiGet (the new ioctx thread). Both modifies a counter that implements a "gather all these tasks and then run callback" logic. I don't want a new mutex just for this; so I used atomics. So now, both threads access to a same atomic<size_t>. They perform atomic |
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
src/ray/gcs/gcs_server/gcs_server.cc
Outdated
RAY_LOG(INFO) << "main_service_ Event stats:\n\n" | ||
<< main_service_.stats().StatsString() << "\n\n"; | ||
RAY_LOG(INFO) << "pubsub_io_context_ Event stats:\n\n" | ||
<< pubsub_io_context_.GetIoService().stats().StatsString() << "\n\n"; | ||
RAY_LOG(INFO) << "kv_io_context_ Event stats:\n\n" | ||
<< kv_io_context_.GetIoService().stats().StatsString() << "\n\n"; | ||
RAY_LOG(INFO) << "task_io_context_ Event stats:\n\n" | ||
<< task_io_context_.GetIoService().stats().StatsString() << "\n\n"; | ||
RAY_LOG(INFO) << "ray_syncer_io_context_ Event stats:\n\n" | ||
<< ray_syncer_io_context_.GetIoService().stats().StatsString() | ||
<< "\n\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is user-facing, let's make this user-friendly -- instead of using field name describe what they are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I placed it this way for ease of searching logs - I don't think Ray users (other than Core developers) will ever need to read this...
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is refactoring in this PR. Can we have the refactoring in its own PR first?
src/ray/common/asio/asio_util.h
Outdated
* The constructor takes a thread name and starts the thread. | ||
* The destructor stops the io_service and joins the thread. | ||
*/ | ||
class InstrumentedIoContextWithThread { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be IO instead of Io? lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is refactoring in this PR. Can we have the refactoring in its own PR first?
Signed-off-by: Ruiyang Wang <[email protected]>
Every time GetInternalConfig reads from table_storage, but it's never mutated. Moves to internal kv as a simple in-mem get (no more read from redis). This itself should slightly update performance. But with #47736 it should improve start up latency a lot in thousand-node clusters. In theory we can remove it all for good, instead just put it as an InternalKV entry. but let's do things step by step. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
) Every time GetInternalConfig reads from table_storage, but it's never mutated. Moves to internal kv as a simple in-mem get (no more read from redis). This itself should slightly update performance. But with ray-project#47736 it should improve start up latency a lot in thousand-node clusters. In theory we can remove it all for good, instead just put it as an InternalKV entry. but let's do things step by step. Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: Ruiyang Wang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
@jjyao ready to review |
@@ -104,7 +103,7 @@ void PeriodicalRunner::DoRunFnPeriodicallyInstrumented( | |||
// event loop. | |||
auto stats_handle = io_service_.stats().RecordStart(name, period.total_nanoseconds()); | |||
timer->async_wait([this, | |||
fn = std::move(fn), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removing move?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because that fn move did not work, because fn is a const& and move is no-op.
// Init GCS table storage. Note this is on the default io context, not the one with | ||
// GcsInternalKVManager, to avoid congestion on the latter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning gcs_table_storage_ should always go with GetDefaultIOContext, not with GetIOContext<GcsInternalKVManager>()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid congestion on the latter.
What does this mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gcs_table_storage_ operations go with GcsInternalKVManager
's, it may slow down GcsInternalKVManager
Since #48231 we can define a policy for IoContext runs. To make a workload configurable one need to fix thread safety for involved classes and then define the policy for the class.
Makes
PeriodicalRunner
thread-safe with atomic bool.Adds a policy to assign InternalKVManager to the default IO Context for
Notably gcs_table_storage_ is not controlled by this policy, because that's used by all other non-KV services and we want it be isolated from InternalKVManager's policy.