Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Expose Serve app source in get_serve_instance_details #45522

Merged

Conversation

JoshKarpel
Copy link
Contributor

@JoshKarpel JoshKarpel commented May 23, 2024

Why are these changes needed?

This change exposes the new Serve app submission API type tracking introduced in #44476 in the dashboard API.

My intent is to eventually introduce an option in KubeRay to only care about the status of DECLARATIVE Serve apps, so that it doesn't care about "dynamically deployed" IMPERATIVE apps.

Per request, I've indicated that this is a developer API in the field docstring.

Related issue number

Follow-up for #44226 (comment)

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -148,6 +148,7 @@ def autoscaling_app():
"message": "",
"last_deployed_time_s": deployment_timestamp,
"deployed_app_config": None,
"source": 1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels awkward to me - can the enum be a https://docs.python.org/3/library/enum.html#enum.StrEnum instead? Or perhaps if they need to be ints we'd use the existing enum field names to convert at the last moment, just for the API (not internally)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be better to change this to a string enum, now that it's exposed through the rest api.

@JoshKarpel JoshKarpel marked this pull request as ready for review May 23, 2024 14:50
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution Josh! Could you add test(s) to check that the source is exposed correctly through the rest api? You can add it to https://github.com/ray-project/ray/blob/master/dashboard/modules/serve/tests/test_serve_dashboard.py.

@JoshKarpel
Copy link
Contributor Author

Thanks for the contribution Josh! Could you add test(s) to check that the source is exposed correctly through the rest api? You can add it to https://github.com/ray-project/ray/blob/master/dashboard/modules/serve/tests/test_serve_dashboard.py.

Aha! I knew these tests must be somewhere! Will add some tests there.

@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) serve Ray Serve Related Issue labels May 29, 2024
@akshay-anyscale akshay-anyscale added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 3, 2024
@JoshKarpel
Copy link
Contributor Author

Thanks for the contribution Josh! Could you add test(s) to check that the source is exposed correctly through the rest api? You can add it to https://github.com/ray-project/ray/blob/master/dashboard/modules/serve/tests/test_serve_dashboard.py.

It looks like the setup for these tests is a little laborious, so I attached it to the existing test_get_serve_instance_details test.

@JoshKarpel JoshKarpel requested a review from zcin June 3, 2024 21:40
@zcin zcin added the go add ONLY when ready to merge, run all tests label Jun 5, 2024
@shrekris-anyscale shrekris-anyscale enabled auto-merge (squash) June 5, 2024 21:28
auto-merge was automatically disabled June 6, 2024 14:42

Head branch was pushed to by a user without write access

@JoshKarpel
Copy link
Contributor Author

@zcin I think I could use some help - I'm getting consistent timeout failures in test_get_serve_instance_details. When I try to run locally on an Intel Mac (and the tests are labelled as flaky on Darwin, though they seem stable for me without my changes), it looks like the Raylet is dying after the test runs:

$ pytest --setup-show -s -x python/ray/dashboard/modules/serve/tests/test_serve_dashboard.py::test_get_serve_instance_details_for_imperative_apps
============================================================================================================================ test session starts =============================================================================================================================
platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0
rootdir: /Users/josh.karpel/projects/ray
configfile: pytest.ini
plugins: anyio-3.7.1, asyncio-0.23.6
asyncio: mode=strict
collected 2 items

python/ray/dashboard/modules/serve/tests/test_serve_dashboard.py
SETUP    S event_loop_policy
        SETUP    F monkeypatch
        SETUP    F pre_envs (fixtures used: monkeypatch)
        SETUP    F ray_start_stop
        SETUP    F url['http://localhost:5...serve/applications/']
        python/ray/dashboard/modules/serve/tests/test_serve_dashboard.py::test_get_serve_instance_details_for_imperative_apps[http://localhost:52365/api/serve/applications/] (fixtures used: event_loop_policy, monkeypatch, pre_envs, ray_start_stop, url)2024-06-07 13:06:32,848	INFO worker.py:1585 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379...
2024-06-07 13:06:32,857	INFO worker.py:1761 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
(ProxyActor pid=22470) INFO 2024-06-07 13:06:36,756 proxy 127.0.0.1 proxy.py:1165 - Proxy starting on node b4a61d3c141f8db44cf0629c69ac764a7833248a6320130901076023 (HTTP port: 8000).
2024-06-07 13:06:36,880	INFO handle.py:126 -- Created DeploymentHandle 'f2cltdgn' for Deployment(name='f', app='app1').
2024-06-07 13:06:36,880	INFO handle.py:126 -- Created DeploymentHandle 'pjzqxetz' for Deployment(name='f', app='app1').
2024-06-07 13:06:36,881	INFO handle.py:126 -- Created DeploymentHandle '5sa1fm11' for Deployment(name='BasicDriver', app='app1').
2024-06-07 13:06:36,881	INFO handle.py:126 -- Created DeploymentHandle 'ca3na2bl' for Deployment(name='f', app='app1').
2024-06-07 13:06:36,882	INFO handle.py:126 -- Created DeploymentHandle '664rf6au' for Deployment(name='BasicDriver', app='app1').
(ServeController pid=22469) INFO 2024-06-07 13:06:36,960 controller 22469 deployment_state.py:1598 - Deploying new version of Deployment(name='f', app='app1') (initial target replicas: 1).
(ServeController pid=22469) INFO 2024-06-07 13:06:36,962 controller 22469 deployment_state.py:1598 - Deploying new version of Deployment(name='BasicDriver', app='app1') (initial target replicas: 1).
(ServeController pid=22469) INFO 2024-06-07 13:06:37,066 controller 22469 deployment_state.py:1844 - Adding 1 replica to Deployment(name='f', app='app1').
(ServeController pid=22469) INFO 2024-06-07 13:06:37,070 controller 22469 deployment_state.py:1844 - Adding 1 replica to Deployment(name='BasicDriver', app='app1').
2024-06-07 13:06:38,928	INFO handle.py:126 -- Created DeploymentHandle '8eze0dg1' for Deployment(name='BasicDriver', app='app1').
2024-06-07 13:06:38,928	INFO api.py:584 -- Deployed app 'app1' successfully.
2024-06-07 13:06:38,938	INFO handle.py:126 -- Created DeploymentHandle 'bje4zo95' for Deployment(name='f', app='app2').
2024-06-07 13:06:38,938	INFO handle.py:126 -- Created DeploymentHandle 'gof08ux9' for Deployment(name='f', app='app2').
2024-06-07 13:06:38,939	INFO handle.py:126 -- Created DeploymentHandle 'gocs3x4n' for Deployment(name='BasicDriver', app='app2').
2024-06-07 13:06:38,940	INFO handle.py:126 -- Created DeploymentHandle 'z2x43h47' for Deployment(name='f', app='app2').
2024-06-07 13:06:38,940	INFO handle.py:126 -- Created DeploymentHandle 'n91zoxkk' for Deployment(name='BasicDriver', app='app2').
(ServeController pid=22469) INFO 2024-06-07 13:06:38,972 controller 22469 deployment_state.py:1598 - Deploying new version of Deployment(name='f', app='app2') (initial target replicas: 1).
(ServeController pid=22469) INFO 2024-06-07 13:06:38,974 controller 22469 deployment_state.py:1598 - Deploying new version of Deployment(name='BasicDriver', app='app2') (initial target replicas: 1).
(ServeController pid=22469) INFO 2024-06-07 13:06:39,077 controller 22469 deployment_state.py:1844 - Adding 1 replica to Deployment(name='f', app='app2').
(ServeController pid=22469) INFO 2024-06-07 13:06:39,079 controller 22469 deployment_state.py:1844 - Adding 1 replica to Deployment(name='BasicDriver', app='app2').
2024-06-07 13:06:41,003	INFO handle.py:126 -- Created DeploymentHandle '6866e1ge' for Deployment(name='BasicDriver', app='app2').
2024-06-07 13:06:41,003	INFO api.py:584 -- Deployed app 'app2' successfully.
All applications are in a RUNNING state.
Finished checking application details.
(ServeController pid=22469) INFO 2024-06-07 13:06:41,623 controller 22469 deployment_state.py:1860 - Removing 1 replica from Deployment(name='f', app='app1').
(ServeController pid=22469) INFO 2024-06-07 13:06:41,624 controller 22469 deployment_state.py:1860 - Removing 1 replica from Deployment(name='BasicDriver', app='app1').
(ServeController pid=22469) INFO 2024-06-07 13:06:43,706 controller 22469 deployment_state.py:2182 - Replica(id='t6du5eho', deployment='f', app='app1') is stopped.
(ServeController pid=22469) INFO 2024-06-07 13:06:43,706 controller 22469 deployment_state.py:2182 - Replica(id='2csbd16s', deployment='BasicDriver', app='app1') is stopped.
(ServeController pid=22469) INFO 2024-06-07 13:06:44,638 controller 22469 deployment_state.py:1860 - Removing 1 replica from Deployment(name='f', app='app2').
(ServeController pid=22469) INFO 2024-06-07 13:06:44,638 controller 22469 deployment_state.py:1860 - Removing 1 replica from Deployment(name='BasicDriver', app='app2').
(ServeController pid=22469) INFO 2024-06-07 13:06:46,695 controller 22469 deployment_state.py:2182 - Replica(id='um2ez93k', deployment='f', app='app2') is stopped.
(ServeController pid=22469) INFO 2024-06-07 13:06:46,696 controller 22469 deployment_state.py:2182 - Replica(id='6ngt7a82', deployment='BasicDriver', app='app2') is stopped.
.
        TEARDOWN F url['http://localhost:5...serve/applications/'](raylet) Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    [2024-06-07 13:06:32,900 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22465, the token is 12
    [2024-06-07 13:06:32,903 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22466, the token is 13
    [2024-06-07 13:06:32,906 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22467, the token is 14
    [2024-06-07 13:06:32,909 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22468, the token is 15
    [2024-06-07 13:06:34,335 I 22449 3476523] (raylet) object_store.cc:35: Object store current usage 8e-09 / 2.14748 GB.
    [2024-06-07 13:06:34,583 I 22449 3476490] (raylet) node_manager.cc:611: New job has started. Job id 01000000 Driver pid 22440 is dead: 0 driver address: 127.0.0.1
    [2024-06-07 13:06:34,583 I 22449 3476490] (raylet) worker_pool.cc:691: Job 01000000 already started in worker pool.
    [2024-06-07 13:06:34,649 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22469, the token is 16
    [2024-06-07 13:06:35,743 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22470, the token is 17
    [2024-06-07 13:06:37,097 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22471, the token is 18
    [2024-06-07 13:06:37,118 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22472, the token is 19
    [2024-06-07 13:06:37,143 I 22449 3476490] (raylet) node_manager.cc:1464: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = false
    [2024-06-07 13:06:39,103 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22473, the token is 20
    [2024-06-07 13:06:39,124 I 22449 3476490] (raylet) worker_pool.cc:510: Started worker process with pid 22474, the token is 21
    [2024-06-07 13:06:41,033 I 22449 3476490] (raylet) node_manager.cc:611: New job has started. Job id 02000000 Driver pid 22451 is dead: 0 driver address: 127.0.0.1
    [2024-06-07 13:06:41,033 I 22449 3476490] (raylet) worker_pool.cc:691: Job 02000000 already started in worker pool.
    [2024-06-07 13:06:43,709 I 22449 3476490] (raylet) node_manager.cc:1464: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = false
    [2024-06-07 13:06:43,709 I 22449 3476490] (raylet) node_manager.cc:1464: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = false
    [2024-06-07 13:06:46,698 I 22449 3476490] (raylet) node_manager.cc:1464: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = false
    [2024-06-07 13:06:46,699 I 22449 3476490] (raylet) node_manager.cc:1464: NodeManager::DisconnectClient, disconnect_type=1, has creation task exception = false


        TEARDOWN F ray_start_stop
        TEARDOWN F pre_envs
        TEARDOWN F monkeypatch
        SETUP    F monkeypatch
        SETUP    F pre_envs (fixtures used: monkeypatch)
        SETUP    F ray_start_stop
        SETUP    F url['http://localhost:8...serve/applications/']
        python/ray/dashboard/modules/serve/tests/test_serve_dashboard.py::test_get_serve_instance_details_for_imperative_apps[http://localhost:8265/api/serve/applications/] (fixtures used: event_loop_policy, monkeypatch, pre_envs, ray_start_stop, url)^C
        TEARDOWN F url['http://localhost:8...serve/applications/']
        TEARDOWN F ray_start_stop
        TEARDOWN F pre_envs
        TEARDOWN F monkeypatch
TEARDOWN S event_loop_policy

It looks like the Raylet crashes during teardown after the first test, then the second test blocks forever (note that I ctrl-c'd it in the above log after a minute or so).

python/ray/dashboard/modules/serve/tests/test_serve_dashboard.py::test_get_serve_instance_details_for_imperative_apps[http://localhost:8265/api/serve/applications/] (fixtures used: event_loop_policy, monkeypatch, pre_envs, ray_start_stop, url)^C

If I remove the serve.run that I added, the Raylet doesn't crash and the test runs as expected. So it's something about deploying imperatively in this test is making the Raylet crash? Maybe through the

@pytest.fixture(scope="function")
def ray_start_stop():
subprocess.check_output(["ray", "stop", "--force"])
wait_for_condition(
check_ray_stop,
timeout=15,
)
subprocess.check_output(["ray", "start", "--head"])
wait_for_condition(
lambda: requests.get("http://localhost:52365/api/ray/version").status_code
== 200,
timeout=15,
)
yield
subprocess.check_output(["ray", "stop", "--force"])
wait_for_condition(
check_ray_stop,
timeout=15,
)
fixture, but I'm not seeing anything that would obviously work differently for imperative vs declarative apps there 🤔

Signed-off-by: Josh Karpel <[email protected]>
@zcin
Copy link
Contributor

zcin commented Jun 10, 2024

@JoshKarpel It is because adding serve.run to the test means you connect to the controller through the Serve client, and the client caches the handle to the controller. So when you shutdown ray (which this test does in between runs), you also need to shutdown Serve to tell the client. I'd suggest you call serve.shutdown() at the end of the test.

@JoshKarpel
Copy link
Contributor Author

Sorry, didn't intend that to send out another review request - still having trouble with the tests. Moved back to draft.

@@ -100,7 +100,7 @@ py_test(
py_test(
name = "test_serve_dashboard",
size = "enormous",
srcs = ["modules/serve/tests/test_serve_dashboard.py"],
srcs = ["modules/serve/tests/test_serve_dashboard.py", "modules/serve/tests/deploy_imperative_serve_apps.py"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with Bazel - is this the right way to declare that this test needs this file? Or should it go under eg. deps?

@@ -418,6 +422,104 @@ def applications_running():
assert app_details[app].last_deployed_time_s > 0
assert app_details[app].route_prefix == expected_values[app]["route_prefix"]
assert app_details[app].docs_path == expected_values[app]["docs_path"]
assert app_details[app].source == expected_values[app]["source"]

# CHECK: all deployments are present
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is a bit confusing here because there's a lot of duplicated code between this existing test case and the new test_get_serve_instance_details_for_imperative_apps below - the code in this test was already here!

Comment on lines -743 to +753
"route_prefix": deployment_args.route_prefix
if deployment_args.HasField("route_prefix")
else None,
"route_prefix": (
deployment_args.route_prefix
if deployment_args.HasField("route_prefix")
else None
),
"ingress": deployment_args.ingress,
"docs_path": deployment_args.docs_path
if deployment_args.HasField("docs_path")
else None,
"docs_path": (
deployment_args.docs_path
if deployment_args.HasField("docs_path")
else None
),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter seemed to want these changes, not sure why they weren't already like this 🤔

@JoshKarpel JoshKarpel marked this pull request as ready for review September 18, 2024 18:44
@JoshKarpel
Copy link
Contributor Author

Bump @zcin @edoakes for review, not urgent, just want to make sure it doesn't get dropped :)

Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zcin zcin merged commit 932a410 into ray-project:master Oct 18, 2024
5 checks passed
@JoshKarpel JoshKarpel deleted the issue-44226-expose-serve-app-source branch October 21, 2024 14:12
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
…project#45522)

## Why are these changes needed?

This change exposes the new Serve app submission API type tracking
introduced in ray-project#44476 in the
dashboard API.

My intent is to eventually introduce an *option* in KubeRay to *only*
care about the status of `DECLARATIVE` Serve apps, so that it doesn't
care about "dynamically deployed" `IMPERATIVE` apps.

Per request, I've indicated that this is a developer API in the field
docstring.

## Related issue number

Follow-up for
ray-project#44226 (comment)

---------

Signed-off-by: Josh Karpel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants