Skip to content

Conversation

rjzamora
Copy link
Member

I spent a long time debugging a hang in XGBoost before I noticed this doc-string note about disabling AMM.

It turns out that xgboost.dask.predict(...) uses client.scatter(..., broadcast=True) to replicate the Booster object on all workers. In some cases, the replication process seems to conflict with the active-memory-manager's ReduceReplicas policy - resulting in a hang.

This PR proposes that a clear error be raised by Client.scatter when broadcast=True and AMM is enabled in the config. It also seems fine to produce a warning instead. However, I definitely think it makes sense to be "loud" when the user is likely to run into a problem like this.

@rjzamora rjzamora requested a review from fjetter as a code owner July 23, 2024 18:15
@rjzamora rjzamora marked this pull request as draft July 23, 2024 18:24
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ± 0      29 suites  ±0   12h 5m 45s ⏱️ + 2m 0s
 4 092 tests + 1   3 968 ✅  -  3    112 💤 ±0  12 ❌ + 5 
55 353 runs  +14  52 824 ✅  - 64  2 438 💤 ±0  91 ❌ +79 

For more details on these failures, see this check.

Results for commit 88c8565. ± Comparison against base commit 7013e2e.

Comment on lines +2782 to +2784
if broadcast and dask.config.get(
"distributed.scheduler.active-memory-manager.start"
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic.

  1. you want to specifically test for ReduceReplicas. Namely, at the moment of writing AMM also serves the RetireWorker policy and could serve other, harmless user-defined policies.
  2. the AMM could have been started by hand with client.amm.start(). The API call clent.amm.running() is there for this.

I would instead suggest adding a new RPC endpoint to

  • distributed.active_memory_manager.AMMClientProxy and to
  • distributed.active_memory_manager.ActiveMemoryManagerExtension.amm_handler

which returns a set of currently running policies, and test that instead.
(caveat: there can be more than one RetireWorker policies running when you have more than one worker closing gracefully at the same time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crusaderky - I agree that the config check is not sufficient. Thanks for the suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants