Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][aDAG ]Fix unhandled error related issues. #47689

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Sep 16, 2024

Why are these changes needed?

  • If DAG is already teardown, we should not do .get
  • unhandled error should be printed and suppressed and should not raise an exception in the destructor

wip tests

Related issue number

Closes #47687

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -661,7 +661,7 @@ def __init__(
# We conservatively set num_shm_buffers to _max_inflight_executions.
# It means that the DAG can be underutilized, but it guarantees there's
# no false positive timeouts.
num_shm_buffers=self._max_inflight_executions,
num_shm_buffers=1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated to this PR, but will be merged separately

from ray.tests.conftest import * # noqa


logger = logging.getLogger(__name__)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently I cannot capture logs. wip

Comment on lines +12 to +13
RAY_IGNORE_UNHANDLED_ERRORS_ENV_VAR = "RAY_IGNORE_UNHANDLED_ERRORS"
RAY_IGNORE_UNHANDLED_ERRORS = env_bool(RAY_IGNORE_UNHANDLED_ERRORS_ENV_VAR, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be prefixed with RAY_ADAG as they are only relevant to adag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is the same env var as what we have in Ray itself

if "RAY_IGNORE_UNHANDLED_ERRORS" in os.environ:

imo it is better unifying. I will use the common constant for the env var instead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can you add a comment here: people tend to change this to make names consistent if they are not aware of the reuse.

Comment on lines +96 to +105
try:
self.get()
except Exception as e:
if RAY_IGNORE_UNHANDLED_ERRORS:
return

logger.error(
"Unhandled error (suppress with "
f"{RAY_IGNORE_UNHANDLED_ERRORS_ENV_VAR}=1'): {e}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think python by default ignores the exceptions happen in __del__ and logs to stderr.

What's the difference with this change? Can you post the log before and after the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will print unnecessary stacktrace (to a destructor) and and an additional message the "exception raised in destructor is ignored". This will have clean error messages

@@ -2143,6 +2149,7 @@ def teardown(self):
monitor = getattr(self, "_monitor", None)
if monitor is not None:
monitor.teardown(wait=True)
self._is_teardown = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _was_torn_down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[aDAG][Core] Weird exception if ref is deleted after adag teardown
3 participants